Clearing memory in different languages for security - string

When studying Java I learned that Strings were not safe for storing passwords, since you can't manually clear the memory associated with them (you can't be sure they will eventually be gc'ed, interned strings may never be, and even after gc you can't be sure the physical memory contents were really wiped). Instead, I were to use char arrays, so I can zero-out them after use. I've tried to search for similar practices in other languages and platforms, but so far I couldn't find the relevant info (usually all I see are code examples of passwords stored in strings with no mention of any security issue).
I'm particularly interested in the situation with browsers. I use jQuery a lot, and my usual approach is just the set the value of a password field to an empty string and forget about it:
$(myPasswordField).val("");
But I'm not 100% convinced it is enough. I also have no idea whether or not the strings used for intermediate access are safe (for instance, when I use $.ajax to send the password to the server). As for other languages, usually I see no mention of this issue (another language I'm interested in particular is Python).
I know questions attempting to build lists are controversial, but since this deals with a common security issue that is largely overlooked, IMHO it's worth it. If I'm mistaken, I'd be happy to know just from JavaScript (in browsers) and Python then. I was also unsure whether to ask here, at security.SE or at programmers.SE, but since it involves the actual code to safely perform the task (not a conceptual question) I believe this site is the best option.
Note: in low-level languages, or languages that unambiguously support characters as primitive types, the answer should be obvious (Edit: not really obvious, as #Gabe showed in his answer below). I'm asking for those high level languages in which "everything is an object" or something like that, and also for those that perform automatic string interning behind the scenes (so you may create a security hole without realizing it, even if you're reasonably careful).
Update: according to an answer in a related question, even using char[] in Java is not guaranteed to be bulletproof (or .NET SecureString, for that matter), since the gc might move the array around so its contents might stick in the memory even after clearing (SecureString at least sticks in the same RAM address, guaranteeing clearing, but its consumers/producers might still leave traces).
I guess #NiklasB. is right, even though the vulnerability exists, the likelyhood of an exploit is low and the difficulty to prevent it is high, that might be the reason this issue is mostly ignored. I wish I could find at least some reference of this problem concerning browsers, but googling for it has been fruitless so far (does this scenario at least have a name?).

The .NET solution to this is SecureString.
A SecureString object is similar to a String object in that it has a text value. However, the value of a SecureString object is automatically encrypted, can be modified until your application marks it as read-only, and can be deleted from computer memory by either your application or the .NET Framework garbage collector.
Note that even for low-level languages like C, the answer isn't as obvious as it seems. Modern compilers can determine that you are writing to the string (zeroing it out) but never reading the values you read out, and just optimize away the zeroing. In order to prevent optimizing away the security, Windows provides SecureZeroMemory.

For Python, there's no way to do that that, according to this answer. A possibility would be using lists of characters (as length-1 strings or maybe code units as integers) instead of strings, so you can overwrite that list after use, but that would require every code that touches it to support this format (if even a single one of them creates a string with its contents, it's over).
There is also a mention to a method using ctypes, but the link is broken, so I'm unaware of its contents. This other answer also refers to it, but there's not a lot of detail.

Related

Strings and Strands in MoarVM

When running Raku code on Rakudo with the MoarVM backend, is there any way to print information about how a given Str is stored in memory from inside the running program? In particular, I am curious whether there's a way to see how many Strands currently make up the Str (whether via Raku introspection, NQP, or something that accesses the MoarVM level (does such a thing even exist at runtime?).
If there isn't any way to access this info at runtime, is there a way to get at it through output from one of Rakudo's command-line flags, such as --target, or --tracing? Or through a debugger?
Finally, does MoarVM manage the number of Strands in a given Str? I often hear (or say) that one of Raku's super powers is that is can index into Unicode strings in O(1) time, but I've been thinking about the pathological case, and it feels like it would be O(n). For example,
(^$n).map({~rand}).join
seems like it would create a Str with a length proportional to $n that consists of $n Strands – and, if I'm understanding the datastructure correctly, that means that into this Str would require checking the length of each Strand, for a time complexity of O(n). But I know that it's possible to flatten a Strand-ed Str; would MoarVM do something like that in this case? Or have I misunderstood something more basic?
When running Raku code on Rakudo with the MoarVM backend, is there any way to print information about how a given Str is stored in memory from inside the running program?
My educated guess is yes, as described below for App::MoarVM modules. That said, my education came from a degree I started at the Unseen University, and a wizard had me expelled for guessing too much, so...
In particular, I am curious whether there's a way to see how many Strands currently make up the Str (whether via Raku introspection, NQP, or something that accesses the MoarVM level (does such a thing even exist at runtime?).
I'm 99.99% sure strands are purely an implementation detail of the backend, and there'll be no Raku or NQP access to that information without MoarVM specific tricks. That said, read on.
If there isn't any way to access this info at runtime
I can see there is access at runtime via MoarVM.
is there a way to get at it through output from one of Rakudo's command-line flags, such as --target, or --tracing? Or through a debugger?
I'm 99.99% sure there are multiple ways.
For example, there's a bunch of strand debugging code in MoarVM's ops.c file starting with #define MVM_DEBUG_STRANDS ....
Perhaps more interesting are what appears to be a veritable goldmine of sophisticated debugging and profiling features built into MoarVM. Plus what appear to be Rakudo specific modules that drive those features, presumably via Raku code. For a dozen or so articles discussing some aspects of those features, I suggest reading timotimo's blog. Browsing github I see ongoing commits related to MoarVM's debugging features for years and on into 2021.
Finally, does MoarVM manage the number of Strands in a given Str?
Yes. I can see that the string handling code (some links are below), which was written by samcv (extremely smart and careful) and, I believe, reviewed by jnthn, has logic limiting the number of strands.
I often hear (or say) that one of Raku's super powers is that is can index into Unicode strings in O(1) time, but I've been thinking about the pathological case, and it feels like it would be O(n).
Yes, if a backend that supported strands did not manage the number of strands.
But for MoarVM I think the intent is to set an absolute upper bound with #define MVM_STRING_MAX_STRANDS 64 in MoarVM's MVMString.h file, and logic that checks against that (and other characteristics of strings; see this else if statement as an exemplar). But the logic is sufficiently complex, and my C chops sufficiently meagre, that I am nowhere near being able to express confidence in that, even if I can say that that appears to be the intent.
For example, (^$n).map({~rand}).join seems like it would create a Str with a length proportional to $n that consists of $n Strands
I'm 95% confident that the strings constructed by simple joins like that will be O(1).
This is based on me thinking that a Raku/NQP level string join operation is handled by MVM_string_join, and my attempts to understand what that code does.
But I know that it's possible to flatten a Strand-ed Str; would MoarVM do something like that in this case?
If you read the code you will find it's doing very sophisticated handling.
Or have I misunderstood something more basic?
I'm pretty sure I will have misunderstood something basic so I sure ain't gonna comment on whether you have. :)
As far as I understand it, the fact that MoarVM implements strands (aka, a concatenating two strings will only result in creation of a strand that consists of "references" to the original strings), is really that: an implementation detail.
You can implement the Raku Programming Language without needing to implement strands. Therefore there is no way to introspect this, at least to my knowledge.
There has been a PR to expose the nqp:: op that would actually concatenate strands into a single string, but that has been refused / closed: https://github.com/rakudo/rakudo/pull/3975

Tools for Domain Specific Language/Functions

Our users can enter questions that get answered by students. Our users need a extensible, flexible way to define the correct answers to these questions (which are stored as a simple string).
I would like to expose a library of domain specific functions that users can call on to describe the correct answer. Eg:
exact_match("puppy") // means the correct answer is the string 'puppy'
or
contains("yesterday") // means any answer with the word 'yesterday' is correct
The naive implementation would involve eval'ing user supplied strings in a sandboxed runtime (like a javascript vm or ruby vm). But I'd like to go further and only allow specific functions to be called. Any other scripting would be discarded. Such that:
puts("foo"); contains("yesterday")
would be illegal. Since we don't expose or allow puts().
How can I constrain the execution environment to only run a whitelist of functions? Or is there a different approach to build this kind of external-facing DSL instead of trying to constrain an existing language to a subset of functions?
I would check out MPS by JetBrains if I were you, its an open source DSL creation tool. I have never used it myself, but from everything I have seen on it, it's very intuitive; and all of their other products are incredibly powerful.
Just because you're creating a DSL, that doesn't necessarily mean that you have to give the user the ability to enter the code in text.
The key to this is providing a list of method names and your special keyword for them, the "FunCode" tag in the code example below:
Create a mapping from keyword to code, and letting them define everything they need, and then use it. And I would actually build my own XML parser so that it's not hackable, at least not on a list of zero-day-exploits hackable.
<strDefs>
<strDef><strNam>sickStr</strNam>
<strText>sick</strText><strNum>01</strNum><strDef>
<strDef><strNam>pupStr</strNam>
<strText>puppy</strText><strNum>02</strNum><strDef>
</strDefs>
<funDefs>
<funDef><funCode>pfContainsStr</funCode><funLabel>contains</funLabel>
<funNum>01</funNum></funDef>
<funDef><funCode>pfXact</funCode><funLabel>exact_match</funLabel>
<funNum>02</funNum></funDef>
</funDefs>
<queries>
<query><fun>01</fun><str>02</str>
</query>
</queries>
The above XML more represents the idea and the structure of what to do, but rather in a user interface, so the user is constrained. The user interface code that allows the data-entry of the above data should be running on your server, and they only interact with it. Any code that runs on their browser is hackable, because they can just save the page, edit the HTML (and/or JavaScript), and run that, which is their code now, not yours anymore.
You can't really open the door (pandora's box) and allow just anyone to write just any code and have it evaluated / interpreted by the language parser, because some hacker is going to exploit it. You must lock down the strings, probably by having them enter them into your database in an earlier step, and each string gets its own token that YOU generate (a SQL Server primary key is very simple, usable, and secure), but give them a display representation so it's readable to them.
Then give them a list of methods / functions they can use, along with a token (a primary key can also serve here, perhaps with a kind of table prefix) and also a display representation (label).
If you have them put all of their labels into yet another table, you can have SQL make sure that all of their labels are unique to each other in the whole "language", and then you can allow them to try to define their expressions in the language they want to use. This has the advantage that foreign languages can be used, but you don't have to do anything terribly special.
An important piece would be the verify button, that would translate their expression into unique tokens and back again, checking that the round-trip was successful. If it wasn't successful, there's some kind of ambiguity, and you might be able to allow them an option to use the list of tokens as the source in that case.
If you heavily rely on set-based logic for the underlying foundation of the language and your tables, you should be able to produce a coherent DSL that works. Many DSL creation problems are ones of integrity, where there are underlying assumptions that are contradictory, unintentionally mutually exclusive, or nonsensical. Truth is an unshakeable foundation. Anything else has a lie somewhere -- that you're trying to build on.
Sudoku is illustrative here. When you screw up a Sudoku, you often don't know that you have done so, and you keep building on that false foundation, until you get to the completion of the puzzle, and one whole string of assumptions disagrees with a different string of assumptions. They can't both be true. But you can't tell where you went wrong because you're too far away from the mistake and can not work backwards (easily). All steps taken look correct. A DSL, a database schema, and code, are all this way. Baby steps, that are double- and even triple-checked, and hopefully "correct by inspection", are the best way to "grow" a DSL, slowly, piece-by-piece. The best way to not have flaws is to not add them in the first place.
You don't want bugs in your DSL. Keep it spartan. KISS - Keep it simple, Sparticus! And I have personally found that keeping it set-based, if not overtly, under the covers, accomplishes this very well.
Finally, to be able to think this way, I've studied languages for a long time, and have cultivated a curiosity about how languages have come to be. Books are a good quality source of information, as they have a higher quality level than the internet, which is nevertheless also an indispensable source. Some of my favorite languages: Forth, Factor, SETL, F#, C#, Visual FoxPro (especially for its embedded SQL), T-SQL, Common LISP, Clojure, and probably my favorite, Dylan, an INFIX Lisp without parentheses that Apple experimented with and abandoned, with a syntax that seems to me reminiscent of Pascal, which I sort of liked. The language list is actually much longer than that (and I haven't written code for many of them -- just studied them or their genesis), but that's enough for now.
One of my favorite books, and immensely interesting for the "people" side of it, is "Masterminds of Programming: Conversations with the Creators of Major Programming Languages" (Theory in Practice (O'Reilly)) 1st Edition, Kindle Edition
by Federico Biancuzzi (Author), Chromatic (Author)
By the way, don't let them compromise the integrity of your DSL -- require that it is expressible set-based, and things should go well (IMHO). I hope it works out well for you. Add a comment to my answer telling me how it worked out, if you think of it. And don't forget to choose my answer if you think it's the best! We work hard for the money! ;-)

When exactly am I required to set objects to nothing in classic asp?

On one hand the advice to always close objects is so common that I would feel foolish to ignore it (e.g. VBScript Out Of Memory Error).
However it would be equally foolish to ignore the wisdom of Eric Lippert, who appears to disagree: http://blogs.msdn.com/b/ericlippert/archive/2004/04/28/when-are-you-required-to-set-objects-to-nothing.aspx
I've worked to fix a number of web apps with OOM errors in classic asp. My first (time consuming) task is always to search the code for unclosed objects, and objects not set to nothing.
But I've never been 100% convinced that this has helped. (That said, I have found it hard to pinpoint exactly what DOES help...)
This post by Eric is talking about standalone VBScript files, not classic ASP written in VBScript. See the comments, then Eric's own comment:
Re: ASP -- excellent point, and one that I had not considered. In ASP it is sometimes very difficult to know where you are and what scope you're in.
So from this I can say that everything he wrote isn't relevant for classic ASP i.e. you should always Set everything to Nothing.
As for memory issues, I think that assigning objects (or arrays) to global scope like Session or Application is the main reason for such problems. That's the first thing I would look for and rewrite to hold only single identifider in Session then use database to manage the data.
Basically by setting a COM object to Nothing, you are forcing its terminator to run deterministically, which gives you the opportunity to handle any errors it may raise.
If you don't do it, you can get into a situation like the following:
Your code raises an error
The error isn't handled in your code and therefore ...
other objects instantiated in your code go out of scope, and their terminators run
one of the terminators raises an error
and the error that is propagated is the one from the terminator going out of scope, masking the original error.
I do remember from the dark and distant past that it was specifically recommended to close ADO objects. I'm not sure if this was because of a bug in ADO objects, or simply for the above reason (which applies more generally to any objects that can raise errors in their terminators).
And this recommendation is often repeated, though often without any credible reason. ("While ASP should automatically close and free up all object instantiations, it is always a good idea to explicitly close and free up object references yourself").
It's worth noting that in the article, he's not saying you should never worry about setting objects to nothing - just that it should not be the default behaviour for every object in every script.
Though I do suspect he's a little too quick to dismiss the "I saw this elsewhere" method of coding behaviour, I'm willing to bet that there is a reason Eric didn't consider that has caused this to be passed along as a hard 'n' fast rule - dealing with junior programmers.
When you start looking more closely at the Dreyfus model of skill acquisition, you see that at the beginning levels of acquiring a new skill, learners need simple to follow recipes. They do not yet have the knowledge or ability to make the judgement calls Eric qualifies the recommendation with later on.
Think back to when you first started programming. Could you readily judge if you were "set[tting an] expensive objects to Nothing when you are done with them if you are done with them well before they go out of scope"? Did you really know which objects were expensive or when they truly went out of scope?
Thus, most entry level programmers are simply told "always set every object to Nothing when you are done with it" because it is within their grasp to understand and follow. Unfortunately, not many programmers take the time to self-educate, learn, and grow into the higher-level Dreyfus stages where you can use the more nuanced situational approach.
And then we come back to my earlier statement - even the best of us started out at that earlier stage, where we reflexively closed all objects because that was the best we were capable of. We left large bodies of code that people look at now, and project our current competence backwards to the earlier work and assume we did that for reasons we don't understand.
I've got to get going, but I hope to expand this a little futher...

Do Lisp apps and webapps need special input sanitizing?

EDIT 3 Quite some new development have happened since I asked this question. Basically I wasn't "seeing things" and webapps written in Clojure have been found to be vulnerable, which prompted changes in Clojure 1.5 and very heated discussion on the Clojure Google groups.
Here's a quote from someone on Hacker News about the changes in Clojure 1.5:
Another slightly interesting thing is the sudden enhancement to
read-eval and EDN[2]. That's mainly because of the rough weather
Ruby/Rubygems was in with the YAML-exploits, which caused a heated
discussion on how the Clojure reader should act by default.
Holes have been found and it's too late to really fix Clojure, so read-eval shall still ship by default set to true (because otherwise it would break too many things). And anyone parsing inputs in Clojure should not use the default read functions but the EDN ones.
So I certainly wasn't seeing things and it didn't take long (not even 18 months) for people to find ways to attack common Clojure webapp stacks.
EDIT 2 I didn't know it but my question is a dupe of the following question (which has been described as a 'killer question'): Lisp data security/validation
If anyone's interested in the answer(s) to this question, I'd suggest they open the above question and read the answers made there by Lisp gurus instead of the ones of the type "nothing to see here, move along, it's just like PHP or JavaScript".
EDIT: I'd like to know if, somehow, because it is Lisp, it would be "easier" for an attacker to transform "data" (i.e. "crafted user input with a malicious intent") into "code". For example, do I need to escape/replace all the parentheses in the user input before starting to "evaluate" / parse or whatever the data?
Original question
I'm still reading about Lisp and suddenly I was wondering, with this entire "code is data" / "data is code" thing, do Lisp need to perform input sanitizing in order to prevent attacks?
I was thinking specifically of webapps, say when a user does some HTTP POST.
What if the data he's sending contains things like:
This is some malicious (eval '(nasty-stuff (...)) or whatever.
(I'm no Lisp programmer, it's just an example of what I've got in mind, it's not meant to be actually mean code)
Is there anything special to keep in mind due to how Lisp works? For example if some dark-side hacker would know that some webserver is running on Clojure, can he exploit that fact and then inject "code between parentheses" that would then be evaluated on the webserver?
Is this a concern at all when receiving/parsing user data (and hence potentially crafted data) from Lisp?
I have written some webapps in Lisp (i.e. Common Lisp) and here are the things I've kept in mind:
if you use read, you should always set *read-eval* to nil for any untrusted data
if you are dealing with code generation - for example, HTML, JS, CSS or SQL generation - which is very common in Lisp-land, you shouldn't forget to use the sanitizing facilities provided by the corresponding libraries (not use raw input strings)
Basically, that's all. Moreover, since it's Lisp, it usually makes your system less prone to attack, because:
there are no standard attacks (as Lisp's use is relatively rare)
the system is rather secure in terms of defaults - this isn't unique, but many web-oriented languages (like PHP in the first place) suffer from insecurity by default, although it is mitigated by modern frameworks
You should always assume that injection attacks are possible until proven otherwise. Without knowing more about your specific Lisp environment and what you are comparing it with, it is impossible to answer whether you need "special" sanitization.
We know that machine code attacks are possible.
We know that SQL injection is possible.
We should assume that it is possible to hijack any turing-complete system, whether it is hardware or software.
Note that the soft barrier between "code" and "data" is not unique to Lisp. perl, once the workhorse of the web world has eval. So does PHP. It looks like Java bytecode injection may be possible as well.
It really does boil down to: don't use READ and don't use EVAL. You need to know exactly what you are sending to either or both of those functions, as well as the contexts within which they are executed. If you do not call either of these, then you're fine.

Code obfuscation usage in various languages

I recently learned about code obfuscation. Its nice thing to do, when you have spare time, but I have different question. Why to do it?
First, there are languages in which I am sure its great thing - interpreted ones, like php, JavaScript and much more. There it seems like a good and more secure thing.
Second, there are languages where this seems to have no real effect for me - all the native code compiled languages. Take C for example. when compiled, all the variable names, function names, most of obfuscation techniques go away. If some can make it into native code, it would be things like recursion instead of for cycles and so, but disassembled code will anyway have instead of names some disassembler-generated identifiers, right?
And last category are languages I am not quite sure about. And that's the main reason I ask. These languages would be Java, C# (.NET),and the last Silverlight used in WP7. I ask because I read some article that state that on WP7 apps, code obfuscation helps preventing code from hacking. But I always thought of byte-code as being very similar to standard assembler codes, therefore again not having any information about real pre-compilation variable names, function names, etc. So, where is the truth?
Do it if you want, but don't expect any determined person to be scared away by it. There exist de-obfuscators, people can read obfuscated code as well (just as there are people who can read optimized assembly and reconstruct the original C code). Code obfuscation just gives you a false sense of security and might deter a person who is just curious (instead of deterring those who are serious about stealing your code). All it gives you is a false sense of security but no real one. Schneier aptly names this "security theater".
Yes, many modern languages that retain more information about the source can be obfuscated better than those that are compiled right to machine code. For the latter the compiler already does quite a good job with optimization. Your notion of bytecode being akin to traditional assembler is slightly wrong here, though. Especially .NET bytecode retains enough metadata to reconstruct the original source almost exactly (see Reflector). What isn't retained there are the names of local variables and arguments to methods. But you still need and retain the method and class names.
Another issue you should be aware of: If you give your customers an obfuscated executable and your program crashes, make sure you have a way of getting the real stacktrace back instead of the obfuscated one. Saying "Sorry, I cannot determine the root cause of why my program killed hours of your work since I chose to obfuscate it" isn't going to cut it, I guess :-)
Obfuscation is a common technique for mobile applications where you have hardware restrictions. Obfuscated code tends to have shorter identifiers and therefore smaller binaries.

Resources