Do Lisp apps and webapps need special input sanitizing? - security

EDIT 3 Quite some new development have happened since I asked this question. Basically I wasn't "seeing things" and webapps written in Clojure have been found to be vulnerable, which prompted changes in Clojure 1.5 and very heated discussion on the Clojure Google groups.
Here's a quote from someone on Hacker News about the changes in Clojure 1.5:
Another slightly interesting thing is the sudden enhancement to
read-eval and EDN[2]. That's mainly because of the rough weather
Ruby/Rubygems was in with the YAML-exploits, which caused a heated
discussion on how the Clojure reader should act by default.
Holes have been found and it's too late to really fix Clojure, so read-eval shall still ship by default set to true (because otherwise it would break too many things). And anyone parsing inputs in Clojure should not use the default read functions but the EDN ones.
So I certainly wasn't seeing things and it didn't take long (not even 18 months) for people to find ways to attack common Clojure webapp stacks.
EDIT 2 I didn't know it but my question is a dupe of the following question (which has been described as a 'killer question'): Lisp data security/validation
If anyone's interested in the answer(s) to this question, I'd suggest they open the above question and read the answers made there by Lisp gurus instead of the ones of the type "nothing to see here, move along, it's just like PHP or JavaScript".
EDIT: I'd like to know if, somehow, because it is Lisp, it would be "easier" for an attacker to transform "data" (i.e. "crafted user input with a malicious intent") into "code". For example, do I need to escape/replace all the parentheses in the user input before starting to "evaluate" / parse or whatever the data?
Original question
I'm still reading about Lisp and suddenly I was wondering, with this entire "code is data" / "data is code" thing, do Lisp need to perform input sanitizing in order to prevent attacks?
I was thinking specifically of webapps, say when a user does some HTTP POST.
What if the data he's sending contains things like:
This is some malicious (eval '(nasty-stuff (...)) or whatever.
(I'm no Lisp programmer, it's just an example of what I've got in mind, it's not meant to be actually mean code)
Is there anything special to keep in mind due to how Lisp works? For example if some dark-side hacker would know that some webserver is running on Clojure, can he exploit that fact and then inject "code between parentheses" that would then be evaluated on the webserver?
Is this a concern at all when receiving/parsing user data (and hence potentially crafted data) from Lisp?

I have written some webapps in Lisp (i.e. Common Lisp) and here are the things I've kept in mind:
if you use read, you should always set *read-eval* to nil for any untrusted data
if you are dealing with code generation - for example, HTML, JS, CSS or SQL generation - which is very common in Lisp-land, you shouldn't forget to use the sanitizing facilities provided by the corresponding libraries (not use raw input strings)
Basically, that's all. Moreover, since it's Lisp, it usually makes your system less prone to attack, because:
there are no standard attacks (as Lisp's use is relatively rare)
the system is rather secure in terms of defaults - this isn't unique, but many web-oriented languages (like PHP in the first place) suffer from insecurity by default, although it is mitigated by modern frameworks

You should always assume that injection attacks are possible until proven otherwise. Without knowing more about your specific Lisp environment and what you are comparing it with, it is impossible to answer whether you need "special" sanitization.
We know that machine code attacks are possible.
We know that SQL injection is possible.
We should assume that it is possible to hijack any turing-complete system, whether it is hardware or software.
Note that the soft barrier between "code" and "data" is not unique to Lisp. perl, once the workhorse of the web world has eval. So does PHP. It looks like Java bytecode injection may be possible as well.

It really does boil down to: don't use READ and don't use EVAL. You need to know exactly what you are sending to either or both of those functions, as well as the contexts within which they are executed. If you do not call either of these, then you're fine.

Related

Tools for Domain Specific Language/Functions

Our users can enter questions that get answered by students. Our users need a extensible, flexible way to define the correct answers to these questions (which are stored as a simple string).
I would like to expose a library of domain specific functions that users can call on to describe the correct answer. Eg:
exact_match("puppy") // means the correct answer is the string 'puppy'
or
contains("yesterday") // means any answer with the word 'yesterday' is correct
The naive implementation would involve eval'ing user supplied strings in a sandboxed runtime (like a javascript vm or ruby vm). But I'd like to go further and only allow specific functions to be called. Any other scripting would be discarded. Such that:
puts("foo"); contains("yesterday")
would be illegal. Since we don't expose or allow puts().
How can I constrain the execution environment to only run a whitelist of functions? Or is there a different approach to build this kind of external-facing DSL instead of trying to constrain an existing language to a subset of functions?
I would check out MPS by JetBrains if I were you, its an open source DSL creation tool. I have never used it myself, but from everything I have seen on it, it's very intuitive; and all of their other products are incredibly powerful.
Just because you're creating a DSL, that doesn't necessarily mean that you have to give the user the ability to enter the code in text.
The key to this is providing a list of method names and your special keyword for them, the "FunCode" tag in the code example below:
Create a mapping from keyword to code, and letting them define everything they need, and then use it. And I would actually build my own XML parser so that it's not hackable, at least not on a list of zero-day-exploits hackable.
<strDefs>
<strDef><strNam>sickStr</strNam>
<strText>sick</strText><strNum>01</strNum><strDef>
<strDef><strNam>pupStr</strNam>
<strText>puppy</strText><strNum>02</strNum><strDef>
</strDefs>
<funDefs>
<funDef><funCode>pfContainsStr</funCode><funLabel>contains</funLabel>
<funNum>01</funNum></funDef>
<funDef><funCode>pfXact</funCode><funLabel>exact_match</funLabel>
<funNum>02</funNum></funDef>
</funDefs>
<queries>
<query><fun>01</fun><str>02</str>
</query>
</queries>
The above XML more represents the idea and the structure of what to do, but rather in a user interface, so the user is constrained. The user interface code that allows the data-entry of the above data should be running on your server, and they only interact with it. Any code that runs on their browser is hackable, because they can just save the page, edit the HTML (and/or JavaScript), and run that, which is their code now, not yours anymore.
You can't really open the door (pandora's box) and allow just anyone to write just any code and have it evaluated / interpreted by the language parser, because some hacker is going to exploit it. You must lock down the strings, probably by having them enter them into your database in an earlier step, and each string gets its own token that YOU generate (a SQL Server primary key is very simple, usable, and secure), but give them a display representation so it's readable to them.
Then give them a list of methods / functions they can use, along with a token (a primary key can also serve here, perhaps with a kind of table prefix) and also a display representation (label).
If you have them put all of their labels into yet another table, you can have SQL make sure that all of their labels are unique to each other in the whole "language", and then you can allow them to try to define their expressions in the language they want to use. This has the advantage that foreign languages can be used, but you don't have to do anything terribly special.
An important piece would be the verify button, that would translate their expression into unique tokens and back again, checking that the round-trip was successful. If it wasn't successful, there's some kind of ambiguity, and you might be able to allow them an option to use the list of tokens as the source in that case.
If you heavily rely on set-based logic for the underlying foundation of the language and your tables, you should be able to produce a coherent DSL that works. Many DSL creation problems are ones of integrity, where there are underlying assumptions that are contradictory, unintentionally mutually exclusive, or nonsensical. Truth is an unshakeable foundation. Anything else has a lie somewhere -- that you're trying to build on.
Sudoku is illustrative here. When you screw up a Sudoku, you often don't know that you have done so, and you keep building on that false foundation, until you get to the completion of the puzzle, and one whole string of assumptions disagrees with a different string of assumptions. They can't both be true. But you can't tell where you went wrong because you're too far away from the mistake and can not work backwards (easily). All steps taken look correct. A DSL, a database schema, and code, are all this way. Baby steps, that are double- and even triple-checked, and hopefully "correct by inspection", are the best way to "grow" a DSL, slowly, piece-by-piece. The best way to not have flaws is to not add them in the first place.
You don't want bugs in your DSL. Keep it spartan. KISS - Keep it simple, Sparticus! And I have personally found that keeping it set-based, if not overtly, under the covers, accomplishes this very well.
Finally, to be able to think this way, I've studied languages for a long time, and have cultivated a curiosity about how languages have come to be. Books are a good quality source of information, as they have a higher quality level than the internet, which is nevertheless also an indispensable source. Some of my favorite languages: Forth, Factor, SETL, F#, C#, Visual FoxPro (especially for its embedded SQL), T-SQL, Common LISP, Clojure, and probably my favorite, Dylan, an INFIX Lisp without parentheses that Apple experimented with and abandoned, with a syntax that seems to me reminiscent of Pascal, which I sort of liked. The language list is actually much longer than that (and I haven't written code for many of them -- just studied them or their genesis), but that's enough for now.
One of my favorite books, and immensely interesting for the "people" side of it, is "Masterminds of Programming: Conversations with the Creators of Major Programming Languages" (Theory in Practice (O'Reilly)) 1st Edition, Kindle Edition
by Federico Biancuzzi (Author), Chromatic (Author)
By the way, don't let them compromise the integrity of your DSL -- require that it is expressible set-based, and things should go well (IMHO). I hope it works out well for you. Add a comment to my answer telling me how it worked out, if you think of it. And don't forget to choose my answer if you think it's the best! We work hard for the money! ;-)

Can you disallow a common lisp script called from common lisp to call specific functions?

Common Lisp allows to execute/compile code at runtime. But I thought for some (scripting-like) purposes it would be good if one could disallow a user-script to call some functions (especially for application extensions). One could still ask the user if he will allow an extension to access files/... I'm thinking of something like the Android permission system for Common Lisp. Is this possible without rewriting the evaluation code?
The problem I see is, that in Common Lisp you would probably want a script to be able to use reader macros and normal macros and for the latter operators like intern, but those would allow you to get arbitrary symbols (by string manipulation & interning), so simply scanning the code before evaluation won't suffice to ensure that specific functions aren't called.
So, is there something like a lock for functions? I thought of using fmakunbound / makunbound (and keeping the values in a local variable), but would that be possible in a multi-threaded environment?
Thanks in advance.
This is not part of the Common Lisp specification and there is no Common Lisp implementation that is extended to make this kind of restriction easy.
It seems to me like it would be easier to use operating system restrictions (e.g. rlimit, capabilities, etc) to enforce what you want on the Common Lisp process.
This is not an unusual desire, i.e. to run untrusted 3rd party code in a sandbox.
You can hand craft a sandbox by creating a custom parser and interpreter for your scripting language. It is pedantic, but true, than any program with an API is providing such a service. API designers and implementors needs to worry about the vile users.
You can still call eval or the compiler to run your sandbox scripts. It just means you need to assure that your reader, parser and language decline to provide access to any risky functionality.
You can use a lisp package to create a good sandbox. You can still use s-expressions for your scripting language's syntax, but you must cripple the standard reader so the user can't escape package-sandbox. You can still use the evaluator and the compiler, but you need to be sure the package you have boxed the user into contains no functionality that he can use to do inappropriate things.
Successful sandbox design and construction is easier when you start with an empty sandbox and slowly add functionality. Common Lisp is a big language and that creates a huge surface for attacker to poke at. So if you create a sandbox out of a package it's best to start with an empty package and add functions one at a time. Thinking thru what risks they create. The same approach is good when creating your crippled reader. Don't start with the full reader and throw things away, start with a useless reader and add things. Sadly taking that advice creates a pretty significant cost to getting started. But, if you look around I suspect you can find an existing safe reader.
Xach's suggestion is another way to go and in many case more straight forward.

Can anyone explain the design decisions behind Autolisp/visual lisp to me?

I wonder can anyone explain the design rationale behind the following features of autolisp / visual lisp? To me they seem to fly in the face of accepted software practice ... am I missing something?
All variables are global by default (ie unless placed after a / in the function arguments)
Reading/writing data from autocad requires putting stuff into an association list with lots of magic numbers. 10 means x/y coordinates, 90 means length of the coordinate list, 63 means colour, etc. Ok you could store these in some constants but that would mean yet more globals, and the documentation encourages you to use the magic numbers directly.
Lisp is a functional-style language, which encourages programming by recursion over iteration, but tail recursion is afaik not optimised in visual lisp leading to horrendous call stacks - unless, of course you iterate. But loop syntax is very restrictive; e.g. you can't break out of or return a value from a loop unless you put some kind of flag in the termination condition. Result, ugly code.
Generally you are forced to declare variables all over the place which flies in the face of functional programming - so why use a functional(-ish) language?
Lisp isn't a language, it's a group of sometimes surprisingly different languages. Scheme and Clojure are the functional members of the family. Common Lisp, and the more specialized breeds like Elisp aren't particularly functional and don't inherently encourage functional programming or recursion. CL in fact includes a very flexible object system, an extremely flexible iteration DSL, and doesn't guarantee optimized tail calls (Scheme dialects do, but not Lisps in general; that's the pitfall in thinking of "Lisp" as a single language).
Now that we have that cleared up, AutoLisp is an implementation from 1986 based on an early version of XLISP (the earliest of which was published in 1983).
The reason that it might fly in the face of currently accepted programming practice is that it predates currently accepted programming practice. Another thing to keep in mind is that the cheapest netbook available today is several hundred times more powerful than what a programmer could expect to have access to back in the mid 80s. Meaning that even if a given feature was accepted to be excellent, CPU or memory constraints may have prevented its implementation in a commercial language.
I've never programmed in Autolisp/Visual Lisp specifically, and the stuff you cite sounds bloody annoying, but it may have had some performance/memory advantage that justified it at the time.
If I remember correctly, AutoLisp is a fork from an early version of XLisp (some sources claim it was XLisp 1.0 (see this C2 article).
XLisp 1.0 is a 1-cell lisp (functions and variables share the same name-space) with some rather odd oddities to it.
You can add dynamic scoping into the mix btw, and if you don't know what it is consider yourself lucky. But actually not all your four points are that big of a deal IMO:
"Undeclared vars are created automatically as global." Same as in CL is it not (via setq)? The other option is to fail, and that's not a very attractive one for the language which is supposed to be used for quick-n-dirty scripting.
"magic numbers" are DXF-codes, which you're right are major inconvenience as they tend to change with the changing ACAD versions sometimes (thankfully, rarely). That's just how it is. Fixing it would require a major overhaul, introducing some "schemas" and what not, and why would "they" bother? AutoLISP was left in its state as of 1992 approximately, and never bothered with since. Visual LISP itself is entirely different and much more capable system, but it is all locked out for the regular user, and only made to serve one goal - to emulate the old AutoLISP as faithfully as possible (except where it added new VBA-related features in the later half of the 1990s, and was locked since then too).
(while (not done) ...) is not that ugly. There's no tail optimization guarantee, yes, just as there isn't one in CL and Haskell (that last one really stumbles me - there's no guaranteed way to encode a loop in Haskell in constant space without monads - how about that?).
"you're forced to declare vars all over the place" here I do not follow you. You declare them were you supposed to declare them - in the function's internal arguments list. What other places do you mean? I don't know of any.
In reality the biggest stumbling block of AutoLISP is its dynamic name resolution IMO, but that's how it was in Xlisp, only few years after Scheme first came out. Then also it's its immutable data store, but that was done mainly for simplicity of implementation, and to prevent too much confusion and hence questions, from the user base, I guess.

How to Protect an Exe File from Decompilation

What are the methods for protecting an Exe file from Reverse Engineering.Many Packers are available to pack an exe file.Such an approach is mentioned in http://c-madeeasy.blogspot.com/2011/07/protecting-your-c-programexe-files-from.html
Is this method efficient?
The only good way to prevent a program from being reverse-engineered ("understood") is to revise its structure to essentially force the opponent into understanding Turing Machines. Essentially what you do is:
take some problem which generally proven to be computationally difficult
synthesize a version of that whose outcome you know; this is generally pretty easy compared to solving a version
make the correct program execution dependent on the correct answer
make the program compute nonsense if the answer is not correct
Now an opponent staring at your code has to figure what the "correct" computation is, by solving algorithmically hard problems. There's tons of NP-hard problems that nobody has solved efficiently in the literature in 40 years; its a pretty good bet if your program depends on one of these, that J. Random Reverse-Engineer won't suddenly be able to solve them.
One generally does this by transforming the original program to obscure its control flow, and/or its dataflow. Some techniques scramble the control flow by converting some control flow into essentially data flow ("jump indirect through this pointer array"), and then implementing data flow algorithms that require precise points-to analysis, which is both provably hard and has proven difficult in practice.
Here's a paper that describes a variety of techniques rather shallowly but its an easy read:
http://www.cs.sjsu.edu/faculty/stamp/students/kundu_deepti.pdf
Here's another that focuses on how to ensure that the obfuscating transformations lead to results that are gauranteed to be computationally hard:
http://www.springerlink.com/content/41135jkqxv9l3xme/
Here's one that surveys a wide variety of control flow transformation methods,
including those that provide levels of gaurantees about security:
http://www.springerlink.com/content/g157gxr14m149l13/
This paper obfuscates control flows in binary programs with low overhead:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.167.3773&rank=2
Now, one could go through a lot of trouble to prevent a program from being decompiled. But if the decompiled one was impossible to understand, you simply might not bother; that's the approach I'd take.
If you insist on preventing decompilation, you can attack that by considering what decompilation is intended to accomplish. Decompilation essentially proposes that you can convert each byte of the target program into some piece of code. One way to make that fail, is to ensure that the application can apparently use each byte
as both computer instructions, and as data, even if if does not actually do so, and that the decision to do so is obfuscated by the above kinds of methods. One variation on this is to have lots of conditional branches in the code that are in fact unconditional (using control flow obfuscation methods); the other side of the branch falls into nonsense code that looks valid but branches to crazy places in the existing code. Another variant on this idea is to implement your program as an obfuscated interpreter, and implement the actual functionality as a set of interpreted data.
A fun way to make this fail is to generate code at run time and execute it on the fly; most conventional languages such as C have pretty much no way to represent this.
A program built like this would be difficult to decompile, let alone understand after the fact.
Tools that are claimed to a good job at protecting binary code are listed at:
https://security.stackexchange.com/questions/1069/any-comprehensive-solutions-for-binary-code-protection-and-anti-reverse-engineeri
Packing, compressing and any other methods of binary protection will only every serve to hinder or slow reversal of your code, they have never been and never will be 100% secure solutions (though the marketing of some would have you believe that). You basically need to evaluate what sort of level of hacker you are up against, if they are script kids, then any packer that require real effort and skill (ie:those that lack unpacking scripts/programs/tutorials) will deter them. If your facing people with skills and resources, then you can forget about keeping your code safe (as many of the comments say: if the OS can read it to execute it, so can you, it'll just take a while longer). If your concern is not so much your IP but rather the security of something your program does, then you might be better served in redesigning in a manner where it cannot be attack even with the original source (chrome takes this approach).
Decompilation is always possible. The statement
This threat can be eliminated to extend by packing/compressing the
executable(.exe).
on your linked site is a plain lie.
Currently many solutions can be used to protect your application from being anti-compiled. Such as compressing, Obfuscation, Code snippet, etc.
You can looking for a company to help you achieve this.
Such as Nelpeiron, the website is:https://www.nalpeiron.com/
Which can cover many platforms, Windows, Linux, ARM-Linux, Android.
What is more Virbox is also can be taken into consideration:
The website is: https://lm-global.virbox.com/index.html
I recommend is because they have more options to protect your source code, such as import table protection, memory check.

Code obfuscation usage in various languages

I recently learned about code obfuscation. Its nice thing to do, when you have spare time, but I have different question. Why to do it?
First, there are languages in which I am sure its great thing - interpreted ones, like php, JavaScript and much more. There it seems like a good and more secure thing.
Second, there are languages where this seems to have no real effect for me - all the native code compiled languages. Take C for example. when compiled, all the variable names, function names, most of obfuscation techniques go away. If some can make it into native code, it would be things like recursion instead of for cycles and so, but disassembled code will anyway have instead of names some disassembler-generated identifiers, right?
And last category are languages I am not quite sure about. And that's the main reason I ask. These languages would be Java, C# (.NET),and the last Silverlight used in WP7. I ask because I read some article that state that on WP7 apps, code obfuscation helps preventing code from hacking. But I always thought of byte-code as being very similar to standard assembler codes, therefore again not having any information about real pre-compilation variable names, function names, etc. So, where is the truth?
Do it if you want, but don't expect any determined person to be scared away by it. There exist de-obfuscators, people can read obfuscated code as well (just as there are people who can read optimized assembly and reconstruct the original C code). Code obfuscation just gives you a false sense of security and might deter a person who is just curious (instead of deterring those who are serious about stealing your code). All it gives you is a false sense of security but no real one. Schneier aptly names this "security theater".
Yes, many modern languages that retain more information about the source can be obfuscated better than those that are compiled right to machine code. For the latter the compiler already does quite a good job with optimization. Your notion of bytecode being akin to traditional assembler is slightly wrong here, though. Especially .NET bytecode retains enough metadata to reconstruct the original source almost exactly (see Reflector). What isn't retained there are the names of local variables and arguments to methods. But you still need and retain the method and class names.
Another issue you should be aware of: If you give your customers an obfuscated executable and your program crashes, make sure you have a way of getting the real stacktrace back instead of the obfuscated one. Saying "Sorry, I cannot determine the root cause of why my program killed hours of your work since I chose to obfuscate it" isn't going to cut it, I guess :-)
Obfuscation is a common technique for mobile applications where you have hardware restrictions. Obfuscated code tends to have shorter identifiers and therefore smaller binaries.

Resources