Does Lisp's treatment of code as data make it more vulnerable to security exploits than a language that doesn't treat code as data? [closed] - security

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I know that this might be a stupid question but I was curious.
Since Lisp treats code and data the same, does this mean that it's easier to write a payload and pass it as "innocent" data that can be used to exploit programs? In comparison to languages that don't do so?
For e.g. In python you can do something like this.
malicious_str = "print('this is a malicious string')"
user_in = eval(malicious_str)
>>> this is a malicious string
P.S I have just started learning Lisp.

No, I don't think it does. In fact because of what is normally meant by 'code is data' in Lisp, it is potentially less vulnerable than some other languages.
[Note: this answer is really about Common Lisp: see the end for a note about that.]
There are two senses in which 'code can be data' in a language.
Turning objects into executable code: eval & friends
This is the first sense. What this means is that you can, say, take a string or some other object (not all types of object, obviously) and say 'turn this into something I can execute, and do that'.
Any language that can do this has either
to be extremely careful about doing this on unconstrained data;
or to be able to be certain that a given program does not actually do this.
Plenty of languages have equivalents of eval and its relations, so plenty of languages have this problem. You give an example of Python for instance, which is a good one, and there are probably other examples in Python (I've written programs even in Python 2 which supported dynamic loading of modules at runtime, which executes potentially arbitrary code, and I think this stuff is much better integrated in Python 3).
This is also not just a property of a language: it's a property of a system. C can't do this, right? Well, yes it can if you're on any kind of reasonable *nixy platform. Not only can you use an exec-family function, but you can probably dynamically load a shared library and execute code in it.
So one solution to this problem is to, somehow, be able to be certain that a given program doesn't do this. One thing that helps is if there are a finite, known number of ways of doing it. In Common Lisp I think those are probably
eval of course;
unconstrained read (because of *read-eval*);
load;
compile;
compile-file;
and probably some others that I have forgotten.
Well, can you detect calls to those, statically, in a program? Not really: consider this:
(funcall (symbol-function (find-symbol s)) ...)
and now you're in trouble unless you have very good control over what s is: it might be "EVAL" for instance.
So that's frightening, but I don't think it's more frightening than what Python can do, for instance (almost certainly you can poke around in the namespace to find eval or something?). And something like that in a program ought to be a really big hint that bad things might happen.
I think there are probably two approaches to this, neither of which CL adopts but which implementations could (and perhaps even programs written in CL could).
One would be to be able to run programs in such a way that the finite set of bad functions above simply are disallowed: they'd signal errors if you tried to call them. An implementation could clearly do that (see below).
The other would be to have something like Perl's 'tainting' where data which came from a user needs to be explicitly looked-at by the program somehow before it's used. That doesn't guarantee safety of course, but it does make it harder to make silly mistakes: if s above came from user input and was thus tainted you'd have to explicitly say 'it's OK to use it' and, well, then it's up to you.
So this is a problem, but I don't think it's worse than the problems that very many other languages (and language-families) have.
An example of an implementation that can address the first approach is LispWorks: if you're building an application with LW, you typically create the binary with a function called deliver, which has options which allow you to remove the definitions of functions from the resulting binary whether or not the delivery process would otherwise leave them there. So, for instance
(deliver 'foo "x" 5
:functions-to-remove '(eval load compile compile-file read))
would result in an executable x which, whatever else it did, couldn't call those functions, because they're not present, at all.
Other implementations probably have similar features: I just don't know them as well.
But there's another sense in which 'code is data' in Lisp.
Program source code is available as structured data
This is the sense that people probably really mean when they say 'code is data' in Lisp, even if they don't know that. It's worth looking at your Python example again:
>>> eval("exit('exploded')")
exploded
$
So what eval eats is a string: a completely unstructured vector of characters. If you want to know whether that string contains something nasty, well, you've got a lot of work ahead of you (disclaimer: see below).
Compare this with CL:
> (let ((trying-to-be-bad "(end-the-world :now t)"))
(eval trying-to-be-bad))
"(end-the-world :now t)"
OK, so that clearly didn't end the world. And it didn't end the world because eval evaluates a bit of Lisp source code, and the value of a string, as source code, is the string.
If I want to do something nasty I have to hand it an actual interesting structure:
> (let ((actually-bad '(eval (progn
(format *query-io* "? ")
(finish-output *query-io*)
(read *query-io*)))))
(eval actually-bad))
? (defun foo () (foo))
foo
Now that's potentially quite nasty in at least several ways. But wait: in order to do this nasty thing, I had to hand eval a chunk of source code represented as an s-expression. And the structure of that s-expression is completely open to inspection by me. I can write a program which inspects this s-expression in any arbitrary way I like, and decides whether or not it is acceptable to me. That's just hugely easier than 'given this string, interpret it as a piece of source text for the language and tell me if it is dangerous':
the process of turning the sequence of characters into an s-expression has happened already;
the structure of s-expressions is both simple and standard.
So in this sense of 'code is data', Lisp is potentially much safer than other languages which have versions of eval which eat strings, like Python, say, because code is structured, standard, simple data. Lisp has an answer to the terrible 'language in a string' problem.
I am fairly sure that Python does in fact have some approach to making the parse tree available in a standard way which can be inspected. But eval still happily eats strings.
As I said above, this answer is about Common Lisp. But there are many other Lisps of course, which will have varying versions of this problem. Racket for instance probably can really fairly tightly constrain things, using sandboxed execution and modules, although I haven't explored this.

Any language can be exploited if you are not careful.
A well-known attack against Lisp is via the #. reader macro:
(read-from-string "#.(start-the-war)")
will start the war if *read-eval* is non-nil - this is why one should always bind it when reading from an un-trusted stream.
However, this is not directly related to "code is data" doctrine...

Related

How does flexibility affect a language's syntax?

I am currently working on writing my own language(shameless plug), which is centered around flexibility. I am trying to make almost any part of the language syntax exchangeable through things like extensions/plugins. While writing the whole thing, it has got me thinking. I am wondering how that sort of flexibility could affect the language.
I know that Lisp is often referred to as one of the most extensible languages due to its extensive macro system. I do understand that concept of macros, but I am yet to find a language that allows someone to change the way it is parsed. To my knowledge, almost every language has an extremely concrete syntax as defined by some long specification.
My question is how could having a flexible syntax affect the intuitiveness and usability of the language? I know the basic "people might be confused when the syntax changes" and "semantic analysis will be hard". Those are things that I am already starting to compensate for. I am looking for a more conceptual answer on the pros and cons of having a flexible syntax.
The topic of language design is still quite foreign to me, so I apologize if I am asking an obvious or otherwise stupid question!
Edit:
I was just wanting to clarify the question I was asking. Where exactly does flexibility in a language's syntax stand, in terms of language theory? I don't really need examples or projects/languages with flexibility, I want to understand how it can affect the language's readability, functionality, and other things like that.
Perl is the most flexible language I know. That a look at Moose, a postmodern object system for Perl 5. It's syntax is very different than Perl's but it is still very Perl-ish.
IMO, the biggest problem with flexibility is precedence in infix notation. But none I know of allow a datatype to have its own infix syntax. For example, take sets. It would be nice to use ⊂ and ⊇ in their syntax. But not only would a compiler have to recognize these symbols, it would have to be told their order of precedence.
Common Lisp allows to change the way it's parsed - see reader macros. Racket allows to modify its parser, see racket languages.
And of course you can have a flexible, dynamically extensible parsing alongside with powerful macros if you use the right parsing techniques (e.g., PEG). Have a look at an example here - mostly a C syntax, but extensible with both syntax and semantic macros.
As for precedence, PEG goes really well together with Pratt.
To answer your updated question - there is surprisingly little research done on programming languages readability anyway. You may want to have a look at what Dr. Blackwell group was up to, but it's still far from conclusive.
So I can only share my hand-wavy anecdotes - flexible syntax languages facilitates eDSL construction, and, in my opinion, eDSLs is the only way to eliminate unnecessary complexity from code, to make code actually maintainable in a long term. I believe that non-flexible languages are one of the biggest mistakes made by this industry, and it must be corrected at all costs, ASAP.
Flexibility allows you to manipulate the syntax of the language. For example, Lisp Macros can enable you to write programs that write programs and manipulate your syntax at compile-time to valid Lisp expressions. For example the Loop Macro:
(loop for x from 1 to 5
do(format t "~A~%" x))
1
2
3
4
5
NIL
And we can see how the code was translated with macroexpand-1:
(pprint(macroexpand-1 '(loop for x from 1 to 5
do (format t "~a~%" x))))
We can then see how a call to that macro is translated:
(LET ((X 1))
(DECLARE (TYPE (AND REAL NUMBER) X))
(TAGBODY
SB-LOOP::NEXT-LOOP
(WHEN (> X '5) (GO SB-LOOP::END-LOOP))
(FORMAT T "~a~%" X)
(SB-LOOP::LOOP-DESETQ X (1+ X))
(GO SB-LOOP::NEXT-LOOP)
SB-LOOP::END-LOOP)))
Language Flexibility just allows you to create your own embedded language within a language and reduce the length of your program in terms of characters used. So in theory, this can make a language very unreadable since we can manipulate the syntax. For example we can create invalid code that's translated to valid code:
(defmacro backwards (expr)
(reverse expr))
BACKWARDS
CL-USER> (backwards ("hello world" nil format))
"hello world"
CL-USER>
Clearly the above code can become complex since:
("hello world" nil format)
is not a valid Lisp expression.
Thanks to SK-logic's answer for pointing me in the direction of Alan Blackwell. I sent him an email asking his stance on the matter, and he responded with an absolutely wonderful explanation. Here it is:
So the person who responded to your StackOverflow question, saying
that flexible syntax could be useful for DSLs, is certainly correct.
It actually used to be fairly common to use the C preprocessor to
create alternative syntax (that would be turned into regular syntax in
an initial compile phase). A lot of the early esolangs were built this
way.
In practice, I think we would have to say that a lot of DSLs are
implemented as libraries within regular programming languages, and
that the library design is far more significant than the syntax. There
may be more purpose for having variety in visual languages, but making
customisable general purpose compilers for arbitrary graphical syntax
is really hard - much worse than changing text syntax features.
There may well be interesting things that your design could enable, so
I wouldn’t discourage experimentation. However, I think there is one reason why
customisable syntax is not so common. This is related to the famous
programmer’s editor EMACS. In EMACS, everything is customisable - all
key bindings, and all editor functions. It’s fun to play with, and
back in the day, many of us made our own personalised version that
only we knew how to operate. But it turned out that it was a real
hassle that everyone’s editor worked completely differently. You could
never lean over and make suggestions on another person’s session, and
teams always had to know who was logged in order to know whether the
editor would work. So it turned out that, over the years, we all just
started to use the default distribution and key bindings, which made
things easier for everyone.
At this point in time, that is just about enough of an explanation that I was looking for. If anyone feels as though they have a better explanation or something to add, feel free to contact me.

What's so bad about Template Haskell?

It seems that Template Haskell is often viewed by the Haskell community as an unfortunate convenience. It's hard to put into words exactly what I have observed in this regard, but consider these few examples
Template Haskell listed under "The Ugly (but necessary)" in response to the question Which Haskell (GHC) extensions should users use/avoid?
Template Haskell considered a temporary/inferior solution in Unboxed Vectors of newtype'd values thread (libraries mailing list)
Yesod is often criticized for relying too much on Template Haskell (see the blog post in response to this sentiment)
I've seen various blog posts where people do pretty neat stuff with Template Haskell, enabling prettier syntax that simply wouldn't be possible in regular Haskell, as well as tremendous boilerplate reduction. So why is it that Template Haskell is looked down upon in this way? What makes it undesirable? Under what circumstances should Template Haskell be avoided, and why?
One reason for avoiding Template Haskell is that it as a whole isn't type-safe, at all, thus going against much of "the spirit of Haskell." Here are some examples of this:
You have no control over what kind of Haskell AST a piece of TH code will generate, beyond where it will appear; you can have a value of type Exp, but you don't know if it is an expression that represents a [Char] or a (a -> (forall b . b -> c)) or whatever. TH would be more reliable if one could express that a function may only generate expressions of a certain type, or only function declarations, or only data-constructor-matching patterns, etc.
You can generate expressions that don't compile. You generated an expression that references a free variable foo that doesn't exist? Tough luck, you'll only see that when actually using your code generator, and only under the circumstances that trigger the generation of that particular code. It is very difficult to unit test, too.
TH is also outright dangerous:
Code that runs at compile-time can do arbitrary IO, including launching missiles or stealing your credit card. You don't want to have to look through every cabal package you ever download in search for TH exploits.
TH can access "module-private" functions and definitions, completely breaking encapsulation in some cases.
Then there are some problems that make TH functions less fun to use as a library developer:
TH code isn't always composable. Let's say someone makes a generator for lenses, and more often than not, that generator will be structured in such a way that it can only be called directly by the "end-user," and not by other TH code, by for example taking a list of type constructors to generate lenses for as the parameter. It is tricky to generate that list in code, while the user only has to write generateLenses [''Foo, ''Bar].
Developers don't even know that TH code can be composed. Did you know that you can write forM_ [''Foo, ''Bar] generateLens? Q is just a monad, so you can use all of the usual functions on it. Some people don't know this, and because of that, they create multiple overloaded versions of essentially the same functions with the same functionality, and these functions lead to a certain bloat effect. Also, most people write their generators in the Q monad even when they don't have to, which is like writing bla :: IO Int; bla = return 3; you are giving a function more "environment" than it needs, and clients of the function are required to provide that environment as an effect of that.
Finally, there are some things that make TH functions less fun to use as an end-user:
Opacity. When a TH function has type Q Dec, it can generate absolutely anything at the top-level of a module, and you have absolutely no control over what will be generated.
Monolithism. You can't control how much a TH function generates unless the developer allows it; if you find a function that generates a database interface and a JSON serialization interface, you can't say "No, I only want the database interface, thanks; I'll roll my own JSON interface"
Run time. TH code takes a relatively long time to run. The code is interpreted anew every time a file is compiled, and often, a ton of packages are required by the running TH code, that have to be loaded. This slows down compile time considerably.
This is solely my own opinion.
It's ugly to use. $(fooBar ''Asdf) just does not look nice. Superficial, sure, but it contributes.
It's even uglier to write. Quoting works sometimes, but a lot of the time you have to do manual AST grafting and plumbing. The API is big and unwieldy, there's always a lot of cases you don't care about but still need to dispatch, and the cases you do care about tend to be present in multiple similar but not identical forms (data vs. newtype, record-style vs. normal constructors, and so on). It's boring and repetitive to write and complicated enough to not be mechanical. The reform proposal addresses some of this (making quotes more widely applicable).
The stage restriction is hell. Not being able to splice functions defined in the same module is the smaller part of it: the other consequence is that if you have a top-level splice, everything after it in the module will be out of scope to anything before it. Other languages with this property (C, C++) make it workable by allowing you to forward declare things, but Haskell doesn't. If you need cyclic references between spliced declarations or their dependencies and dependents, you're usually just screwed.
It's undisciplined. What I mean by this is that most of the time when you express an abstraction, there is some kind of principle or concept behind that abstraction. For many abstractions, the principle behind them can be expressed in their types. For type classes, you can often formulate laws which instances should obey and clients can assume. If you use GHC's new generics feature to abstract the form of an instance declaration over any datatype (within bounds), you get to say "for sum types, it works like this, for product types, it works like that". Template Haskell, on the other hand, is just macros. It's not abstraction at the level of ideas, but abstraction at the level of ASTs, which is better, but only modestly, than abstraction at the level of plain text.*
It ties you to GHC. In theory another compiler could implement it, but in practice I doubt this will ever happen. (This is in contrast to various type system extensions which, though they might only be implemented by GHC at the moment, I could easily imagine being adopted by other compilers down the road and eventually standardized.)
The API isn't stable. When new language features are added to GHC and the template-haskell package is updated to support them, this often involves backwards-incompatible changes to the TH datatypes. If you want your TH code to be compatible with more than just one version of GHC you need to be very careful and possibly use CPP.
There's a general principle that you should use the right tool for the job and the smallest one that will suffice, and in that analogy Template Haskell is something like this. If there's a way to do it that's not Template Haskell, it's generally preferable.
The advantage of Template Haskell is that you can do things with it that you couldn't do any other way, and it's a big one. Most of the time the things TH is used for could otherwise only be done if they were implemented directly as compiler features. TH is extremely beneficial to have both because it lets you do these things, and because it lets you prototype potential compiler extensions in a much more lightweight and reusable way (see the various lens packages, for example).
To summarize why I think there are negative feelings towards Template Haskell: It solves a lot of problems, but for any given problem that it solves, it feels like there should be a better, more elegant, disciplined solution better suited to solving that problem, one which doesn't solve the problem by automatically generating the boilerplate, but by removing the need to have the boilerplate.
* Though I often feel that CPP has a better power-to-weight ratio for those problems that it can solve.
EDIT 23-04-14: What I was frequently trying to get at in the above, and have only recently gotten at exactly, is that there's an important distinction between abstraction and deduplication. Proper abstraction often results in deduplication as a side effect, and duplication is often a telltale sign of inadequate abstraction, but that's not why it's valuable. Proper abstraction is what makes code correct, comprehensible, and maintainable. Deduplication only makes it shorter. Template Haskell, like macros in general, is a tool for deduplication.
I'd like to address a few of the points dflemstr brings up.
I don't find the fact that you can't typecheck TH to be that worrying. Why? Because even if there is an error, it will still be compile time. I'm not sure if this strengthens my argument, but this is similar in spirit to the errors that you receive when using templates in C++. I think these errors are more understandable than C++'s errors though, as you'll get a pretty printed version of the generated code.
If a TH expression / quasi-quoter does something that's so advanced that tricky corners can hide, then perhaps it's ill-advised?
I break this rule quite a bit with quasi-quoters I've been working on lately (using haskell-src-exts / meta) - https://github.com/mgsloan/quasi-extras/tree/master/examples . I know this introduces some bugs such as not being able to splice in the generalized list comprehensions. However, I think that there's a good chance that some of the ideas in http://hackage.haskell.org/trac/ghc/blog/Template%20Haskell%20Proposal will end up in the compiler. Until then, the libraries for parsing Haskell to TH trees are a nearly perfect approximation.
Regarding compilation speed / dependencies, we can use the "zeroth" package to inline the generated code. This is at least nice for the users of a given library, but we can't do much better for the case of editing the library. Can TH dependencies bloat generated binaries? I thought it left out everything that's not referenced by the compiled code.
The staging restriction / splitting of compilation steps of the Haskell module does suck.
RE Opacity: This is the same for any library function you call. You have no control over what Data.List.groupBy will do. You just have a reasonable "guarantee" / convention that the version numbers tell you something about the compatibility. It is somewhat of a different matter of change when.
This is where using zeroth pays off - you're already versioning the generated files - so you'll always know when the form of the generated code has changed. Looking at the diffs might be a bit gnarly, though, for large amounts of generated code, so that's one place where a better developer interface would be handy.
RE Monolithism: You can certainly post-process the results of a TH expression, using your own compile-time code. It wouldn't be very much code to filter on top-level declaration type / name. Heck, you could imagine writing a function that does this generically. For modifying / de-monolithisizing quasiquoters, you can pattern match on "QuasiQuoter" and extract out the transformations used, or make a new one in terms of the old.
This answer is in response to the issues brought up by illissius, point by point:
It's ugly to use. $(fooBar ''Asdf) just does not look nice. Superficial, sure, but it contributes.
I agree. I feel like $( ) was chosen to look like it was part of the language - using the familiar symbol pallet of Haskell. However, that's exactly what you /don't/ want in the symbols used for your macro splicing. They definitely blend in too much, and this cosmetic aspect is quite important. I like the look of {{ }} for splices, because they are quite visually distinct.
It's even uglier to write. Quoting works sometimes, but a lot of the time you have to do manual AST grafting and plumbing. The [API][1] is big and unwieldy, there's always a lot of cases you don't care about but still need to dispatch, and the cases you do care about tend to be present in multiple similar but not identical forms (data vs. newtype, record-style vs. normal constructors, and so on). It's boring and repetitive to write and complicated enough to not be mechanical. The [reform proposal][2] addresses some of this (making quotes more widely applicable).
I also agree with this, however, as some of the comments in "New Directions for TH" observe, the lack of good out-of-the-box AST quoting is not a critical flaw. In this WIP package, I seek to address these problems in library form: https://github.com/mgsloan/quasi-extras . So far I allow splicing in a few more places than usual and can pattern match on ASTs.
The stage restriction is hell. Not being able to splice functions defined in the same module is the smaller part of it: the other consequence is that if you have a top-level splice, everything after it in the module will be out of scope to anything before it. Other languages with this property (C, C++) make it workable by allowing you to forward declare things, but Haskell doesn't. If you need cyclic references between spliced declarations or their dependencies and dependents, you're usually just screwed.
I've run into the issue of cyclic TH definitions being impossible before... It's quite annoying. There is a solution, but it's ugly - wrap the things involved in the cyclic dependency in a TH expression that combines all of the generated declarations. One of these declarations generators could just be a quasi-quoter that accepts Haskell code.
It's unprincipled. What I mean by this is that most of the time when you express an abstraction, there is some kind of principle or concept behind that abstraction. For many abstractions, the principle behind them can be expressed in their types. When you define a type class, you can often formulate laws which instances should obey and clients can assume. If you use GHC's [new generics feature][3] to abstract the form of an instance declaration over any datatype (within bounds), you get to say "for sum types, it works like this, for product types, it works like that". But Template Haskell is just dumb macros. It's not abstraction at the level of ideas, but abstraction at the level of ASTs, which is better, but only modestly, than abstraction at the level of plain text.
It's only unprincipled if you do unprincipled things with it. The only difference is that with the compiler implemented mechanisms for abstraction, you have more confidence that the abstraction isn't leaky. Perhaps democratizing language design does sound a bit scary! Creators of TH libraries need to document well and clearly define the meaning and results of the tools they provide. A good example of principled TH is the derive package: http://hackage.haskell.org/package/derive - it uses a DSL such that the example of many of the derivations /specifies/ the actual derivation.
It ties you to GHC. In theory another compiler could implement it, but in practice I doubt this will ever happen. (This is in contrast to various type system extensions which, though they might only be implemented by GHC at the moment, I could easily imagine being adopted by other compilers down the road and eventually standardized.)
That's a pretty good point - the TH API is pretty big and clunky. Re-implementing it seems like it could be tough. However, there are only really only a few ways to slice the problem of representing Haskell ASTs. I imagine that copying the TH ADTs, and writing a converter to the internal AST representation would get you a good deal of the way there. This would be equivalent to the (not insignificant) effort of creating haskell-src-meta. It could also be simply re-implemented by pretty printing the TH AST and using the compiler's internal parser.
While I could be wrong, I don't see TH as being that complicated of a compiler extension, from an implementation perspective. This is actually one of the benefits of "keeping it simple" and not having the fundamental layer be some theoretically appealing, statically verifiable templating system.
The API isn't stable. When new language features are added to GHC and the template-haskell package is updated to support them, this often involves backwards-incompatible changes to the TH datatypes. If you want your TH code to be compatible with more than just one version of GHC you need to be very careful and possibly use CPP.
This is also a good point, but somewhat dramaticized. While there have been API additions lately, they haven't been extensively breakage inducing. Also, I think that with the superior AST quoting I mentioned earlier, the API that actually needs to be used can be very substantially reduced. If no construction / matching needs distinct functions, and are instead expressed as literals, then most of the API disappears. Moreover, the code you write would port more easily to AST representations for languages similar to Haskell.
In summary, I think that TH is a powerful, semi-neglected tool. Less hate could lead to a more lively eco-system of libraries, encouraging the implementation of more language feature prototypes. It's been observed that TH is an overpowered tool, that can let you /do/ almost anything. Anarchy! Well, it's my opinion that this power can allow you to overcome most of its limitations, and construct systems capable of quite principled meta-programming approaches. It's worth the usage of ugly hacks to simulate the "proper" implementation, as this way the design of the "proper" implementation will gradually become clear.
In my personal ideal version of nirvana, much of the language would actually move out of the compiler, into libraries of these variety. The fact that the features are implemented as libraries does not heavily influence their ability to faithfully abstract.
What's the typical Haskell answer to boilerplate code? Abstraction. What're our favorite abstractions? Functions and typeclasses!
Typeclasses let us define a set of methods, that can then be used in all manner of functions generic on that class. However, other than this, the only way classes help avoid boilerplate is by offering "default definitions". Now here is an example of an unprincipled feature!
Minimal binding sets are not declarable / compiler checkable. This could lead to inadvertent definitions that yield bottom due to mutual recursion.
Despite the great convenience and power this would yield, you cannot specify superclass defaults, due to orphan instances http://lukepalmer.wordpress.com/2009/01/25/a-world-without-orphans/ These would let us fix the numeric hierarchy gracefully!
Going after TH-like capabilities for method defaults led to http://www.haskell.org/haskellwiki/GHC.Generics . While this is cool stuff, my only experience debugging code using these generics was nigh-impossible, due to the size of the type induced for and ADT as complicated as an AST. https://github.com/mgsloan/th-extra/commit/d7784d95d396eb3abdb409a24360beb03731c88c
In other words, this went after the features provided by TH, but it had to lift an entire domain of the language, the construction language, into a type system representation. While I can see it working well for your common problem, for complex ones, it seems prone to yielding a pile of symbols far more terrifying than TH hackery.
TH gives you value-level compile-time computation of the output code, whereas generics forces you to lift the pattern matching / recursion part of the code into the type system. While this does restrict the user in a few fairly useful ways, I don't think the complexity is worth it.
I think that the rejection of TH and lisp-like metaprogramming led to the preference towards things like method-defaults instead of more flexible, macro-expansion like declarations of instances. The discipline of avoiding things that could lead to unforseen results is wise, however, we should not ignore that Haskell's capable type system allows for more reliable metaprogramming than in many other environments (by checking the generated code).
One rather pragmatic problem with Template Haskell is that it only works when GHC's bytecode interpreter is available, which is not the case on all architectures. So if your program uses Template Haskell or relies on libraries that use it, it will not run on machines with an ARM, MIPS, S390 or PowerPC CPU.
This is relevant in practice: git-annex is a tool written in Haskell that makes sense to run on machines worrying about storage, such machines often have non-i386-CPUs. Personally, I run git-annex on a NSLU 2 (32 MB of RAM, 266MHz CPU; did you know Haskell works fine on such hardware?) If it would use Template Haskell, this is not possible.
(The situation about GHC on ARM is improving these days a lot and I think 7.4.2 even works, but the point still stands).
Why is TH bad? For me, it comes down to this:
If you need to produce so much repetitive code that you find yourself trying to use TH to auto-generate it, you're doing it wrong!
Think about it. Half the appeal of Haskell is that its high-level design allows you to avoid huge amounts of useless boilerplate code that you have to write in other languages. If you need compile-time code generation, you're basically saying that either your language or your application design has failed you. And we programmers don't like to fail.
Sometimes, of course, it's necessary. But sometimes you can avoid needing TH by just being a bit more clever with your designs.
(The other thing is that TH is quite low-level. There's no grand high-level design; a lot of GHC's internal implementation details are exposed. And that makes the API prone to change...)

A language in which everything compiles

I'm trying to do some research for a new project, and I need to create objects dynamically from random data.
For this to work, I need a language / compiler that doesn't have problems with weird uncompilable code lying around.
Basically, I need the random code to compile (or be interpreted) as much as possible - Meaning that the uncompilable parts will be ignored, and only the compilable parts will create the objects (which could be ran).
Object Oriented-ness is not a must, but is a very strong advantage.
I thought of ASM, but it's very messy, and I'd probably need a more readable code
Thanks!
It sounds like you're doing something very much like genetic programming; even if you aren't, GP has to solve some of the same problems—using randomness to generate valid programs. The approach to this that is typically used is to work with a syntax tree: rather than storing x + y * 3 - 2, you store something like the following:
Then, instead of randomly changing the syntax, one can randomly change nodes in the tree instead. And if x should randomly change to, say, +, you can statically know that this means you need to insert two children (or not, depending on how you define +).
A good choice for a language to work with for this would be any Lisp dialect. In a Lisp, the above program would be written (- (+ x (* y 3)) 2), which is just a linearization of the syntax tree using parentheses to show depth. And in fact, Lisps expose this feature: you can just as easily work with the object '(- (+ x (* y 3)) 2) (note the leading quote). This is a three-element list, whose first element is -, second element is another list, and third element is 2. And, though you might or might not want it for your particular application, there's an eval function, such that (eval '(- (+ x (* y 3)) 2)) will take in the given list, treat it as a Lisp syntax tree/program, and evaluate it. This is what makes Lisps so attractive for doing this sort of work; Lisp syntax is basically a reification of the syntax-tree, and if you operate at the syntax-tree level, you can work on code as though it was a value. Lisp won't help you read /dev/random as a program directly, but with a little interpretation layered on top, you should be able to get what you want.
I should also mention, though I don't know anything about it (not that I know much about ordinary genetic programming) the existence of linear genetic programming. This is sort of like the assembly model that you mentioned—a linear stream of very, very simple instructions. The advantage here would seem to be that if you are working with /dev/random or something like it, the amount of interpretation needed is very small; the disadvantage would be, as you mentioned, the low-level nature of the code.
I'm not sure if this is what you're looking for, but any programming language can be made to function this way. For any programming language P, define the language Palways as follows:
If p is a valid program in P, then p is a valid program in Palways whose meaning is the same as its meaning in P.
If p is not a valid program in P, then p is a valid program in Palways whose meaning is the same as a program that immediately terminates.
For example, I could make the language C++always so that this program:
#include <iostream>
using namespace std;
int main() {
cout << "Hello, world!" << endl;
}
would compile as "Hello, world!", while this program:
Hahaha! This isn't legal C++ code!
Would be a legal program that just does absolutely nothing.
To solve your original problem, just take any OOP language like Java, Smalltalk, etc. and construct the appropriate Javaalways, Smalltalkalways, etc. language from it. Again, I'm not sure if this is at all what you're looking for, but it could be done very easily.
Alternatively, consider finding a grammar for any OOP language and then using that grammar to produce random syntactically valid programs. You could then filter those programs down by using the Palways programming language for that language to eliminate syntactically but not semantically valid programs.
Divide the ASCII byte values into 9 classes (division modulo 9 would help). Then assign then to Brainfuck codewords (see http://en.wikipedia.org/wiki/Brainfuck). Then interpret as Brainfuck.
There you go, any sequence of ASCII characters is a program. Not that it's going to do anything sensible... This approach has a much better chance, compared to templatetypedef's answer, to get a nontrivial program from a random byte sequence.
Text Editors
You could try feeding random character strings to an editor like Emacs or VI. Many (most?) characters will perform an editing action but some will do nothing (other than beep, perhaps). You would have to ensure that the random code mutator never generates the character sequence that exits the editor. However, this experience would be much like programming a Turing machine -- the code is not too readable.
Mathematica
In Mathematica, undefined symbols and other expressions evaluate to themselves, without error. So, that language might be a viable choice if you can arrange for the random code mutator to always generate well-formed expressions. This would be readily achievable since the basic Mathematica syntax is trivial, making it easy to operate on syntactic units rather than at the character level. It would be even easier if the mutator were written in Mathematica itself since expression-munging is Mathematica's forte. You could define a mini-language of valid operations within a Mathematica package that does not import the system-defined symbols. This would allow you to generate well-formed expressions to your heart's content without fear of generating a dangerous expression, like DeleteFile[FileNames["*.*", "/", Infinity]].
I believe Common Lisp should suit your needs. I always have some code in my SLIME/Emacs session that wouldn't compile. You can always tweak things, redefine functions in run-time. It is actually very good for prototyping.
A few years ago it took me quite a while to learn. But nowadays we have quicklisp and everything is so much easier.
Here I describe my development environment:
Install lisp on my linux machine
PS: I want to give an example, where Common Lisp was useful for me:
Up to maybe 2004 I used to write small programs in C (the keep it simple Unix way).
The last 3 years I had to get lots of different hardware running. Motorized stages, scientific cameras, IO cards.
The cameras turned out to be quite annoying. Usually you have to cool them down to -50 degree celsius or so and (in some SDKs) they don't like it when you close them. But this
is exactly how my C development cycle worked: write (30s), compile (1s), run (0.1s), repeat.
Eventually I decided to just use Common Lisp. Often it is straight forward to define the foreign function interfaces to talk to the SDKs and I can do this without ever leaving the running Lisp image. I start the editor in the morning define the open-device function, to talk to the device and after 3 hours I have enough of the functions implemented to set gain, temperature, region of interest and obtain the video.
Then I can often put the SDK manual away and just use the camera.
I used the same interactive programming approach when I have to parse some webpage or some weird XML.

How does one avoid creating an ad-hoc type system in dynamically typed languages?

In every project I've started in languages without type systems, I eventually begin to invent a runtime type system. Maybe the term "type system" is too strong; at the very least, I create a set of type/value-range validators when I'm working with complex data types, and then I feel the need to be paranoid about where data types can be created and modified.
I hadn't thought twice about it until now. As an independent developer, my methods have been working in practice on a number of small projects, and there's no reason they'd stop working now.
Nonetheless, this must be wrong. I feel as if I'm not using dynamically-typed languages "correctly". If I must invent a type system and enforce it myself, I may as well use a language that has types to begin with.
So, my questions are:
Are there existing programming paradigms (for languages without types) that avoid the necessity of using or inventing type systems?
Are there otherwise common recommendations on how to solve the problems that static typing solves in dynamically-typed languages (without sheepishly reinventing types)?
Here is a concrete example for you to consider. I'm working with datetimes and timezones in erlang (a dynamic, strongly typed language). This is a common datatype I work with:
{{Y,M,D},{tztime, {time, HH,MM,SS}, Flag}}
... where {Y,M,D} is a tuple representing a valid date (all entries are integers), tztime and time are atoms, HH,MM,SS are integers representing a sane 24-hr time, and Flag is one of the atoms u,d,z,s,w.
This datatype is commonly parsed from input, so to ensure valid input and a correct parser, the values need to be checked for type correctness, and for valid ranges. Later on, instances of this datatype are compared to each other, making the type of their values all the more important, since all terms compare. From the erlang reference manual
number < atom < reference < fun < port < pid < tuple < list < bit string
Aside from the confsion of static vs. dynamic and strong vs. weak typing:
What you want to implement in your example isn't really solved by most existing static typing systems. Range checks and complications like February 31th and especially parsed input are usually checked during runtime no matter what type system you have.
Your example being in Erlang I have a few recommendations:
Use records. Besides being usefull and helpfull for a whole bunch of reasons, the give you easy runtime type checking without a lot of effort e.g.:
is_same_day(#datetime{year=Y1, month=M1, day=D1},
#datetime{year=Y2, month=M2, day=D2}) -> ...
Effortless only matches for two datetime records. You could even add guards to check for ranges if the source is untrusted. And it conforms to erlangs let it crash method of error handling: if no match is found you get a badmatch, and can handle this on the level where it is apropriate (usually the supervisor level).
Generally write your code that it crashes when the assumptions are not valid
If this doesn't feel static checked enough: use typer and dialyzer to find the kind of errors that can be found statically, whatever remains will be checkd at runtime.
Don't be too restrictive in your functions what "types" you accept, sometimes the added functionality of just doing someting useful even for different inputs is worth more than checking the types and ranges on every function. If you do it where it matters usually you will catch the error early enough for it to be easy fixable. This is especially true for a functionaly language where you allways know where every value comes from.
A lot of good answers, let me add:
Are there existing programming paradigms (for languages without types) that avoid the necessity of using or inventing type systems?
The most important paradigm, especially in Erlang, is this: Assume the type is right, otherwise let it crash. Don't write excessively checking paranoid code, but assume that the input you get is of the right type or the right pattern. Don't write (there are exceptions to this rule, but in general)
foo({tag, ...}) -> do_something(..);
foo({tag2, ...}) -> do_something_else(..);
foo(Otherwise) ->
report_error(Otherwise),
try to fix problem here...
Kill the last clause and have it crash right away. Let a supervisor and other processes do the cleanup (you can use monitors() for janitorial processes to know when a crash has occurred).
Do be precise however. Write
bar(N) when is_integer(N) -> ...
baz([]) -> ...
baz(L) when is_list(L) -> ...
if the function is known only to work with integers or lists respectively. Yes, it is a runtime check but the goal is to convey information to the programmer. Also, HiPE tend to utilize the hint for optimization and eliminate the type check if possible. Hence, the price may be less than what you think it is.
You choose an untyped/dynamically-typed language so the price you have to pay is that type checking and errors from clashes will happen at runtime. As other posts hint, a statically typed language is not exempt from doing some checks as well - the type system is (usually) an approximation of a proof of correctness. In most static languages you often get input which you can't trust. This input is transformed at the "border" of the application and then converted to an internal format. The conversion serves to mark trust: From now on, the thing has been validated and we can assume certain things about it. The power and correctness of this assumption is directly tied to its type signature and how good the programmer is with juggling the static types of the language.
Are there otherwise common recommendations on how to solve the problems that static typing solves in dynamically-typed languages (without sheepishly reinventing types)?
Erlang has the dialyzer which can be used to statically analyze and infer types of your programs. It will not come up with as many type errors as a type checker in e.g., Ocaml, but it won't "cry wolf" either: An error from the dialyzer is provably an error in the program. And it won't reject a program which may be working ok. A simple example is:
and(true, true) -> true;
and(true, _) -> false;
and(false, _) -> false.
The invocation and(true, greatmistake) will return false, yet a static type system will reject the program because it will infer from the first line that the type signature takes a boolean() value as the 2nd parameter. The dialyzer will accept this function in contrast and give it the signature (boolean(), term()) -> boolean(). It can do this, because there is no need to protect a priori for an error. If there is a mistake, the runtime system has a type check that will capture it.
In order for a statically-typed language to match the flexibility of a dynamically-typed one, I think it would need a lot, perhaps infinitely many, features.
In the Haskell world, one hears a lot of sophisticated, sometimes to the point of being scary, teminology. Type classes. Parametric polymorphism. Generalized algebraic data types. Type families. Functional dependencies. The Ωmega programming language takes it even further, with the website listing "type-level functions" and "level polymorphism", among others.
What are all these? Features added to static typing to make it more flexible. These features can be really cool, and tend to be elegant and mind-blowing, but are often difficult to understand. Learning curve aside, type systems often fail to model real-world problems elegantly. A particularly good example of this is interacting with other languages (a major motivation for C# 4's dynamic feature).
Dynamically-typed languages give you the flexibility to implement your own framework of rules and assumptions about data, rather than be constrained by the ever-limited static type system. However, "your own framework" won't be machine-checked, meaning the onus is on you to ensure your "type system" is safe and your code is well-"typed".
One thing I've found from learning Haskell is that I can carry lessons learned about strong typing and sound reasoning over to weaker-typed languages, such as C and even assembly, and do the "type checking" myself. Namely, I can prove that sections of code are correct in and of themselves, by bearing in mind the rules my functions and values are supposed to follow, and the assumptions I am allowed to make about other functions and values. When debugging, I go through and check things again, and think through whether or not my approach is sound.
The bottom line: dynamic typing puts more flexibility at your fingertips. On the other hand, statically-typed languages tend to be more efficient (by orders of magnitude), and good static type systems drastically cut down on debugging time by letting the computer do much of it for you. If you want the benefits of both, install a static type checker in your brain by learning decent, strongly-typed languages.
Sometimes data need validation. Validating any data received from the network is almost always a good idea — especially data from a public network. Being paranoid here is only good. If something resembling a static type system helps this in the least painful way, so be it. There's a reason why Erlang allows type annotations. Even pattern matching can be seen as just a kind of dynamic type checking; nevertheless, it's a central feature of the language. The very structure of data is its 'type' in Erlang.
The good thing is that you can custom-tailor your 'type system' to your needs, make it flexible and smart, while type systems of OO languages typically have fixed features. When data structures you use are immutable, once you've validated such a structure, you're safe to assume it conforms your restrictions, just like with static typing.
There's no point in being ready to process any kind of data at any point of a program, dynamically-typed or not. A 'dynamic type' is essentially a union of all possible types; limiting it to a useful subset is a valid way to program.
A statically typed language detects type errors at compile time. A dynamically typed language detects them at runtime. There are some modest restrictions on what one can write in a statically typed language such that all type errors can be caught at compile time.
But yes, you still have types even in a dynamically typed language, and that's a good thing. The problem is you wander into lots of runtime checks to ensure that you have the types you think you do, since the compiler hasn't taken care of that for you.
Erlang has a very nice tool for specifying and statically verifying lots of types -- dialyzer: Erlang type system, for references.
So don't reinvent types, use the typing tools that Erlang already provides, to handle the types that already exist in your program (but which you haven't yet specified).
And this on its own won't eliminate range checks, unfortunately. Without lots of special sauce you really have to enforce this on your own by convention (and smart constructors, etc. to help), or fall back to runtime checks, or both.

Lisp data security/validation

This is really just a conceptual question for me at this point.
In Lisp, programs are data and data are programs. The REPL does exactly that - reads and then evaluates.
So how does one go about getting input from the user in a secure way? Obviously it's possible - I mean viaweb - now Yahoo!Stores is pretty secure, so how is it done?
The REPL stands for Read Eval Print Loop.
(loop (print (eval (read))))
Above is only conceptual, the real REPL code is much more complicated (with error handling, debugging, ...).
You can read all kinds of data in Lisp without evaluating it. Evaluation is a separate step - independent from reading data.
There are all kinds of IO functions in Lisp. The most complex of the provided functions is usually READ, which reads s-expressions. There is an option in Common Lisp which allows evaluation during READ, but that can and should be turned off when reading data.
So, data in Lisp is not necessarily a program and even if data is a program, then Lisp can read the program as data - without evaluation. A REPL should only be used by a developer and should not be exposed to arbitrary users. For getting data from users one uses the normal IO functions, including functions like READ, which can read S-expressions, but does not evaluate them.
Here are a few things one should NOT do:
use READ to read arbitrary data. READ for examples allows one to read really large data - there is no limit.
evaluate during READ ('read eval'). This should be turned off.
read symbols from I/O and call their symbol functions
read cyclical data structures with READ, when your functions expect plain lists. Walking down a cyclical list can keep your program busy for a while.
not handle syntax errors during reading from data.
You do it the way everyone else does it. You read a string of data from the stream, you parse it for your commands and parameters, you validate the commands and parameters, and you interpret the commands and parameters.
There's no magic here.
Simply put, what you DON'T do, is you don't expose your Lisp listener to an unvalidated, unsecure data source.
As was mentioned, the REPL is read - eval - print. #The Rook focused on eval (with reason), but do not discount READ. READ is a VERY powerful command in Common Lisp. The reader can evaluate code on its own, before you even GET to "eval".
Do NOT expose READ to anything you don't trust.
With enough work, you could make a custom package, limit scope of functions avaliable to that package, etc. etc. But, I think that's more work than simply writing a simple command parser myself and not worrying about some side effect that I missed.
Create your own readtable and fill with necessary hooks: SET-MACRO-CHARACTER, SET-DISPATCH-MACRO-CHARACTER et al.
Bind READTABLE to your own readtable.
Bind READ-EVAL to nil to prevent #. (may not be necessary if step 1 is done right)
READ
Probably something else.
Also there is a trick in interning symbols in temporary package while reading.
If data in not LL(1)-ish, simply write usual parser.
This is a killer question and I thought this same thing when I was reading about Lisp. Although I haven't done anything meaningful in LISP so my answer is very limited.
What I can tell you is that eval() is nasty. There is a saying that I like "If eval is the answer then you are asking the wrong question." --Unknown.
If the attacker can control data that is then evaluated then you have a very serious remote code execution vulnerability. This can be mitigated, and I'll show you an example with PHP, because that is what I know:
$id=addslashes($_GET['id']);
eval('$test="$id";');
If you weren't doing an add slashes then an attacker could get remote code execution by doing this:
http://localhost?evil_eval.php?id="; phpinfo();/*
But the add slashes will turn the " into a \", thus keeping the attacker from "breaking out" of the "data" and being able to execute code. Which is very similar to sql injection.
I found that question quit controversial. The eval wont eval your input unless you explicitly ask for it.
I mean your input will not be treat it as a LISP code but instead as a string.
Is not because that your language have powerfull concept like the eval that it is not "safe".
I think the confusion come from SQL where your actually treat an input as a [part of] SQL.
(query (concatenate 'string "SELECT * FROM foo WHERE id = " input-id))
Here input-id is being evaluate by the SQL engine.
This is because you have no nice way to write SQL, or whatever, but the point is that your input become part of what is being evaluate.
So eval don't bring you insecurity unless your are using it eyes closed.
EDIT Forgot to tell that this apply to any language.

Resources