How does flexibility affect a language's syntax? - programming-languages

I am currently working on writing my own language(shameless plug), which is centered around flexibility. I am trying to make almost any part of the language syntax exchangeable through things like extensions/plugins. While writing the whole thing, it has got me thinking. I am wondering how that sort of flexibility could affect the language.
I know that Lisp is often referred to as one of the most extensible languages due to its extensive macro system. I do understand that concept of macros, but I am yet to find a language that allows someone to change the way it is parsed. To my knowledge, almost every language has an extremely concrete syntax as defined by some long specification.
My question is how could having a flexible syntax affect the intuitiveness and usability of the language? I know the basic "people might be confused when the syntax changes" and "semantic analysis will be hard". Those are things that I am already starting to compensate for. I am looking for a more conceptual answer on the pros and cons of having a flexible syntax.
The topic of language design is still quite foreign to me, so I apologize if I am asking an obvious or otherwise stupid question!
Edit:
I was just wanting to clarify the question I was asking. Where exactly does flexibility in a language's syntax stand, in terms of language theory? I don't really need examples or projects/languages with flexibility, I want to understand how it can affect the language's readability, functionality, and other things like that.

Perl is the most flexible language I know. That a look at Moose, a postmodern object system for Perl 5. It's syntax is very different than Perl's but it is still very Perl-ish.
IMO, the biggest problem with flexibility is precedence in infix notation. But none I know of allow a datatype to have its own infix syntax. For example, take sets. It would be nice to use ⊂ and ⊇ in their syntax. But not only would a compiler have to recognize these symbols, it would have to be told their order of precedence.

Common Lisp allows to change the way it's parsed - see reader macros. Racket allows to modify its parser, see racket languages.
And of course you can have a flexible, dynamically extensible parsing alongside with powerful macros if you use the right parsing techniques (e.g., PEG). Have a look at an example here - mostly a C syntax, but extensible with both syntax and semantic macros.
As for precedence, PEG goes really well together with Pratt.
To answer your updated question - there is surprisingly little research done on programming languages readability anyway. You may want to have a look at what Dr. Blackwell group was up to, but it's still far from conclusive.
So I can only share my hand-wavy anecdotes - flexible syntax languages facilitates eDSL construction, and, in my opinion, eDSLs is the only way to eliminate unnecessary complexity from code, to make code actually maintainable in a long term. I believe that non-flexible languages are one of the biggest mistakes made by this industry, and it must be corrected at all costs, ASAP.

Flexibility allows you to manipulate the syntax of the language. For example, Lisp Macros can enable you to write programs that write programs and manipulate your syntax at compile-time to valid Lisp expressions. For example the Loop Macro:
(loop for x from 1 to 5
do(format t "~A~%" x))
1
2
3
4
5
NIL
And we can see how the code was translated with macroexpand-1:
(pprint(macroexpand-1 '(loop for x from 1 to 5
do (format t "~a~%" x))))
We can then see how a call to that macro is translated:
(LET ((X 1))
(DECLARE (TYPE (AND REAL NUMBER) X))
(TAGBODY
SB-LOOP::NEXT-LOOP
(WHEN (> X '5) (GO SB-LOOP::END-LOOP))
(FORMAT T "~a~%" X)
(SB-LOOP::LOOP-DESETQ X (1+ X))
(GO SB-LOOP::NEXT-LOOP)
SB-LOOP::END-LOOP)))
Language Flexibility just allows you to create your own embedded language within a language and reduce the length of your program in terms of characters used. So in theory, this can make a language very unreadable since we can manipulate the syntax. For example we can create invalid code that's translated to valid code:
(defmacro backwards (expr)
(reverse expr))
BACKWARDS
CL-USER> (backwards ("hello world" nil format))
"hello world"
CL-USER>
Clearly the above code can become complex since:
("hello world" nil format)
is not a valid Lisp expression.

Thanks to SK-logic's answer for pointing me in the direction of Alan Blackwell. I sent him an email asking his stance on the matter, and he responded with an absolutely wonderful explanation. Here it is:
So the person who responded to your StackOverflow question, saying
that flexible syntax could be useful for DSLs, is certainly correct.
It actually used to be fairly common to use the C preprocessor to
create alternative syntax (that would be turned into regular syntax in
an initial compile phase). A lot of the early esolangs were built this
way.
In practice, I think we would have to say that a lot of DSLs are
implemented as libraries within regular programming languages, and
that the library design is far more significant than the syntax. There
may be more purpose for having variety in visual languages, but making
customisable general purpose compilers for arbitrary graphical syntax
is really hard - much worse than changing text syntax features.
There may well be interesting things that your design could enable, so
I wouldn’t discourage experimentation. However, I think there is one reason why
customisable syntax is not so common. This is related to the famous
programmer’s editor EMACS. In EMACS, everything is customisable - all
key bindings, and all editor functions. It’s fun to play with, and
back in the day, many of us made our own personalised version that
only we knew how to operate. But it turned out that it was a real
hassle that everyone’s editor worked completely differently. You could
never lean over and make suggestions on another person’s session, and
teams always had to know who was logged in order to know whether the
editor would work. So it turned out that, over the years, we all just
started to use the default distribution and key bindings, which made
things easier for everyone.
At this point in time, that is just about enough of an explanation that I was looking for. If anyone feels as though they have a better explanation or something to add, feel free to contact me.

Related

Does Lisp's treatment of code as data make it more vulnerable to security exploits than a language that doesn't treat code as data? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I know that this might be a stupid question but I was curious.
Since Lisp treats code and data the same, does this mean that it's easier to write a payload and pass it as "innocent" data that can be used to exploit programs? In comparison to languages that don't do so?
For e.g. In python you can do something like this.
malicious_str = "print('this is a malicious string')"
user_in = eval(malicious_str)
>>> this is a malicious string
P.S I have just started learning Lisp.
No, I don't think it does. In fact because of what is normally meant by 'code is data' in Lisp, it is potentially less vulnerable than some other languages.
[Note: this answer is really about Common Lisp: see the end for a note about that.]
There are two senses in which 'code can be data' in a language.
Turning objects into executable code: eval & friends
This is the first sense. What this means is that you can, say, take a string or some other object (not all types of object, obviously) and say 'turn this into something I can execute, and do that'.
Any language that can do this has either
to be extremely careful about doing this on unconstrained data;
or to be able to be certain that a given program does not actually do this.
Plenty of languages have equivalents of eval and its relations, so plenty of languages have this problem. You give an example of Python for instance, which is a good one, and there are probably other examples in Python (I've written programs even in Python 2 which supported dynamic loading of modules at runtime, which executes potentially arbitrary code, and I think this stuff is much better integrated in Python 3).
This is also not just a property of a language: it's a property of a system. C can't do this, right? Well, yes it can if you're on any kind of reasonable *nixy platform. Not only can you use an exec-family function, but you can probably dynamically load a shared library and execute code in it.
So one solution to this problem is to, somehow, be able to be certain that a given program doesn't do this. One thing that helps is if there are a finite, known number of ways of doing it. In Common Lisp I think those are probably
eval of course;
unconstrained read (because of *read-eval*);
load;
compile;
compile-file;
and probably some others that I have forgotten.
Well, can you detect calls to those, statically, in a program? Not really: consider this:
(funcall (symbol-function (find-symbol s)) ...)
and now you're in trouble unless you have very good control over what s is: it might be "EVAL" for instance.
So that's frightening, but I don't think it's more frightening than what Python can do, for instance (almost certainly you can poke around in the namespace to find eval or something?). And something like that in a program ought to be a really big hint that bad things might happen.
I think there are probably two approaches to this, neither of which CL adopts but which implementations could (and perhaps even programs written in CL could).
One would be to be able to run programs in such a way that the finite set of bad functions above simply are disallowed: they'd signal errors if you tried to call them. An implementation could clearly do that (see below).
The other would be to have something like Perl's 'tainting' where data which came from a user needs to be explicitly looked-at by the program somehow before it's used. That doesn't guarantee safety of course, but it does make it harder to make silly mistakes: if s above came from user input and was thus tainted you'd have to explicitly say 'it's OK to use it' and, well, then it's up to you.
So this is a problem, but I don't think it's worse than the problems that very many other languages (and language-families) have.
An example of an implementation that can address the first approach is LispWorks: if you're building an application with LW, you typically create the binary with a function called deliver, which has options which allow you to remove the definitions of functions from the resulting binary whether or not the delivery process would otherwise leave them there. So, for instance
(deliver 'foo "x" 5
:functions-to-remove '(eval load compile compile-file read))
would result in an executable x which, whatever else it did, couldn't call those functions, because they're not present, at all.
Other implementations probably have similar features: I just don't know them as well.
But there's another sense in which 'code is data' in Lisp.
Program source code is available as structured data
This is the sense that people probably really mean when they say 'code is data' in Lisp, even if they don't know that. It's worth looking at your Python example again:
>>> eval("exit('exploded')")
exploded
$
So what eval eats is a string: a completely unstructured vector of characters. If you want to know whether that string contains something nasty, well, you've got a lot of work ahead of you (disclaimer: see below).
Compare this with CL:
> (let ((trying-to-be-bad "(end-the-world :now t)"))
(eval trying-to-be-bad))
"(end-the-world :now t)"
OK, so that clearly didn't end the world. And it didn't end the world because eval evaluates a bit of Lisp source code, and the value of a string, as source code, is the string.
If I want to do something nasty I have to hand it an actual interesting structure:
> (let ((actually-bad '(eval (progn
(format *query-io* "? ")
(finish-output *query-io*)
(read *query-io*)))))
(eval actually-bad))
? (defun foo () (foo))
foo
Now that's potentially quite nasty in at least several ways. But wait: in order to do this nasty thing, I had to hand eval a chunk of source code represented as an s-expression. And the structure of that s-expression is completely open to inspection by me. I can write a program which inspects this s-expression in any arbitrary way I like, and decides whether or not it is acceptable to me. That's just hugely easier than 'given this string, interpret it as a piece of source text for the language and tell me if it is dangerous':
the process of turning the sequence of characters into an s-expression has happened already;
the structure of s-expressions is both simple and standard.
So in this sense of 'code is data', Lisp is potentially much safer than other languages which have versions of eval which eat strings, like Python, say, because code is structured, standard, simple data. Lisp has an answer to the terrible 'language in a string' problem.
I am fairly sure that Python does in fact have some approach to making the parse tree available in a standard way which can be inspected. But eval still happily eats strings.
As I said above, this answer is about Common Lisp. But there are many other Lisps of course, which will have varying versions of this problem. Racket for instance probably can really fairly tightly constrain things, using sandboxed execution and modules, although I haven't explored this.
Any language can be exploited if you are not careful.
A well-known attack against Lisp is via the #. reader macro:
(read-from-string "#.(start-the-war)")
will start the war if *read-eval* is non-nil - this is why one should always bind it when reading from an un-trusted stream.
However, this is not directly related to "code is data" doctrine...

What's so bad about Template Haskell?

It seems that Template Haskell is often viewed by the Haskell community as an unfortunate convenience. It's hard to put into words exactly what I have observed in this regard, but consider these few examples
Template Haskell listed under "The Ugly (but necessary)" in response to the question Which Haskell (GHC) extensions should users use/avoid?
Template Haskell considered a temporary/inferior solution in Unboxed Vectors of newtype'd values thread (libraries mailing list)
Yesod is often criticized for relying too much on Template Haskell (see the blog post in response to this sentiment)
I've seen various blog posts where people do pretty neat stuff with Template Haskell, enabling prettier syntax that simply wouldn't be possible in regular Haskell, as well as tremendous boilerplate reduction. So why is it that Template Haskell is looked down upon in this way? What makes it undesirable? Under what circumstances should Template Haskell be avoided, and why?
One reason for avoiding Template Haskell is that it as a whole isn't type-safe, at all, thus going against much of "the spirit of Haskell." Here are some examples of this:
You have no control over what kind of Haskell AST a piece of TH code will generate, beyond where it will appear; you can have a value of type Exp, but you don't know if it is an expression that represents a [Char] or a (a -> (forall b . b -> c)) or whatever. TH would be more reliable if one could express that a function may only generate expressions of a certain type, or only function declarations, or only data-constructor-matching patterns, etc.
You can generate expressions that don't compile. You generated an expression that references a free variable foo that doesn't exist? Tough luck, you'll only see that when actually using your code generator, and only under the circumstances that trigger the generation of that particular code. It is very difficult to unit test, too.
TH is also outright dangerous:
Code that runs at compile-time can do arbitrary IO, including launching missiles or stealing your credit card. You don't want to have to look through every cabal package you ever download in search for TH exploits.
TH can access "module-private" functions and definitions, completely breaking encapsulation in some cases.
Then there are some problems that make TH functions less fun to use as a library developer:
TH code isn't always composable. Let's say someone makes a generator for lenses, and more often than not, that generator will be structured in such a way that it can only be called directly by the "end-user," and not by other TH code, by for example taking a list of type constructors to generate lenses for as the parameter. It is tricky to generate that list in code, while the user only has to write generateLenses [''Foo, ''Bar].
Developers don't even know that TH code can be composed. Did you know that you can write forM_ [''Foo, ''Bar] generateLens? Q is just a monad, so you can use all of the usual functions on it. Some people don't know this, and because of that, they create multiple overloaded versions of essentially the same functions with the same functionality, and these functions lead to a certain bloat effect. Also, most people write their generators in the Q monad even when they don't have to, which is like writing bla :: IO Int; bla = return 3; you are giving a function more "environment" than it needs, and clients of the function are required to provide that environment as an effect of that.
Finally, there are some things that make TH functions less fun to use as an end-user:
Opacity. When a TH function has type Q Dec, it can generate absolutely anything at the top-level of a module, and you have absolutely no control over what will be generated.
Monolithism. You can't control how much a TH function generates unless the developer allows it; if you find a function that generates a database interface and a JSON serialization interface, you can't say "No, I only want the database interface, thanks; I'll roll my own JSON interface"
Run time. TH code takes a relatively long time to run. The code is interpreted anew every time a file is compiled, and often, a ton of packages are required by the running TH code, that have to be loaded. This slows down compile time considerably.
This is solely my own opinion.
It's ugly to use. $(fooBar ''Asdf) just does not look nice. Superficial, sure, but it contributes.
It's even uglier to write. Quoting works sometimes, but a lot of the time you have to do manual AST grafting and plumbing. The API is big and unwieldy, there's always a lot of cases you don't care about but still need to dispatch, and the cases you do care about tend to be present in multiple similar but not identical forms (data vs. newtype, record-style vs. normal constructors, and so on). It's boring and repetitive to write and complicated enough to not be mechanical. The reform proposal addresses some of this (making quotes more widely applicable).
The stage restriction is hell. Not being able to splice functions defined in the same module is the smaller part of it: the other consequence is that if you have a top-level splice, everything after it in the module will be out of scope to anything before it. Other languages with this property (C, C++) make it workable by allowing you to forward declare things, but Haskell doesn't. If you need cyclic references between spliced declarations or their dependencies and dependents, you're usually just screwed.
It's undisciplined. What I mean by this is that most of the time when you express an abstraction, there is some kind of principle or concept behind that abstraction. For many abstractions, the principle behind them can be expressed in their types. For type classes, you can often formulate laws which instances should obey and clients can assume. If you use GHC's new generics feature to abstract the form of an instance declaration over any datatype (within bounds), you get to say "for sum types, it works like this, for product types, it works like that". Template Haskell, on the other hand, is just macros. It's not abstraction at the level of ideas, but abstraction at the level of ASTs, which is better, but only modestly, than abstraction at the level of plain text.*
It ties you to GHC. In theory another compiler could implement it, but in practice I doubt this will ever happen. (This is in contrast to various type system extensions which, though they might only be implemented by GHC at the moment, I could easily imagine being adopted by other compilers down the road and eventually standardized.)
The API isn't stable. When new language features are added to GHC and the template-haskell package is updated to support them, this often involves backwards-incompatible changes to the TH datatypes. If you want your TH code to be compatible with more than just one version of GHC you need to be very careful and possibly use CPP.
There's a general principle that you should use the right tool for the job and the smallest one that will suffice, and in that analogy Template Haskell is something like this. If there's a way to do it that's not Template Haskell, it's generally preferable.
The advantage of Template Haskell is that you can do things with it that you couldn't do any other way, and it's a big one. Most of the time the things TH is used for could otherwise only be done if they were implemented directly as compiler features. TH is extremely beneficial to have both because it lets you do these things, and because it lets you prototype potential compiler extensions in a much more lightweight and reusable way (see the various lens packages, for example).
To summarize why I think there are negative feelings towards Template Haskell: It solves a lot of problems, but for any given problem that it solves, it feels like there should be a better, more elegant, disciplined solution better suited to solving that problem, one which doesn't solve the problem by automatically generating the boilerplate, but by removing the need to have the boilerplate.
* Though I often feel that CPP has a better power-to-weight ratio for those problems that it can solve.
EDIT 23-04-14: What I was frequently trying to get at in the above, and have only recently gotten at exactly, is that there's an important distinction between abstraction and deduplication. Proper abstraction often results in deduplication as a side effect, and duplication is often a telltale sign of inadequate abstraction, but that's not why it's valuable. Proper abstraction is what makes code correct, comprehensible, and maintainable. Deduplication only makes it shorter. Template Haskell, like macros in general, is a tool for deduplication.
I'd like to address a few of the points dflemstr brings up.
I don't find the fact that you can't typecheck TH to be that worrying. Why? Because even if there is an error, it will still be compile time. I'm not sure if this strengthens my argument, but this is similar in spirit to the errors that you receive when using templates in C++. I think these errors are more understandable than C++'s errors though, as you'll get a pretty printed version of the generated code.
If a TH expression / quasi-quoter does something that's so advanced that tricky corners can hide, then perhaps it's ill-advised?
I break this rule quite a bit with quasi-quoters I've been working on lately (using haskell-src-exts / meta) - https://github.com/mgsloan/quasi-extras/tree/master/examples . I know this introduces some bugs such as not being able to splice in the generalized list comprehensions. However, I think that there's a good chance that some of the ideas in http://hackage.haskell.org/trac/ghc/blog/Template%20Haskell%20Proposal will end up in the compiler. Until then, the libraries for parsing Haskell to TH trees are a nearly perfect approximation.
Regarding compilation speed / dependencies, we can use the "zeroth" package to inline the generated code. This is at least nice for the users of a given library, but we can't do much better for the case of editing the library. Can TH dependencies bloat generated binaries? I thought it left out everything that's not referenced by the compiled code.
The staging restriction / splitting of compilation steps of the Haskell module does suck.
RE Opacity: This is the same for any library function you call. You have no control over what Data.List.groupBy will do. You just have a reasonable "guarantee" / convention that the version numbers tell you something about the compatibility. It is somewhat of a different matter of change when.
This is where using zeroth pays off - you're already versioning the generated files - so you'll always know when the form of the generated code has changed. Looking at the diffs might be a bit gnarly, though, for large amounts of generated code, so that's one place where a better developer interface would be handy.
RE Monolithism: You can certainly post-process the results of a TH expression, using your own compile-time code. It wouldn't be very much code to filter on top-level declaration type / name. Heck, you could imagine writing a function that does this generically. For modifying / de-monolithisizing quasiquoters, you can pattern match on "QuasiQuoter" and extract out the transformations used, or make a new one in terms of the old.
This answer is in response to the issues brought up by illissius, point by point:
It's ugly to use. $(fooBar ''Asdf) just does not look nice. Superficial, sure, but it contributes.
I agree. I feel like $( ) was chosen to look like it was part of the language - using the familiar symbol pallet of Haskell. However, that's exactly what you /don't/ want in the symbols used for your macro splicing. They definitely blend in too much, and this cosmetic aspect is quite important. I like the look of {{ }} for splices, because they are quite visually distinct.
It's even uglier to write. Quoting works sometimes, but a lot of the time you have to do manual AST grafting and plumbing. The [API][1] is big and unwieldy, there's always a lot of cases you don't care about but still need to dispatch, and the cases you do care about tend to be present in multiple similar but not identical forms (data vs. newtype, record-style vs. normal constructors, and so on). It's boring and repetitive to write and complicated enough to not be mechanical. The [reform proposal][2] addresses some of this (making quotes more widely applicable).
I also agree with this, however, as some of the comments in "New Directions for TH" observe, the lack of good out-of-the-box AST quoting is not a critical flaw. In this WIP package, I seek to address these problems in library form: https://github.com/mgsloan/quasi-extras . So far I allow splicing in a few more places than usual and can pattern match on ASTs.
The stage restriction is hell. Not being able to splice functions defined in the same module is the smaller part of it: the other consequence is that if you have a top-level splice, everything after it in the module will be out of scope to anything before it. Other languages with this property (C, C++) make it workable by allowing you to forward declare things, but Haskell doesn't. If you need cyclic references between spliced declarations or their dependencies and dependents, you're usually just screwed.
I've run into the issue of cyclic TH definitions being impossible before... It's quite annoying. There is a solution, but it's ugly - wrap the things involved in the cyclic dependency in a TH expression that combines all of the generated declarations. One of these declarations generators could just be a quasi-quoter that accepts Haskell code.
It's unprincipled. What I mean by this is that most of the time when you express an abstraction, there is some kind of principle or concept behind that abstraction. For many abstractions, the principle behind them can be expressed in their types. When you define a type class, you can often formulate laws which instances should obey and clients can assume. If you use GHC's [new generics feature][3] to abstract the form of an instance declaration over any datatype (within bounds), you get to say "for sum types, it works like this, for product types, it works like that". But Template Haskell is just dumb macros. It's not abstraction at the level of ideas, but abstraction at the level of ASTs, which is better, but only modestly, than abstraction at the level of plain text.
It's only unprincipled if you do unprincipled things with it. The only difference is that with the compiler implemented mechanisms for abstraction, you have more confidence that the abstraction isn't leaky. Perhaps democratizing language design does sound a bit scary! Creators of TH libraries need to document well and clearly define the meaning and results of the tools they provide. A good example of principled TH is the derive package: http://hackage.haskell.org/package/derive - it uses a DSL such that the example of many of the derivations /specifies/ the actual derivation.
It ties you to GHC. In theory another compiler could implement it, but in practice I doubt this will ever happen. (This is in contrast to various type system extensions which, though they might only be implemented by GHC at the moment, I could easily imagine being adopted by other compilers down the road and eventually standardized.)
That's a pretty good point - the TH API is pretty big and clunky. Re-implementing it seems like it could be tough. However, there are only really only a few ways to slice the problem of representing Haskell ASTs. I imagine that copying the TH ADTs, and writing a converter to the internal AST representation would get you a good deal of the way there. This would be equivalent to the (not insignificant) effort of creating haskell-src-meta. It could also be simply re-implemented by pretty printing the TH AST and using the compiler's internal parser.
While I could be wrong, I don't see TH as being that complicated of a compiler extension, from an implementation perspective. This is actually one of the benefits of "keeping it simple" and not having the fundamental layer be some theoretically appealing, statically verifiable templating system.
The API isn't stable. When new language features are added to GHC and the template-haskell package is updated to support them, this often involves backwards-incompatible changes to the TH datatypes. If you want your TH code to be compatible with more than just one version of GHC you need to be very careful and possibly use CPP.
This is also a good point, but somewhat dramaticized. While there have been API additions lately, they haven't been extensively breakage inducing. Also, I think that with the superior AST quoting I mentioned earlier, the API that actually needs to be used can be very substantially reduced. If no construction / matching needs distinct functions, and are instead expressed as literals, then most of the API disappears. Moreover, the code you write would port more easily to AST representations for languages similar to Haskell.
In summary, I think that TH is a powerful, semi-neglected tool. Less hate could lead to a more lively eco-system of libraries, encouraging the implementation of more language feature prototypes. It's been observed that TH is an overpowered tool, that can let you /do/ almost anything. Anarchy! Well, it's my opinion that this power can allow you to overcome most of its limitations, and construct systems capable of quite principled meta-programming approaches. It's worth the usage of ugly hacks to simulate the "proper" implementation, as this way the design of the "proper" implementation will gradually become clear.
In my personal ideal version of nirvana, much of the language would actually move out of the compiler, into libraries of these variety. The fact that the features are implemented as libraries does not heavily influence their ability to faithfully abstract.
What's the typical Haskell answer to boilerplate code? Abstraction. What're our favorite abstractions? Functions and typeclasses!
Typeclasses let us define a set of methods, that can then be used in all manner of functions generic on that class. However, other than this, the only way classes help avoid boilerplate is by offering "default definitions". Now here is an example of an unprincipled feature!
Minimal binding sets are not declarable / compiler checkable. This could lead to inadvertent definitions that yield bottom due to mutual recursion.
Despite the great convenience and power this would yield, you cannot specify superclass defaults, due to orphan instances http://lukepalmer.wordpress.com/2009/01/25/a-world-without-orphans/ These would let us fix the numeric hierarchy gracefully!
Going after TH-like capabilities for method defaults led to http://www.haskell.org/haskellwiki/GHC.Generics . While this is cool stuff, my only experience debugging code using these generics was nigh-impossible, due to the size of the type induced for and ADT as complicated as an AST. https://github.com/mgsloan/th-extra/commit/d7784d95d396eb3abdb409a24360beb03731c88c
In other words, this went after the features provided by TH, but it had to lift an entire domain of the language, the construction language, into a type system representation. While I can see it working well for your common problem, for complex ones, it seems prone to yielding a pile of symbols far more terrifying than TH hackery.
TH gives you value-level compile-time computation of the output code, whereas generics forces you to lift the pattern matching / recursion part of the code into the type system. While this does restrict the user in a few fairly useful ways, I don't think the complexity is worth it.
I think that the rejection of TH and lisp-like metaprogramming led to the preference towards things like method-defaults instead of more flexible, macro-expansion like declarations of instances. The discipline of avoiding things that could lead to unforseen results is wise, however, we should not ignore that Haskell's capable type system allows for more reliable metaprogramming than in many other environments (by checking the generated code).
One rather pragmatic problem with Template Haskell is that it only works when GHC's bytecode interpreter is available, which is not the case on all architectures. So if your program uses Template Haskell or relies on libraries that use it, it will not run on machines with an ARM, MIPS, S390 or PowerPC CPU.
This is relevant in practice: git-annex is a tool written in Haskell that makes sense to run on machines worrying about storage, such machines often have non-i386-CPUs. Personally, I run git-annex on a NSLU 2 (32 MB of RAM, 266MHz CPU; did you know Haskell works fine on such hardware?) If it would use Template Haskell, this is not possible.
(The situation about GHC on ARM is improving these days a lot and I think 7.4.2 even works, but the point still stands).
Why is TH bad? For me, it comes down to this:
If you need to produce so much repetitive code that you find yourself trying to use TH to auto-generate it, you're doing it wrong!
Think about it. Half the appeal of Haskell is that its high-level design allows you to avoid huge amounts of useless boilerplate code that you have to write in other languages. If you need compile-time code generation, you're basically saying that either your language or your application design has failed you. And we programmers don't like to fail.
Sometimes, of course, it's necessary. But sometimes you can avoid needing TH by just being a bit more clever with your designs.
(The other thing is that TH is quite low-level. There's no grand high-level design; a lot of GHC's internal implementation details are exposed. And that makes the API prone to change...)

A language in which everything compiles

I'm trying to do some research for a new project, and I need to create objects dynamically from random data.
For this to work, I need a language / compiler that doesn't have problems with weird uncompilable code lying around.
Basically, I need the random code to compile (or be interpreted) as much as possible - Meaning that the uncompilable parts will be ignored, and only the compilable parts will create the objects (which could be ran).
Object Oriented-ness is not a must, but is a very strong advantage.
I thought of ASM, but it's very messy, and I'd probably need a more readable code
Thanks!
It sounds like you're doing something very much like genetic programming; even if you aren't, GP has to solve some of the same problems—using randomness to generate valid programs. The approach to this that is typically used is to work with a syntax tree: rather than storing x + y * 3 - 2, you store something like the following:
Then, instead of randomly changing the syntax, one can randomly change nodes in the tree instead. And if x should randomly change to, say, +, you can statically know that this means you need to insert two children (or not, depending on how you define +).
A good choice for a language to work with for this would be any Lisp dialect. In a Lisp, the above program would be written (- (+ x (* y 3)) 2), which is just a linearization of the syntax tree using parentheses to show depth. And in fact, Lisps expose this feature: you can just as easily work with the object '(- (+ x (* y 3)) 2) (note the leading quote). This is a three-element list, whose first element is -, second element is another list, and third element is 2. And, though you might or might not want it for your particular application, there's an eval function, such that (eval '(- (+ x (* y 3)) 2)) will take in the given list, treat it as a Lisp syntax tree/program, and evaluate it. This is what makes Lisps so attractive for doing this sort of work; Lisp syntax is basically a reification of the syntax-tree, and if you operate at the syntax-tree level, you can work on code as though it was a value. Lisp won't help you read /dev/random as a program directly, but with a little interpretation layered on top, you should be able to get what you want.
I should also mention, though I don't know anything about it (not that I know much about ordinary genetic programming) the existence of linear genetic programming. This is sort of like the assembly model that you mentioned—a linear stream of very, very simple instructions. The advantage here would seem to be that if you are working with /dev/random or something like it, the amount of interpretation needed is very small; the disadvantage would be, as you mentioned, the low-level nature of the code.
I'm not sure if this is what you're looking for, but any programming language can be made to function this way. For any programming language P, define the language Palways as follows:
If p is a valid program in P, then p is a valid program in Palways whose meaning is the same as its meaning in P.
If p is not a valid program in P, then p is a valid program in Palways whose meaning is the same as a program that immediately terminates.
For example, I could make the language C++always so that this program:
#include <iostream>
using namespace std;
int main() {
cout << "Hello, world!" << endl;
}
would compile as "Hello, world!", while this program:
Hahaha! This isn't legal C++ code!
Would be a legal program that just does absolutely nothing.
To solve your original problem, just take any OOP language like Java, Smalltalk, etc. and construct the appropriate Javaalways, Smalltalkalways, etc. language from it. Again, I'm not sure if this is at all what you're looking for, but it could be done very easily.
Alternatively, consider finding a grammar for any OOP language and then using that grammar to produce random syntactically valid programs. You could then filter those programs down by using the Palways programming language for that language to eliminate syntactically but not semantically valid programs.
Divide the ASCII byte values into 9 classes (division modulo 9 would help). Then assign then to Brainfuck codewords (see http://en.wikipedia.org/wiki/Brainfuck). Then interpret as Brainfuck.
There you go, any sequence of ASCII characters is a program. Not that it's going to do anything sensible... This approach has a much better chance, compared to templatetypedef's answer, to get a nontrivial program from a random byte sequence.
Text Editors
You could try feeding random character strings to an editor like Emacs or VI. Many (most?) characters will perform an editing action but some will do nothing (other than beep, perhaps). You would have to ensure that the random code mutator never generates the character sequence that exits the editor. However, this experience would be much like programming a Turing machine -- the code is not too readable.
Mathematica
In Mathematica, undefined symbols and other expressions evaluate to themselves, without error. So, that language might be a viable choice if you can arrange for the random code mutator to always generate well-formed expressions. This would be readily achievable since the basic Mathematica syntax is trivial, making it easy to operate on syntactic units rather than at the character level. It would be even easier if the mutator were written in Mathematica itself since expression-munging is Mathematica's forte. You could define a mini-language of valid operations within a Mathematica package that does not import the system-defined symbols. This would allow you to generate well-formed expressions to your heart's content without fear of generating a dangerous expression, like DeleteFile[FileNames["*.*", "/", Infinity]].
I believe Common Lisp should suit your needs. I always have some code in my SLIME/Emacs session that wouldn't compile. You can always tweak things, redefine functions in run-time. It is actually very good for prototyping.
A few years ago it took me quite a while to learn. But nowadays we have quicklisp and everything is so much easier.
Here I describe my development environment:
Install lisp on my linux machine
PS: I want to give an example, where Common Lisp was useful for me:
Up to maybe 2004 I used to write small programs in C (the keep it simple Unix way).
The last 3 years I had to get lots of different hardware running. Motorized stages, scientific cameras, IO cards.
The cameras turned out to be quite annoying. Usually you have to cool them down to -50 degree celsius or so and (in some SDKs) they don't like it when you close them. But this
is exactly how my C development cycle worked: write (30s), compile (1s), run (0.1s), repeat.
Eventually I decided to just use Common Lisp. Often it is straight forward to define the foreign function interfaces to talk to the SDKs and I can do this without ever leaving the running Lisp image. I start the editor in the morning define the open-device function, to talk to the device and after 3 hours I have enough of the functions implemented to set gain, temperature, region of interest and obtain the video.
Then I can often put the SDK manual away and just use the camera.
I used the same interactive programming approach when I have to parse some webpage or some weird XML.

Is there a language with native pass-by-reference/pass-by-name semantics, which could be used in modern production applications?

This is a reopened question.
I look for a language and supporting platform for it, where the language could have pass-by-reference or pass-by-name semantics by default. I know the history a little, that there were Algol, Fortran and there still is C++ which could make it possible; but, basically, what I look for is something more modern and where the mentioned value pass methodology is preferred and by default (implicitly assumed).
I ask this question, because, to my mind, some of the advantages of pass-by-ref/name seem kind of obvious. For example when it is used in a standalone agent, where copyiong of values is not necessary (to some extent) and performance wouldn't be downgraded much in that case. So, I could employ it in e.g. rich client app or some game-style or standalone service-kind application.
The main advantage to me is the clear separation between identity of a symbol, and its current value. I mean when there is no reduntant copying, you know that you're working with the exact symbol/path you have queried/received. And intristing boxing of values will not interfere with the actual logic of program.
I know that there is C# ref keyword, but it's something not so intristic, though acceptable. Equally, I realize that pass-by-reference semantics could be simulated in virtually any language (Java as an instant example) and so on.. not sure about pass by name :)
What would you recommend - create a something like DSL for such needs wherever it be appropriate; or use some languages that I already know? Maybe, there is something that I'm missing?
Thank you!
UPDATE: Currently, I think that Haskell would be appropriate. But I didn't investigate much, so I think I'll update this text later.
Scala provides very flexible parameter passing semantics including real call-by-name:
def whileLoop(cond: => Boolean)(body: => Unit) {
if (cond) {
body
whileLoop(cond)(body)
}
}
And it really works
var i = 10
whileLoop (i > 0) {
println(i)
i -= 1
}
Technical details:
Though all parameters are passed by value (and these are usually references) much like Java, the notation => Type will make Scala generate the required closures automatically in order to emulate call-by-name.
Note that there is lazy evaluation too.
lazy val future = evalFunc()
The interesting thing is that you have consistent strict call-by-value semantics but can punctually change these where you really need to - nearly without any syntactic overhead.
Haskell has call-by-need as its default (and indeed only) evaluation strategy.
Now, you asked for two things: call-by-name and modern. Well, Haskell is a pure language and in a pure language call-by-name and call-by-need are semantically the same thing, or more precisely they always have the same result, the only difference being that call-by-need is usually faster and at worst only a constant factor slower than call-by-name. And Haskell surely is a modern language: it is merely 23 years old and in many of its features it is actually 10 years ahead of many languages that were created just recently.
The other thing you asked about is call-by-reference. Again, in a pure language, call-by-value and call-by-reference are the same thing, except that the latter is faster. (Which is why, even though most functional languages are usually described as being call-by-value, they actually implement call-by-reference.)
Now, call-by-name (and by extension call-by-need) are not the same thing as call-by-value (and by extension call-by-reference), because call-by-name may return a result in cases where call-by-value doesn't terminate.
However, in all cases where call-by-value or call-by-reference terminates, in a pure language, call-by-value, call-by-reference, call-by-name and call-by-need are the same thing. And in cases where they are not the same thing, call-by-name and call-by-need are in some sense "better", because they give you an answer in cases where call-by-value and call-by-reference would basically have run into an infinite loop.
Ergo, Haskell is your answer. Although probably not the one you were looking for :-)
Pass-by name is rare nowadays. However, you can simulate it in most functional programming languages using a lambda-nill:
// Pass by value
(dosomething (random))
// Pass by name hack
(dosomething (lambda () (random)))
Other then that: ML and O'CaML has a distinction between pass-by-value (default), pass-by-ref (using ref variables) and of course using lambdas. However, I'm not sure either of them qualifies as a "modern" language.
I'm not quite following your reasoning for why C#'s ref and out modifiers aren't "intrinsic." Seems to me that it provides almost exactly what you're looking for: A modern language and environment that supports pass-by-value and pass-by-reference. (As Little Bobby Tables pointed out, pass-by-name is very rare these days, you're better off with a lambda/closure.)
AFAIK, modern Fortran is pass-by-reference (preserving compatibility with ye olde FORTRAN).
Modern Fortran has all the niceties you expect of a modular language, so you can build just fine systems in it. Nobody does, because "Fortran is passe" and everybody wants to code in C# "because its cool.".
In Java, all objects are passed by reference.

Can a language have Lisp's powerful macros without the parentheses?

Can a language have Lisp's powerful macros without the parentheses?
Sure, the question is whether the macro is convenient to use and how powerful they are.
Let's first look how Lisp is slightly different.
Lisp syntax is based on data, not text
Lisp has a two-stage syntax.
A) first there is the data syntax for s-expressions
examples:
(mary called tim to tell him the price of the book)
(sin ( x ) + cos ( x ))
s-expressions are atoms, lists of atoms or lists.
B) second there is the Lisp language syntax on top of s-expressions.
Not every s-expression is a valid Lisp program.
(3 + 4)
is not a valid Lisp program, because Lisp uses prefix notation.
(+ 3 4)
is a valid Lisp program. The first element is a function - here the function +.
S-expressions are data
The interesting part is now that s-expressions can be read and then Lisp uses the normal data structures (numbers, symbols, lists, strings) to represent them.
Most other programming languages don't have a primitive representation for internalized source - other than strings.
Note that s-expressions here are not representing an AST (Abstract Syntax Tree). It's more like a hierarchical token tree coming out of a lexer phase. A lexer identifies the lexical elements.
The internalized source code now makes it easy to calculate with code, because the usual functions to manipulate lists can be applied.
Simple code manipulation with list functions
Let's look at the invalid Lisp code:
(3 + 4)
The program
(defun convert (code)
(list (second code) (first code) (third code)))
(convert '(3 + 4)) -> (+ 3 4)
has converted an infix expression into the valid Lisp prefix expression. We can evaluate it then.
(eval (convert '(3 + 4))) -> 7
EVAL evaluates the converted source code. eval takes as input an s-expression, here a list (+ 3 4).
How to calculate with code?
Programming languages now have at least three choices to make source calculations possible:
base the source code transformations on string transformations
use a similar primitive data structure like Lisp. A more complex variant of this is a syntax based on XML. One could then transform XML expressions. There are other possible external formats combined with internalized data.
use a real syntax description format and represent the source code internalized as a syntax tree using data structures that represent syntactic categories. -> use an AST.
For all these approaches you will find programming languages. Lisp is more or less in camp 2. The consequence: it is theoretically not really satisfying and makes it impossible to statically parse source code (if the code transformations are based on arbitrary Lisp functions). The Lisp community struggles with this for decades (see for example the myriad of approaches that the Scheme community has tried). Fortunately it is relatively easy to use, compared to some of the alternatives and quite powerful. Variant 1 is less elegant. Variant 3 leads to a lot complexity in simple AND complex transformations. It usually also means that the expression was already parsed with respect to a specific language grammar.
Another problem is HOW to transform the code. One approach would be based on transformation rules (like in some Scheme macro variants). Another approach would be a special transformation language (like a template language which can do arbitrary computations). The Lisp approach is to use Lisp itself. That makes it possible to write arbitrary transformations using the full Lisp language. In Lisp there is not a separate parsing stage, but at any time expressions can be read, transformed and evaluated - because these functions are available to the user.
Lisp is kind of a local maximum of simplicity for code transformations.
Other frontend syntax
Also note that the function read reads s-expressions to internal data. In Lisp one could either use a different reader for a different external syntax or reuse the Lisp built-in reader and reprogram it using the read macro mechanism - this mechanism makes it possible to extend or change the s-expression syntax. There are examples for both approaches to provide a different external syntax in Lisp.
For example there are Lisp variants which have a more conventional syntax, where code gets parsed into s-expressions.
Why is the s-expression-based syntax popular among Lisp programmers?
The current Lisp syntax is popular among Lisp programmers for two reasons:
1) the data is code is data idea makes it easy to write all kinds of code transformations based on the internalized data. There is also a relatively direct way from reading code, over manipulating code to printing code. The usual development tools can be used.
2) the text editor can be programmed in a straight forward way to manipulate s-expressions. That makes basic code and data transformations in the editor relatively easy.
Originally Lisp was thought to have a different, more conventional syntax. There were several attempts later to switch to other syntax variants - but for some reasons it either failed or spawned different languages.
Absolutely. It's just a couple orders of magnitude more complex, if you have to deal with a complex grammar. As Peter Norvig noted:
Python does have access to the
abstract syntax tree of programs, but
this is not for the faint of heart. On
the plus side, the modules are easy to
understand, and with five minutes and
five lines of code I was able to get
this:
>>> parse("2 + 2")
['eval_input', ['testlist', ['test', ['and_test', ['not_test', ['comparison',
['expr', ['xor_expr', ['and_expr', ['shift_expr', ['arith_expr', ['term',
['factor', ['power', ['atom', [2, '2']]]]], [14, '+'], ['term', ['factor',
['power', ['atom', [2, '2']]]]]]]]]]]]]]], [4, ''], [0, '']]
This was rather a disapointment to me. The Lisp parse of the equivalent expression is (+ 2 2). It seems that only a real expert would want to manipulate Python parse trees, whereas Lisp parse trees are simple for anyone to use. It is still possible to create something similar to macros in Python by concatenating strings, but it is not integrated with the rest of the language, and so in practice is not done.
Since I'm not a super-genius (or even a Peter Norvig), I'll stick with (+ 2 2).
Here's a shorter version of Rainer's answer:
In order to have lisp-style macros, you need a way of representing source-code in data structures. In most languages, the only "source code data structure" is a string, which doesn't have nearly enough structure to allow you to do real macros on. Some languages offer a real data structure, but it's too complex, like Python, so that writing real macros is stupidly complicated and not really worth it.
Lisp's lists and parentheses hit the sweet spot in the middle. Just enough structure to make it easy to handle, but not too much so you drown in complexity. As a bonus, when you nest lists you get a tree, which happens to be precisely the structure that programming languages naturally adopt (nearly all programming languages are first parsed into an "abstract syntax tree", or AST, before being actually interpreted/compiled).
Basically, programming Lisp is writing an AST directly, rather than writing some other language that then gets turned into an AST by the computer. You could possibly forgo the parens, but you'd just need some other way to group things into a list/tree. You probably wouldn't gain much from doing so.
Parentheses are irrelevant to macros. It's just Lisp's way of doing things.
For example, Prolog has a very powerful macros mechanism called "term expansion". Basically, whenever Prolog reads a term T, if tries a special rule term_expansion(T, R). If it is successful, the content of R is interpreted instead of T.
Not to mention the Dylan language, which has a pretty powerful syntactic macro system, which features (among other things) referential transparency, while being an infix (Algol-style) language.
Yes. Parentheses in Lisp are used in the classic way, as a grouping mechanism. Indentation is an alternative way to express groups. E.g. the following structures are equivalent:
A ((B C) D)
and
A
B
C
D
Have a look at Sweet-expressions. Wheeler makes a very good case that the reason things like infix notation have not worked before is that typical notation also tries to add precedence, which then adds complexity, which causes difficulties in writing macros.
For this reason, he proposes infix syntax like {1 + 2 + 3} and {1 + {2 * 3}} (note the spaces between symbols), that are translated to (+ 1 2) and (+ 1 (* 2 3)) respectively. He adds that if someone writes {1 + 2 * 3}, it should become (nfx 1 + 2 * 3), which could be captured, if you really want to provide precedence, but would, as a default, be an error.
He also suggests that indentation should be significant, proposes that functions could be called as fn(A B C) as well as (fn A B C), would like data[A] to translate to (bracketaccess data A), and that the entire system should be compatible with s-expressions.
Overall, it's an interesting set of proposals I'd like to experiment with extensively. (But don't tell anyone at comp.lang.lisp: they'll burn you at the stake for your curiosity :-).
Code rewriting in Tcl in a manner recognizably similar to Lisp macros is a common technique. For example, this is (trivial) code that makes it easier to write procedures that always import a certain set of global variables:
proc gproc {name arguments body} {
set realbody "global foo bar boo;$body"
uplevel 1 [list proc $name $arguments $realbody]
}
With that, all procedures declared with gproc xyz rather than proc xyz will have access to the foo, bar and boo globals. The whole key is that uplevel takes a command and evaluates it in the caller's context, and list is (among other things) an ideal constructor for substitution-safe code fragments.
Erlang's parse transforms are similar in power to Lisp macros, though they are much trickier to write and use (they are applied to the entire source file, rather than being invoked on demand).
Lisp itself had a brief dalliance with non-parenthesised syntax in the form of M-expressions. It didn't take with the community, though variants of the idea found their way into modern Lisps, so you get Lisp's powerful macros without the parentheses ... in Lisp!
Yes, you can definitely have Lisp macros without all the parentheses.
Take a look at "sweet-expressions", which provides a set of additional abbreviations for traditional s-expressions. They add indentation, a way to do infix, and traditional function calls like f(x), but in a way that is backwards-compatible (you can freely mix well-formatted s-expressions and sweet-expressions), generic, and homoiconic.
Sweet-expressions were developed on http://readable.sourceforge.net and there is a sample implementation.
For Scheme there is a SRFI for sweet-expressions, SRFI-110: http://srfi.schemers.org/srfi-110/
No, it's not necessary. Anything that gives you some sort of access to a parse tree would be enough to allow you to manipulate the macro body in hte same way as is done in Common Lisp. However, as the manipulation of the AST in lisp is identical to the manipulation of lists (something that is bordering on easy in the lisp family), it's possibly not nearly as natural without having the "parse tree" and "written form" be essentially the same.
I think this was not mentioned.
C++ templates are Turing-complete and perform processing at compile-time.
There is the well-known expression templates mechanism that allow transformations,
not from arbitrary code, but at least, from the subset of c++ operators.
So imagine you have 3 vectors of 1000 elements and you must perform:
(A + B + C)[0]
You can capture this tree in a expression template and arbitrarily manipulate it
at compile-time.
With this tree, at compile time, you can transform the expression.
For example, if that expression means A[0] + B[0] + C[0] for your domain, you could
avoid the normal c++ processing which would be:
Add A and B, adding 1000 elements.
Create a temporary for the result, and add with the 1000 elements of C.
Index the result to get the first element.
And replace with another transformed expression template tree that does:
Capture A[0]
Capture B[0]
Capture C[0]
Add all 3 results together in the result to return with += avoiding temporaries.
It is not better than lisp, I think, but it is still very powerful.
Yes, it is certainly possible. Especially if it is still a Lisp under the bonnet:
http://www.meta-alternative.net/pfront.pdf
http://www.meta-alternative.net/pfdoc.pdf
Boo has a nice "quoted" macro syntax that uses [| |] as delimiters, and has certain substitutions which are actually verified syntactically by the compiler pipeline using $variables. While simple and relatively painless to use, it's much more complicated to implement on the compiler side than s-expressions. Boo's solution may have a few limitations that haven't affected my own code. There's also an alternate syntax that reads more like ordinary OO code, but that falls into the "not for the faint of heart" category like dealing with Ruby or Python parse trees.
Javascript's template strings offer yet another approach to this sort of thing. For instance, Mark S. Miller's quasiParserGenerator implements a grammar syntax for parsers.
Go ahead and enter the Elixir programming language.
Elixir is a functional programming language that feels like Lisp with respect to macros, but it is on Ruby's clothes, and runs on top of the Erlang VM.
For those who do not like the parenthesis, but wish their language has powerful macros, Elixir is a great choice.
You can write macros in R (it have more like Algol Syntax) that have notion of delayed expression like in LISP macros. You can call substitute() or quote() to not evaluate the delayed expression but get actual expression and traverse its source code like in LISP. Even structure of the expression source code is like in LISP. Operators are first item in list. e.g.: input$foo which is getting property foo from list input as expression is written as ['$', 'input', 'foo'] just like in LISP.
You can check the ebook Metaprogramming in R that also show how to create Macros in R (not something you would normally do but it's possible). It's based on Article from 2001 Programmer’s Niche: Macros in R that explain how to write LIPS macros in R.

Resources