What do I need to learn to build an interpreter? - programming-languages

For my AQA A2-level Computing project, I've decided to create a basic interpreted programming language, outputting to Console. I don't know how to build an interpreter. I have a copy of the purple dragon book, which is all about compiler design, as user166390 said on an answer to this question that the initial steps to building a compiler are the same to build an interpreter. My question is: is this true?
Can I use the techniques described in the dragon book to write an interpreter? And if so, which steps do I need to use and learn how to use?
Do I need to write a lexical analyser, a syntax analyser, a semantic analyser and an intermediate code generator, for example?
Could I get away with writing a basic parser that reads each line of the source code, parses it, and executes the instruction straight away, or is that a notoriously bad idea?

Yes, you can use the techniques described in the dragon book to write an interpreter.
You need a lexical analyzer and a parser regardless.
As others have pointed out, you do need to write the code to do actual execution -- but for a simple interpreter, this can be essentially the same as the syntax-directed translation described in the dragon book.
Everything else is optional.
If you want to skip straight from the parser to execution, you can. That will leave you with a very simple language, which can be both good and bad -- look at Tcl for an example of such a language.
If you want to interpret each line as you parse it, you can do that, too; this is what most command-line interpreters (Unix shell scripts, Microsoft's cmd.com and PowerShell) do, as well as interactive "REPL's" (Read-Eval-Print-Loops) for languages like Python and Ruby.
"Semantic analyzer" seems vague to me, but sounds like it should include most kinds of load-time consistency checks. This is also optional, but there are advantages in an interpreter that won't take any old garbage and try to execute it as a program...
"Intermediate code" is also kind of vague, but it is arguably optional. If you aren't executing directly from the program string (as in Tcl), you need some kind of internal representation to store your code once you've read it in. One popular option is to execute from an internal tree structure, based more or less closely on your parse tree, which is arguably distinct from producing "intermediate code". On the other hand, if your "intermediate code" could be written out more or less directly from your internal tree structure, you might as well count the internal structure as your "intermediate code".
There are important issues that you haven't addressed; one that stands out is: how do you want to handle names? Presumably you will want the programmer to be able to define and use his own names (e.g., for variables, functions, and so forth), so you will need to implement some kind of mechanism for that.
Exactly how names are handled is a big design decision, with major implications for the usability and implementability of your language. The simplest option for implementation is to use a single, global hash map to implement a single, global namespace -- but note that this choice has well-known usability problems...

Could I get away with writing a basic parser that reads source code and executes the steps straight away?
You could but you'd be doing it the hard way.
Do I need to write a lexical analyser, a syntax analyser, a semantic analyser and an intermediate code generator, for example?
You can skip intermediate code generation except if you want to write a VM-based interpreter. Perl for example, used to execute its parse graph directly; this is in contrast with Java or Python, which produces intermediate byte code.
The interpreter part of a VM-based language is generally simpler than the interpreter that have to understand a parse graph (so each component in the system is simpler), however the complexity of the whole interpreter stack is generally simpler when you don't need to define an intermediate bytecode language. So pick your poison.

Related

Creating libraries from machine readable specifications in Haskell

I have a specification and I wish to transform it into a library. I can write a program that writes out Haskel source. However is there a cleaner way that would allow me to compile the specification directly (perhaps using templates)?
References to manuals and tutorials would be greatly appropriated.
Yes, you can use Template Haskell. The are a couple of approaches to using it.
One approach is to use quasiquotation to embed (parts of) the text of the specification in a quasiquotation within a source file. To implement it, you need to write a parser of the machine specification that outputs Haskell AST. This might be useful if the specification is relatively static, it makes sense to have subsets of the specification, or you want to manually map parts of the specification to different modules. This may also be useful, in addition to a different approach perhaps, to provide tools for users of the library to express things in terms of the specification.
Another approach is to execute IO in a normal Template Haskell splice. This would allow you to read the specification from a file (see addDependentFile too in this case), the network (don't do this), or to execute an arbitrary program to produce the Haskell AST needed. This might be more useful if the specification changes more often, or you want to keep a strict separation between the specification and code.
If it's much easier to produce Haskell source than Haskell AST, you can use a library like haskell-src-meta which will parse a string into Template Haskell AST.

How should I make my parser concurrent?

I'm working on implementing a music programming language parser in Clojure. The idea is that you run the parser program with a text file as a command-line argument; the text file contains code in this music language I'm developing; the parser interprets the code and figures out what "instrument instances" have been declared, and for each instrument instance, it parses the code and returns a sequence of musical "events" (notes, chords, rests, etc.) that the instrument does. So before that last step, we have multiple strings of "music code," one string per instrument instance.
I'm somewhat new to Clojure and still learning the nuances of how to use reference types and threads/concurrency. My parser is going to be doing some complex parsing, so I figured it would benefit from using concurrency to boost performance. Here are my questions:
The simplest way to do this, it seems, would be to save the concurrency for after the instruments are "split up" by the initial parse (a single-thread operation), then parse each instrument's code on a different thread at the same time (rather than wait for each instrument to finish parsing before moving onto the next). Am I on the right track, or is there a more efficient and/or logical way to structure my "concurrency plan"?
What options do I have for how to implement this concurrent parsing, and which one might work the best, either from a performance or a code maintenance standpoint? It seems like it could be as simple as: (map #(future (process-music-code %)) instrument-instances), but I'm not sure if there is a better way to do it like with an agent, or manual threads via Java interop, or what. I'm new to concurrent programming, so any input on different ways to do this would be great.
From what I've read, it seems that Clojure's reference types play an important role in concurrent programming, and I can see why, but is it always necessary to use them when working with multiple threads? Should I worry about making some of my data mutable? If so, what in particular should be mutable in the code for the parser I'm writing? and what reference type(s) would be best suited for what I'm doing? The nature of the way my program will work (user runs the program with a text file as an argument -- program processes it and turns it into audio) makes it seem like I don't need anything to be mutable, since the input data never changes, so my gut tells me I won't need to use any reference types, but then again, I might not fully understand the relationship between reference types and concurrency in Clojure.
I would suggest that you might be distracting yourself from more important things (like working out the details of your music language) by premature optimization. It would be better to write the simplest, easiest-to-code parser which you can first, to get up and running. If you find it too slow, then you can look at how to optimize for better performance.
The parser should be fairly self-contained, and will probably not take a whole lot of code anyways, so even if you later throw it out and rewrite it, it will not be a big loss. And the experience of writing the first parser will help if and when you write the second one.
Other points:
You are absolutely right about reference types -- you probably won't need any. Your program is a compiler -- it takes input, transforms it, writes output, then exits. That is the ideal situation for pure functional programming, with nothing mutable and all flow of data going purely through function arguments and return values.
Using a parser generator is usually the quickest way to get a working parser, but I haven't found a really good parser generator for Clojure. Parsley has a really nice API, but it generates LR(0) parsers, which are almost useless for anything which does not have clear, unambiguous markers for the beginning/end of each "section". (Like the way S-expressions open and close with parens.) There are a couple parser combinator libraries out there, like squarepeg, but I don't like their APIs and prefer to write my own hand-coded, recursive-descent parsers using my own implementation of something like parser combinators. (They're not fast, but the code reads really well.)
I can only support Alex Ds point that writing parsers is an excellent exercise. You should definitely do it in C one time. From my own experience, it's a lot of debugging training at least.
Aside from that, given that you are in the beautiful world of Clojure notice the following:
Your parser will transform ordinary strings to data structures, like
{:command :declare,
:args {:name "bazooka-violin",
...},
...}
In Clojure you can read such data structures easily from EDN files. Possibly it would be a more valuable approach to play around with finding suitable structures directly before you constrain the syntax of your language too much for it to be flexible for later changes in the way your language works.
Don't ever think about writing for performance. Unless your user describes the collected works of Bach in a file, it's unlikely that it will take more than a second to parse.
If you write your interpreter in a functional, modular and concise way, it should be easy to decompose it into steps that can be parallelized using various techniques from pmap to core.reducers. The same of course goes for all other code and your parser as well (if multi-threading is a necessity there).
Even Clojure is not compiled in parallel. However it supports recompilation (on the JVM) which in contrast is a way more valuable feature to think about.
As an aside, I've been reading The Joy of Clojure, and I just learned that there is a nifty clojure.core function called pmap (parallel map) that provides a nice, easy way to perform an operation in parallel on a sequence of data. It's syntax is just like map, but the difference is that it performs the function on each item of the sequence in parallel and returns a lazy sequence of the results! This can generally give a performance boost, but it depends on the inherent performance cost of coordinating the sequence result, so whether or not pmap gives a performance boost will depend on the situation.
At this stage in my MPL parser, my plan is to map a function over a sequence of instruments/music data, transforming each instrument's music data from a parse tree into audio. I have no idea how costly this transformation will be, but if it turns out that it takes a while to generate the audio for each instrument individually, I suppose I could try changing my map to pmap and see if that improves performance.

APL readability

I have to code in APL. Since the code is going to be maintained for a long time, I am wondering if there are some papers/books which contain heuristics/tips/samples to help in designing clean and readable APL programs.
It is a different experience than coding in other programming language. Making a function, for example. Small will not help: such a function can contain one line of code, which is completely incomprehensible.
First, welcome to the wonderful world of APL.
Writing readable and maintainable APL code is not much different than writing readable and maintainable code in any language. Any good book on writing clean code is as applicable to APL as any other language, perhaps even more so. I recommend Clean Code by Robert C. Martin.
Consider the guideline in this book that all code in a function should be at the same level of abstraction. This applies to APL 100 times over. For example, if you have a function named DoThisBigTask it should have very few APL primitive symbols in it, and certainly no long complex one-liners. It should just be series of calls to other, lower level functions. If these higher-level functions are all well-named and well-defined, the general drift should be easily determined by someone who does not even know APL. The lowest level functions will be nothing but primitives and will be inscrutable to the non-APLer. Depending on how they are written they may even initially appear inscrutable to a seasoned APLer. However, these low level functions should be short, have no side effects, and can easily be re-written rather than modified if the maintaining programmer is unable to understand the original coding technique.
In general, keep your functions short, well-named, well-defined, and to the point. And keep the lines of code even shorter. It is much more important to have well-defined and well-documented functions than it is to have well-written or well document lines of code.
Since you asked for books and other references, I can suggest:
APL2 in Depth by Norman D. Thomson and Raymond P. Polivka. I worked with Ray Polivka for years and he was one of the best APL teachers I
have ever known.
The classic A. P. L.: An Interactive Approach by
Leonard Gilman and Allen J. Rose is good for the core language, but
is rather outdated and doesn't contain much that is truly relevant on
readability.
APL 2 at a Glance by James A. Brown and Sandra Pakin serves in some ways as an update to Gilman and Rose. It covers nested operations and other updates to APL, but has not much specifically directed at readability. Still, if you follow the examples here you will be writing readable code.
APL is Easy by STSC and Jerry R. Turner is an intro directed specifically at the APL*Plus line. Again, not much specifically on readability, but the models are generally well-designed readable code.
Mastering Dyalog APL: A Complete Introduction to Dyalog APL by Bernard Legrand is quite good if you are specifically workign in Dyalog APL, not so much if you are working in one of the other versions such as APL*Plus (from APL2000)
It is my view that the reputation of APL as a "write-only language" is much overstated. One does need to get used to the primitives and the symbols used to represent them. But then one needs to get used to the syntax and the various library functions in many other language environments. I have seen convoluted code in C, C++, and Java as hard to follow as any APL. Of course, it isn't good C, C++, or Java, even if it is clever.
Some advice:
Writing 'one-liners' is a way to test one's mastery of the language,
but is very poor practice for production code.
Comment to make the algorithm and especially the data structure being used clear. As with any code, comments should add something
that cannot be easily read from the code itself, or call attention to
complex or obscure code.
If possible avoid obscure code so there is no need to explain it. It is usually possible.
Make each function do one and only one job, with a clear interface.
Avoid global variables for the most part, and document any that are needed.
Document the interface, purpose, and efect of any function at the
top. Make utilities black boxes without side-effects if possible. If
side-effects are essential, document those as part of the interface.
Develop a standard header comment structure.
Dynamic code built on-the-fly can add flexabiliy to a solution, but
is often much harder to debug if problems occur. Make such code
bullet-proof to the extent you can, and build in optional logging to
help when it turns out to have problems anyway.
You can use an OOP-like style if you wish. But there is no need to do so. If you do, it should IMO be used fairly pervasively through an application, except perhaps for low-level utilities. But OOP-style code can be at least as convoluted as non-OOP code, and APL doesn't have built-in inheritance or other OOP-supporting syntax.
(I'll use here "A" instead of comment, "'" instead of symbol sign.)
Well, I was developing APL for a year, I have only used Aplusdev.org.
You don't even need more. The trick is to try to think OOP-like. You should have -- if I remember well -- structured fields used as class data, sth like {'attribute1 'attribute2, {value,value2}}, so you can easily pick them out like obj.attribute1 in c++.
(here 'attribute Pick object, use only in class functions :) )
Moreover, use namespaced functions:
namespace_classname.method(this, arg1)
namespace_classname._private_method(this, arg1, arg2)
and lots of simple tool functions instead of nifty, long lines. The performance drop is not substantial, you can optimize later for say arrays once you see something could be faster.
And before anything: think matlab and mathematica without for loops! :) It helps a lot.
My suggestions for robust, maintainable code:
use extensive set of utility functions instead of trickery with those unreadable symbols to make your code always to the point.
try-catch blocks there is a built in exception handling, which can be utilized here,
try_begin();
A tried code, maybe in extra brackets not to forget try_end() at the end.
try_end();
catch(sth, function_here);
can be nicely implemented. (You'll see, catching errors is very important)
crude type checking : implement a standard and use for not-so-many times called functions... (you can put a function with flexible parameters right after a function definition)
Syntax:
function(point2i, ch):
{
typecheck({{'int, [1 2]}, 'char}); A do some assertions in typecheck...
// your function goes here
}
lambda functions can be very effective, you can do some reflections to achieve lambdas.
always declare returns with saying "return"!
Unit tests based on try-catch testing each and every function you write.
I also used a lot of 'apply' and 'map' from mathematica, implementing my own version, they are very-very effective here.
I wrote matlab thinking since you can here have a list of structured fields (=class data) in a variable. You will write lots of those if you wanna keep things for-loop-less (and you wanna, trust me). For that you need to have a standard naming convention say indicate with plurals:
namespace_class.method(objects, arg1, arg2)
To the end: also, I wrote inputBox and messageBox like the ones in Javascript or VisualBasic, they will make very easy hacking together simple tools or checking states. The only catch of messageBox, that it can't put the function-flow on hold,
so you need
AA documentation of f1
f1():
{
A do sth
msgbox.call("Hi there",{'Ok, {'f2}});
}
f2():
{
A continue doing stuff
}
You can write auto-docs in bash with a gawk/sed combination to put it into a webpage.
Also creating HTML formatted code helps in printing. ;)
I hope this was good outline for a proper build-up. Before writing own tools, try to dig up the available tools from the legacy codebase... functions are often even 4 times implemented with different names due to the mess that time.

Functional programming languages introspection

I'm sketching a design of something (machine learning of functions) that will preferably want a functional programming language, and also introspection, specifically the ability to examine the program's own code in some nicely tractable format, and preferably also the ability to get machine generated code compiled at runtime, and I'm wondering what's the best language to write it in. Lisp of course has strong introspection capabilities, but the statically typed languages also have advantages; the ones I'm considering are:
F# - the .Net platform has a good story here, you can read byte code at run time and also emit byte code and get it compiled; I assume there's no problem accessing these facilities from F#.
Haskell, Ocaml - do these have similar facilities, either via byte code or parse tree?
Are there other languages I should also be looking at?
Haskell's introspection mechanism is Template Haskell, which supports compile time metaprogramming, and when combined with e.g. llvm, provides runtime metaprogramming facilities.
Ocaml has:
Camlp4 to manipulate Ocaml concrete syntax trees in Ocaml. The maintained implementation of Camlp4 is Camlp5.
MetaOCaml for full-scale multi-stage programming.
Ocamljit to generate native code at run time, but I don't think it's been maintained recently.
Ocaml-Java to compile Ocaml code for the Java virtual machine. I don't know if there are nice reflection capabilities.
Not really an answer, but note also the F# Quotations feature and library, for more homoiconicity stuff.
You might check out the typed variant of Racket (previously known as PLT Scheme). It retains most of the syntactic simplicity of Scheme, but provides a static type system. Since Racket is a Scheme, metaprogramming is par for the course, and the runtime can emit native code by way of a JIT.
The Haskell approach would be more along the lines of parsing the source. The Haskell Platform includes a complete source parser, or you can use the GHC API to get access that way.
I'd also look at Scala or Clojure which come with them all the libraries that have been developed for Java. You'll never need to worry if a library does not exist. But more to the point of your question, these languages give you the same reflection (or more powerful types) that you will find within Java.
I'm sketching a design of something (machine learning of functions) that will preferably want a functional programming language, and also introspection, specifically the ability to examine the program's own code in some nicely tractable format, and preferably also the ability to get machine generated code compiled at runtime, and I'm wondering what's the best language to write it in. Lisp of course has strong introspection capabilities, but the statically typed languages also have advantages; the ones I'm considering are:
Can you not just parse the source code like an ordinary interpreter or compiler? Why do you need introspection?
F# - the .Net platform has a good story here, you can read byte code at run time and also emit byte code and get it compiled; I assume there's no problem accessing these facilities from F#.
F# has a rudimentary quotation mechanism but you can only quote some expressions and not other kinds of code, most notably type definitions. Also, its evaluation mechanism is orders of magnitude slower than genuine compilation so it is basically completely useless. You can use reflection to analyze type definitions but, again, it is quite rudimentary.
You can read byte code but that has been compiled so a lot of information and structure has been lost.
F# also has lexing and parsing technology (most notably fslex, fsyacc and FParsec) but it is not as mature as OCaml's.
Haskell, Ocaml - do these have similar facilities, either via byte code or parse tree?
Haskell has Template Haskell but I've never heard of anyone using it (abandonware?).
OCaml has its Camlp4 macro system and a few people do use it but it is poorly documented.
As for lexing and parsing, Haskell has a few libraries (most notably Parsec) and OCaml has many libraries.
Are there other languages I should also be looking at?
Term rewrite languages like Mathematica would be an obvious choice because they make it trivial to manipulate code. The Pure language might be of interest.
You might also consider MetaOCaml for its run-time compilation capabilities.

How to create a language these days?

I need to get around to writing that programming language I've been meaning to write. How do you kids do it these days? I've been out of the loop for over a decade; are you doing it any differently now than we did back in the pre-internet, pre-windows days? You know, back when "real" coders coded in C, used the command line, and quibbled over which shell was superior?
Just to clarify, I mean, not how do you DESIGN a language (that I can figure out fairly easily) but how do you build the compiler and standard libraries and so forth? What tools do you kids use these days?
One consideration that's new since the punched card era is the existence of virtual machines already bountifully provided with "standard libraries." Targeting the JVM or the .NET CLR instead of ye olde "language walled garden" saves you a lot of bootstrapping. If you're creating a compiled language, you may also find Java byte code or MSIL an easier compile target than machine code (of course, if you're in this for the fun of creating a tight optimising compiler then you'll see this as a bug rather than a feature).
On the negative side, the idioms of the JVM or CLR may not be what you want for your language. So you may still end up building "standard libraries" just to provide idiomatic interfaces over the platform facility. (An example is that every languages and its dog seems to provide its own method for writing to the console, rather than leaving users to manually call System.out.println or Console.WriteLine.) Nevertheless, it enables an incremental development of the idiomatic libraries, and means that the more obscure libraries for which you never get round to building idiomatic interfaces are still accessible even if in an ugly way.
If you're considering an interpreted language, .NET also has support for efficient interpretation via the Dynamic Language Runtime (DLR). (I don't know if there's an equivalent for the JVM.) This should help free you up to focus on the language design without having to worry so much about the optimisation of the interpreter.
I've written two compilers now in Haskell for small domain-specific languages, and have found it to be an incredibly productive experience. The parsec library makes playing with syntax easy, and interpreters are very simple to write over a Haskell data structure. There is a description of writing a Lisp interpreter in Haskell that I found helpful.
If you are interested in a high-performance backend, I recommend LLVM. It has a concise and elegant byte-code and the best x86/amd64 generating backend you can find. There is an optional garbage collector, and some experimental backends that target the JVM and CLR.
You can write a compiler in any language that produces LLVM bytecode. If you are adventurous enough to learn Haskell but want LLVM, there are a set of Haskell-LLVM bindings.
What has changed considerably but hasn't been mentioned yet is IDE support and interoperability:
Nowadays we pretty much expect Intellisense, step-by-step execution and state inspection "right in the editor window", new types that tell the debugger how to treat them and rather helpful diagnostic messages. The old "compile .x -> .y" executable is not enough to create a language anymore. The environment is nothing to focus on first, but affects willingness to adopt.
Also, libraries have become much more powerful, noone wants to implement all that in yet another language. Try to borrow, make it easy to call existing code, and make it easy to be called by other code.
Targeting a VM - as itowlson suggested - is probably a good way to get started. If that turns out a problem, it can still be replaced by native compilers.
I'm pretty sure you do what's always been done.
Write some code, and show your results to the world.
As compared to the olden times, there are some tools to make your job easier though. Might I suggest ANTLR for parsing your language grammar?
Speaking as someone who just built a very simple assembly like language and interpreter, I'd start out with the .NET framework or similar. Nothing can beat the powerful syntax of C# + the backing of the entire .NET community when attempting to write most things. From here i designed a simple bytecode format and assembly syntax and proceeeded to write my interpreter + assembler.
Like i said, it was a very simple language.
You should not accept wimpy solutions like using the latest tools. You should bootstrap the language by writing a minimal compiler in Visual Basic for Applications or a similar language, then write all the compilation tools in your new language and then self-compile it using only the language itself.
Also, what is the proposed name of the language?
I think recently there have not been languages with ALL CAPITAL LETTER names like COBOL and FORTRAN, so I hope you will call it something like MIKELANG with all capital letters.
Not so much an implementation but a design decision which effects implementation - if you make every statement of your language have a unique parse tree without context, you'll get something that it's easy to hand-code a parser, and that doesn't require large amounts of work to provide syntax highlighting for. Similarly simple things like using a different symbol for module namespaces and object namespaces ( unlike Java which uses . for both package and class namespaces ) means you can parse the code without loading every module that it refers to.
Standard libraries - include the equivalent of everything in C99 standard libraries other than setjmp. Add whatever else you need for your domain. Work out an easy way to do this, either something like SWIG or an in-line FFI such as Ruby's [can't remember module name] and Python's ctypes.
Building as much of the language in the language is an option, but projects which start out doing either give up (rubinius moved to using C++ for parts of its standard library), or is only for research purposes (Mozilla Narcissus)
I am actually a kid, haha. I've never written an actual compiler before or designed a language, but I have finished The Red Dragon Book, so I suppose I have somewhat of an idea (I hope).
It would depend firstly on the grammar. If it's LR or LALR I suppose tools like Bison/Flex would work well. If it's more LL, I'd use Spirit, which is a component of Boost. It allows you to write the language's grammar in C++ in an EBNF-like syntax, so no muddling around with code generators; the C++ compiler compiles the grammar for you. If any of these fail, I'd write an EBNF grammar on paper, and then proceed to do some heavy recursive descent parsing, which seems to work; if C++ can be parsed pretty well using RDP (as GCC does it), then I suppose with enough unit tests and patience you could write entire compilers using RDP.
Once I have a parser running and some sort of intermediate representation, it then depends on how it runs. If it's some bytecode or native code compiler, I'll use LLVM or libJIT to process it. LLVM is more suited for general compilation, but I like the libJIT API and documentation better. Alternatively, if I'm really lazy, I'll generate C code and let GCC do the actual compilation. Another alternative, is to target an existing VM, like Parrot or the JVM or the CLR. Parrot is the VM being designed for Perl. If it's just an interpreter, I'll walk the syntax tree.
A radical alternative is to use Prolog, which has syntax features which remarkably simulate EBNF. I have no experience with it though, and if I am not wrong (which I am almost certainly going to be), Prolog would be quite slow if used to parse heavy duty programming languages with a lot of syntactical constructs and quirks (read: C++ and Perl).
All this I'll do in C++, if only because I am more used to writing in it than C. I'd stay away from Java/Python or anything of that sort for the actual production code (writing compilers in C/C++ help to make it portable), but I could see myself using them as a prototyping language, especially Python, which I am partial towards. Of course, I've never actually done any of this before, so I'm not one to say.
On lambda-the-ultimate there's a link to Create Your Own Programming Language by Marc-André Cournoyer, which appears to describe how to leverage some modern tools for creating little languages.
Just to clarify, I mean, not how do you DESIGN a language (that I can figure out fairly easily)
Just a hint: Look at some quite different languages first, before designing a new languge (i.e. languages with a very different evaluation strategy). Haskell and Oz come to mind. Though you should also know Prolog and Scheme. A year ago I also was like "hey, let's design a language that behaves exactly as I want", but fortunatly I looked at those other languages first (or you could also say unfortunatly, because now I don't know how I want a language to behave anymore...).
Before you start creating a language you should read this:
Hanspeter Moessenboeck, The Art of Niklaus Wirth
ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe00b.pdf
There's a big shortcut to implementing a language that I don't see in the other answers here. If you use one of Lukasiewicz's "unparenthesized" forms (ie. Forward Polish or Reverse Polish) you don't need a parser at all! With reverse polish, the dependencies go right-to-left so you simply execute each token as it's scanned. With forward polish, it's the reverse of that, so you actually execute the program "backwards", simplifying subexpressions until reaching the starting token.
To understand why this works, you should investigate the 3 primary tree-traversal algorithms: pre-order, in-order, post-order. These three traversals are the inverse of the parsing task that a language reader (i. parser) has to perform. Only the in-order notation "requires" a recursive decent to re-construct the expression tree. With the other two, you can get away with just a stack.
This may require more "thinking' and less "implementing".
BTW, if you've already found an answer (this question is a year old), you can post that and accept it.
Real coders still code in C. Just that it's a litte sharper.
Hmmm... language design? or writing a compiler?
If you want to write a compiler, you'd use Flex + Bison. (google)
Not an easy answer, but..
You essentially want to define a set of rules written in text (tokens) and then some parser that checks these rules and assembles them into fragments.
http://www.mactech.com/articles/mactech/Vol.16/16.07/UsingFlexandBison/
People can spend years on this, The above article talks about using two tools (Flex and Bison) That can be used to turn text into code you can feed to a compiler.
First I spent a year or so to actually think how the language should look like. At the same time I helped in developing Ioke (www.ioke.org) to learn language internals.
I have chosen Objective-C as implementation platform as it's fast (enough), simple and rich language. It also provides test framework so agile approach is a go. It also has a rich standard library I can build upon.
Since my language is simple on syntactic level (no keywords, only literals, operators and messages) I could go with Ragel (http://www.complang.org/ragel/) for building scanner. It's fast as hell and simple to use.
Now I have a working object model, scanner and simple operator shuffling plus standard library bootstrap code. I can even run a simple programs - as long as they fit in one file that is :)
Of course older techniques are still common (e.g. using Flex and Bison) many newer language implementations combine the lexing and parsing phase, by using a parser based on a parsing expression grammar (PEG). This works for recursive descent parsers created using combinators, or memoizing Packrat parsers. Many compilers are built using the Antlr framework also.
Use bison/flex which is the gnu version of yacc/lex. This book is extremely helpful.
The reason to use bison is it catches any conflicts in the language. I used it and it made my life many years easier (ok so i'm on my 2nd year but the first 6months was a few years ago writing it in C++ and the parsing/conflicts/results were terrible! :(.)
If you want to write a compiler obviously you need to read the Dragon Book ;)
Here is another good book that I have just read. It is practical and easier to understand than the Dragon Book:
http://www.amazon.co.uk/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=language+implementation+patterns&x=0&y=0
Mike --
If you're interested in an efficient native-code-generating compiler for Windows so you can get your bearings -- without wading through all the unnecessary widgets, gadgets, and other nonsense that clutter today's machines -- I recommend the Osmosian Order's Plain English development system. It includes a unique interface, a simplified file manager, a friendly text editor, a handy hexadecimal dumper, the compiler/linker (of course), and a wysiwyg page-layout application for documentation. Written entirely in Plain English, it is a quick download (less than a megabyte), small enough to understand in short order (about 25,000 lines of Plain English code, with just 4,000 in the compiler/linker), yet powerful enough to reproduce itself on a bottom-of-the-line Dell in less than three seconds. Really: three seconds. And it's free to all who write and ask for a copy, including the source code and and a rather humorous tongue-in-cheek 100-page manual. See www.osmosian.com for details on how to get a copy, or write to me directly with questions or comments: Gerry.Rzeppa#pobox.com

Resources