Creating source to source translator - programming-languages

I want to know that what are the strategies to create a source to source translator i.e creating translation from one high level language to another. The two ways that come into my mind are
1- Changing syntax tree of one language to other language syntax tree
2- Changing it to intermediate language and then converting that to other high level language
My question is that is it possible to do the conversion using both strategies and which is more feasible to do, can anyone give some refernces to any theory or implementation done by some converter like any of above methods. And is there any standard xml based intermediate language, i know that xmlvm uses xml as intermediate language but it does not provide any proper specification of the intermediate language.

Any compiler is, roughly, a source-to-source translator. Target language can be an assembly language (or directly a binary machine code language), or C, or whatever high level language you fancy. So, the general compilers theory is applicable.
And just as a word of advice - one intermediate language is normally not nearly enough. Use more. Use dozens of intermediate languages, each different from a previous one in just one tiny aspect. This way any language-to-language translation is nothing but trivial.
Another word of advice (anticipating downvotes here) - stay away from XML, especially as a representation for ASTs.

I would look at LLVM, which can do source to source. Although the output isn't pretty, it might provide some good ideas.

The converters are usually based on constructing the semantic tree of one program and then re-shaping it to the target PL. As an example, take a look at C# to Java convertor.
The second approach is also possible, but the organization of your code may change completely after conversion. So, it is better to keep the intermediate common structure (IL, ST, etc), as high level as possible.

Try Clang! It is powerful for source-to-source translation. As of now it fully supports C, C++, Objective C and Objective C++.
You may also want to look at ROSE compiler infrastructure.

Related

Is it true that anything that can be coded in one programming language can be done in any other language?

Is it true that anything that can be coded in one programming language can be done in any other language?
For instance, is it possible to code an Android App in c or c++ instead of Java?
Yes,the only difference being, some languages give inbuilt libraries for certain implementations while in others, you may have to implement it by yourself.
Take for example: consider A and B to be matrices, In Octave you can multiply them by just doing A*B, while in other languages, like C, you should write the full code yourself.
Yes of course no doubt. The only thing is that somethings can be done easily in one programming language while in other it may be a tough one.
This is my personal opinion; it is difficult to answer with 100% accuracy because I do not know the ins and outs of every language that exists but I will try to answer in two parts.
Short answer:
Yes; it is entirely possible to do anything in one language in another language. However, it may be harder to duplicate specific logic across languages.
Long answer:
Although it can be possible to implement the same program/class/software in one language in another language, this is not always the best thing to do.
For example, some languages have advantages over others in specific areas. FORTRAN, C, and C++ are all languages that can produce fast results for math operations when compared to languages like Java or C#. But C# for example, gives you much more flexibility than say C++ with object orientation.
So for example, you have millions of operations that need to be done, and time is important to you - C/C++ would be a better language to implement compared to C#.
But If you cared less about the efficiency, and wanted to leverage the benefits of OOD within C#, then you would use C#.
Note:
I tried to be simple with this example, remember a book could be written on this topic.
A programming language can do anything another can, but certain steps are put in place so they can not. You don't want Javascript calling to your windowAPI and bringing up windows outside your browser, or placing bat files on your computer without your permision

Tool for automated porting and language that can compile into others

I'm just asking this out of curiosity :
Is there any tool that can automatically convert a source code of reasonable complexity from one language to another ?
Is there any "meta-language" that can compile into several other languages ? For example CoffeeScript compiles into Javascript.
If you know any open-source example, it'd be great !
Thank you for your time.
PS: No idea how to tag this. Feel free to edit.
GCC converts complex C++ code into machine code and thus technically is an answer to your question. In fact, there are lots of compiler like this, but I don't think these are what you intended to ask.
There are tools that are hardwired to translate just one language to another as source code (another poster suggested "f2C", which is a perfect example). These are just like compilers... but rarer.
There are virtually no tools that will map from one language to many others, out of the box. The problem is that languages have different execution models, data types, and execution schemes, which such a translator has to simulate properly in the target language.
The are "code generators" that claim to do this, but they are largely IMHO specifications of rather simple functions that translate trivially to simple code in the target langauge.
If you want to translate one language to another in a sort of general way, you need a program transformation system, e.g., a system that can parse arbitrary langauges, and for which you can provide translation rules that map to other languages in a sort of straightforward way.
Our DMS Software Reengineering Toolkit is one of these. This SO What kinds of patterns could I enforce on the code to make it easier to translate to another programming language? discusses the issues in more detail.
You can convert Fortran code to C using the f2c tool.
For python, you can convert a subset of the language to C++ using shedskin.
The vala language is converted to C before the real compilation.

What qualifies a programming language as dynamic?

What qualifies a programming language to be called dynamic language? What sort of problems should I use a dynamic programming language to solve? What is the main difference between static programming languages and dynamic programming languages?
I don't think there is black and white here - there is a whole spectrum between dynamic and static.
Let's take two extreme examples for each side of the spectrum, and see where that takes us.
Haskell is an extreme in the static direction.
It has a powerful type system that is checked at compile time: If your program compiles it is free from common and not so common errors.
The compiled form is very different from the haskell program (it is a binary). Consequently runtime reflection and modification is hard, unless you have foreseen it. In comparison to interpreting the original, the result is potentially more efficient, as the compiler is free to do funky optimizations.
So for static languages I usually think: fairly lengthy compile-time analysis needed, type system will prevent me from making silly mistakes but also from doing some things that are actually valid, and if I want to do any manipulation of a program at runtime, it's going to be somewhat of a pain because the runtime representation of a program (i.e. its compiled form) is different from the actual language itself. Also it could be a pain to modify things later on if I have not foreseen it.
Clojure is an extreme in the dynamic direction.
It too has a type system, but at compile time there is no type checking. Many common errors can only be discovered by running the program.
Clojure programs are essentially just Clojure lists (the data structure) and can be manipulated as such. So when doing runtime reflection, you are actually processing a Clojure program more or less as you would type it - the runtime form is very close to the programming language itself. So you can basically do the same things at runtime as you could at "type time". Consequently, runtime performance may suffer because the compiler can't do many up-front optimizations.
For dynamic languages I usually think: short compilation step (basically just reading syntax), so fast and incremental development, practically no limits to what it will allow me to do, but won't prevent me from silly mistakes.
As other posts have indicated, other languages try to take more of a middle ground - e.g. static languages like F# and C# offer reflection capabilities through a separate API, and of course can offer incremental development by using clever tools like F#'s REPL. Dynamic languages sometimes offer optional typing (like Racket, Strongtalk), and generally, it seems, have more advanced testing frameworks to offset the lack of any sanity checking at compile time. Also type hints, while not checked at compile time, are useful hints to generate more efficient code (e.g. Clojure).
If you are looking to find the right tool for a given problem, then this is certainly one of the dimensions you can look at - but by itself is not likely to force a decision either way. Have a think about the other properties of the languages you are considering - is it a functional or OO or logic or ... language? Does it have a good framework for the things I need? Do I need stability and binary backwards compatibility, or can I live with some churn in the compiler? Do I need extensive tooling?Etc.
Dynamic language does many tasks at runtime where a static language would do them at compile-time.
The tasks in question are usually one or more of: type system, method dispatch and code generation.
Which also pretty much answers the questions about their usage.
There are a lot of different definitions in use, but one possible difference is:
A dynamic language typically uses dynamic typing.
A static language typically uses static typing.
Some languages are difficult to classify as either static or dynamically typed. For example, C# is traditionally regarded as a statically typed language, but C# 4.0 introduced a static type called dynamic which behaves in some ways more like a dynamic type than a static type.
What qualifies a programming language to be called dynamic language.
Dynamic languages are generally considered to be those that offer flexibility at run-time. Note that this does not necessarily conflict with static type systems. For example, F# was recently voted "favorite dynamic language on .NET" at a conference even though it is statically typed. Many people consider F# to be a dynamic language because it offers run-time features like meta-circular evaluation, a Read-Evaluate-Print-Loop (REPL) and dynamic typing (of sorts). Also, type inference means that F# code is not littered with type declarations like most statically typed languages (e.g. C, C++, Java, C# 2, Scala).
What are the problems for which I should go for dynamic language to solve.
In general, provided time and space are not of critical importance you probably always want to use languages with run-time flexibility and capabilities like run-time compilation.
This thread covers the issue pretty well:
Static/Dynamic vs Strong/Weak
The question is asked during Dynamic Languages Wizards Series - Panel on Language Design (at 24m 04s).
Answer from Jonathan Rees:
You know one when you see one
Answer from Guy Steele:
A dynamic language is one that defers as many decisions as possible to run time.
For example about array size, the number of data objects to allocate, decisions like that.
The concept is deferring until runtime, that's what I understand to be dynamic.

is it possible to markup all programming languages under object oriented paradigm using a common markup schema?

i have planned to develop a tool that converts a program written in a programming language (eg: Java) to a common markup language (eg: XML) and that markup code is converted to another language (eg: C#).
in simple words, it is a programming language converter that converts program written in one language to another language.
i think it is possible but i don know where to start. i wanna know the possibilities to do so and information about some existing system.
What you are trying to do is extremely hard, but if you want to know what you are up for I've listed the steps you need to follow below:
First the hard bit:
First you obtain or derive an operational semantics for your source and target languages.
Then you enhance the semantics to capture your source and target memory models.
Then you need to unify the two enhanced-semantics within a common operational model.
Then you need to define a mapping from your source languages onto the common operational model.
Then you need to define a mapping from your operational model to your target language
Step 4, as you pointed out in your question, is trivial.
Step 1 is difficult, as most languages do not have sufficiently formal semantics specified; but I recommend checking out http://lucacardelli.name/TheoryOfObjects.html as this is the best starting point for building a traditional OO semantics.
Step 2 is almost certainly impossible in general, but may be merely obscenely difficult if you are willing to sacrifice some efficiency.
Step 3 will depend on how clean the result of step 1 turned out, but is going to be anything from delicate and tricky to impossible.
Step 5 is not going to be trivial, it is effectively writing a compiler.
Ultimately, what you propose to do is impossible in general, due to the difficulties inherited in steps 1 and 2. However it should be difficult, but doable, if you are willing to: severely restrict the source language constructs supported; pretty much forget handling threads correctly; and pick two languages with sufficiently similar semantics (ie. Java and C# are ok, but C++ and anything-else is not).
It depends on what languages you want to support, but in general this is a huge & difficult task unless you plan to only support a very small subset of each language.
The real problem is that each programming languages has different features (with some areas that overlap and others that don't) and different ways of solving the same problems -- and it's pretty tricky to detect the problem the programmer is trying to solve and convert that to a new idiom. :) And think about the differences between GUIs created in different languages....
See http://xmlvm.org/ as an example (a project aimed at converting between source code of many different languages, with an XML middle-point) -- the site covers in some depth the challenges they are tackling and the compromises they take, and (if you still have any interest in this kind of project...) ask more specific followup questions.
Notice specifically what the output source code looks like -- it's not at all readable, maintainable, efficient, etc..
It is "technically easy" to produce XML for any single langauge: build a parser, construct and abstract syntax tree, and dump out that tree as XML. (I build tools that do this off-the-shelf for many languages). By technically easy, I mean that the community knows how to do this (see any compiler textbook, e.g., Aho&Ullman Dragon book). I do not mean this is a trivial exercise in terms of effort, because real languages are complicated and messy; there have been many attempts to build C++ parsers and few successes. (I have one of the successes, and it was expensive to get right).
What is really hard (and I don't try to do) is produce XML according to a single schema in which the language semantics are exposed. And without that, it will be essentially impossible to write a translator from a generic XML to an arbitrary target language. This is known as the UNCOL problem and people have been looking since 1958 for the answer. I note that the Wikipedia article seems to indicate the problem is solved, but you can't find many references to UNCOL in the literature since 1961.
The closest attempt I've seen to this is the OMG's "ASTM" model (http://www.omg.org/spec/ASTM/1.0/Beta1/); it exports XMI which is XML. But the ASTM model has lots of escapes built into it to allow langauges that it doesn't model perfectly (AFAIK, that means every language) to extend the XMI in arbitrary ways so that the language-specific information can be encoded. Consequently each language parser produces a custom version of the XMI, and thus each reader has to pretty much know about the extensions and full generality vanishes.

How to create a language these days?

I need to get around to writing that programming language I've been meaning to write. How do you kids do it these days? I've been out of the loop for over a decade; are you doing it any differently now than we did back in the pre-internet, pre-windows days? You know, back when "real" coders coded in C, used the command line, and quibbled over which shell was superior?
Just to clarify, I mean, not how do you DESIGN a language (that I can figure out fairly easily) but how do you build the compiler and standard libraries and so forth? What tools do you kids use these days?
One consideration that's new since the punched card era is the existence of virtual machines already bountifully provided with "standard libraries." Targeting the JVM or the .NET CLR instead of ye olde "language walled garden" saves you a lot of bootstrapping. If you're creating a compiled language, you may also find Java byte code or MSIL an easier compile target than machine code (of course, if you're in this for the fun of creating a tight optimising compiler then you'll see this as a bug rather than a feature).
On the negative side, the idioms of the JVM or CLR may not be what you want for your language. So you may still end up building "standard libraries" just to provide idiomatic interfaces over the platform facility. (An example is that every languages and its dog seems to provide its own method for writing to the console, rather than leaving users to manually call System.out.println or Console.WriteLine.) Nevertheless, it enables an incremental development of the idiomatic libraries, and means that the more obscure libraries for which you never get round to building idiomatic interfaces are still accessible even if in an ugly way.
If you're considering an interpreted language, .NET also has support for efficient interpretation via the Dynamic Language Runtime (DLR). (I don't know if there's an equivalent for the JVM.) This should help free you up to focus on the language design without having to worry so much about the optimisation of the interpreter.
I've written two compilers now in Haskell for small domain-specific languages, and have found it to be an incredibly productive experience. The parsec library makes playing with syntax easy, and interpreters are very simple to write over a Haskell data structure. There is a description of writing a Lisp interpreter in Haskell that I found helpful.
If you are interested in a high-performance backend, I recommend LLVM. It has a concise and elegant byte-code and the best x86/amd64 generating backend you can find. There is an optional garbage collector, and some experimental backends that target the JVM and CLR.
You can write a compiler in any language that produces LLVM bytecode. If you are adventurous enough to learn Haskell but want LLVM, there are a set of Haskell-LLVM bindings.
What has changed considerably but hasn't been mentioned yet is IDE support and interoperability:
Nowadays we pretty much expect Intellisense, step-by-step execution and state inspection "right in the editor window", new types that tell the debugger how to treat them and rather helpful diagnostic messages. The old "compile .x -> .y" executable is not enough to create a language anymore. The environment is nothing to focus on first, but affects willingness to adopt.
Also, libraries have become much more powerful, noone wants to implement all that in yet another language. Try to borrow, make it easy to call existing code, and make it easy to be called by other code.
Targeting a VM - as itowlson suggested - is probably a good way to get started. If that turns out a problem, it can still be replaced by native compilers.
I'm pretty sure you do what's always been done.
Write some code, and show your results to the world.
As compared to the olden times, there are some tools to make your job easier though. Might I suggest ANTLR for parsing your language grammar?
Speaking as someone who just built a very simple assembly like language and interpreter, I'd start out with the .NET framework or similar. Nothing can beat the powerful syntax of C# + the backing of the entire .NET community when attempting to write most things. From here i designed a simple bytecode format and assembly syntax and proceeeded to write my interpreter + assembler.
Like i said, it was a very simple language.
You should not accept wimpy solutions like using the latest tools. You should bootstrap the language by writing a minimal compiler in Visual Basic for Applications or a similar language, then write all the compilation tools in your new language and then self-compile it using only the language itself.
Also, what is the proposed name of the language?
I think recently there have not been languages with ALL CAPITAL LETTER names like COBOL and FORTRAN, so I hope you will call it something like MIKELANG with all capital letters.
Not so much an implementation but a design decision which effects implementation - if you make every statement of your language have a unique parse tree without context, you'll get something that it's easy to hand-code a parser, and that doesn't require large amounts of work to provide syntax highlighting for. Similarly simple things like using a different symbol for module namespaces and object namespaces ( unlike Java which uses . for both package and class namespaces ) means you can parse the code without loading every module that it refers to.
Standard libraries - include the equivalent of everything in C99 standard libraries other than setjmp. Add whatever else you need for your domain. Work out an easy way to do this, either something like SWIG or an in-line FFI such as Ruby's [can't remember module name] and Python's ctypes.
Building as much of the language in the language is an option, but projects which start out doing either give up (rubinius moved to using C++ for parts of its standard library), or is only for research purposes (Mozilla Narcissus)
I am actually a kid, haha. I've never written an actual compiler before or designed a language, but I have finished The Red Dragon Book, so I suppose I have somewhat of an idea (I hope).
It would depend firstly on the grammar. If it's LR or LALR I suppose tools like Bison/Flex would work well. If it's more LL, I'd use Spirit, which is a component of Boost. It allows you to write the language's grammar in C++ in an EBNF-like syntax, so no muddling around with code generators; the C++ compiler compiles the grammar for you. If any of these fail, I'd write an EBNF grammar on paper, and then proceed to do some heavy recursive descent parsing, which seems to work; if C++ can be parsed pretty well using RDP (as GCC does it), then I suppose with enough unit tests and patience you could write entire compilers using RDP.
Once I have a parser running and some sort of intermediate representation, it then depends on how it runs. If it's some bytecode or native code compiler, I'll use LLVM or libJIT to process it. LLVM is more suited for general compilation, but I like the libJIT API and documentation better. Alternatively, if I'm really lazy, I'll generate C code and let GCC do the actual compilation. Another alternative, is to target an existing VM, like Parrot or the JVM or the CLR. Parrot is the VM being designed for Perl. If it's just an interpreter, I'll walk the syntax tree.
A radical alternative is to use Prolog, which has syntax features which remarkably simulate EBNF. I have no experience with it though, and if I am not wrong (which I am almost certainly going to be), Prolog would be quite slow if used to parse heavy duty programming languages with a lot of syntactical constructs and quirks (read: C++ and Perl).
All this I'll do in C++, if only because I am more used to writing in it than C. I'd stay away from Java/Python or anything of that sort for the actual production code (writing compilers in C/C++ help to make it portable), but I could see myself using them as a prototyping language, especially Python, which I am partial towards. Of course, I've never actually done any of this before, so I'm not one to say.
On lambda-the-ultimate there's a link to Create Your Own Programming Language by Marc-André Cournoyer, which appears to describe how to leverage some modern tools for creating little languages.
Just to clarify, I mean, not how do you DESIGN a language (that I can figure out fairly easily)
Just a hint: Look at some quite different languages first, before designing a new languge (i.e. languages with a very different evaluation strategy). Haskell and Oz come to mind. Though you should also know Prolog and Scheme. A year ago I also was like "hey, let's design a language that behaves exactly as I want", but fortunatly I looked at those other languages first (or you could also say unfortunatly, because now I don't know how I want a language to behave anymore...).
Before you start creating a language you should read this:
Hanspeter Moessenboeck, The Art of Niklaus Wirth
ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe00b.pdf
There's a big shortcut to implementing a language that I don't see in the other answers here. If you use one of Lukasiewicz's "unparenthesized" forms (ie. Forward Polish or Reverse Polish) you don't need a parser at all! With reverse polish, the dependencies go right-to-left so you simply execute each token as it's scanned. With forward polish, it's the reverse of that, so you actually execute the program "backwards", simplifying subexpressions until reaching the starting token.
To understand why this works, you should investigate the 3 primary tree-traversal algorithms: pre-order, in-order, post-order. These three traversals are the inverse of the parsing task that a language reader (i. parser) has to perform. Only the in-order notation "requires" a recursive decent to re-construct the expression tree. With the other two, you can get away with just a stack.
This may require more "thinking' and less "implementing".
BTW, if you've already found an answer (this question is a year old), you can post that and accept it.
Real coders still code in C. Just that it's a litte sharper.
Hmmm... language design? or writing a compiler?
If you want to write a compiler, you'd use Flex + Bison. (google)
Not an easy answer, but..
You essentially want to define a set of rules written in text (tokens) and then some parser that checks these rules and assembles them into fragments.
http://www.mactech.com/articles/mactech/Vol.16/16.07/UsingFlexandBison/
People can spend years on this, The above article talks about using two tools (Flex and Bison) That can be used to turn text into code you can feed to a compiler.
First I spent a year or so to actually think how the language should look like. At the same time I helped in developing Ioke (www.ioke.org) to learn language internals.
I have chosen Objective-C as implementation platform as it's fast (enough), simple and rich language. It also provides test framework so agile approach is a go. It also has a rich standard library I can build upon.
Since my language is simple on syntactic level (no keywords, only literals, operators and messages) I could go with Ragel (http://www.complang.org/ragel/) for building scanner. It's fast as hell and simple to use.
Now I have a working object model, scanner and simple operator shuffling plus standard library bootstrap code. I can even run a simple programs - as long as they fit in one file that is :)
Of course older techniques are still common (e.g. using Flex and Bison) many newer language implementations combine the lexing and parsing phase, by using a parser based on a parsing expression grammar (PEG). This works for recursive descent parsers created using combinators, or memoizing Packrat parsers. Many compilers are built using the Antlr framework also.
Use bison/flex which is the gnu version of yacc/lex. This book is extremely helpful.
The reason to use bison is it catches any conflicts in the language. I used it and it made my life many years easier (ok so i'm on my 2nd year but the first 6months was a few years ago writing it in C++ and the parsing/conflicts/results were terrible! :(.)
If you want to write a compiler obviously you need to read the Dragon Book ;)
Here is another good book that I have just read. It is practical and easier to understand than the Dragon Book:
http://www.amazon.co.uk/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=language+implementation+patterns&x=0&y=0
Mike --
If you're interested in an efficient native-code-generating compiler for Windows so you can get your bearings -- without wading through all the unnecessary widgets, gadgets, and other nonsense that clutter today's machines -- I recommend the Osmosian Order's Plain English development system. It includes a unique interface, a simplified file manager, a friendly text editor, a handy hexadecimal dumper, the compiler/linker (of course), and a wysiwyg page-layout application for documentation. Written entirely in Plain English, it is a quick download (less than a megabyte), small enough to understand in short order (about 25,000 lines of Plain English code, with just 4,000 in the compiler/linker), yet powerful enough to reproduce itself on a bottom-of-the-line Dell in less than three seconds. Really: three seconds. And it's free to all who write and ask for a copy, including the source code and and a rather humorous tongue-in-cheek 100-page manual. See www.osmosian.com for details on how to get a copy, or write to me directly with questions or comments: Gerry.Rzeppa#pobox.com

Resources