By accident I knew that the compiler of Haskell is written in Haskell. It sounds strange to me. How is this possible, I mean, to compile itself? Who is to compile the compiler then? What is the ultimate code accepted by machine?
Consider the programming language who is the first to have a compiler. What is the language of its compiler? Going back even farther in time, how did people program before the era of compiler?
Broadly speaking, I am often confused about the border between software (e.g. programming written by people) and hardware (e.g. something executable on a physical machine).
P.S.: I have basic knowledge about compiler such as lexical analysis, parsing, and code optimization. However, I know little about hardware (the machine).
It seems that the answer to a related post Implementing a compiler in “itself” does not go deeply into the border between software and hardware.
And I would like to see some concrete examples.
EDIT: Some comments mentioned the term "bootstrapping". It seems that there is some minimum core part of a language (like axioms/basic theorems in mathematics) which must be compiled in a lower-level way (instead of by itself). What are they? Are they basically the same in different languages? Again, I would like to see some concrete examples.
As you can read in A history of Haskell page 28, the first haskell compiler was written in Lazy ML in June 1989. It implemented essentially all of Haskell 1.0.
Now that this compiler existed, it can then be used to compile the Haskell version of GHC. The first beta of GHC written in Haskell was release on 1 April 1991. The full release came in December 1992.
Because the Lazy ML-based compiler wasn't developed further, today you use a previous version of GHC to compile GHC. So if you want to build GHC 7.8, you use GHC 7.6 to build it (in practice, it's a bit more complicated, because there are multiple stages and only the first stage, which doesn't support GHCi or TemplateHaskell is built with GHC 7.6)
That means that if you don't have a working haskell compiler today, you have two options:
Try to install a LML compiler and compile the first version of GHC written in Lazy ML. Then use this compiler to compile the next version which is written in Haskell. Then again use that compiler to build the next version, and repeat until you have a reasonably recent compiler. It may be possible to skip a few versions, but I don't know how many. As you can imagine, this could take a lot of time.
(Much easier) Download pre-built GHC binaries.
Um... I have not tried this, but another route would be simply compiling to c and using a c compiler to compile latest ghc...ghc is itself built in stages, so you don't really even need to convert the whole code base to c, just the first stage, which then can compile the rest. Certainly no need to dig up Lazy ML.
Edit: Note the resulting compiler will not build binaries targeting the new platform, it would simply run on that platform and be a cross-platform compiler for targets that ghc already has backends for. Another note is that i actually intended this in response to bennofs answer, not as stand alone answer to the OP.
Related
I am working on a project which involves doing Program Analysis on Haskell. I thought that it will be good idea to implement Program Analysis on Cminusminus (C--) level.
I know that, using ghc, Haskell is first compiled to core then to stg and then cmm. The generated cmm is then converted to executable.
In order to properly understand and work on C--, I need a cminusminus (C--) compiler. However, according to my information, the compilers which compile to C-- are long discontinued and does not compile to x86_64 (e.g QuickC--). Am I missing something? Is there a C-- compiler which compiles to x64?
On a side note, has there been further work on Cminusminus after 2008-09? Is there a Cminusminus community?
Edit 1: According to this GHC can compile very basic cmm (cmm is a ghc's flavor of c--, which has some differences with c--, however according to my project's needs I should be more concerned with cmm) files. Moreover, the -ddump-cmm contains many items which are not parseable by ghc.
I've recently been looking at the Rust programming language. How does it work? Rust code seems to be compiled into ELF or PE (etc) binaries, but I've not been able to find any information on how that's done? Is it compiled to an intermediate format then compiled the rest of the way with gxx for example? Any help (or links) would be really appreciated.
The code-generation phase of the Rust compiler is mainly done by LLVM. LLVM is a set of tools for building a compiler, most notably used by the C[++] Compiler clang[++].
First, the Rust compiler (just like clang, for example) does all the Rust specific stuff like type and borrow checking; in the end, it generates LLVM-IR. IR stands for intermediate representation and it's... comparable to assembly, but a tiny bit more high level and most importantly: platform independent. Then the Rust compiler just calls ☏ LLVM and says:
Hey buddy, could you please take this IR and generate machine code for the current platform? That would be fantastic ◕ ◡ ◕
To which LLVM responds:
🌈 Sure, no problem, new friend. Here is your highly optimized machine code for [e.g.] x86_64! ♫♪♫
Afterwards they invite a few more friends to wrap it all up in a nice little [e.g.] ELF package and beautifully place it in the users file system. (and the user is like...)
Information like this can be found in the official FAQ which contains a lot of interesting information anyway. For more in-depth details on the Rust compiler, you can read the "Rustc Guide". For this question, the chapter "High Level Overview" is pretty interesting.
I've heard of the idea of bootstrapping a language, that is, writing a compiler/interpreter for the language in itself. I was wondering how this could be accomplished and looked around a bit, and saw someone say that it could only be done by either
writing an initial compiler in a different language.
hand-coding an initial compiler in Assembly, which seems like a special case of the first
To me, neither of these seem to actually be bootstrapping a language in the sense that they both require outside support. Is there a way to actually write a compiler in its own language?
Is there a way to actually write a compiler in its own language?
You have to have some existing language to write your new compiler in. If you were writing a new, say, C++ compiler, you would just write it in C++ and compile it with an existing compiler first. On the other hand, if you were creating a compiler for a new language, let's call it Yazzleof, you would need to write the new compiler in another language first. Generally, this would be another programming language, but it doesn't have to be. It can be assembly, or if necessary, machine code.
If you were going to bootstrap a compiler for Yazzleof, you generally wouldn't write a compiler for the full language initially. Instead you would write a compiler for Yazzle-lite, the smallest possible subset of the Yazzleof (well, a pretty small subset at least). Then in Yazzle-lite, you would write a compiler for the full language. (Obviously this can occur iteratively instead of in one jump.) Because Yazzle-lite is a proper subset of Yazzleof, you now have a compiler which can compile itself.
There is a really good writeup about bootstrapping a compiler from the lowest possible level (which on a modern machine is basically a hex editor), titled Bootstrapping a simple compiler from nothing. It can be found at https://web.archive.org/web/20061108010907/http://www.rano.org/bcompiler.html.
The explanation you've read is correct. There's a discussion of this in Compilers: Principles, Techniques, and Tools (the Dragon Book):
Write a compiler C1 for language X in language Y
Use the compiler C1 to write compiler C2 for language X in language X
Now C2 is a fully self hosting environment.
The way I've heard of is to write an extremely limited compiler in another language, then use that to compile a more complicated version, written in the new language. This second version can then be used to compile itself, and the next version. Each time it is compiled the last version is used.
This is the definition of bootstrapping:
the process of a simple system activating a more complicated system that serves the same purpose.
EDIT: The Wikipedia article on compiler bootstrapping covers the concept better than me.
A super interesting discussion of this is in Unix co-creator Ken Thompson's Turing Award lecture.
He starts off with:
What I am about to describe is one of many "chicken and egg" problems that arise when compilers are written in their own language. In this ease, I will use a specific example from the C compiler.
and proceeds to show how he wrote a version of the Unix C compiler that would always allow him to log in without a password, because the C compiler would recognize the login program and add in special code.
The second pattern is aimed at the C compiler. The replacement code is a Stage I self-reproducing program that inserts both Trojan horses into the compiler. This requires a learning phase as in the Stage II example. First we compile the modified source with the normal C compiler to produce a bugged binary. We install this binary as the official C. We can now remove the bugs from the source of the compiler and the new binary will reinsert the bugs whenever it is compiled. Of course, the login command will remain bugged with no trace in source anywhere.
Check out podcast Software Engineering Radio episode 61 (2007-07-06) which discusses GCC compiler internals, as well as the GCC bootstrapping process.
Donald E. Knuth actually built WEB by writing the compiler in it, and then hand-compiled it to assembly or machine code.
As I understand it, the first Lisp interpreter was bootstrapped by hand-compiling the constructor functions and the token reader. The rest of the interpreter was then read in from source.
You can check for yourself by reading the original McCarthy paper, Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I.
Every example of bootstrapping a language I can think of (C, PyPy) was done after there was a working compiler. You have to start somewhere, and reimplementing a language in itself requires writing a compiler in another language first.
How else would it work? I don't think it's even conceptually possible to do otherwise.
Another alternative is to create a bytecode machine for your language (or use an existing one if it's features aren't very unusual) and write a compiler to bytecode, either in the bytecode, or in your desired language using another intermediate - such as a parser toolkit which outputs the AST as XML, then compile the XML to bytecode using XSLT (or another pattern matching language and tree-based representation). It doesn't remove the dependency on another language, but could mean that more of the bootstrapping work ends up in the final system.
It's the computer science version of the chicken-and-egg paradox. I can't think of a way not to write the initial compiler in assembler or some other language. If it could have been done, I should Lisp could have done it.
Actually, I think Lisp almost qualifies. Check out its Wikipedia entry. According to the article, the Lisp eval function could be implemented on an IBM 704 in machine code, with a complete compiler (written in Lisp itself) coming into being in 1962 at MIT.
Some bootstrapped compilers or systems keep both the source form and the object form in their repository:
ocaml is a language which has both a bytecode interpreter (i.e. a compiler to Ocaml bytecode) and a native compiler (to x86-64 or ARM, etc... assembler). Its svn repository contains both the source code (files */*.{ml,mli}) and the bytecode (file boot/ocamlc) form of the compiler. So when you build it is first using its bytecode (of a previous version of the compiler) to compile itself. Later the freshly compiled bytecode is able to compile the native compiler. So Ocaml svn repository contains both *.ml[i] source files and the boot/ocamlc bytecode file.
The rust compiler downloads (using wget, so you need a working Internet connection) a previous version of its binary to compile itself.
MELT is a Lisp-like language to customize and extend GCC. It is translated to C++ code by a bootstrapped translator. The generated C++ code of the translator is distributed, so the svn repository contains both *.melt source files and melt/generated/*.cc "object" files of the translator.
J.Pitrat's CAIA artificial intelligence system is entirely self-generating. It is available as a collection of thousands of [A-Z]*.c generated files (also with a generated dx.h header file) with a collection of thousands of _[0-9]* data files.
Several Scheme compilers are also bootstrapped. Scheme48, Chicken Scheme, ...
Say I have a Haskell program or library that I'd like to make accessible to non-Haskellers, potentially C programmers. Can I compile it to C using GHC and then distribute this as a C source?
If this is possible, can someone provide a minimal example? (e.g., a Makefile)
Is it possible to use GHC to automatically determine what compiler flags and headers and needed, and then perhaps bundle this into a single folder?
Basically I'm interested in being able to write portions of programs in C and Haskell, and then distributing it as a tarball, but without requiring the target to have GHC and Cabal installed.
I'm interested in being able to write portions of programs in C and Haskell, and then distributing it as a tarball, but without requiring the target to have GHC and Cabal installed.
You're asking for an awful lot of infrastructure that you're unlikely to find. Remember that any Haskell program, even if it is going to be compiled to C, is almost certain to depend on a large, complex run-time system for its correct operation. At a bare minimum, that run-time system has to support garbage collection and lazy evaluation. So you have more than just a translation problem.
I suggest you tackle this problem as a software-distribution problem. Rather than a tarball, provide a package for your favored distribution platform (Debian, Red Hat, InstallShield, whatever). Personally, in order to reuse other people's efforts, I would aim for something that checks for Cabal, installs Cabal if needed, then uses Cabal to install the rest of what your users will need.
You can do this with jhc. It's a full program optimizing compiler that compiles down to C. It doesn't have all the fancy extensions that GHC supports though.
Even if you could, I wouldn't call it "C source". GHC can use C as part of its compilation system, but the generated C code is not even slightly readable. Even if it could be read and understood, it would make no sense to modify it because there is no way (apart from back-porting the changes into Haskell) to incorporate any modifications made by C hackers into future versions of your program.
The term "source" means the code that is written by a human and used to generate the program. In this case that is the Haskell. C generated by a compiler is not "source code", it is an intermediate representation.
You can't get there with GHC. Even when it compiles via C, GHC relies on manipulating the resulting assembly to shuffle segments around, a huge runtime system and a LOT of baggage.
On the other hand, you might have better luck if what you want is supported by somewhat more limited feature set of John Meacham's JHC compiler, however, which generates fairly compact C output.
I know this is an old post, but I still wanted to also mention ajhc. Ajhc forked jhc with the plans of adding new features and later pushing the updates back to jhc.
Say I have a Haskell program or library that I'd like to make accessible to non-Haskellers, potentially C programmers. Can I compile it to C using GHC and then distribute this as a C source
You can compile to C, but the resulting C is not human-readable. You're better off writing header files and using the excellent C FFI alongside it. In any case, distributing the generated C seems like a fool's errand.
Basically I'm interested in being able to write portions of programs in C and Haskell, and then distributing it as a tarball, but without requiring the target to have GHC and Cabal installed.
I do not know of any solutions that do not involve GHC. You'd have to distribute at the very least the Haskell RTS.
Can I compile it to C using GHC and then distribute this as a C source?
No it is not possible but you can easily create interface between haskell and c by using the Foreign Function Interface (FFI) of Haskell.
You can have more example here.
I have a large program written with my own patched version of the GNU Eiffel (SmallEiffel) compiler. While I love the language I'm running into the problem that the compiler is O(n^2) or worse on the compiled system size. So I have to move soon.
ISE Eiffel the only alive Eiffel compiler is not an option for various reasons. Mostly because the compiled code runs way to slow.
I'm looking for a language which is:
imperative and OO
has generics/templates
compiles to native code and does not
require .NET/Java
statically typed (which means fast)
garbage collected
cross platform
not as ugly and braindead as C++
I couldn't come up with anything else then D but this looks a little bit to low level and non stable. Is there really none which satisfies this seven points?
OCaml, perhaps?
You could write in Java and compile to native-ish code with GCJ (it will be native code, but you'll need to link against a fair portion of code that makes up all the things Java needs at run-time. Your users will not need to install a JRE.)
Googling 'object oriented native code compiler' brings up Objective Caml before Eiffel.
If you're willing to take your chances on a research compiler, check out the Diesel language and the native-code Vortex compiler (written for Diesel in Diesel). It is a research project, but it is stable, and Craig Chambers is one of the best people in the business.
What about Python?
It is OO, scripted language, runs fast, has generic templates.