GHC internals: is there C implementation of the type system? - haskell

I'm looking into internals of GHC and I find all the parsing and type system written completely in Haskell. Low-level core of the language is provided by RTS. The question is which one of the following is true?
RTS contains C implementation of the type system and other basic parts of Haskell (I didn't find it, RTS is mainly GC and threading)
Everything is implemented in Haskell itself. But it seems quite tricky because building GHC already requires GHC.
Could you explain development logic of the compiler? For example Python internals provide an opaque implementation of everything in C.

As others have noted in the comments, GHC is written almost entirely
in Haskell (plus select GHC extensions) and is intended to be compiled with itself. In fact, the only program in the world that can compile the GHC compiler is the GHC compiler! In particular,
parsing and type inference are implemented in Haskell code, and you
won't find a C implementation hidden in there anywhere.
The best source for understanding the internal structure of the
compiler (and what's implemented how) is the GHC Developer Wiki
and specifically the "GHC Commentary" link. If you have a fair bit of spare time, the video
series from the
Portland 2006 GHC Hackathon is absolutely fascinating.
Note that the idea of a compiler being written in the language it
compiles is not unusual. Many compilers are "self-hosting" meaning
that they are written in the language they compile and are intended to
compile themselves. See, for example, this question on another Stack
Exchange sister site: Why are self-hosting compilers considered a
rite of passage for new languages?, or simply Google for
"self-hosting compiler"
As you say, this is "tricky", because you need a way to get the
process started. Some approaches are:
You can write the first compiler in a different language that
already has a compiler (or write it in assembly language); then,
once you have a running compiler, you can port it to the same
language it compiles. According to this Quora answer, the
first C compiler was written this way. It was written in "NewB"
whose compiler was written in "B", a self-hosting compiler that
had originally been written in assembly and then rewritten in
itself.
If the language is popular enough to have another compiler, write
the compiler in its own language and compile it in phases, first
with the other compiler, then with itself (as compiled by the
other compiler), then again with itself (as compiled by itself).
The last two compiler executables can be compared as a sort of
massive test that the compiler is correct. The Gnu C Compiler can
be compiled this way (and this certainly used to be the standard way to install it from source, using the vendor's [inferior!] C compiler to get started).
If an interpreter written in another language already exists or is
easy to write, the compiler can be run by the interpreter to
compile its own source code, and thereafter the compiled compiler
can be used to compile itself. The first LISP compiler is
claimed to be the first compiler to bootstrap itself this way.
The bootstrapping process can often be simplified by writing the compiler (at least initially) in a restricted core of the language, even though the compiler itself is capable of compiling the full language. Then, a sub-par existing compiler or a simplified bootstrapping compiler or interpreter can get the process started.
According to the Wikipedia entry for GHC, the original GHC compiler was written in 1989 in Lazy ML, then rewritten in Haskell later the same year. These days, new versions of GHC with all their shiny new features are compiled on older versions of GHC.
The situation for the Python interpreter is a little different. An
interpreter can be written in the language it interprets, of course,
and there are many examples in the Lisp world of writing Lisp
interpreters in Lisp (for fun, or in developing a new Lisp dialect, or
because you're inventing Lisp), but it can't be interpreters all
the way down, so eventually you'd need either a compiler or an
interpreter implemented in another language. As a result, most
interpreters aren't self-hosting: the mainstream interpreters for
Python, Ruby, and PHP are written in C. (Though, PyPy is an alternate
implementation of the Python interpreter that's written in Python,
so...)

Related

sizeof, offsetof, and alignment via TemplateHaskell

I wonder if someone has implemented analogues of hsc2hs pragmas via TemplateHaskell? It feels like it should be doable, since TH runs on target platform at compile time, and GHC always has a C compiler lying around. This could be useful as another way of generating haskell wrappers for C structs and deriving stuff for them.
The question is: if there is such a library, point me please. Otherwise, tell me if I miss something and this is impossible or does not make sense.

Generalizing/compiling haskell code into a lambda

I am pretty much 90% sure that the title of this question is wrong however I have no idea what the right title would be (I will gladly edit the title if suggestions come along!).
When reading up on Haskell and the core principles of the language you always find that it is a language "based on lambda expressions". I remember reading somewhere that this means that at the end, the main function just gets "proprocessed" into one big lambda, everything gets inlined, basically your entire code becomes one single, huge, lambda expression.
My questions are:
Is what I said above true?
If the answer to question 1 is "yes", is there any... decompiler/partial compiler/preprocessor? I know about this that lets you see the assembly code behind languages like C/++ and Haskell but is there anything I could use to explore the generated lambda expression?
This question is asked from a purely educational standpoint and not intended to seek a solution to a particular problem. I simply wish to learn more about a language I find extremely fascinating.
Let's make a distinction between the semantics of Haskell and the implementation of GHC. Mostly because we use different terms for language semantics than for assembly, but also because some other compiler might do things differently than GHC.
Every Haskell program defines main, which is an expression of type IO (). I don't like to call it a "lambda expression" because the type shows that it's not a function. The definition of main is some nested tree of function calls. Even the sequential lines in a do block are defined as calls to the functions (>>) and (>>=).
GHC uses heuristics to decide what to inline, to get the best runtime performance. It will usually inline small expressions that aren't recursive. I believe the runtime system maintains a callstack of functions currently being evaluated, not unlike the runtime result of compiling function calls in C or other imperative languages.
GHC provides many options for printing intermediate stages of compilation. I'm not sure which you will find interesting. Core is the lowest-level representation that feels like Haskell. Cmm (also called C--) is the highest-level representation that feels like assembly.

What would change if a JVM Language Compilation process had an STG phase like Haskell?

I had a friend say:
For me the most interesting thing about Haskell is not the language and the types. It is the Spineless Tagless Graph Machine behind it.
Because Haskell people talk about types all the time, this quote really caught my attention. Now we can look at the Haskell compilation process like this:
Parsing
Type checking
Desugaring + a few bobs and bits
Translation to core
Lion share of optimization
Translation to STG language
STG language to C–
C– to assembly or llvm
Which we can simplify down to:
.. front end stuff ..
Translate IL to STG language
Compile STG language to C/ASM/LLVM/Javascript
Ie - there is an intermediate 'graph language' that Haskell is compiled to, and various optimisations happen there, prior to it being compiled to LLVM/C etc.
This contrasts to a potential JVM Language compilation process that looks like this:
Convert JVM Language Code to Java bytecode inside a class.
Run the Bytecode on a Java Virtual Machine.
Assuming it were possible to add an intermediate STG Compilation step to the Java Compilation process, I'm wondering what impact would this change have? What would change about the compiled code?
(I'm aware that you need a pure functional language to get the most use out of the spineless tagless graph machine, so if it is helpful to answer the question, assume we're compiling Frege [Haskell for the JVM].)
My question is: What would change if the JVM Language Compilation process had an STG phase like Haskell?
You need to clarify if you mean Java the language or some language running on the JVM.
My knowledge of Java the language is limited to having read the specification, and I know nothing about the Haskell IR you're talking about. However, Java is, by spec, a dynamic language, and it would be illegal to perform any AOT xform which uses any information outside of each end classfile.
Of course a project that doesn't use these features could break these rules.

Difference between hsc2hs and c2hs?

What is the difference between hsc2hs and c2hs?
I know what hsc2hs is a preprocessor but what does it exactly do?
And c2hs can make Haskell modules from C-code, but do I need hsc2hs for this?
They both have the same function: make it easier to write FFI bindings. You don't need to know about hsc2hs if you chose to use c2hs; they are independent. C2hs is more powerful, but also more complicated: Edward Z. Yang illustrates this point with a nice diagram in his c2hs tutorial:
When should I use c2hs? There are many
Haskell pre-processors; which one
should you use? A short (and somewhat
inaccurate) way to characterize the
above hierarchy is the further down
you go, the less boilerplate you have
to write and the more documentation
you have to read; I have thus heard
advice that hsc2hs is what you should
use for small FFI projects, while c2hs
is more appropriate for the larger
ones.
Things that c2hs supports that hsc2hs
does not:
Automatic generation of foreign import based on the contents of the C
header file
Semi-automatic marshalling to and from function calls, and
Translation of pointer types and hierarchies into Haskell types.
Mikhail's answer is good, but there's another side. There are also things that hsc2hs provides that c2hs does not, and it may be necessary to use both in conjunction.
Notably, hsc2hs operates by producing a C executable that is run to generate Haskell code, while c2hs parses header files directly. Therefore hsc2hs allows you to access #defines, etc. So while I've found c2hs better for generating bindings and wrappers to bindings as well as "deep" peeks and pokes into complex C structures, it is not good for accessing constants and enumerations, and it only automates mildly the boilerplate for Storable instances. I've found hsc2hs necessary as well, in conjunction with the bindings-dsl package [1], in particular in my case for predefined constants. In one instance, I have one hsc file for an enormous amount of constants, and one chs file for wrapping the functions that use these constants.
[1] http://hackage.haskell.org/package/bindings-DSL

Why are most S-Expression languages dynamically typed?

How come most Lisps and Schemes are dynamically typed?
Does static typing not mix with some of their common features?
Typing and s-expressions can be made to work together, see typed scheme.
Partly it is a historical coincidence that s-expression languages are dynamically typed. These languages tend to rely more heavily on macros, and the ease of parsing and pattern-matching on s-expressions makes macro processing much easier. Most research on sophisticated macros happens in s-expression languages.
Typed Hygienic Macros are hard.
When Lisp was invented in the years from 1958 to 1960 it introduced a lot of features both as a language and an implementation (garbage collection, a self-hosting compiler, ...). Some features were inherited (with some improvements) from other languages (list processing, ...). The language implemented computation with functions. The s-expressions were more an implementation detail ( at that time ), than a language feature. A type system was not part of the language. Using the language in an interactive way was also an early implementation feature.
The useful type systems for functional languages were not yet invented at that time. Still until today it is also relatively difficult to use statically typed languages in an interactive way. There are many implementations of statically typed languages which also provide some interactive interface - but mostly they don't offer the same level of support of interactive use as a typical Lisp system. Programming in an interactive Lisp system means that many things can be changed on the fly and it could be problematic if type changes had to be propagated through whole programs and data in such an interactive Lisp system. note that Some Schemers have a different view about these things. R6RS is mostly a batch language generally not that much in the spirit of Lisp...
The functional languages that got invented later with static type systems then also got a non-s-expression syntax - they did not offer support for macros or related features. later some of these languages/implementations used a preprocessor for syntactic extensions.
Static typing is lexical, it means that all information about types can be inferred from reading source code without evaluating any expressions or computing any things, conditionals being most important here. A statically typed language is designed so that this can happen, a better term would be 'lexically typed', as in, a compiler can prove from reading the source alone that no type errors will occur.
In the case of lisp, this is awkwardly different because lisp's source code itself is not static, lisp is homo-iconic, it uses data as code and can to some extend dynamically edit its own running source.
Lisp was the first dynamically typed language, and probably for this reason, program code itself is no longer lexical in Lisp.
Edit: a far more powerful reason, in the case of static typing you'd have to type lists. You can either have extremely complex types for each lists which account for all elements, of demand that each element has the same type and type it as a list of that. The former option will produce hell with lists of lists. The latter option demands that source code only contains the same type for each datum, this means that you can't even build expressions as a list is anyhow a different type than an integer.
So I dare say that it is completely and utterly infeasible to realize.

Resources