sizeof, offsetof, and alignment via TemplateHaskell - haskell

I wonder if someone has implemented analogues of hsc2hs pragmas via TemplateHaskell? It feels like it should be doable, since TH runs on target platform at compile time, and GHC always has a C compiler lying around. This could be useful as another way of generating haskell wrappers for C structs and deriving stuff for them.
The question is: if there is such a library, point me please. Otherwise, tell me if I miss something and this is impossible or does not make sense.

Related

GHC internals: is there C implementation of the type system?

I'm looking into internals of GHC and I find all the parsing and type system written completely in Haskell. Low-level core of the language is provided by RTS. The question is which one of the following is true?
RTS contains C implementation of the type system and other basic parts of Haskell (I didn't find it, RTS is mainly GC and threading)
Everything is implemented in Haskell itself. But it seems quite tricky because building GHC already requires GHC.
Could you explain development logic of the compiler? For example Python internals provide an opaque implementation of everything in C.
As others have noted in the comments, GHC is written almost entirely
in Haskell (plus select GHC extensions) and is intended to be compiled with itself. In fact, the only program in the world that can compile the GHC compiler is the GHC compiler! In particular,
parsing and type inference are implemented in Haskell code, and you
won't find a C implementation hidden in there anywhere.
The best source for understanding the internal structure of the
compiler (and what's implemented how) is the GHC Developer Wiki
and specifically the "GHC Commentary" link. If you have a fair bit of spare time, the video
series from the
Portland 2006 GHC Hackathon is absolutely fascinating.
Note that the idea of a compiler being written in the language it
compiles is not unusual. Many compilers are "self-hosting" meaning
that they are written in the language they compile and are intended to
compile themselves. See, for example, this question on another Stack
Exchange sister site: Why are self-hosting compilers considered a
rite of passage for new languages?, or simply Google for
"self-hosting compiler"
As you say, this is "tricky", because you need a way to get the
process started. Some approaches are:
You can write the first compiler in a different language that
already has a compiler (or write it in assembly language); then,
once you have a running compiler, you can port it to the same
language it compiles. According to this Quora answer, the
first C compiler was written this way. It was written in "NewB"
whose compiler was written in "B", a self-hosting compiler that
had originally been written in assembly and then rewritten in
itself.
If the language is popular enough to have another compiler, write
the compiler in its own language and compile it in phases, first
with the other compiler, then with itself (as compiled by the
other compiler), then again with itself (as compiled by itself).
The last two compiler executables can be compared as a sort of
massive test that the compiler is correct. The Gnu C Compiler can
be compiled this way (and this certainly used to be the standard way to install it from source, using the vendor's [inferior!] C compiler to get started).
If an interpreter written in another language already exists or is
easy to write, the compiler can be run by the interpreter to
compile its own source code, and thereafter the compiled compiler
can be used to compile itself. The first LISP compiler is
claimed to be the first compiler to bootstrap itself this way.
The bootstrapping process can often be simplified by writing the compiler (at least initially) in a restricted core of the language, even though the compiler itself is capable of compiling the full language. Then, a sub-par existing compiler or a simplified bootstrapping compiler or interpreter can get the process started.
According to the Wikipedia entry for GHC, the original GHC compiler was written in 1989 in Lazy ML, then rewritten in Haskell later the same year. These days, new versions of GHC with all their shiny new features are compiled on older versions of GHC.
The situation for the Python interpreter is a little different. An
interpreter can be written in the language it interprets, of course,
and there are many examples in the Lisp world of writing Lisp
interpreters in Lisp (for fun, or in developing a new Lisp dialect, or
because you're inventing Lisp), but it can't be interpreters all
the way down, so eventually you'd need either a compiler or an
interpreter implemented in another language. As a result, most
interpreters aren't self-hosting: the mainstream interpreters for
Python, Ruby, and PHP are written in C. (Though, PyPy is an alternate
implementation of the Python interpreter that's written in Python,
so...)

Haskell compiler magic: what requires a special treatment from the compiler?

When trying to learn Haskell, one of the difficulties that arise is the ability when something requires special magic from the compiler. One exemple that comes in mind is the seq function which can't be defined i.e. you can't make a seq2 function behaving exactly as the built-in seq. Consequently, when teaching someone about seq, you need to mention that seq is special because it's a special symbol for the compiler.
Another example would be the do-notation which only works with instances of the Monad class.
Sometimes, it's not always obvious. For instance, continuations. Does the compiler knows about Control.Monad.Cont or is it plain old Haskell that you could have invented yourself? In this case, I think nothing special is required from the compiler even if continuations are a very strange kind of beast.
Language extensions set aside, what other compiler magic Haskell learners should be aware of?
Nearly all the ghc primitives that cannot be implemented in userland are in the ghc-prim package. (it even has a module called GHC.Magic there!)
So browsing it will give a good sense.
Note that you should not use this module in userland code unless you know exactly what you are doing. Most of the usable stuff from it is exported in downstream modules in base, sometimes in modified form. Those downstream locations and APIs are considered more stable, while ghc-prim makes no guarantees as to how it will act from version to version.
The GHC-specific stuff is reexported in GHC.Exts, but plenty of other things go into the Prelude (such as basic data types, as well as seq) or the concurrency libraries, etc.
Polymorphic seq is definitely magic. You can implement seq for any specific type, but only the compiler can implement one function for all possible types [and avoid optimising it away even though it looks no-op].
Obviously the entire IO monad is deeply magic, as is everything to with concurrency and parallelism (par, forkIO, MVar), mutable storage, exception throwing and catching, querying the garbage collector and run-time stats, etc.
The IO monad can be considered a special case of the ST monad, which is also magic. (It allows truly mutable storage, which requires low-level stuff.)
The State monad, on the other hand, is completely ordinary user-level code that anybody can write. So is the Cont monad. So are the various exception / error monads.
Anything to do with syntax (do-blocks, list comprehensions) is hard-wired into the language definition. (Note, though, that some of these respond to LANGUAGE RebindableSyntax, which lets you change what functions it binds to.) Also the deriving stuff; the compiler "knows about" a handful of special classes and how to auto-generate instances for them. Deriving for newtype works for any class though. (It's just copying an instance from one type to another identical copy of that type.)
Arrays are hard-wired. Much like every other programming language.
All of the foreign function interface is clearly hard-wired.
STM can be implemented in user code (I've done it), but it's currently hard-wired. (I imagine this gives a significant performance benefit. I haven't tried actually measuring it.) But, conceptually, that's just an optimisation; you can implement it using the existing lower-level concurrency primitives.

Which GHC type system extensions should I try to learn first?

GHC has a whole zoo of type system extensions: multiparameter type classes, functional dependencies, rank-n polymorphism, existential types, GADTs, type families, scoped type variables, etc., etc. Which ones are likely to be easiest to learn about first? Also, do these features fit together in some way, or are they all pretty much separate ideas useful for entirely different purposes?
A good one to learn early on is ScopedTypeVariables, because they are very useful for debugging type issues in a function. When I have a baffling type error, I temporarily add type declarations on each of the expressions in the function. (Often you'll need to break up some of the expressions to see what's really going on.) That usually helps me determine which expression has a different type than I expected.
TypeFamilies are more powerful than MultiParamTypeClasses, so you don't really need the latter. When working with type families, you usually need to enable FlexibleContexts and FlexibleInstances as well, so that's three pragmas you'll learn for the price of one. FunctionalDependencies is generally used with MultiParamTypeClasses, so that's one you can ignore for now.
GHC is pretty good at telling you when you need to enable Rank2Types or RankNTypes, so you can postpone learning more about those until a little later.
Those are the ones I'd start with.
EDIT: Removed comment about avoiding StandaloneDeriving. (I was thinking of orphan instances.)
Note that I've worked with Haskell some more, I've developed some of my own opinions on the matter. mhwombat's suggestion of ScopedTypeVariables was a good one. These days it's usually the first thing I type when I start writing a Haskell module. Whenever code gets a bit tricky, I like to have lots of type signatures to help me see what I'm doing, and this extension lets me write ones I otherwise couldn't. It also can improve type errors dramatically. It also seems to be nearly essential when using other type system extensions.
I didn't really appreciate GADTs too much until I learned a bit about dependently typed programming languages. Note I think it's awesome how they can serve as proof objects, and how they can be constrained by type indices.
GADTa work very well with DataKinds, which produces fun type indices like lists and Booleans. I can now do things like express that an indexed list is as long as a tree is tall, without driving myself crazy using higher-order nested types.
I still haven't explored multiparameter type classes and functional dependencies much yet. I have, however, come to appreciate Edward Kmett's reflection library, which uses them in its interface.
I have learned a healthy respect for overlapping and incoherent instances, by which I mean I never use them. The overlapping ones feel a bit like macro programming with worse error messages, while the incoherent ones are insane.
RankNTypes is powerful indeed. It's one of those things that's rarely needed, but when needed is really essential.

How does Haskell compile zipper patterns?

"Learn You a Haskell" shows the following data type and then gives a bunch of algorithms that manipulate the trees using this.
data Crumb a = LeftCrumb a (Tree a) | Right Crumb a (Tree a) deriving (Show)
Unlike imperative languages where something like binary search would be explained in terms of walking down pointers. Here there are no mentions of pointers. But how do algorithms like binary search get compiled down in Haskll? Do they compile down to the same efficient walking down pointers?
The Haskell language: Compilers can do whatever they want to the code as long as it makes sense from the specification. This means that there can be pointer walking just like you'd expect in C, or there might not be. The language specification doesn't really care how the things are implemented, as long as they work like they are supposed to.
The GHC compiler: If you really want to know how GHC compiles your code in the end, I suggest learning to read C-- (pronounced "C-minus-minus") or assembly. You can get GHC to spit out C-- code with -ddump-cmm and assembly with -ddump-asm. Unless you are planning to start work on optimising the compiler though, I don't think this would be a very useful exercise.
As a general rule, imperative code GHC writes looks very different from what a human would write. So probably no pointers in the sense you're thinking of. (And the cool thing is that it works out efficiently in the end anyway!)

Difference between hsc2hs and c2hs?

What is the difference between hsc2hs and c2hs?
I know what hsc2hs is a preprocessor but what does it exactly do?
And c2hs can make Haskell modules from C-code, but do I need hsc2hs for this?
They both have the same function: make it easier to write FFI bindings. You don't need to know about hsc2hs if you chose to use c2hs; they are independent. C2hs is more powerful, but also more complicated: Edward Z. Yang illustrates this point with a nice diagram in his c2hs tutorial:
When should I use c2hs? There are many
Haskell pre-processors; which one
should you use? A short (and somewhat
inaccurate) way to characterize the
above hierarchy is the further down
you go, the less boilerplate you have
to write and the more documentation
you have to read; I have thus heard
advice that hsc2hs is what you should
use for small FFI projects, while c2hs
is more appropriate for the larger
ones.
Things that c2hs supports that hsc2hs
does not:
Automatic generation of foreign import based on the contents of the C
header file
Semi-automatic marshalling to and from function calls, and
Translation of pointer types and hierarchies into Haskell types.
Mikhail's answer is good, but there's another side. There are also things that hsc2hs provides that c2hs does not, and it may be necessary to use both in conjunction.
Notably, hsc2hs operates by producing a C executable that is run to generate Haskell code, while c2hs parses header files directly. Therefore hsc2hs allows you to access #defines, etc. So while I've found c2hs better for generating bindings and wrappers to bindings as well as "deep" peeks and pokes into complex C structures, it is not good for accessing constants and enumerations, and it only automates mildly the boilerplate for Storable instances. I've found hsc2hs necessary as well, in conjunction with the bindings-dsl package [1], in particular in my case for predefined constants. In one instance, I have one hsc file for an enormous amount of constants, and one chs file for wrapping the functions that use these constants.
[1] http://hackage.haskell.org/package/bindings-DSL

Resources