Distributing a Haskell program as C source

Distributing a Haskell program as C source - haskell

Say I have a Haskell program or library that I'd like to make accessible to non-Haskellers, potentially C programmers. Can I compile it to C using GHC and then distribute this as a C source?
If this is possible, can someone provide a minimal example? (e.g., a Makefile)
Is it possible to use GHC to automatically determine what compiler flags and headers and needed, and then perhaps bundle this into a single folder?
Basically I'm interested in being able to write portions of programs in C and Haskell, and then distributing it as a tarball, but without requiring the target to have GHC and Cabal installed.

I'm interested in being able to write portions of programs in C and Haskell, and then distributing it as a tarball, but without requiring the target to have GHC and Cabal installed.
You're asking for an awful lot of infrastructure that you're unlikely to find. Remember that any Haskell program, even if it is going to be compiled to C, is almost certain to depend on a large, complex run-time system for its correct operation. At a bare minimum, that run-time system has to support garbage collection and lazy evaluation. So you have more than just a translation problem.
I suggest you tackle this problem as a software-distribution problem. Rather than a tarball, provide a package for your favored distribution platform (Debian, Red Hat, InstallShield, whatever). Personally, in order to reuse other people's efforts, I would aim for something that checks for Cabal, installs Cabal if needed, then uses Cabal to install the rest of what your users will need.

You can do this with jhc. It's a full program optimizing compiler that compiles down to C. It doesn't have all the fancy extensions that GHC supports though.

Even if you could, I wouldn't call it "C source". GHC can use C as part of its compilation system, but the generated C code is not even slightly readable. Even if it could be read and understood, it would make no sense to modify it because there is no way (apart from back-porting the changes into Haskell) to incorporate any modifications made by C hackers into future versions of your program.
The term "source" means the code that is written by a human and used to generate the program. In this case that is the Haskell. C generated by a compiler is not "source code", it is an intermediate representation.

You can't get there with GHC. Even when it compiles via C, GHC relies on manipulating the resulting assembly to shuffle segments around, a huge runtime system and a LOT of baggage.
On the other hand, you might have better luck if what you want is supported by somewhat more limited feature set of John Meacham's JHC compiler, however, which generates fairly compact C output.

I know this is an old post, but I still wanted to also mention ajhc. Ajhc forked jhc with the plans of adding new features and later pushing the updates back to jhc.

Say I have a Haskell program or library that I'd like to make accessible to non-Haskellers, potentially C programmers. Can I compile it to C using GHC and then distribute this as a C source
You can compile to C, but the resulting C is not human-readable. You're better off writing header files and using the excellent C FFI alongside it. In any case, distributing the generated C seems like a fool's errand.
Basically I'm interested in being able to write portions of programs in C and Haskell, and then distributing it as a tarball, but without requiring the target to have GHC and Cabal installed.
I do not know of any solutions that do not involve GHC. You'd have to distribute at the very least the Haskell RTS.

Can I compile it to C using GHC and then distribute this as a C source?
No it is not possible but you can easily create interface between haskell and c by using the Foreign Function Interface (FFI) of Haskell.
You can have more example here.

Related

Customising Cabal libraries (I think?)

Perhaps it's just better to describe my problem.
I'm developing a Haskell library. But part of the library is written in C, and another part actually in raw LLVM. To actually get GHC to spit out the code I want I have to follow this process:
Run ghc -emit-llvm on both the code that uses the Haskell module and the "Main" module.
Run clang -emit-llvm on the C file
Now I've got three .ll files from above. I add the part of the library I've handwritten in raw LLVM and llvm-link these into one .ll file.
I then run LLVM's opt on the linked file.
Lastly, I feed the LLVM bitcode fileback into GHC (which pleasantly accepts it) and produces an executable.
This process (with appropriate optimisation settings of course) seems to be the only way I can inline code from C, removing the function call overhead. Since many of these C functions are very small this is significant.
Anyway, I want to be able to distribute the library and for users to be able to use it as painlessly as possible, whilst still gaining the optimisations from the process above. I understand it's going to be a bit more of a pain than an ordinary library (for example, you're forced to compile via LLVM) but as painlessly as possible is what I'm looking for advice for.
Any guidance will be appreciated, I don't expect a step by step answer because I think it will be complex, but just some ideas would be helpful.

Confusions arising from a programming language whose compiler is written in itself

By accident I knew that the compiler of Haskell is written in Haskell. It sounds strange to me. How is this possible, I mean, to compile itself? Who is to compile the compiler then? What is the ultimate code accepted by machine?
Consider the programming language who is the first to have a compiler. What is the language of its compiler? Going back even farther in time, how did people program before the era of compiler?
Broadly speaking, I am often confused about the border between software (e.g. programming written by people) and hardware (e.g. something executable on a physical machine).
P.S.: I have basic knowledge about compiler such as lexical analysis, parsing, and code optimization. However, I know little about hardware (the machine).
It seems that the answer to a related post Implementing a compiler in “itself” does not go deeply into the border between software and hardware.
And I would like to see some concrete examples.
EDIT: Some comments mentioned the term "bootstrapping". It seems that there is some minimum core part of a language (like axioms/basic theorems in mathematics) which must be compiled in a lower-level way (instead of by itself). What are they? Are they basically the same in different languages? Again, I would like to see some concrete examples.

As you can read in A history of Haskell page 28, the first haskell compiler was written in Lazy ML in June 1989. It implemented essentially all of Haskell 1.0.
Now that this compiler existed, it can then be used to compile the Haskell version of GHC. The first beta of GHC written in Haskell was release on 1 April 1991. The full release came in December 1992.
Because the Lazy ML-based compiler wasn't developed further, today you use a previous version of GHC to compile GHC. So if you want to build GHC 7.8, you use GHC 7.6 to build it (in practice, it's a bit more complicated, because there are multiple stages and only the first stage, which doesn't support GHCi or TemplateHaskell is built with GHC 7.6)
That means that if you don't have a working haskell compiler today, you have two options:
Try to install a LML compiler and compile the first version of GHC written in Lazy ML. Then use this compiler to compile the next version which is written in Haskell. Then again use that compiler to build the next version, and repeat until you have a reasonably recent compiler. It may be possible to skip a few versions, but I don't know how many. As you can imagine, this could take a lot of time.
(Much easier) Download pre-built GHC binaries.

Um... I have not tried this, but another route would be simply compiling to c and using a c compiler to compile latest ghc...ghc is itself built in stages, so you don't really even need to convert the whole code base to c, just the first stage, which then can compile the rest. Certainly no need to dig up Lazy ML.
Edit: Note the resulting compiler will not build binaries targeting the new platform, it would simply run on that platform and be a cross-platform compiler for targets that ghc already has backends for. Another note is that i actually intended this in response to bennofs answer, not as stand alone answer to the OP.

How to inspect Haskell bytecode

I am trying to figure out a bug (a serious performance downgrade). Unfortunately, I wasn't able to figure out why by going back many different versions of my code.
I am suspecting it could be some modifications to libraries that I've updated, not to mention in the meanwhile I've updated to GHC 7.6 from 7.4 (and if anybody knows if some laziness behavior has changed I would greatly appreciate it!).
I have an older executable of this code that does not have this bug and thus I wonder if there are any tools to tell me the library versions I was linking to from before? Like if it can figure out the symbols, etc.

GHC creates executables, which are notoriously hard to understand... On my Linux box I can view the assembly code by typing in
objdump -d <executable filename>
but I get back over 100K lines of code from just a simple "Hello, World!" program written in Haskell.
If you happen to have the GHC .hi files, you can get some information about the executable by typing in
ghc --show-iface <hi filename>
This won't give you the assembly code, but you can get some extra information that may prove useful.
As I mentioned in the comment above, on Linux you can use "ldd" to see what C-system libraries you used in the compile, but that is also probably less than useful.
You can try to use a disassembler, but those are generally written to disassemble to C, not anything higher level and certainly not Haskell. That being said, GHC compiles to C as an intermediary (at least it used to; has that changed?), so you might be able to learn something.
Personally I often find view system calls in action much more interesting than viewing pure assembly. On my Linux box, I can view all system calls by running using strace (use Wireshark for the network traffic equivalent):
strace <program executable>
This also will generate a lot of data, so it might only be useful if you know of some specific place where direct real world communication (i.e., changes to a file on the hard disk drive) goes wrong.
In all honesty, you are probably better off just debugging the problem from source, although, depending on the actual problem, some of these techniques may help you pinpoint something.
Most of these tools have Mac and Windows equivalents.

Since much has changed in the last 9 years, and apparently this is still the first result a search engine gives on this question (like for me, again), an updated answer is in order:
First of all, yes, while Haskell does not specify a bytecode format, bytecode is also just a kind of machine code, for a virtual machine. So for the rest of the answer I will treat them as the same thing. The “Core“ as well as the LLVM intermediate language, or even WASM could be considered equivalent too.
Secondly, if your old binary is statically linked, then of course, no matter the format your program is in, no symbols will be available to check out. Because that is what linking does. Even with bytecode, and even with just classic static #include in simple languages. So your old binary will be no good, no matter what. And given the optimisations compilers do, a classic decompiler will very likely never be able to figure out what optimised bits used to be partially what libraries. Especially with stream fusion and such “magic”.
Third, you can do the things you asked with a modern Haskell program. But you need to have your binaries compiled with -dynamic and -rdynamic, So not only the C-calling-convention libraries (e.g. .so), and the Haskell libraries, but also the runtime itself is dynamically loaded. That way you end up with a very small binary, consisting of only your actual code, dynamic linking instructions, and the exact data about what libraries and runtime were used to build it. And since the runtime is compiler-dependent, you will know the compiler too. So it would give you everything you need, but only if you compiled it right. (I recommend using such dynamic linking by default in any case as it saves memory.)
The last factor that one might forget, is that even the exact same compiler version might behave vastly differently, depending on what IT was compiled with. (E.g. if somebody put a backdoor in the very first version of GHC, and all GHCs after that were compiled with that first GHC, and nobody ever checked, then that backdoor could still be in the code today, with no traces in any source or libraries whatsoever. … Or for a less extreme case, that version of GHC your old binary was built with might have been compiled with different architecture options, leading to it putting more optimised instructions into the binaries it compiles for unless told to cross-compile.)
Finally, of course, you can profile even compiled binaries, by profiling their system calls. This will give you clues about which part of the code acted differently and how. (E.g. if you notice that your new binary floods the system with some slow system calls where the old one just used a single fast one. A classic OpenGL example would be using fast display lists versus slow direct calls to draw triangles. Or using a different sorting algorithm, or having switched to a different kind of data structure that fits your work load badly and thrashes a lot of memory.)

How do functional language compilers work? [duplicate]

I've heard of the idea of bootstrapping a language, that is, writing a compiler/interpreter for the language in itself. I was wondering how this could be accomplished and looked around a bit, and saw someone say that it could only be done by either
writing an initial compiler in a different language.
hand-coding an initial compiler in Assembly, which seems like a special case of the first
To me, neither of these seem to actually be bootstrapping a language in the sense that they both require outside support. Is there a way to actually write a compiler in its own language?

Is there a way to actually write a compiler in its own language?
You have to have some existing language to write your new compiler in. If you were writing a new, say, C++ compiler, you would just write it in C++ and compile it with an existing compiler first. On the other hand, if you were creating a compiler for a new language, let's call it Yazzleof, you would need to write the new compiler in another language first. Generally, this would be another programming language, but it doesn't have to be. It can be assembly, or if necessary, machine code.
If you were going to bootstrap a compiler for Yazzleof, you generally wouldn't write a compiler for the full language initially. Instead you would write a compiler for Yazzle-lite, the smallest possible subset of the Yazzleof (well, a pretty small subset at least). Then in Yazzle-lite, you would write a compiler for the full language. (Obviously this can occur iteratively instead of in one jump.) Because Yazzle-lite is a proper subset of Yazzleof, you now have a compiler which can compile itself.
There is a really good writeup about bootstrapping a compiler from the lowest possible level (which on a modern machine is basically a hex editor), titled Bootstrapping a simple compiler from nothing. It can be found at https://web.archive.org/web/20061108010907/http://www.rano.org/bcompiler.html.

The explanation you've read is correct. There's a discussion of this in Compilers: Principles, Techniques, and Tools (the Dragon Book):
Write a compiler C1 for language X in language Y
Use the compiler C1 to write compiler C2 for language X in language X
Now C2 is a fully self hosting environment.

The way I've heard of is to write an extremely limited compiler in another language, then use that to compile a more complicated version, written in the new language. This second version can then be used to compile itself, and the next version. Each time it is compiled the last version is used.
This is the definition of bootstrapping:
the process of a simple system activating a more complicated system that serves the same purpose.
EDIT: The Wikipedia article on compiler bootstrapping covers the concept better than me.

A super interesting discussion of this is in Unix co-creator Ken Thompson's Turing Award lecture.
He starts off with:
What I am about to describe is one of many "chicken and egg" problems that arise when compilers are written in their own language. In this ease, I will use a specific example from the C compiler.
and proceeds to show how he wrote a version of the Unix C compiler that would always allow him to log in without a password, because the C compiler would recognize the login program and add in special code.
The second pattern is aimed at the C compiler. The replacement code is a Stage I self-reproducing program that inserts both Trojan horses into the compiler. This requires a learning phase as in the Stage II example. First we compile the modified source with the normal C compiler to produce a bugged binary. We install this binary as the official C. We can now remove the bugs from the source of the compiler and the new binary will reinsert the bugs whenever it is compiled. Of course, the login command will remain bugged with no trace in source anywhere.

Check out podcast Software Engineering Radio episode 61 (2007-07-06) which discusses GCC compiler internals, as well as the GCC bootstrapping process.

Donald E. Knuth actually built WEB by writing the compiler in it, and then hand-compiled it to assembly or machine code.

As I understand it, the first Lisp interpreter was bootstrapped by hand-compiling the constructor functions and the token reader. The rest of the interpreter was then read in from source.
You can check for yourself by reading the original McCarthy paper, Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I.

Every example of bootstrapping a language I can think of (C, PyPy) was done after there was a working compiler. You have to start somewhere, and reimplementing a language in itself requires writing a compiler in another language first.
How else would it work? I don't think it's even conceptually possible to do otherwise.

Another alternative is to create a bytecode machine for your language (or use an existing one if it's features aren't very unusual) and write a compiler to bytecode, either in the bytecode, or in your desired language using another intermediate - such as a parser toolkit which outputs the AST as XML, then compile the XML to bytecode using XSLT (or another pattern matching language and tree-based representation). It doesn't remove the dependency on another language, but could mean that more of the bootstrapping work ends up in the final system.

It's the computer science version of the chicken-and-egg paradox. I can't think of a way not to write the initial compiler in assembler or some other language. If it could have been done, I should Lisp could have done it.
Actually, I think Lisp almost qualifies. Check out its Wikipedia entry. According to the article, the Lisp eval function could be implemented on an IBM 704 in machine code, with a complete compiler (written in Lisp itself) coming into being in 1962 at MIT.

Some bootstrapped compilers or systems keep both the source form and the object form in their repository:
ocaml is a language which has both a bytecode interpreter (i.e. a compiler to Ocaml bytecode) and a native compiler (to x86-64 or ARM, etc... assembler). Its svn repository contains both the source code (files */*.{ml,mli}) and the bytecode (file boot/ocamlc) form of the compiler. So when you build it is first using its bytecode (of a previous version of the compiler) to compile itself. Later the freshly compiled bytecode is able to compile the native compiler. So Ocaml svn repository contains both *.ml[i] source files and the boot/ocamlc bytecode file.
The rust compiler downloads (using wget, so you need a working Internet connection) a previous version of its binary to compile itself.
MELT is a Lisp-like language to customize and extend GCC. It is translated to C++ code by a bootstrapped translator. The generated C++ code of the translator is distributed, so the svn repository contains both *.melt source files and melt/generated/*.cc "object" files of the translator.
J.Pitrat's CAIA artificial intelligence system is entirely self-generating. It is available as a collection of thousands of [A-Z]*.c generated files (also with a generated dx.h header file) with a collection of thousands of _[0-9]* data files.
Several Scheme compilers are also bootstrapped. Scheme48, Chicken Scheme, ...

What would be involved in calling ARPACK++ (a C++ library) from Haskell?

I've spent a couple of days developing a program in Haskell, while learning the language. Now I realize that I'll need to call Arpack (a Fortran library) or Arpack++ (a C++ wrapper to Arpack) -- I can't find a good implementation of Lanczos method with Haskell bindings. Do any more experienced Haskell programers have an opinion of how difficult this would be?
I've been able to get ".so" ("shared object") versions of libarpack and libarpack++ installed through Ubuntu's repository, but I'm not sure that will suffice. I suspect I'm going to ultimately need to build Arpack++ from source code, which is possible, but I'm getting a lot of build errors, so it will take time. Is there any way to use just the ".so" files, without knowing exactly which version of the header files were used to generate them?
I'm considering using GreenCard, because it looks like the most well maintained Haskell/C bridge. I can't find much documentation though, so I'm wondering whether it will support C++ too.
I'm also starting to wonder whether I should rewrite my program in Python, and use scipy to call Arpack, but I've already sunk a couple of days into writing Haskell. I really like Haskell too, so I'm hoping I can make this work. I guess my overall question is this: What would be involved in making this work with Haskell?
Thanks much.

ELF format is standard format of executables and shared libraries, so accessing the code in these compiled modules is only a matter of knowing function names. If I understand correctly, Fortran is interoperable with C. As a consequence, Fortran should be interoperable with any language which can use C bindings, including Haskell. FYI, you can find all names exported by a module (executable or shared object or simple object archive) using nm tool (it is usually available in all linux distros by default). This of course would work if the binary file was not "stripped", but AFAIK it is not common practice.
However, Haskell cannot use C++ bindings in sane way, since C++ polymorphic features require name mangling, and the method of this name transformation is highly compiler-dependent. It is well-known problem which is not specific to Haskell. Of course, you could try to get a list of exported symbols from C++ shared object and then bind them using FFI, but... It isn't worth it.
As dsign said, you can use Foreign Function Interface GHC feature to create bindings to foreign code. All you would require is library headers (and the library itself of course). In case of C language that would be header files (*.h), but since your library is written in Fortran, you have to find header files analogue in library sources, refere to this page to match Fortran and C types, and then use this information to write FFI bindings. It would be helpful first to write C bindings, i.e. write C header. Then you can even use automatic FFI binding programs like c2hs.
It maybe also helpful to look through C++ bindings. It is possible that it has the header file I've described above. If it has one, then writing FFI bindings will be no more difficult than writing them for any other library.
So, it is not entirely impossible, but it may require some thorough work. Writing bindings to scientific/pure computational libraries is way easier than writing them for some system library which does a lot of IO and keeps its own internal state, but since this library is written not in C... Well, it may be advisable to invest your time in easier alternatives. I cannot say anythin about scipy, I've never used it, but since Python as a language is much more simpler than Haskell, it may be good alternative.

I can tell you that using a C/Fortran library from Haskell, with the help of the Foreign Function Interface would be certainly possible and not terribly complicated. Here is an introduction. In my understanding, you should be able to call anything with a C calling convention, and perhaps even Fortran, without need of recompiling the code. The only exception is with things that look like function calls but are indeed macros, in which case you will have to figure out what the macros do and reproduce them in Haskell.
As of greencard, I have never used it, so I can not vouch for it.
Your second idea of using Python could potentially save you more than a couple of days. Sad as it is, I have never managed Haskell code to easily adapt to my changing requirements, while I find that trivial in Python. Of course, that could be a limitation on my skills with Haskell or my thinking process rather that something to blame to the language.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string