Why is "cabal build" so slow compared with "make"? - haskell

If I have a package with several executables, which I initially build using cabal build. Now I change one file that impacts just one executable, cabal seems to take about a second or two to examine each executable to see if it's impacted or not. On the other hand, make, given an equivalent number of executables and source files, will determine in a fraction of a second what needs to be recompiled. Why the huge difference? Is there a reason, cabal can't just build its own version of a makefile and go from there?

Disclaimer: I'm not familiar enough with Haskell or make internals to give technical specifics, but some web searching does offer some insight that lines up with my proposal (trying to avoid eliciting opinions by providing references). Also, I'm assuming your makefile is calling ghc, as cabal apparently would.
Proposal: I believe there could be several key reasons, but the main one is that make is written in C, whereas cabal is written in Haskell. This would be coupled with superior dependency checking from make (although I'm not sure how to prove this without looking at the source code). Other supporting reasons, as found on the web:
cabal tries to do a lot more than simply compiling, e.g. appears to take steps with regard to packaging (https://www.haskell.org/cabal/)
cabal is written in haskell, although the run time is written in C (https://en.wikipedia.org/wiki/Glasgow_Haskell_Compiler)
Again, not being overly familiar with make internals, make may simply have a faster dependency checking mechanism, thereby better tracking these changes. I point this out because from the OP it sounds like there is a significant enough difference to where cabal may be doing a blanket check against all dependencies. I suspect this would be the primary reason for the speed difference, if true.
At any rate, these are open source and can be downloaded from their respective sites (haskell.org/cabal/ and savannah.gnu.org/projects/make/) allowing anyone to examine specifics of the implementations.
It is also likely one could see a lot of variance in speed based upon the switches passed to the compilers in use.
HTH at least point you in the right direction.

Related

Customising Cabal libraries (I think?)

Perhaps it's just better to describe my problem.
I'm developing a Haskell library. But part of the library is written in C, and another part actually in raw LLVM. To actually get GHC to spit out the code I want I have to follow this process:
Run ghc -emit-llvm on both the code that uses the Haskell module and the "Main" module.
Run clang -emit-llvm on the C file
Now I've got three .ll files from above. I add the part of the library I've handwritten in raw LLVM and llvm-link these into one .ll file.
I then run LLVM's opt on the linked file.
Lastly, I feed the LLVM bitcode fileback into GHC (which pleasantly accepts it) and produces an executable.
This process (with appropriate optimisation settings of course) seems to be the only way I can inline code from C, removing the function call overhead. Since many of these C functions are very small this is significant.
Anyway, I want to be able to distribute the library and for users to be able to use it as painlessly as possible, whilst still gaining the optimisations from the process above. I understand it's going to be a bit more of a pain than an ordinary library (for example, you're forced to compile via LLVM) but as painlessly as possible is what I'm looking for advice for.
Any guidance will be appreciated, I don't expect a step by step answer because I think it will be complex, but just some ideas would be helpful.

cabal hell with dependencies of ghc-baked in packages

I have the following instance of cabal hell:
(with ghc-7.8.3 built from source on x86_64 GNU/Linux,
and user-install: True in .cabal/config)
1) at some time, transformers-0.4.0.0 was installed (in user space, shadowing (?) transformers-0.3 from the global space)
2) later, several libraries pick transformers-0.4
3) then, I install hint, which depends on ghc, which depends on transformers-0.3, and which cannot be changed, since ghc is hard-wired.
result: I cannot use libraries from 2) and hint in one project.
As a work-around, I am putting constraint: transformers installed in .cabal/config, and rebuild. Is there a better way to handle this situation - or to avoid it in the first place?
Is there a better way to handle this situation.
No, your approach is sensible.
or to avoid it in the first place?
Tricky. Most people do not build stuff depending on ghc, so for them it makes sense to upgrade transformers etc. Therefore, your constraint is not a suitable default.
As Zeta writes: Sandboxes can help. If you had used sandboxes for your installations in (2), and used another sandbox for whatever tries to use both hint and (2), then it would simply build these dependencies dedicated for whatever you are building.
This comes at the expense of not sharing any space or build-time between the various things you are doing.

One multimode Haskell executable vs separate executables sharing a library

I'm working on a project now in which I configure the cabal file to build several executables which share the library built by the same cabal file. The cabal project is structured much like this one, with one library section followed by several executable sections that include this library in their build-depends sections.
I'm using this approach so I can make common functions available to any number of executables, and create more executables easily as needed.
Yet in his Monad Reader article on Hoogle p.33, Neil Mitchell advocates bundling up Haskell projects into a single executable with multiple modes (e.g. by using Neil Mitchell's CmdArgs library.) So there might be one mode to start a web server, another mode to query the database from the command line, etc. Quote:
Provide one executable
Version 3 had four executable programs – one to generate ranking
information, one to do command line searching, one to do web
searching, and one to do regression testing. Version 4 has one
executable, which does all the above and more, controlled by flags.
There are many advantages to providing only one end program – it
reduces the chance of code breaking without noticing it, it makes the
total file size smaller by not duplicating the Haskell run-time system,
it decreases the number of commands users need to learn. The move to
one multipurpose executable seems to be a common theme, which tools
such as darcs and hpc both being based on one command with multiple
modes.
Is a single multimode executable really the better way to go? Are there countervailing reasons to stick with separate executables sharing the same library?
Personally, I'm more of a fan of the Unix philosophy "write programs that do one thing and do it well". However there are reasons for doing either way, so the only reasonable answer here is: it depends.
One example where it makes senses to bundle everything into same executable, is when you're targeting a platform that is very limited on resources (e.g, embedded system). This is the approach taken by BusyBox.
On the other hand if you divide into multiple executables, you give your clients the option of just using those that matter to them. With a single executable, even if your client really just wanted one functionality, he'll have no way to get rid of the extra baggage.
I'm sure there are a lot of more reasons for going either way, but this just goes to show that there's no definitive answer. It depends on the use case.

How to inspect Haskell bytecode

I am trying to figure out a bug (a serious performance downgrade). Unfortunately, I wasn't able to figure out why by going back many different versions of my code.
I am suspecting it could be some modifications to libraries that I've updated, not to mention in the meanwhile I've updated to GHC 7.6 from 7.4 (and if anybody knows if some laziness behavior has changed I would greatly appreciate it!).
I have an older executable of this code that does not have this bug and thus I wonder if there are any tools to tell me the library versions I was linking to from before? Like if it can figure out the symbols, etc.
GHC creates executables, which are notoriously hard to understand... On my Linux box I can view the assembly code by typing in
objdump -d <executable filename>
but I get back over 100K lines of code from just a simple "Hello, World!" program written in Haskell.
If you happen to have the GHC .hi files, you can get some information about the executable by typing in
ghc --show-iface <hi filename>
This won't give you the assembly code, but you can get some extra information that may prove useful.
As I mentioned in the comment above, on Linux you can use "ldd" to see what C-system libraries you used in the compile, but that is also probably less than useful.
You can try to use a disassembler, but those are generally written to disassemble to C, not anything higher level and certainly not Haskell. That being said, GHC compiles to C as an intermediary (at least it used to; has that changed?), so you might be able to learn something.
Personally I often find view system calls in action much more interesting than viewing pure assembly. On my Linux box, I can view all system calls by running using strace (use Wireshark for the network traffic equivalent):
strace <program executable>
This also will generate a lot of data, so it might only be useful if you know of some specific place where direct real world communication (i.e., changes to a file on the hard disk drive) goes wrong.
In all honesty, you are probably better off just debugging the problem from source, although, depending on the actual problem, some of these techniques may help you pinpoint something.
Most of these tools have Mac and Windows equivalents.
Since much has changed in the last 9 years, and apparently this is still the first result a search engine gives on this question (like for me, again), an updated answer is in order:
First of all, yes, while Haskell does not specify a bytecode format, bytecode is also just a kind of machine code, for a virtual machine. So for the rest of the answer I will treat them as the same thing. The “Core“ as well as the LLVM intermediate language, or even WASM could be considered equivalent too.
Secondly, if your old binary is statically linked, then of course, no matter the format your program is in, no symbols will be available to check out. Because that is what linking does. Even with bytecode, and even with just classic static #include in simple languages. So your old binary will be no good, no matter what. And given the optimisations compilers do, a classic decompiler will very likely never be able to figure out what optimised bits used to be partially what libraries. Especially with stream fusion and such “magic”.
Third, you can do the things you asked with a modern Haskell program. But you need to have your binaries compiled with -dynamic and -rdynamic, So not only the C-calling-convention libraries (e.g. .so), and the Haskell libraries, but also the runtime itself is dynamically loaded. That way you end up with a very small binary, consisting of only your actual code, dynamic linking instructions, and the exact data about what libraries and runtime were used to build it. And since the runtime is compiler-dependent, you will know the compiler too. So it would give you everything you need, but only if you compiled it right. (I recommend using such dynamic linking by default in any case as it saves memory.)
The last factor that one might forget, is that even the exact same compiler version might behave vastly differently, depending on what IT was compiled with. (E.g. if somebody put a backdoor in the very first version of GHC, and all GHCs after that were compiled with that first GHC, and nobody ever checked, then that backdoor could still be in the code today, with no traces in any source or libraries whatsoever. … Or for a less extreme case, that version of GHC your old binary was built with might have been compiled with different architecture options, leading to it putting more optimised instructions into the binaries it compiles for unless told to cross-compile.)
Finally, of course, you can profile even compiled binaries, by profiling their system calls. This will give you clues about which part of the code acted differently and how. (E.g. if you notice that your new binary floods the system with some slow system calls where the old one just used a single fast one. A classic OpenGL example would be using fast display lists versus slow direct calls to draw triangles. Or using a different sorting algorithm, or having switched to a different kind of data structure that fits your work load badly and thrashes a lot of memory.)

Why use build tools like Autotools when we can just write our own makefiles?

Recently, I switched my development environment from Windows to Linux. So far, I have only used Visual Studio for C++ development, so many concepts, like make and Autotools, are new to me. I have read the GNU makefile documentation and got almost an idea about it. But I am kind of confused about Autotools.
As far as I know, makefiles are used to make the build process easier.
Why do we need tools like Autotools just for creating the makefiles? Since all knows how to create a makefile, I am not getting the real use of Autotools.
What is the standard? Do we need to use tools like this or would just handwritten makefiles do?
You are talking about two separate but intertwined things here:
Autotools
GNU coding standards
Within Autotools, you have several projects:
Autoconf
Automake
Libtool
Let's look at each one individually.
Autoconf
Autoconf easily scans an existing tree to find its dependencies and create a configure script that will run under almost any kind of shell. The configure script allows the user to control the build behavior (i.e. --with-foo, --without-foo, --prefix, --sysconfdir, etc..) as well as doing checks to ensure that the system can compile the program.
Configure generates a config.h file (from a template) which programs can include to work around portability issues. For example, if HAVE_LIBPTHREAD is not defined, use forks instead.
I personally use Autoconf on many projects. It usually takes people some time to get used to m4. However, it does save time.
You can have makefiles inherit some of the values that configure finds without using automake.
Automake
By providing a short template that describes what programs will be built and what objects need to be linked to build them, Makefiles that adhere to GNU coding standards can automatically be created. This includes dependency handling and all of the required GNU targets.
Some people find this easier. I prefer to write my own makefiles.
Libtool
Libtool is a very cool tool for simplifying the building and installation of shared libraries on any Unix-like system. Sometimes I use it; other times (especially when just building static link objects) I do it by hand.
There are other options too, see StackOverflow question Alternatives to Autoconf and Autotools?.
Build automation & GNU coding standards
In short, you really should use some kind of portable build configuration system if you release your code to the masses. What you use is up to you. GNU software is known to build and run on almost anything. However, you might not need to adhere to such (and sometimes extremely pedantic) standards.
If anything, I'd recommend giving Autoconf a try if you're writing software for POSIX systems. Just because Autotools produce part of a build environment that's compatible with GNU standards doesn't mean you have to follow those standards (many don't!) :) There are plenty of other options, too.
Edit
Don't fear m4 :) There is always the Autoconf macro archive. Plenty of examples, or drop in checks. Write your own or use what's tested. Autoconf is far too often confused with Automake. They are two separate things.
First of all, the Autotools are not an opaque build system but a loosely coupled tool-chain, as tinkertim already pointed out. Let me just add some thoughts on Autoconf and Automake:
Autoconf is the configuration system that creates the configure script based on feature checks that are supposed to work on all kinds of platforms. A lot of system knowledge has gone into its m4 macro database during the 15 years of its existence. On the one hand, I think the latter is the main reason Autotools have not been replaced by something else yet. On the other hand, Autoconf used to be far more important when the target platforms were more heterogeneous and Linux, AIX, HP-UX, SunOS, ..., and a large variety of different processor architecture had to be supported. I don't really see its point if you only want to support recent Linux distributions and Intel-compatible processors.
Automake is an abstraction layer for GNU Make and acts as a Makefile generator from simpler templates. A number of projects eventually got rid of the Automake abstraction and reverted to writing Makefiles manually because you lose control over your Makefiles and you might not need all the canned build targets that obfuscate your Makefile.
Now to the alternatives (and I strongly suggest an alternative to Autotools based on your requirements):
CMake's most notable achievement is replacing AutoTools in KDE. It's probably the closest you can get if you want to have Autoconf-like functionality without m4 idiosyncrasies. It brings Windows support to the table and has proven to be applicable in large projects. My beef with CMake is that it is still a Makefile-generator (at least on Linux) with all its immanent problems (e.g. Makefile debugging, timestamp signatures, implicit dependency order).
SCons is a Make replacement written in Python. It uses Python scripts as build control files allowing very sophisticated techniques. Unfortunately, its configuration system is not on par with Autoconf. SCons is often used for in-house development when adaptation to specific requirements is more important than following conventions.
If you really want to stick with Autotools, I strongly suggest to read Recursive Make Considered Harmful (archived) and write your own GNU Makefile configured through Autoconf.
The answers already provided here are good, but I'd strongly recommend not taking the advice to write your own makefile if you have anything resembling a standard C/C++ project. We need the autotools instead of handwritten makefiles because a standard-compliant makefile generated by automake offers a lot of useful targets under well-known names, and providing all these targets by hand is tedious and error-prone.
Firstly, writing a Makefile by hand seems a great idea at first, but most people will not bother to write more than the rules for all, install and maybe clean. automake generates dist, distcheck, clean, distclean, uninstall and all these little helpers. These additional targets are a great boon to the sysadmin that will eventually install your software.
Secondly, providing all these targets in a portable and flexible way is quite error-prone. I've done a lot of cross-compilation to Windows targets recently, and the autotools performed just great. In contrast to most hand-written files, which were mostly a pain in the ass to compile. Mind you, it is possible to create a good Makefile by hand. But don't overestimate yourself, it takes a lot of experience and knowledge about a bunch of different systems, and automake creates great Makefiles for you right out of the box.
Edit: And don't be tempted to use the "alternatives". CMake and friends are a horror to the deployer because they aren't interface-compatible to configure and friends. Every half-way competent sysadmin or developer can do great things like cross-compilation or simple things like setting a prefix out of his head or with a simple --help with a configure script. But you are damned to spend an hour or three when you have to do such things with BJam. Don't get me wrong, BJam is probably a great system under the hood, but it's a pain in the ass to use because there are almost no projects using it and very little and incomplete documentation. autoconf and automake have a huge lead here in terms of established knowledge.
So, even though I'm a bit late with this advice for this question: Do yourself a favor and use the autotools and automake. The syntax might be a bit strange, but they do a way better job than 99% of the developers do on their own.
For small projects or even for large projects that only run on one platform, handwritten makefiles are the way to go.
Where autotools really shine is when you are compiling for different platforms that require different options. Autotools is frequently the brains behind the typical
./configure
make
make install
compilation and install steps for Linux libraries and applications.
That said, I find autotools to be a pain and I've been looking for a better system. Lately I've been using bjam, but that also has its drawbacks. Good luck finding what works for you.
Autotools are needed because Makefiles are not guaranteed to work the same across different platforms. If you handwrite a Makefile, and it works on your machine, there is a good chance that it won't on mine.
Do you know what unix your users will be using? Or even which distribution of Linux? Do you know where they want software installed? Do you know what tools they have, what architecture they want to compile on, how many CPUs they have, how much RAM and disk might be available to them?
The *nix world is a cross-platform landscape, and your build and install tools need to deal with that.
Mind you, the auto* tools date from an earlier epoch, and there are many valid complaints about them, but the several projects to replace them with more modern alternatives are having trouble developing a lot of momentum.
Lots of things are like that in the *nix world.
Autotools is a disaster.
The generated ./configure script checks for features that have not been present on any Unix system for last 20 years or so. To do this, it spends a huge amount of time.
Running ./configure takes for ages. Although modern server CPUs can have even dozens of cores, and there may be several such CPUs per server, the ./configure is single-threaded. We still have enough years of Moore's law left that the number of CPU cores will go way up as a function of time. So, the time ./configure takes will stay approximately constant whereas parallel build times reduce by a factor of 2 every 2 years due to Moore's law. Or actually, I would say the time ./configure takes might even increase due to increasing software complexity taking advantage of improved hardware.
The mere act of adding just one file to your project requires you to run automake, autoconf and ./configure which will take ages, and then you'll probably find that since some important files have changed, everything will be recompiled. So add just one file, and make -j${CPUCOUNT} recompiles everything.
And about make -j${CPUCOUNT}. The generated build system is a recursive one. Recursive make has for a long amount of time been considered harmful.
Then when you install the software that has been compiled, you'll find that it doesn't work. (Want proof? Clone protobuf repository from Github, check out commit 9f80df026933901883da1d556b38292e14836612, install it to a Debian or Ubuntu system, and hey presto: protoc: error while loading shared libraries: libprotoc.so.15: cannot open shared object file: No such file or directory -- since it's in /usr/local/lib and not /usr/lib; workaround is to do export LD_RUN_PATH=/usr/local/lib before typing make).
The theory is that by using autotools, you could create a software package that can be compiled on Linux, FreeBSD, NetBSD, OpenBSD, DragonflyBSD and other operating systems. The fact? Every non-Linux system to build packages from source has numerous patch files in their repository to work around autotools bugs. Just take a look at e.g. FreeBSD /usr/ports: it's full of patches. So, it would have been as easy to create a small patch for a non-autotools build system on a per project basis than to create a small patch for an autotools build system on a per project basis. Or perhaps even easier, as standard make is much easier to use than autotools.
The fact is, if you create your own build system based on standard make (and make it inclusive and not recursive, following the recommendations of the "Recursive make considered harmful" paper), things work in a much better manner. Also, your build time goes down by an order of magnitude, perhaps even two orders of magnitude if your project is very small project of 10-100 C language files and you have dozens of cores per CPU and multiple CPUs. It's also much easier to interface custom automatic code generation tools with a custom build system based on standard make instead of dealing with the m4 mess of autotools. With standard make, you can at least type a shell command into the Makefile.
So, to answer your question: why use autotools? Answer: there is no reason to do so. Autotools has been obsolete since when commercial Unix has become obsolete. And the advent of multi-core CPUs has made autotools even more obsolete. Why programmers haven't realized that yet, is a mystery. I'll happily use standard make on my build systems, thank you. Yes, it takes some amount of work to generate the dependency files for C language header inclusion, but the amount of work is saved by not having to fight with autotools.
I dont feel I am an expert to answer this but still give you a bit analogy with my experience.
Because upto some extent it is similar to why we should write Embedded Codes in C language(High Level language) rather then writing in Assembly Language.
Both solves the same purpose but latter is more lenghty, tedious ,time consuming and more error prone(unless you know ISA of the processor very well) .
Same is the case with Automake tool and writing your own makefile.
Writing Makefile.am and configure.ac is pretty simple than writing individual project Makefile.

Resources