Tools for analyzing performance of a Haskell program

Tools for analyzing performance of a Haskell program - haskell

While solving some Project Euler Problems to learn Haskell (so currently I'm a completly beginner) I came over Problem 12. I wrote this (naive) solution:
--Get Number of Divisors of n
numDivs :: Integer -> Integer
numDivs n = toInteger $ length [ x | x<-[2.. ((n `quot` 2)+1)], n `rem` x == 0] + 2
--Generate a List of Triangular Values
triaList :: [Integer]
triaList = [foldr (+) 0 [1..n] | n <- [1..]]
--The same recursive
triaList2 = go 0 1
where go cs n = (cs+n):go (cs+n) (n+1)
--Finds the first triangular Value with more than n Divisors
sol :: Integer -> Integer
sol n = head $ filter (\x -> numDivs(x)>n) triaList2
This Solution for n=500 (sol 500) is extremely slow (running for more than 2 hours now), so I wondered how to find out why this solution is so slow. Are there any commands that tell me where most of the computation-time is spent so I know which part of my haskell-program is slow? Something like a simple profiler.
To make it clear, I'm not asking for a faster solution but for a way to find this solution. How would you start if you would have no haskell knowledge?
I tried to write two triaList functions but found no way to test which one is faster, so this is where my problems start.
Thanks

how to find out why this solution is so slow. Are there any commands that tell me where most of the computation-time is spend so I know which part of my haskell-program is slow?
Precisely! GHC provides many excellent tools, including:
runtime statistics
time profiling
heap profiling
thread analysis
core analysis.
comparative benchmarking
GC tuning
A tutorial on using time and space profiling is part of Real World Haskell.
GC Statistics
Firstly, ensure you're compiling with ghc -O2. And you might make sure it is a modern GHC (e.g. GHC 6.12.x)
The first thing we can do is check that garbage collection isn't the problem.
Run your program with +RTS -s
$ time ./A +RTS -s
./A +RTS -s
749700
9,961,432,992 bytes allocated in the heap
2,463,072 bytes copied during GC
29,200 bytes maximum residency (1 sample(s))
187,336 bytes maximum slop
**2 MB** total memory in use (0 MB lost due to fragmentation)
Generation 0: 19002 collections, 0 parallel, 0.11s, 0.15s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 13.15s ( 13.32s elapsed)
GC time 0.11s ( 0.15s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 13.26s ( 13.47s elapsed)
%GC time **0.8%** (1.1% elapsed)
Alloc rate 757,764,753 bytes per MUT second
Productivity 99.2% of total user, 97.6% of total elapsed
./A +RTS -s 13.26s user 0.05s system 98% cpu 13.479 total
Which already gives us a lot of information: you only have a 2M heap, and GC takes up 0.8% of time. So no need to worry that allocation is the problem.
Time Profiles
Getting a time profile for your program is straight forward: compile with -prof -auto-all
$ ghc -O2 --make A.hs -prof -auto-all
[1 of 1] Compiling Main ( A.hs, A.o )
Linking A ...
And, for N=200:
$ time ./A +RTS -p
749700
./A +RTS -p 13.23s user 0.06s system 98% cpu 13.547 total
which creates a file, A.prof, containing:
Sun Jul 18 10:08 2010 Time and Allocation Profiling Report (Final)
A +RTS -p -RTS
total time = 13.18 secs (659 ticks # 20 ms)
total alloc = 4,904,116,696 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
numDivs Main 100.0 100.0
Indicating that all your time is spent in numDivs, and it is also the source of all your allocations.
Heap Profiles
You can also get a break down of those allocations, by running with +RTS -p -hy, which creates A.hp, which you can view by converting it to a postscript file (hp2ps -c A.hp), generating:
which tells us there's nothing wrong with your memory use: it is allocating in constant space.
So your problem is algorithmic complexity of numDivs:
toInteger $ length [ x | x<-[2.. ((n `quot` 2)+1)], n `rem` x == 0] + 2
Fix that, which is 100% of your running time, and everything else is easy.
Optimizations
This expression is a good candidate for the stream fusion optimization, so I'll rewrite it
to use Data.Vector, like so:
numDivs n = fromIntegral $
2 + (U.length $
U.filter (\x -> fromIntegral n `rem` x == 0) $
(U.enumFromN 2 ((fromIntegral n `div` 2) + 1) :: U.Vector Int))
Which should fuse into a single loop with no unnecessary heap allocations. That is, it will have better complexity (by constant factors) than the list version. You can use the ghc-core tool (for advanced users) to inspect the intermediate code after optimization.
Testing this, ghc -O2 --make Z.hs
$ time ./Z
749700
./Z 3.73s user 0.01s system 99% cpu 3.753 total
So it reduced running time for N=150 by 3.5x, without changing the algorithm itself.
Conclusion
Your problem is numDivs. It is 100% of your running time, and has terrible complexity. Think about numDivs, and how, for example, for each N you are generating [2 .. n div 2 + 1] N times.
Try memoizing that, since the values don't change.
To measure which of your functions is faster, consider using criterion, which will provide statistically robust information about sub-microsecond improvements in running time.
Addenda
Since numDivs is 100% of your running time, touching other parts of the program won't make much difference,
however, for pedagogical purposes, we can also rewrite those using stream fusion.
We can also rewrite trialList, and rely on fusion to turn it into the loop you write by hand in trialList2,
which is a "prefix scan" function (aka scanl):
triaList = U.scanl (+) 0 (U.enumFrom 1 top)
where
top = 10^6
Similarly for sol:
sol :: Int -> Int
sol n = U.head $ U.filter (\x -> numDivs x > n) triaList
With the same overall running time, but a bit cleaner code.

Dons' answer is great without being a spoiler by giving a direct solution to the problem.
Here I want to suggest a little tool that I wrote recently. It saves you the time to write SCC annotations by hand when you want a more detailed profile than the default ghc -prof -auto-all. Besides that it's colorful!
Here's an example with the code you gave(*), green is OK, red is slow:
All the time goes in creating the list of divisors. This suggests a few things you can do:
1. Make the filtering n rem x == 0 faster, but since it's a built-in function probably it's already fast.
2. Create a shorter list. You've already done something in that direction by checking only up to n quot 2.
3. Throw away the list generation completely and use some math to get a faster solution. This is the usual way for project Euler problems.
(*) I got this by putting your code in a file called eu13.hs, adding a main function main = print $ sol 90. Then running visual-prof -px eu13.hs eu13 and the result is in eu13.hs.html.

Haskell related note: triaList2 is of course faster than triaList because the latter performs a lot of unnecessary computations. It will take quadratic time to compute n first elements of triaList, but linear for triaList2. There is another elegant (and efficient) way to define an infinite lazy list of triangle numbers:
triaList = 1 : zipWith (+) triaList [2..]
Math related note: there is no need to check all divisors up to n / 2, it's enough to check up to sqrt(n).

You can run your program with flags to enable time profiling. Something like this:
./program +RTS -P -sprogram.stats -RTS
That should run the program and produce a file called program.stats which will have how much time was spent in each function. You can find more information about profiling with GHC in the GHC user guide. For benchmarking, there is the Criterion library. I've found this blog post has a useful introduction.

Related

Making sense from GHC profiler

I'm trying to make sense from GHC profiler. There is a rather simple app, which uses werq and lens-aeson libraries, and while learning about GHC profiling, I decided to play with it a bit.
Using different options (time tool, +RTS -p -RTS and +RTS -p -h) I acquired entirely different numbers of my memory usage. Having all those numbers, I'm now completely lost trying to understand what is going on, and how much memory the app actually uses.
This situation reminds me the phrase by Arthur Bloch: "A man with a watch knows what time it is. A man with two watches is never sure."
Can you, please, suggest me, how I can read all those numbers, and what is the meaning of each of them.
Here are the numbers:
time -l reports around 19M
#/usr/bin/time -l ./simple-wreq
...
3.02 real 0.39 user 0.17 sys
19070976 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
21040 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
71 messages sent
71 messages received
2991 signals received
43 voluntary context switches
6490 involuntary context switches
Using +RTS -p -RTS flag reports around 92M. Although it says "total alloc" it seems strange to me, that a simple app like this one can allocate and release 91M
# ./simple-wreq +RTS -p -RTS
# cat simple-wreq.prof
Fri Oct 14 15:08 2016 Time and Allocation Profiling Report (Final)
simple-wreq +RTS -N -p -RTS
total time = 0.07 secs (69 ticks # 1000 us, 1 processor)
total alloc = 91,905,888 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
main.g Main 60.9 88.8
MAIN MAIN 24.6 2.5
decodeLenient/look Data.ByteString.Base64.Internal 5.8 2.6
decodeLenientWithTable/fill Data.ByteString.Base64.Internal 2.9 0.1
decodeLenientWithTable.\.\.fill Data.ByteString.Base64.Internal 1.4 0.0
decodeLenientWithTable.\.\.fill.\ Data.ByteString.Base64.Internal 1.4 0.1
decodeLenientWithTable.\.\.fill.\.\.\.\ Data.ByteString.Base64.Internal 1.4 3.3
decodeLenient Data.ByteString.Base64.Lazy 1.4 1.4
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 443 0 24.6 2.5 100.0 100.0
main Main 887 0 0.0 0.0 75.4 97.4
main.g Main 889 0 60.9 88.8 75.4 97.4
object_ Data.Aeson.Parser.Internal 925 0 0.0 0.0 0.0 0.2
jstring_ Data.Aeson.Parser.Internal 927 50 0.0 0.2 0.0 0.2
unstream/resize Data.Text.Internal.Fusion 923 600 0.0 0.3 0.0 0.3
decodeLenient Data.ByteString.Base64.Lazy 891 0 1.4 1.4 14.5 8.1
decodeLenient Data.ByteString.Base64 897 500 0.0 0.0 13.0 6.7
....
+RTS -p -h and hp2ps show me the following picture and two numbers: 114K in the header and something around 1.8Mb on the graph.
And, just in case, here is the app:
module Main where
import Network.Wreq
import Control.Lens
import Data.Aeson.Lens
import Control.Monad
main :: IO ()
main = replicateM_ 10 g
where
g = do
r <- get "http://httpbin.org/get"
print $ r ^. responseBody
. key "headers"
. key "User-Agent"
. _String
UPDATE 1: Thank everyone for incredible good responses. As was suggested, I add +RTS -s output, so the entire picture builds up for everyone who read it.
#./simple-wreq +RTS -s
...
128,875,432 bytes allocated in the heap
32,414,616 bytes copied during GC
2,394,888 bytes maximum residency (16 sample(s))
355,192 bytes maximum slop
7 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 194 colls, 0 par 0.018s 0.022s 0.0001s 0.0022s
Gen 1 16 colls, 0 par 0.027s 0.031s 0.0019s 0.0042s
UPDATE 2: The size of the executable:
#du -h simple-wreq
63M simple-wreq

A man with a watch knows what time it is. A man with two watches is never sure.
Ah, but what do does two watches show? Are both meant to show the current time in UTC? Or is one of them supposed to show the time in UTC, and the other one the time on a certain point on Mars? As long as they are in sync, the second scenario wouldn't be a problem, right?
And that is exactly what is happening here. You compare different memory measurements:
the maximum residency
the total amount of allocated memory
The maximum residency is the highest amount of memory your program ever uses at a given time. That's 19MB. However, the total amount of allocated memory is a lot more, since that's how GHC works: it "allocates" memory for objects that are garbage collected, which is almost everything that's not unpacked.
Let us inspect a C example for this:
int main() {
int i;
char * mem;
for(i = 0; i < 5; ++i) {
mem = malloc(19 * 1000 * 1000);
free(mem);
}
return 0;
}
Whenever we use malloc, we will allocate 19 megabytes of memory. However, we free the memory immediately after. The highest amount of memory we ever have at one point is therefore 19 megabytes (and a little bit more for the stack and the program itself).
However, in total, we allocate 5 * 19M, 95M total. Still, we could run our little program with just 20 megs of RAM fine. That's the difference between total allocated memory and maximum residency. Note that the residency reported by time is always at least du <executable>, since that has to reside in memory too.
That being said, the easiest way to generate statistics is -s, which will show how what was the maximum residency from the Haskell's program point of view. In your case, it will be the 1.9M, the number in your heap profile (or double the amount due to profiling). And yeah, Haskell executables tend to get extremely large, since libraries are statically linked.

time -l is displaying the (resident, i.e. not swapped out) size of the process as seen by the operating system (obviously). This includes twice the maximum size of the Haskell heap (due to the way that GHC's GC works), plus anything else allocated by the RTS or other C libraries, plus the code of your executable itself plus the libraries it depends on, etc. I'm guessing in this case the primary contributor to the 19M is the size of your exectuable.
total alloc is the total amount allocated onto the Haskell heap. It is not at all a measure of maximum heap size (which is what people usually mean by "how much memory is my program using"). Allocation is very cheap and allocation rates of around 1GB/s are typical for a Haskell program.
The number in the header of the hp2ps output "114,272 bytes x seconds" is something completely different again: it is the integral of the graph, and is measured in bytes * seconds, not in bytes. For example if your program holds onto a 10 MB structure for 4 seconds then that will cause this number to increase by 40 MB*s.
The number around 1.8 MB shown in the graph is the actual maximum size of the Haskell heap, which is probably the number you're most interested in.
You've omitted the most useful source of numbers about your program's execution, which is running it with +RTS -s (this doesn't even require it to have been built with profiling).

No heap profiling data for module Data.ByteString

I was trying to generate heap memory profile for following naive Haskell code that copies a file:
import System.Environment
import System.IO
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as LB
naiveCopy :: String -> String -> IO ()
naiveCopy from to = do
putStrLn $ "From: " ++ from
putStrLn $ "To: " ++ to
s <- B.readFile from
B.writeFile to s
main = do
args <- getArgs
mapM (\ x-> putStrLn x) args
naiveCopy (head args) ((head.tail) args)
Command that build the code with ghc 8.0.1:
ghc -o t -rtsopts -prof -fprof-auto t.hs
Command that collect the profiling data:
./t +RTS -p -h -RTS in/data out/data && hp2ps -e8in -c t.hp
where in/data is a quite big file (approx 500MB) which will take the program about 2 seconds to copy.
The problem is that I couldn't get heap profiling data if I use the strict Data.ByteString, there's only an small t.hp file without any sample data, it looks like this:
JOB "t in/data out/data +RTS -p -h"
DATE "Thu Aug 4 20:19 2016"
SAMPLE_UNIT "seconds"
VALUE_UNIT "bytes"
BEGIN_SAMPLE 0.000000
END_SAMPLE 0.000000
BEGIN_SAMPLE 0.943188
END_SAMPLE 0.943188
and corresponding profile chart like this:
However I could get heap profiling data if I switch to the lazy version Data.ByteString.Lazy, profile chart like this:
Update: Thanks #ryachza, I added a -i0 parameter to set sampling interval and tried again, this time I got sample data for strict ByteString and it looked reasonable (I was copying a 500M file and the memory allocation peak in following profiling chart is about 500M)
./t +RTS -p -h -RTS in/data out/data && hp2ps -e8in -c t.hp

It appears as though the runtime isn't "getting the chance to measure" the heap. If you add -s to your RTS options, it should print some time and allocation information. When I run this, I see the bytes allocated and total memory use is very high (size of the file), but the maximum residency (and the number of samples) is very low, and while the elapsed time is high the actual "work" time is practically 0.
Adding the RTS option -i0 allowed me to reproducibly visualize the bytestring allocation as PINNED (this is the classification because the byte arrays that bytestring uses internally are allocated in an area in which the GC can't move things). You could experiment with different -h options which associate allocations to different cost centers (for example, -hy should show ARR_WORDS) but it probably wouldn't have much value in this case as the bytestrings are really just "big chunks of raw memory".
The references I used to find the RTS options were (clearly I wasn't particular about the GHC version - I can't imagine these flags change frequently):
https://downloads.haskell.org/~ghc/7.0.1/docs/html/users_guide/runtime-control.html
https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/profiling.html

Haskell threads heap overflow despite only 22Mb total memory usage?

I am trying to parallelize a ray-tracer. This means I have a very long list of small computations. The vanilla program runs on a specific scene in 67.98 seconds and 13 MB of total memory use and 99.2% productivity.
In my first attempt I used the parallel strategy parBuffer with a buffer size of 50. I chose parBuffer because it walks through the list only as fast as sparks are consumed, and does not force the spine of the list like parList, which would use a lot of memory since the list is very long. With -N2, it ran in a time of 100.46 seconds and 14 MB of total memory use and 97.8% productivity. The spark information is: SPARKS: 480000 (476469 converted, 0 overflowed, 0 dud, 161 GC'd, 3370 fizzled)
The large proportion of fizzled sparks indicates that the granularity of sparks was too small, so next I tried using the strategy parListChunk, which splits the list into chunks and creates a spark for each chunk. I got the best results with a chunk size of 0.25 * imageWidth. The program ran in 93.43 seconds and 236 MB of total memory use and 97.3% productivity. The spark information is: SPARKS: 2400 (2400 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled). I believe the much greater memory use is because parListChunk forces the spine of the list.
Then I tried to write my own strategy that lazily divided the list into chunks and then passed the chunks to parBuffer and concatenated the results.
concat $ withStrategy (parBuffer 40 rdeepseq) (chunksOf 100 (map colorPixel pixels))
This ran in 95.99 seconds and 22MB of total memory use and 98.8% productivity. This was succesful in the sense that all the sparks are being converted and the memory usage is much lower, however the speed is not improved. Here is an image of part of the eventlog profile.
As you can see the threads are being stopped due to heap overflows. I tried adding +RTS -M1G which increases the default heap size all the way up to 1Gb. The results did not change. I read that Haskell main thread will use memory from the heap if its stack overflows, so I also tried increasing the default stack size too with +RTS -M1G -K1G but this also had no impact.
Is there anything else I can try? I can post more detailed profiling info for memory usage or eventlog if needed, I did not include it all because it is a lot of information and I did not think all of it was necessary to include.
EDIT: I was reading about the Haskell RTS multicore support, and it talks about there being a HEC (Haskell Execution Context) for each core. Each HEC contains, among other things, an Allocation Area (which is a part of a single shared heap). Whenever any HEC's Allocation Area is exhausted, a garbage collection must be performed. The appears to be a RTS option to control it, -A. I tried -A32M but saw no difference.
EDIT2:
Here is a link to a github repo dedicated to this question. I have included the profiling results in the profiling folder.
EDIT3: Here is the relevant bit of code:
render :: [([(Float,Float)],[(Float,Float)])] -> World -> [Color]
render grids world = cs where
ps = [ (i,j) | j <- reverse [0..wImgHt world - 1] , i <- [0..wImgWd world - 1] ]
cs = map (colorPixel world) (zip ps grids)
--cs = withStrategy (parListChunk (round (wImgWd world)) rdeepseq) (map (colorPixel world) (zip ps grids))
--cs = withStrategy (parBuffer 16 rdeepseq) (map (colorPixel world) (zip ps grids))
--cs = concat $ withStrategy (parBuffer 40 rdeepseq) (chunksOf 100 (map (colorPixel world) (zip ps grids)))
The grids are random floats that are precomputed and used by colorPixel.The type of colorPixel is:
colorPixel :: World -> ((Float,Float),([(Float,Float)],[(Float,Float)])) -> Color

Not the solution to your problem, but a hint to the cause:
Haskell seems to be very conservative in memory reuse and when the interpreter sees the potential to reclaim a memory block, it goes for it. Your problem description fits the minor GC behavior described here (bottom)
https://wiki.haskell.org/GHC/Memory_Management.
New data are allocated in 512kb "nursery". Once it's exhausted, "minor
GC" occurs - it scans the nursery and frees unused values.
So if you chop the data into smaller chunks, you enable the engine to do the cleanup earlier - GC kicks in.

Profiling multithreading performance in a Haskell program — no speedups using parallel strategies

After attempting to add multithreading functionality in a Haskell program, I noticed that performance didn't improve at all. Chasing it down, I got the following data from threadscope:
Green indicates running, and orange is garbage collection.
Here vertical green bars indicate spark creation, blue bars are parallel GC requests, and light blue bars indicate thread creation.
The labels are: spark created, requesting parallel GC, creating thread n, and stealing spark from cap 2.
On average, I'm only getting about 25% activity over 4 cores, which is no improvement at all over the single-threaded program.
Of course, the question would be void without a description of the actual program. Essentially, I create a traversable data structure (e.g. a tree), and then fmap a function over it, before then feeding it into an image writing routine (explaining the unambiguously single-threaded segment at the end of the program run, past 15s). Both the construction and the fmapping of the function take a significant amount of time to run, although the second slightly more so.
The above graphs were made by adding a parTraversable strategy for that data structure before it is consumed by the image writing. I have also tried using toList on the data structure and then using various parallel list strategies (parList, parListChunk, parBuffer), but the results were similar each time for a wide range of parameters (even using large chunks).
I also tried to fully evaluate the traversable data structure before fmapping the function over it, but the exact same problem occurred.
Here are some additional statistics (for a different run of the same program):
5,702,829,756 bytes allocated in the heap
385,998,024 bytes copied during GC
55,819,120 bytes maximum residency (8 sample(s))
1,392,044 bytes maximum slop
133 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 10379 colls, 10378 par 5.20s 1.40s 0.0001s 0.0327s
Gen 1 8 colls, 8 par 1.01s 0.25s 0.0319s 0.0509s
Parallel GC work balance: 1.24 (96361163 / 77659897, ideal 4)
MUT time (elapsed) GC time (elapsed)
Task 0 (worker) : 0.00s ( 15.92s) 0.02s ( 0.02s)
Task 1 (worker) : 0.27s ( 14.00s) 1.86s ( 1.94s)
Task 2 (bound) : 14.24s ( 14.30s) 1.61s ( 1.64s)
Task 3 (worker) : 0.00s ( 15.94s) 0.00s ( 0.00s)
Task 4 (worker) : 0.25s ( 14.00s) 1.66s ( 1.93s)
Task 5 (worker) : 0.27s ( 14.09s) 1.69s ( 1.84s)
SPARKS: 595854 (595854 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 15.67s ( 14.28s elapsed)
GC time 6.22s ( 1.66s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 21.89s ( 15.94s elapsed)
Alloc rate 363,769,460 bytes per MUT second
Productivity 71.6% of total user, 98.4% of total elapsed
I'm not sure what other useful information I can give to assist answering. Profiling doesn't reveal anything interesting: it's the same as the single core statistics, except with an added IDLE taking up 75% of the time, as expected from the above.
What's happening that's preventing useful parallelisation?

Sorry that I couldn't provide code in a timely manner to assist respondents. It took me a while to untangle the exact location of the issue.
The problem was as follows: I was fmapping a function
f :: a -> S b
over the traversable data structure
structure :: T a
where S and T are two traversable functors.
Then, when using parTraversable, I was mistakenly writing
Compose (fmap f structure) `using` parTraversable rdeepseq
instead of
Compose $ fmap f structure `using` parTraversable rdeepseq
so I was wrongly using the Traversable instance for Compose T S to do the multithreading (using Data.Functor.Compose).
(This looks like it should've been easy to catch, but it took me a while to extract the above mistake from the code!)
This now looks much better!

Basic I/O performance in Haskell

Another microbenchmark: Why is this "loop" (compiled with ghc -O2 -fllvm, 7.4.1, Linux 64bit 3.2 kernel, redirected to /dev/null)
mapM_ print [1..100000000]
about 5x slower than a simple for-cycle in plain C with write(2) non-buffered syscall? I am trying to gather Haskell gotchas.
Even this slow C solution is much faster than Haskell
int i;
char buf[16];
for (i=0; i<=100000000; i++) {
sprintf(buf, "%d\n", i);
write(1, buf, strlen(buf));
}

Okay, on my box the C code, compiled per gcc -O3 takes about 21.5 seconds to run, the original Haskell code about 56 seconds. So not a factor of 5, a bit above 2.5.
The first nontrivial difference is that
mapM_ print [1..100000000]
uses Integers, that's a bit slower because it involves a check upfront, and then works with boxed Ints, while the Show instance of Int does the conversion work on unboxed Int#s.
Adding a type signature, so that the Haskell code works on Ints,
mapM_ print [1 :: Int .. 100000000]
brings the time down to 47 seconds, a bit above twice the time the C code takes.
Now, another big difference is that show produces a linked list of Char and doesn't just fill a contiguous buffer of bytes. That is slower too.
Then that linked list of Chars is used to fill a byte buffer that then is written to the stdout handle.
So, the Haskell code does more, and more complicated things than the C code, thus it's not surprising that it takes longer.
Admittedly, it would be desirable to have an easy way to output such things more directly (and hence faster). However, the proper way to handle it is to use a more suitable algorithm (that applies to C too). A simple change to
putStr . unlines $ map show [0 :: Int .. 100000000]
almost halves the time taken, and if one wants it really fast, one uses the faster ByteString I/O and builds the output efficiently as exemplified in applicative's answer.

On my (rather slow and outdated) machine the results are:
$ time haskell-test > haskell-out.txt
real 1m57.497s
user 1m47.759s
sys 0m9.369s
$ time c-test > c-out.txt
real 7m28.792s
user 1m9.072s
sys 6m13.923s
$ diff haskell-out.txt c-out.txt
$
(I have fixed the list so that both C and Haskell start with 0).
Yes you read this right. Haskell is several times faster than C. Or rather, normally buffered Haskell is faster than C with write(2) non-buffered syscall.
(When measuring output to /dev/null instead of a real disk file, C is about 1.5 times faster, but who cares about /dev/null performance?)
Technical data: Intel E2140 CPU, 2 cores, 1.6 GHz, 1M cache, Gentoo Linux, gcc4.6.1, ghc7.6.1.

The standard Haskell way to hand giant bytestrings over to the operating system is to use a builder monoid.
import Data.ByteString.Lazy.Builder -- requires bytestring-0.10.x
import Data.ByteString.Lazy.Builder.ASCII -- omit for bytestring-0.10.2.x
import Data.Monoid
import System.IO
main = hPutBuilder stdout $ build [0..100000000::Int]
build = foldr add_line mempty
where add_line n b = intDec n <> charUtf8 '\n' <> b
which gives me:
$ time ./printbuilder >> /dev/null
real 0m7.032s
user 0m6.603s
sys 0m0.398s
in contrast to Haskell approach you used
$ time ./print >> /dev/null
real 1m0.143s
user 0m58.349s
sys 0m1.032s
That is, it's child's play to do nine times better than mapM_ print, contra Daniel Fischer's suprising defeatism. Everything you need to know is here: http://hackage.haskell.org/packages/archive/bytestring/0.10.2.0/doc/html/Data-ByteString-Builder.html I won't compare it with your C since my results were much slower than Daniel's and n.m. so I figure something was going wrong.
Edit: Made the imports consistent with all versions of bytestring-0.10.x It occurred to me the following might be clearer -- the Builder equivalent of unlines . map show:
main = hPutBuilder stdout $ unlines_ $ map intDec [0..100000000::Int]
where unlines_ = mconcat . map (<> charUtf8 '\n')

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string