Another microbenchmark: Why is this "loop" (compiled with ghc -O2 -fllvm, 7.4.1, Linux 64bit 3.2 kernel, redirected to /dev/null)
mapM_ print [1..100000000]
about 5x slower than a simple for-cycle in plain C with write(2) non-buffered syscall? I am trying to gather Haskell gotchas.
Even this slow C solution is much faster than Haskell
int i;
char buf[16];
for (i=0; i<=100000000; i++) {
sprintf(buf, "%d\n", i);
write(1, buf, strlen(buf));
}
Okay, on my box the C code, compiled per gcc -O3 takes about 21.5 seconds to run, the original Haskell code about 56 seconds. So not a factor of 5, a bit above 2.5.
The first nontrivial difference is that
mapM_ print [1..100000000]
uses Integers, that's a bit slower because it involves a check upfront, and then works with boxed Ints, while the Show instance of Int does the conversion work on unboxed Int#s.
Adding a type signature, so that the Haskell code works on Ints,
mapM_ print [1 :: Int .. 100000000]
brings the time down to 47 seconds, a bit above twice the time the C code takes.
Now, another big difference is that show produces a linked list of Char and doesn't just fill a contiguous buffer of bytes. That is slower too.
Then that linked list of Chars is used to fill a byte buffer that then is written to the stdout handle.
So, the Haskell code does more, and more complicated things than the C code, thus it's not surprising that it takes longer.
Admittedly, it would be desirable to have an easy way to output such things more directly (and hence faster). However, the proper way to handle it is to use a more suitable algorithm (that applies to C too). A simple change to
putStr . unlines $ map show [0 :: Int .. 100000000]
almost halves the time taken, and if one wants it really fast, one uses the faster ByteString I/O and builds the output efficiently as exemplified in applicative's answer.
On my (rather slow and outdated) machine the results are:
$ time haskell-test > haskell-out.txt
real 1m57.497s
user 1m47.759s
sys 0m9.369s
$ time c-test > c-out.txt
real 7m28.792s
user 1m9.072s
sys 6m13.923s
$ diff haskell-out.txt c-out.txt
$
(I have fixed the list so that both C and Haskell start with 0).
Yes you read this right. Haskell is several times faster than C. Or rather, normally buffered Haskell is faster than C with write(2) non-buffered syscall.
(When measuring output to /dev/null instead of a real disk file, C is about 1.5 times faster, but who cares about /dev/null performance?)
Technical data: Intel E2140 CPU, 2 cores, 1.6 GHz, 1M cache, Gentoo Linux, gcc4.6.1, ghc7.6.1.
The standard Haskell way to hand giant bytestrings over to the operating system is to use a builder monoid.
import Data.ByteString.Lazy.Builder -- requires bytestring-0.10.x
import Data.ByteString.Lazy.Builder.ASCII -- omit for bytestring-0.10.2.x
import Data.Monoid
import System.IO
main = hPutBuilder stdout $ build [0..100000000::Int]
build = foldr add_line mempty
where add_line n b = intDec n <> charUtf8 '\n' <> b
which gives me:
$ time ./printbuilder >> /dev/null
real 0m7.032s
user 0m6.603s
sys 0m0.398s
in contrast to Haskell approach you used
$ time ./print >> /dev/null
real 1m0.143s
user 0m58.349s
sys 0m1.032s
That is, it's child's play to do nine times better than mapM_ print, contra Daniel Fischer's suprising defeatism. Everything you need to know is here: http://hackage.haskell.org/packages/archive/bytestring/0.10.2.0/doc/html/Data-ByteString-Builder.html I won't compare it with your C since my results were much slower than Daniel's and n.m. so I figure something was going wrong.
Edit: Made the imports consistent with all versions of bytestring-0.10.x It occurred to me the following might be clearer -- the Builder equivalent of unlines . map show:
main = hPutBuilder stdout $ unlines_ $ map intDec [0..100000000::Int]
where unlines_ = mconcat . map (<> charUtf8 '\n')
Related
I am writing a small snake game in Haskell as sort of a guided tutorial for beginners. The "rendering" just takes a Board and produces a Data.ByteString.Builder which is printed in the terminal. (the html profiles are pushed to the repo, you can inspect them without compiling the programm)
The problem
The problem I have is that the heap profiling looks weird: There are many spikes, and suddenly Builder, PAP and BuildStep take as same memory as the rest of the program. Considering that rendering is happenning 10 times in a second (i.e. every second we produce 10 builders), it seems inconsistent that every once in a while the builder just takes that much memory. I don't know if this is considered an space leak, since there is no thunks in the profile, but the PAP doesn't look right (I don't know...)
Implementation
The board is represented as an inmutable array of builders indexed by coordinaates (tuples) type Board = Array (Int, Int) Builder (essentialy, what should be printed in each coordinate). The function which converts the board into a builder is the expected strict fold which handle new lines using height and width of the board.
toBuilder :: RenderState -> Builder
-- |- The Array (Int, Int) Builder
toBuilder (RenderState b binf#(BoardInfo h w) gOver s) =
-- ^^^ height and width
if gOver
then ppScore s <> fst (boardToString $ emptyGrid binf) -- Not interesting. Case of game over print build an empty grid
else ppScore s <> fst (boardToString b) -- print the current board
where
boardToString = foldl' fprint (mempty, 0) -- concatenate builders and count the number, such that when #width builders have been concatenated, add a new line.
fprint (!s, !i) cell =
if ((i + 1) `mod` w) == 0
then (s <> cell <> B.charUtf8 '\n', i + 1 )
else (s <> cell , i + 1)
Up to the .prof file this function take most of the time and space (92%, which is expected). Moreover, this is the only part of the code that produces a big builder, so the problem should be here.
The buffering mode
The above profile happens when BufferMode is set to LineBuffering (default), but interestingly if I change it to NoBuffering then the profile looks the same but a thunk appears and the builder disappear...
The questions
I have reached a point which I don't know whats going on, hence my questions are a little bit vague:
Is my code with line buffering (first profile) actually leaking? No thunk appears but the PAP eating so much memory looks like a warning
The second profile clearly(?) leaks, is there an standard way to inspect which part of the code is producing the thunk?
Am I completely missing something, and actually the profile looks fine?
In case anyone is interested, I think I've found the problem. It is the terminal speed... If I run an smaller board size or a slower rendering time (the picture is for a 50x70 board with 10 renders a second), then the memory usage is completely normal.
What I think is happening, is that the board is printed into the console using B.hPutBuilder stdout, this action takes shorter than the console to actually print it, so the haskell thread continues and creates another board which should wait to be printed because the console is busy. I guess this leads to some how, two boards living in memory for a short time.
Other guesses are welcome!
Context: I have a function defined in a library called toXlsx :: ByteString -> Xlsx (that ByteString is from Data.ByteString.Lazy)
Now to do certain operations I've defined certain functions that operate on the same file, thus I would like to open, read and convert to Xlsx the file once and keep it in memory to operate with it.
Right now I'm reading the file as bs <- Data.ByteString.Lazy.readfile file and at the end doing Data.ByteString.Lazy.length bs 'seq' return value.
Is there any way to use this function and keep the file in memory as a whole to reuse it?
Note that the way a lazy bytestring works, the contents of the file won't be read until they are "used", but once they are read, they will remain in memory for any subsequent operations. The only way they will be removed from memory is if they are garbage collected because your program no longer has any way to access them.
For example, if you run the following program on a large file:
import qualified Data.ByteString.Lazy as BL
main = do
bigFile <- BL.readFile "ubuntu-14.04-desktop-amd64.iso"
print $ BL.length $ BL.filter (==0) bigFile -- takes a while
print $ BL.length $ BL.filter (==255) bigFile -- runs fast
the first computation will actually read the entire file into memory and it will be kept there for the second computation.
I guess this by itself isn't too convincing, since the operating system will also cache the file into memory, and it ends up being hard to tell the difference in timing between Haskell reading the file from the operating system cache for each computation and keeping it in memory across all computations. But, if you ran some heap profiling on this code, you'd discover that the first operation loads up the entire file into "pinned" bytestrings and that allocation stays constant through subsequent operations.
If your concern is that you want the complete file to be read at the start, even if the first operation doesn't need to read it all, so that there are no subsequent delays as additional parts of the file are read, then your seq-based solution is probably fine. Alternatively, you can read the entire file as a strict bytestring and then convert it using fromStrict -- this operation is instantaneous and doesn't copy any data. (In contrast to toStrict, which is expensive and does copy data.) So this will work:
import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as BL
main = do
-- read strict
bigFile <- BS.readFile "whatever.mov"
-- do strict and lazy operations
print $ strictOp bigFile
print $ lazyOp (BL.fromStrict bigFile)
I've written a function that up-samples a file from 48kHz to 192kHz by means of a filter:
upsample :: Coefficients -> FilePath -> IO ()
It takes the filter coefficients, the path of the file (that has to be upsamples) and writes the result to a new file.
I have to up-sample many files so I've written a function to up-sample a full directory in parallel, using forConcurrently_ from Control.Concurrent.Async:
upsampleDirectory :: Directory -> FilePath -> IO ()
upsampleDirectory dir coefPath = do
files <- getAllFilesFromDirectory dir
coefs <- loadCoefficients coefPath
forConcurrently_ files $ upsample coefs
I'm compiling with the -threaded option and running using +RTS -N2. What I see is that up-sampling 2 files sequentially is faster than up-sampling both files in parallel.
Upsampling file1.wav takes 18.863s.
Upsampling file2.wav takes 18.707s.
Upsampling a directory with file1.wav and file2.wav takes 66.250s.
What am I doing wrong?
I've tried to keep this post concise, so ask me if you need more details on some of the functions.
Here are a couple of possibilities. First, make yourself 100% sure you're actually running your program with +RTS -N2 -RTS. I can't tell you how many times I've been benchmarking a parallel program and written:
stack exec myprogram +RTS -N2 -RTS
in place of:
stack exec myprogram -- +RTS -N2 -RTS
and gotten myself hopelessly confused. (The first version runs the stack executable on two processors but the target executable on one!) Maybe add a print $ getNumCapabilities at the beginning of your main program to be sure.
After confirming you're running on two processors, then the next most likely issue is that your implementation is not running in constant space and is blowing up the heap. Here's a simple test program I used to try to duplicate your problem. (Feel free to use my awesome upsampling filter yourself!)
module Main where
import Control.Concurrent.Async
import System.Environment
import qualified Data.ByteString as B
upsample :: FilePath -> IO ()
upsample fp = do c <- B.readFile fp
let c' = B.pack $ concatMap (replicate 4) $ B.unpack c
B.writeFile (fp ++ ".out") c'
upsampleFiles :: [FilePath] -> IO ()
upsampleFiles files = do
forConcurrently_ files $ upsample
main :: IO ()
main = upsampleFiles =<< getArgs -- sample all file on command line
When I ran this on a single 70meg test file, it ran in 14 secs. When I ran it on two copies in parallel, it ran for more than a minute before it started swapping like mad, and I had to kill it. After switching to:
import qualified Data.ByteString.Lazy as B
it ran in 3.7 secs on a single file, 7.8 secs on two copies on a single processor, and 4.0 secs on two copies on two processors with +RTS -N2.
Make sure you're compiling with optimzations on, profile your program, and make sure it's running in a constant (or at least reasonable) heap space. The above program runs in a constant 100k bytes of heap. A similar version that uses a strict ByteString for reading and a lazy ByteString for writing reads the whole file into memory, but the heap almost immediately grows to 70megs (the size of the file) within a fraction of a second and then stays constant while the file is processed.
No matter how complicated your filter is, if your program is growing gigabytes of heap, the implementation is broken, and you'll need to fix it before you worry about performance, parallel or otherwise.
Here's an old question from 7 months ago, when stack overflowers agreed that Haskell's inefficiency in computing the Ackermann function was due to a compiler error.
Ackermann very inefficient with Haskell/GHC
7 months later, this appears to be fixed. It seems like ack runs with linear memory, but it runs pretty freakin' slow.
main = print (ack 4 1)
-- Ackermann function
ack 0 n = n + 1
ack m 0 = ack (m-1) 1
ack m n = ack (m-1) (ack m (n - 1))
$ time ./ack
65533
>real 8m53.274s
>user 8m47.313s
>sys 0m4.868s
Processor 2.8 GHz Intel Core i7
Memory 8 GB 1333 MHz DDR3
Software Mac OS X Lion 10.7.5 (11G63)
I am just asking for any insights into this. The more detailed ones will get upvoted. Keep in mind I am new to functional programming and even simple remarks about tail recursion vs regular recursion would be appreciated and upvoted.
I don't know how you're running it, but I suspect the complete list is:
Your program with no changes and compiling with no optimizations. Initial time: 7m29.755s
It appears you didn't use optimization. Be sure to use -O2 and try -fllvm when compiling. New time: 1m2.412s
Use explicit type signatures and use Int (vs the default of Integer) when you can. New time: 0m15.486s
So we received almost 8x speed-up by using optimizations (why does every other benchmark question not use optimization flags?!?!?) and an additional ~4x by using Int instead of Integer.
Add a type signature to ack:
ack :: Int -> Int -> Int
This should solve two problems with your code:
Overly general types
Without the signature, the compiler derives the following type:
ack :: (Eq a, Eq b, Num a, Num b) => a -> b -> b
ack ends up generalized to all number types, instead of just integers. This additional layer of indirection makes the code slow.
Giving ack a concrete type (like Int) removes this indirection.
Type defaulting
In addition, I'm guessing your main action is written like this:
main = print (ack 4 1)
Your ack works on any number type, but you don't specify exactly which one. This means GHC chooses one automatically, in a process called type defaulting.
In this case, it chooses Integer, a variable length type. Because Integer can handle numbers of arbitrary size, it is much slower than the machine sized Int.
Conclusion
To summarize:
Always write type signatures for top-level definitions.
Always compile with -Wall.
Just out of curiosity, I made a simple script to check speed and memory efficiency of constructing a list in Haskell:
wasteMem :: Int -> [Int]
wasteMem 0 = [199]
wasteMem x = (12432483483467856487256348746328761:wasteMem (x-1))
main = do
putStrLn("hello")
putStrLn(show (wasteMem 10000000000000000000000000000000000))
The strange thing is, when I tried this, it didn't run out of memory or stack space, it only prints [199], the same as running wasteMem 0. It doesn't even print an error message... why? Entering this large number in ghci just prints the number, so I don't think it's a rounding or reading error.
Your program is using a number greater than maxBound :: Int32. This means it will behave differently on different platforms. For GHC x86_64 Int is 64 bits (32 bits otherwise, but the Haskell report only promises 29 bits). This means your absurdly large value (1x10^34) is represented as 4003012203950112768 for me and zero for you 32-bit folks:
GHCI> 10000000000000000000000000000000000 :: Int
4003012203950112768
GHCI> 10000000000000000000000000000000000 :: Data.Int.Int32
0
This could be made platform independent by either using a fixed-size type (ex: from Data.Word or Data.Int) or using Integer.
All that said, this is a poorly conceived test to begin with. Haskell is lazy, so the amount of memory consumed by wastedMem n for any value n is minimal - it's just a thunk. Once you try to show this result it will grab elements off the list one at a time - first generating "[12432483483467856487256348746328761, and leaving the rest of the list as a thunk. The first value can be garbage collected before the second value is even considered (a constant-space program).
Adding to Thomas' answer, if you really want to waste space, you have to perform an operation on the list, which needs the whole list in memory at once. One such operation is sorting:
print . sort . wasteMem $ (2^16)
Also note that it's almost impossible to estimate the run-time memory usage of your list. If you want a more predictable memory benchmark, create an unboxed array instead of a list. This also doesn't require any complicated operation to ensure that everything stays in memory. Indexing a single element in an array already makes sure that the array is in memory at least once.