Haskell GPGME much slower at signing than GPG - haskell

My program uses GPG for signing files. I'm using GPGME in Haskell and the issue is that it is about 16 times slower than using GPG from the command line. Here is an example:
Haskell code:
module Main where
import qualified Data.ByteString as B
import qualified Crypto.Gpgme as Gpg
main :: IO ()
main = do
fileContents <- B.readFile "randomfile"
eitherSigned <- sign fileContents
case eitherSigned of
Left err -> print err
Right signed ->
B.writeFile "signedfile" signed
sign :: B.ByteString -> IO (Either [Gpg.InvalidKey] B.ByteString)
sign bs =
let
sign' :: Gpg.Ctx -> IO (Either [Gpg.InvalidKey] Gpg.Plain)
sign' ctx = Gpg.sign ctx [] Gpg.Normal bs
in
Gpg.withCtx "/home/t/.gnupg" "C" Gpg.OpenPGP sign'
I generate a junk 10MB file with
$ dd if=/dev/zero of=randomfile bs=10000000 count=1
I sign it with GPG from the command line:
$ time gpg -s randomfile
and get
gpg -s randomfile 0.14s user 0.00s system 99% cpu 0.136 total
I sign it with my Haskell program with
$ time stack exec hasksign
and get
stack exec hasksign 0.27s user 0.07s system 14% cpu 2.239 total
I tried running the Haskell code again with profiling on and got this result:
hasksign +RTS -p -RTS randomfile
total time = 2.21 secs (2208 ticks # 1000 us, 1 processor)
total alloc = 115,627,312 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
newCtx Crypto.Gpgme.Ctx src/Crypto/Gpgme/Ctx.hs:(21,1)-(51,21) 77.0 0.0
signIntern Crypto.Gpgme.Crypto src/Crypto/Gpgme/Crypto.hs:(310,1)-(354,14) 20.4 8.6
collectResult.go.\ Crypto.Gpgme.Internal src/Crypto/Gpgme/Internal.hs:(29,21)-(34,46) 2.1 81.9
main Main src/Main.hs:(7,1)-(13,43) 0.1 8.7
I have looked in the newCtx function where the time is spent but it is not clear to me what is costing so much.

Related

Haskell/GHC per thread memory costs

I'm trying to understand how expensive a (green) thread in Haskell (GHC 7.10.1 on OS X 10.10.5) really is. I'm aware that its super cheap compared to a real OS thread, both for memory usage and for CPU.
Right, so I started writing a super simple program with forks n (green) threads (using the excellent async library) and then just sleeps each thread for m seconds.
Well, that's easy enough:
$ cat PerTheadMem.hs
import Control.Concurrent (threadDelay)
import Control.Concurrent.Async (mapConcurrently)
import System.Environment (getArgs)
main = do
args <- getArgs
let (numThreads, sleep) = case args of
numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
_ -> error "wrong args"
mapConcurrently (\_ -> threadDelay (sleep*1000*1000)) [1..numThreads]
and first of all, let's compile and run it:
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.1
$ ghc -rtsopts -O3 -prof -auto-all -caf-all PerTheadMem.hs
$ time ./PerTheadMem 100000 10 +RTS -sstderr
that should fork 100k threads and wait 10s in each and then print us some information:
$ time ./PerTheadMem 100000 10 +RTS -sstderr
340,942,368 bytes allocated in the heap
880,767,000 bytes copied during GC
164,702,328 bytes maximum residency (11 sample(s))
21,736,080 bytes maximum slop
350 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 648 colls, 0 par 0.373s 0.415s 0.0006s 0.0223s
Gen 1 11 colls, 0 par 0.298s 0.431s 0.0392s 0.1535s
INIT time 0.000s ( 0.000s elapsed)
MUT time 79.062s ( 92.803s elapsed)
GC time 0.670s ( 0.846s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.065s ( 0.091s elapsed)
Total time 79.798s ( 93.740s elapsed)
%GC time 0.8% (0.9% elapsed)
Alloc rate 4,312,344 bytes per MUT second
Productivity 99.2% of total user, 84.4% of total elapsed
real 1m33.757s
user 1m19.799s
sys 0m2.260s
It took quite long (1m33.757s) given that each thread is supposed to only just wait for 10s but we've build it non-threaded so fair enough for now. All in all, we used 350 MB, that's not too bad, that's 3.5 KB per thread. Given that the initial stack size (-ki is 1 KB).
Right, but now let's compile is in threaded mode and see if we can get any faster:
$ ghc -rtsopts -O3 -prof -auto-all -caf-all -threaded PerTheadMem.hs
$ time ./PerTheadMem 100000 10 +RTS -sstderr
3,996,165,664 bytes allocated in the heap
2,294,502,968 bytes copied during GC
3,443,038,400 bytes maximum residency (20 sample(s))
14,842,600 bytes maximum slop
3657 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6435 colls, 0 par 0.860s 1.022s 0.0002s 0.0028s
Gen 1 20 colls, 0 par 2.206s 2.740s 0.1370s 0.3874s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 0.879s ( 8.534s elapsed)
GC time 3.066s ( 3.762s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.074s ( 0.247s elapsed)
Total time 4.021s ( 12.545s elapsed)
Alloc rate 4,544,893,364 bytes per MUT second
Productivity 23.7% of total user, 7.6% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
real 0m12.565s
user 0m4.021s
sys 0m1.154s
Wow, much faster, just 12s now, way better. From Activity Monitor I saw that it roughly used 4 OS threads for the 100k green threads, which makes sense.
However, 3657 MB total memory ! That's 10x more than the non-threaded version used...
Up until now, I didn't do any profiling using -prof or -hy or so. To investigate a bit more I then did some heap profiling (-hy) in separate runs. The memory usage didn't change in either case, the heap profiling graphs look interestingly different (left: non-threaded, right: threaded) but I can't find the reason for the 10x difference.
Diffing the profiling output (.prof files) I can also not find any real difference.
Therefore my question: Where does the 10x difference in memory usage come from?
EDIT: Just to mention it: The same difference applies when the program isn't even compiled with profiling support. So running time ./PerTheadMem 100000 10 +RTS -sstderr with ghc -rtsopts -threaded -fforce-recomp PerTheadMem.hs is 3559 MB. And with ghc -rtsopts -fforce-recomp PerTheadMem.hs it's 395 MB.
EDIT 2: On Linux (GHC 7.10.2 on Linux 3.13.0-32-generic #57-Ubuntu SMP, x86_64) the same happens: Non-threaded 460 MB in 1m28.538s and threaded is 3483 MB is 12.604s. /usr/bin/time -v ... reports Maximum resident set size (kbytes): 413684 and Maximum resident set size (kbytes): 1645384 respectively.
EDIT 3: Also changed the program to use forkIO directly:
import Control.Concurrent (threadDelay, forkIO)
import Control.Concurrent.MVar
import Control.Monad (mapM_)
import System.Environment (getArgs)
main = do
args <- getArgs
let (numThreads, sleep) = case args of
numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
_ -> error "wrong args"
mvar <- newEmptyMVar
mapM_ (\_ -> forkIO $ threadDelay (sleep*1000*1000) >> putMVar mvar ())
[1..numThreads]
mapM_ (\_ -> takeMVar mvar) [1..numThreads]
And it doesn't change anything: non-threaded: 152 MB, threaded: 3308 MB.
IMHO, the culprit is threadDelay. *threadDelay** uses a lot of memory. Here is a program equivalent to yours that behaves better with memory. It ensures that all the threads are running concurrently by having a long-running computation.
uBound = 38
lBound = 34
doSomething :: Integer -> Integer
doSomething 0 = 1
doSomething 1 = 1
doSomething n | n < uBound && n > 0 = let
a = doSomething (n-1)
b = doSomething (n-2)
in a `seq` b `seq` (a + b)
| otherwise = doSomething (n `mod` uBound )
e :: Chan Integer -> Int -> IO ()
e mvar i =
do
let y = doSomething . fromIntegral $ lBound + (fromIntegral i `mod` (uBound - lBound) )
y `seq` writeChan mvar y
main =
do
args <- getArgs
let (numThreads, sleep) = case args of
numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
_ -> error "wrong args"
dld = (sleep*1000*1000)
chan <- newChan
mapM_ (\i -> forkIO $ e chan i) [1..numThreads]
putStrLn "All threads created"
mapM_ (\_ -> readChan chan >>= putStrLn . show ) [1..numThreads]
putStrLn "All read"
And here are the timing statistics:
$ ghc -rtsopts -O -threaded test.hs
$ ./test 200 10 +RTS -sstderr -N4
133,541,985,480 bytes allocated in the heap
176,531,576 bytes copied during GC
356,384 bytes maximum residency (16 sample(s))
94,256 bytes maximum slop
4 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 64246 colls, 64246 par 1.185s 0.901s 0.0000s 0.0274s
Gen 1 16 colls, 15 par 0.004s 0.002s 0.0001s 0.0002s
Parallel GC work balance: 65.96% (serial 0%, perfect 100%)
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.003s elapsed)
MUT time 63.747s ( 16.333s elapsed)
GC time 1.189s ( 0.903s elapsed)
EXIT time 0.001s ( 0.000s elapsed)
Total time 64.938s ( 17.239s elapsed)
Alloc rate 2,094,861,384 bytes per MUT second
Productivity 98.2% of total user, 369.8% of total elapsed
gc_alloc_block_sync: 98548
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 2
Maximum residency is at around 1.5 kb per thread. I played a bit with the number of threads and the running length of the computation. Since threads start doing stuff immediately after forkIO, creating 100000 threads actually takes a very long time. But the results held for 1000 threads.
Here is another program where threadDelay has been "factored out", this one doesn't use any CPU and can be executed easily with 100000 threads:
e :: MVar () -> MVar () -> IO ()
e start end =
do
takeMVar start
putMVar end ()
main =
do
args <- getArgs
let (numThreads, sleep) = case args of
numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
_ -> error "wrong args"
starts <- mapM (const newEmptyMVar ) [1..numThreads]
ends <- mapM (const newEmptyMVar ) [1..numThreads]
mapM_ (\ (start,end) -> forkIO $ e start end) (zip starts ends)
mapM_ (\ start -> putMVar start () ) starts
putStrLn "All threads created"
threadDelay (sleep * 1000 * 1000)
mapM_ (\ end -> takeMVar end ) ends
putStrLn "All done"
And the results:
129,270,632 bytes allocated in the heap
404,154,872 bytes copied during GC
77,844,160 bytes maximum residency (10 sample(s))
10,929,688 bytes maximum slop
165 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 128 colls, 128 par 0.178s 0.079s 0.0006s 0.0152s
Gen 1 10 colls, 9 par 0.367s 0.137s 0.0137s 0.0325s
Parallel GC work balance: 50.09% (serial 0%, perfect 100%)
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 0.189s ( 10.094s elapsed)
GC time 0.545s ( 0.217s elapsed)
EXIT time 0.001s ( 0.002s elapsed)
Total time 0.735s ( 10.313s elapsed)
Alloc rate 685,509,460 bytes per MUT second
Productivity 25.9% of total user, 1.8% of total elapsed
On my i5, it takes less than one second to create the 100000 threads and put the "start" mvar. The peak residency is at around 778 bytes per thread, not bad at all!
Checking threadDelay's implementation, we see that it is effectively different for the threaded and unthreaded case:
https://hackage.haskell.org/package/base-4.8.1.0/docs/src/GHC.Conc.IO.html#threadDelay
Then here: https://hackage.haskell.org/package/base-4.8.1.0/docs/src/GHC.Event.TimerManager.html
which looks innocent enough. But an older version of base has an arcane spelling of (memory) doom for those that invoke threadDelay:
https://hackage.haskell.org/package/base-4.4.0.0/docs/src/GHC-Event-Manager.html#line-121
If there is still an issue or not, it is hard to say. However, one can always hope that a "real life" concurrent program won't need to have too many threads waiting on threadDelay at the same time. I for one will keep an eye on my usage of threadDelay from now on.

Heap overflow in Haskell

I'm getting Heap exhausted message when running the following short Haskell program on a big enough dataset. For example, the program fails (with heap overflow) on 20 Mb input file with around 900k lines. The heap size was set (through -with-rtsopts) to 1 Gb. It runs ok if longestCommonSubstrB is defined as something simpler, e.g. commonPrefix. I need to process files in the order of 100 Mb.
I compiled the program with the following command line (GHC 7.8.3):
ghc -Wall -O2 -prof -fprof-auto "-with-rtsopts=-M512M -p -s -h -i0.1" SampleB.hs
I would appreciate any help in making this thing run in a reasonable amount of space (in the order of the input file size), but I would especially appreciate the thought process of finding where the bottleneck is and where and how to force the strictness.
My guess is that somehow forcing longestCommonSubstrB function to evaluate strictly would solve the problem, but I don't know how to do that.
{-# LANGUAGE BangPatterns #-}
module Main where
import System.Environment (getArgs)
import qualified Data.ByteString.Lazy.Char8 as B
import Data.List (maximumBy, sort)
import Data.Function (on)
import Data.Char (isSpace)
-- | Returns a list of lexicon items, i.e. [[w1,w2,w3]]
readLexicon :: FilePath -> IO [[B.ByteString]]
readLexicon filename = do
text <- B.readFile filename
return $ map (B.split '\t' . stripR) . B.lines $ text
where
stripR = B.reverse . B.dropWhile isSpace . B.reverse
transformOne :: [B.ByteString] -> B.ByteString
transformOne (w1:w2:w3:[]) =
B.intercalate (B.pack "|") [w1, longestCommonSubstrB w2 w1, w3]
transformOne a = error $ "transformOne: unexpected tuple " ++ show a
longestCommonSubstrB :: B.ByteString -> B.ByteString -> B.ByteString
longestCommonSubstrB xs ys = maximumBy (compare `on` B.length) . concat $
[f xs' ys | xs' <- B.tails xs] ++
[f xs ys' | ys' <- tail $ B.tails ys]
where f xs' ys' = scanl g B.empty $ B.zip xs' ys'
g z (x, y) = if x == y
then z `B.snoc` x
else B.empty
main :: IO ()
main = do
(input:output:_) <- getArgs
lexicon <- readLexicon input
let flattened = B.unlines . sort . map transformOne $ lexicon
B.writeFile output flattened
This is the profile ouput for the test dataset (100k lines, heap size set to 1 GB, i.e. generateSample.exe 100000, the resulting file size is 2.38 MB):
Heap profile over time:
Execution statistics:
3,505,737,588 bytes allocated in the heap
785,283,180 bytes copied during GC
62,390,372 bytes maximum residency (44 sample(s))
216,592 bytes maximum slop
96 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6697 colls, 0 par 1.05s 1.03s 0.0002s 0.0013s
Gen 1 44 colls, 0 par 4.14s 3.99s 0.0906s 0.1935s
INIT time 0.00s ( 0.00s elapsed)
MUT time 7.80s ( 9.17s elapsed)
GC time 3.75s ( 3.67s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 1.44s ( 1.35s elapsed)
EXIT time 0.02s ( 0.00s elapsed)
Total time 13.02s ( 12.85s elapsed)
%GC time 28.8% (28.6% elapsed)
Alloc rate 449,633,678 bytes per MUT second
Productivity 60.1% of total user, 60.9% of total elapsed
Time and Allocation Profiling Report:
SampleB.exe +RTS -M1G -p -s -h -i0.1 -RTS sample.txt sample_out.txt
total time = 3.97 secs (3967 ticks # 1000 us, 1 processor)
total alloc = 2,321,595,564 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
longestCommonSubstrB Main 43.3 33.1
longestCommonSubstrB.f Main 21.5 43.6
main.flattened Main 17.5 5.1
main Main 6.6 5.8
longestCommonSubstrB.g Main 5.0 5.8
readLexicon Main 2.5 2.8
transformOne Main 1.8 1.7
readLexicon.stripR Main 1.8 1.9
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 45 0 0.1 0.0 100.0 100.0
main Main 91 0 6.6 5.8 99.9 100.0
main.flattened Main 93 1 17.5 5.1 89.1 89.4
transformOne Main 95 100000 1.8 1.7 71.6 84.3
longestCommonSubstrB Main 100 100000 43.3 33.1 69.8 82.5
longestCommonSubstrB.f Main 101 1400000 21.5 43.6 26.5 49.5
longestCommonSubstrB.g Main 104 4200000 5.0 5.8 5.0 5.8
readLexicon Main 92 1 2.5 2.8 4.2 4.8
readLexicon.stripR Main 98 0 1.8 1.9 1.8 1.9
CAF GHC.IO.Encoding.CodePage 80 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 74 0 0.0 0.0 0.0 0.0
CAF GHC.IO.FD 70 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 66 0 0.0 0.0 0.0 0.0
CAF System.Environment 65 0 0.0 0.0 0.0 0.0
CAF Data.ByteString.Lazy.Char8 54 0 0.0 0.0 0.0 0.0
CAF Main 52 0 0.0 0.0 0.0 0.0
transformOne Main 99 0 0.0 0.0 0.0 0.0
readLexicon Main 96 0 0.0 0.0 0.0 0.0
readLexicon.stripR Main 97 1 0.0 0.0 0.0 0.0
main Main 90 1 0.0 0.0 0.0 0.0
UPDATE: The following program can be used to generate sample data. It expects one argument, the number of lines in the generated dataset. The generated data will be saved to the sample.txt file. When I generate 900k lines dataset with it (by running generateSample.exe 900000), the produced dataset makes the above program fail with heap overflow (the heap size was set to 1 GB). The resulting dataset is around 20 MB.
module Main where
import System.Environment (getArgs)
import Data.List (intercalate, permutations)
generate :: Int -> [(String,String,String)]
generate n = take n $ zip3 (f "banana") (f "ruanaba") (f "kikiriki")
where
f = cycle . permutations
main :: IO ()
main = do
(n:_) <- getArgs
let flattened = unlines . map f $ generate (read n :: Int)
writeFile "sample.txt" flattened
where
f (w1,w2,w3) = intercalate "\t" [w1, w2, w3]
It seems to me you've implemented a naive longest common substring, with terrible space complexity (at least O(n^2)). Strictness has nothing to do with it.
You'll want to implement a dynamic programming algo. You may find inspiration in the string-similarity package, or in the lcs function in the guts of the Diff package.

Making a histogram computation in Haskell faster

I am quite new to Haskell and I am wanting to create a histogram. I am using Data.Vector.Unboxed to fuse operations on the data; which is blazing fast (when compiled with -O -fllvm) and the bottleneck is my fold application; which aggregates the bucket counts.
How can I make it faster? I read about trying to reduce the number of thunks by keeping things strict so I've made things strict by using seq and foldr' but not seeing much performance increase. Your ideas are strongly encouraged.
import qualified Data.Vector.Unboxed as V
histogram :: [(Int,Int)]
histogram = V.foldr' agg [] $ V.zip k v
where
n = 10000000
c = 1000000
k = V.generate n (\i -> i `div` c * c)
v = V.generate n (\i -> 1)
agg kv [] = [kv]
agg kv#(k,v) acc#((ck,cv):as)
| k == ck = let a = (ck,cv+v):as in a `seq` a
| otherwise = let a = kv:acc in a `seq` a
main :: IO ()
main = print histogram
Compiled with:
ghc --make -O -fllvm histogram.hs
First, compile the program with -O2 -rtsopts. Then, to get a first idea where you could optimize, run the program with the options +RTS -sstderr:
$ ./question +RTS -sstderr
[(0,1000000),(1000000,1000000),(2000000,1000000),(3000000,1000000),(4000000,1000000),(5000000,1000000),(6000000,1000000),(7000000,1000000),(8000000,1000000),(9000000,1000000)]
1,193,907,224 bytes allocated in the heap
1,078,027,784 bytes copied during GC
282,023,968 bytes maximum residency (7 sample(s))
86,755,184 bytes maximum slop
763 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1964 colls, 0 par 3.99s 4.05s 0.0021s 0.0116s
Gen 1 7 colls, 0 par 1.60s 1.68s 0.2399s 0.6665s
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.67s ( 2.68s elapsed)
GC time 5.59s ( 5.73s elapsed)
EXIT time 0.02s ( 0.03s elapsed)
Total time 8.29s ( 8.43s elapsed)
%GC time 67.4% (67.9% elapsed)
Alloc rate 446,869,876 bytes per MUT second
Productivity 32.6% of total user, 32.0% of total elapsed
Notice that 67% of your time is spent in GC! There is clearly something wrong. To find out what is wrong, we can run the program with heap profiling enabled (using +RTS -h), which produces the following figure:
So, you're leaking thunks. How does this happen? Looking at the code, the only time where a thunk is build up (recursively) in agg is when you do the addition. Making cv strict by adding a bang pattern thus fixes the issue:
{-# LANGUAGE BangPatterns #-}
import qualified Data.Vector.Unboxed as V
histogram :: [(Int,Int)]
histogram = V.foldr' agg [] $ V.zip k v
where
n = 10000000
c = 1000000
k = V.generate n (\i -> i `div` c * c)
v = V.generate n id
agg kv [] = [kv]
agg kv#(k,v) acc#((ck,!cv):as) -- Note the !
| k == ck = (ck,cv+v):as
| otherwise = kv:acc
main :: IO ()
main = print histogram
Output:
$ time ./improved +RTS -sstderr
[(0,499999500000),(1000000,1499999500000),(2000000,2499999500000),(3000000,3499999500000),(4000000,4499999500000),(5000000,5499999500000),(6000000,6499999500000),(7000000,7499999500000),(8000000,8499999500000),(9000000,9499999500000)]
672,063,056 bytes allocated in the heap
94,664 bytes copied during GC
160,028,816 bytes maximum residency (2 sample(s))
1,464,176 bytes maximum slop
155 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 992 colls, 0 par 0.03s 0.03s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.03s 0.03s 0.0161s 0.0319s
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.24s ( 1.25s elapsed)
GC time 0.06s ( 0.06s elapsed)
EXIT time 0.03s ( 0.03s elapsed)
Total time 1.34s ( 1.34s elapsed)
%GC time 4.4% (4.5% elapsed)
Alloc rate 540,674,868 bytes per MUT second
Productivity 95.5% of total user, 95.1% of total elapsed
./improved +RTS -sstderr 1,14s user 0,20s system 99% cpu 1,352 total
This is much better.
So now you could ask, why did the issue appear, even though you used seq? The reason for this is the seq only forces the first argument to be WHNF, and for a pair, (_,_) (where _ are unevaluated thunks) is already WHNF! Also, seq a a is the same as a, because it seq a b (informally) means: evaluate a before b is evaluated, so seq a a just means: evaluate a before a is evaluated, and that is the same as just evaluating a!

Why does attoparsec use 100 times more memory than my input file?

I have a 2.5 MB file full of floats separated by spaces (the code below can generate it for you) and want to parse it into an array with attoparsec.
It is surprisingly slow, taking almost a second, and allocating a lot of memory:
time ./Attoparsec-problem +RTS -sstderr < floats.txt
299999.0
956,647,344 bytes allocated in the heap
752,875,520 bytes copied during GC
166,485,416 bytes maximum residency (7 sample(s))
8,874,168 bytes maximum slop
337 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1604 colls, 0 par 0.21s 0.27s 0.0002s 0.0092s
Gen 1 7 colls, 0 par 0.24s 0.36s 0.0520s 0.1783s
...
%GC time 65.5% (75.1% elapsed)
Alloc rate 3,985,781,488 bytes per MUT second
Productivity 34.5% of total user, 28.6% of total elapsed
My parser is incredibly simple: It is essentially double <* skipSpace.
This is the code:
import Control.Applicative
import Data.Attoparsec.ByteString.Char8 as A
import qualified Data.ByteString as BS
import qualified Data.Vector.Unboxed as U
-- Compile with:
-- ghc --make -O2 -prof -auto-all -caf-all -rtsopts -fforce-recomp Attoparsec-problem.hs
-- Run:
-- time ./Attoparsec-problem +RTS -sstderr < floats.txt
main :: IO ()
main = do
-- For creating the test file (2.5 MB)
-- writeFile "floats.txt" (Prelude.unwords $ Prelude.map show [1.0,2.0..300000.0])
r <- parse parser <$> BS.getContents
case r of
Done _ arr -> print $ U.last arr
x -> print x
where
parser = do
U.replicateM (300000-1) (double <* skipSpace)
-- This gives surprisingly bad productivity (70% GC time) and 180 MB max residency
-- for a 2.5 MB file!
Can you explain me what is going on?

Why does the strictness flag make memory usage increase?

The following two programs differ only by the strictness flag on variable st
$ cat testStrictL.hs
module Main (main) where
import qualified Data.Vector as V
import qualified Data.Vector.Generic as GV
import qualified Data.Vector.Mutable as MV
len = 5000000
testL = do
mv <- MV.new len
let go i = do
if i >= len then return () else
do let st = show (i+10000000) -- no strictness flag
MV.write mv i st
go (i+1)
go 0
v <- GV.unsafeFreeze mv :: IO (V.Vector String)
return v
main =
do
v <- testL
print (V.length v)
mapM_ print $ V.toList $ V.slice 4000000 5 v
$ cat testStrictS.hs
module Main (main) where
import qualified Data.Vector as V
import qualified Data.Vector.Generic as GV
import qualified Data.Vector.Mutable as MV
len = 5000000
testS = do
mv <- MV.new len
let go i = do
if i >= len then return () else
do let !st = show (i+10000000) -- this has the strictness flag
MV.write mv i st
go (i+1)
go 0
v <- GV.unsafeFreeze mv :: IO (V.Vector String)
return v
main =
do
v <- testS
print (V.length v)
mapM_ print $ V.toList $ V.slice 4000000 5 v
Compiling and running these two programs on Ubuntu 10.10 with ghc 7.03
I get the following results
$ ghc --make testStrictL.hs -O3 -rtsopts
[2 of 2] Compiling Main ( testStrictL.hs, testStrictL.o )
Linking testStrictL ...
$ ghc --make testStrictS.hs -O3 -rtsopts
[2 of 2] Compiling Main ( testStrictS.hs, testStrictS.o )
Linking testStrictS ...
$ ./testStrictS +RTS -sstderr
./testStrictS +RTS -sstderr
5000000
"14000000"
"14000001"
"14000002"
"14000003"
"14000004"
824,145,164 bytes allocated in the heap
1,531,590,312 bytes copied during GC
349,989,148 bytes maximum residency (6 sample(s))
1,464,492 bytes maximum slop
656 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 1526 collections, 0 parallel, 5.96s, 6.04s elapsed
Generation 1: 6 collections, 0 parallel, 2.79s, 4.36s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.77s ( 2.64s elapsed)
GC time 8.76s ( 10.40s elapsed)
EXIT time 0.00s ( 0.13s elapsed)
Total time 10.52s ( 13.04s elapsed)
%GC time 83.2% (79.8% elapsed)
Alloc rate 466,113,027 bytes per MUT second
Productivity 16.8% of total user, 13.6% of total elapsed
$ ./testStrictL +RTS -sstderr
./testStrictL +RTS -sstderr
5000000
"14000000"
"14000001"
"14000002"
"14000003"
"14000004"
81,091,372 bytes allocated in the heap
143,799,376 bytes copied during GC
44,653,636 bytes maximum residency (3 sample(s))
1,005,516 bytes maximum slop
79 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 112 collections, 0 parallel, 0.54s, 0.59s elapsed
Generation 1: 3 collections, 0 parallel, 0.41s, 0.45s elapsed
INIT time 0.00s ( 0.03s elapsed)
MUT time 0.12s ( 0.18s elapsed)
GC time 0.95s ( 1.04s elapsed)
EXIT time 0.00s ( 0.06s elapsed)
Total time 1.06s ( 1.24s elapsed)
%GC time 89.1% (83.3% elapsed)
Alloc rate 699,015,343 bytes per MUT second
Productivity 10.9% of total user, 9.3% of total elapsed
Could someone please explain why the strictness flag seems to cause the program to use so much
memory? This simple example came about from my attempts to understand why my programs
use so much memory when reading large files of 5 million lines and creating vectors
of records.
The problem here is mainly that you're using the String (which is an alias for [Char]) type which due to its representation as a non-strict list of single Chars requires 5 words per characters on the memory heap (See also this blog article for some memory footprint comparisons)
In the lazy case, you basically store an unevaluated thunk pointing to the (shared) evaluation function show . (+10000000) and a varying integer, whereas in the strict case the complete strings composed of 8 characters seem to be materialized (usually the bang-pattern would only force the outermost list-constructor :, i.e. the first character of a lazy String, to be evaluated), which requires way more heap space the longer the strings become.
Storing 5000000 String-typed strings of length 8 thus requires 5000000*8*5 = 200000000 words, which on 32-bit correspond to about ~763 MiB. If the Char digits are shared, you only need 3/5 of that, i.e. ~458 MiB, which seems to match your observed memory overhead.
If you replace your String by something more compact such as a Data.ByteString.ByteString you'll notice that the memory overhead will be about one magnitude lower compared to when using a plain String.

Resources