I try to parse large log files in haskell. I'm using System.IO.Streams but it seems to eat a lot of memory when I fold over the input. Here are two (ugly) examples:
First load 1M Int to memory in a list.
let l = foldl (\aux p -> p:aux) [] [1..1000000]
return (sum l)
Memory consumption is beautiful. Ints eat 3Mb and the list needs 6Mb:
see memory consumption of building list of 1M Int
Then try the same with Stream of ByteStrings. We need an ugly back and forth conversation but I don't think makes any difference
let s = Streams.fromList $ map (B.pack . show) [1..1000000]
l <- s >>=
Streams.map bsToInt >>=
Streams.fold (\aux p -> p:aux) []
return (sum l)
see memory consumption of building a list of Ints from a stream
Why does it need more memory? And it's even worse if I read it from a file. It needs 90Mb
result <- withFileAsInput file load
putStrLn $ "loaded " ++ show result
where load is = do
l <- Streams.lines is >>=
Streams.map bsToInt >>=
Streams.fold (\aux p -> p:aux) []
return (sum l)
My assumption is Streams.fold has some issues. Because the library's built in countInput method doesn't use it. Any idea?
EDIT
after investigation I reduced the question to this: why does this code needs an extra 50Mb?
do
let l = map (Builder.toLazyByteString . intDec ) [1..1000000]
let l2 = map (fst . fromJust . B.readInt) l
return (foldl' (\aux p -> p:aux) [] l2)
without the conversions it only needs 30Mb, with the conversions 90Mb.
In your first example, the foldl (\aux p -> p:aux) [] is redundant. It constructs a list with the same elements as the list it takes as an argument! Without the redundancy, the example is equivalent to sum [1..1000000] or foldl (+) 0 [1..1000000]. Also, it would be better to use the strict left fold foldl' to avoid the accumulation of reducible expressions on the heap. See Foldr Foldl Foldl' on the Haskell wiki.
In your last example, you are using System.IO.Streams.Combinators.fold for building a list of all the integers which are read from the file, and then try to sum the list like you did in your first example.
The problem is that, because of the sequencing of file read operations imposed by the IO monad, all the data in the file has been read before you start summing the list, and is lurking on the heap, possibly still untransformed from the original Strings and taking even more memory.
The solution is to perform the actual sum inside the fold as each new element arrives; that way you don't need to have the full list in memory at any time, only the current element (being able to do this while performing I/O is one of the aims of streaming libraries). And the fold provided by io-streams is strict, analogous to foldl'. So you don't accumulate reducible expressions on the heap, either.
Try something like System.IO.Streams.Combinators.fold (+) 0.
So the problem was the lazy creation of ByteStrings and not with the iterator.
See
Why creating and disposing temporal ByteStrings eats up my memory in Haskell?
Related
Learn You a Haskell demonstrates the DiffList concept:
*Main Control.Monad.Writer> let f = \xs -> "dog" ++ ("meat" ++ xs)
*Main Control.Monad.Writer> f "foo"
"dogmeatfoo"
Is the primary benefit of the DiffList that the list gets constructed from left to right?
The DList package lists some of the asymptotics: https://hackage.haskell.org/package/dlist-0.5/docs/Data-DList.html
You'll note lots of things only take O(1), including cons, snoc, and append. However, note that inspecting the list needs to force lots of operations each time, so if you are doing more inspecting than construction, or interleaving the two, the DList approach won't necessarily be a win.
In the simple code below, part of the definition of a function that deletes an element from a binary search tree:
deleteB x (Node n l r) | x == n = Node (leastB r) l (deleteB (leastB r) r)
does the compiler optimize the code so that it calls (least B r) only once as if it were:
deleteB x (Node n l r) | x == n = Node k l (deleteB k r)
where k = leastB r
?
In other words, is the compiler able to understand that since parameter r isn't changed within the body of the function deleteB, the result of the call of the same function (leastB) on it can't give different results, hence it is useless to compute it twice?
More generally, how would I be able to understand if the compiler does this optimization or not in case amazing stackoverflow did not exist? thanks
If you want to know what GHC "really did", you want to look at the "Core" output.
GHC takes your Haskell source code, which is extremely high-level, and transforms it into a sequence of lower and lower-level languages:
Haskell ⇒ Core ⇒ STG ⇒ C−− ⇒ assembly language ⇒ machine code
Almost all of the high-level optimisations happen in Core. The one you're asking about is basically "common subexpression elimination" (CSE). If you think about it, this is a time / space tradeoff; by saving the previous result, you're using less CPU time, but also using more RAM. If the result you're trying to store is tiny (i.e., an integer), this can be worth it. If the result is huge (i.e., the entire contents of that 17GB text file you just loaded), this is probably a Very Bad Idea.
As I understand it (⇒ not very well!), GHC tends not to do CSE. But if you want to know for sure, in your specific case, you want to look at the Core that your program has actually been compiled into. I believe the switch you want is --ddump-prep.
http://www.haskell.org/ghc/docs/7.0.2/html/users_guide/options-debugging.html
GHC does not perform this optimization because it is not always an optimization space-wise.
For instance, consider
n = 1000000
x = (length $ map succ [1..n], length $ map pred [1..n])
On a lazy language such as Haskell, one would expect this to run in constant space. Indeed, the list generating expression [1..n] should lazily produce an element at a time, which would be affected by succ/pred because of the maps, and then counted by length. (Even better, succ and pred are not computed at all since length does not force the list elements). After this, the produced element can be garbage collected, and the list generator can produce the next element, and so on. In real implementations, one would not expect every single element to be garbage collected immediately, but if the garbage collector is good, only a constant amount of them should be in memory at any time.
By comparison, the "optimized" code
n = 1000000
l = [1..n]
x = (length $ map succ l, length $ map pred l)
does not allow to garbage collect the elements of l until both components of x are evaluated. So, while it produces the list only once, it uses O(n) words of memory to store the full list. This is likely to lead to a lower performance than the unoptimized code.
The below two code are taken from the RWH book's concurrency chapter:
force :: [a] -> ()
force xs = go xs `pseq` ()
where go (_:xs) = go xs
go [] = 1
randomInts :: Int -> StdGen -> [Int]
randomInts k g = let result = take k (randoms g)
in force result `seq` result
randomInts is a function for generating list of random number for testing the performance of parallel sorting algorithm. It has been mentioned in the book that they have avoided some potential problem in the above code. This is what has been said in the book:
Invisible data dependencies.
When we generate the list of random numbers, simply printing the
length of the list would not perform enough evaluation. This wouls
evaluate the spine of the list, but not its elements. The actual
random numbers would not be evaluated until the sort compares them.
This can have serious consequences for performance. The value of a
random number depends on the value of the preceding random number in
the list, but we have scattered the list elements randomly among our
processor cores. If we did not evaluate the list elements prior to
sorting, we would suffer a terrible “ping pong” effect: not only would
evaluation bounce from one core to another, performance would suffer.
Try snipping out the application of force from the body of main above:
you should find that the parallel code can easily end up three times
slower than the non-parallel code.
So basically they are saying that by using the force function they have avoided the ping-pong problem. But again during the explanation of the force function, they describe it like this:
Notice that we don't care what's in the list; we walk down its spine
to the end, then use pseq once. There is clearly no magic involved
here: we are just using our usual understanding of Haskell's
evaluation model. And because we will be using force on the left hand
side of par or pseq, we don't need to return a meaningful value.
As seen from the definition of the force function and the explanation above, the elements in the individual list elements are not evaluated. So how does the randomInts function is actually avoiding the ping-pong effect. Is this an error in the book or am I understanding something wrong ?
randomInts actually doesn't seem to suffer from ping-pong effect. The function force is actually not only traversing the entire spline of the list, but also evaluating the elements of the list.
import Control.Parallel (par, pseq)
force :: [a] -> ()
force xs = go xs `pseq` ()
where go (_:xs) = go xs
go [] = 1
In ghci:
ghci > let a = [1..10]
ghci > :sprint a
a = _
ghci > force a
()
ghci > :sprint a
a = [1,2,3,4,5,6,7,8,9,10]
So the force function fully evaluates the list, saving it from the ping-pong effect.
I can't figure out why m1 is apparently memoized while m2 is not in the following:
m1 = ((filter odd [1..]) !!)
m2 n = ((filter odd [1..]) !! n)
m1 10000000 takes about 1.5 seconds on the first call, and a fraction of that on subsequent calls (presumably it caches the list), whereas m2 10000000 always takes the same amount of time (rebuilding the list with each call). Any idea what's going on? Are there any rules of thumb as to if and when GHC will memoize a function? Thanks.
GHC does not memoize functions.
It does, however, compute any given expression in the code at most once per time that its surrounding lambda-expression is entered, or at most once ever if it is at top level. Determining where the lambda-expressions are can be a little tricky when you use syntactic sugar like in your example, so let's convert these to equivalent desugared syntax:
m1' = (!!) (filter odd [1..]) -- NB: See below!
m2' = \n -> (!!) (filter odd [1..]) n
(Note: The Haskell 98 report actually describes a left operator section like (a %) as equivalent to \b -> (%) a b, but GHC desugars it to (%) a. These are technically different because they can be distinguished by seq. I think I might have submitted a GHC Trac ticket about this.)
Given this, you can see that in m1', the expression filter odd [1..] is not contained in any lambda-expression, so it will only be computed once per run of your program, while in m2', filter odd [1..] will be computed each time the lambda-expression is entered, i.e., on each call of m2'. That explains the difference in timing you are seeing.
Actually, some versions of GHC, with certain optimization options, will share more values than the above description indicates. This can be problematic in some situations. For example, consider the function
f = \x -> let y = [1..30000000] in foldl' (+) 0 (y ++ [x])
GHC might notice that y does not depend on x and rewrite the function to
f = let y = [1..30000000] in \x -> foldl' (+) 0 (y ++ [x])
In this case, the new version is much less efficient because it will have to read about 1 GB from memory where y is stored, while the original version would run in constant space and fit in the processor's cache. In fact, under GHC 6.12.1, the function f is almost twice as fast when compiled without optimizations than it is compiled with -O2.
m1 is computed only once because it is a Constant Applicative Form, while m2 is not a CAF, and so is computed for each evaluation.
See the GHC wiki on CAFs: http://www.haskell.org/haskellwiki/Constant_applicative_form
There is a crucial difference between the two forms: the monomorphism restriction applies to m1 but not m2, because m2 has explicitly given arguments. So m2's type is general but m1's is specific. The types they are assigned are:
m1 :: Int -> Integer
m2 :: (Integral a) => Int -> a
Most Haskell compilers and interpreters (all of them that I know of actually) do not memoize polymorphic structures, so m2's internal list is recreated every time it's called, where m1's is not.
I'm not sure, because I'm quite new to Haskell myself, but it appears that it's beacuse the second function is parametrized and the first one is not. The nature of the function is that, it's result depends on input value and in functional paradigm especailly it depends ONLY on the input. Obvious implication is that a function with no parameters returns always the same value over and over, no matter what.
Aparently there's an optimizing mechanizm in GHC compiler that exploits this fact to compute the value of such a function only once for whole program runtime. It does it lazily, to be sure, but does it nonetheless. I noticed it myself, when I wrote the following function:
primes = filter isPrime [2..]
where isPrime n = null [factor | factor <- [2..n-1], factor `divides` n]
where f `divides` n = (n `mod` f) == 0
Then to test it, I entered GHCI and wrote: primes !! 1000. It took a few seconds, but finally I got the answer: 7927. Then I called primes !! 1001 and got the answer instantly. Similarly in an instant I got the result for take 1000 primes, because Haskell had to compute the whole thousand-element list to return 1001st element before.
Thus if you can write your function such that it takes no parameters, you probably want it. ;)
I have a list of key-value pairs and I want to count how many times each key occurs and what values it occurs with, but when I try, I get a stack overflow. Here's a simplified version of the code I'm running:
import Array
add (n, vals) val = n `seq` vals `seq` (n+1,val:vals)
histo = accumArray add (0,[]) (0,9) [(0, n) | n <- [0..5000000]]
main = print histo
When I compile this with 'ghc -O' and run it, I get "Stack space overflow: current size 8388608 bytes."
I think I know what's going on: accumArray has the same properties as foldl, and so I need a strict version of accumArray. Unfortunately, the only one I've found is in Data.Array.Unboxed, which doesn't work for an array of lists.
The documentation says that when the accumulating function is strict, then accumArray should be too, but I can't get this to work, and the discussion here claims that the documentation is wrong (at least for GHC).
Is there a strict version of accumArray other than the one in Data.Array.Unboxed? Or is there a better way to do what I want?
Well, strict doesn't necessarily mean that no thunks are created, it just means that if an argument is bottom, the result is bottom too. But accumArray is not that strict, it just writes bottoms to the array if they occur. It can't really do anything else, since it must allow for non-strict functions that could produce defined values from intermediate bottoms. And the strictness analyser can't rewrite it so that the accumulation function is evaluated to WHNF on each write if it is strict, because that would change the semantics of the programme in a rather drastic way (an array containing some bottoms vs. bottom).
That said, I agree that there's an unfortunate lack of strict and eager functions in several areas.
For your problem, you can use a larger stack (+RTS -K128M didn't suffice here, but 256M did), or you can use
import Data.Array.Base (unsafeRead, unsafeWrite)
import Data.Array.ST
import GHC.Arr
strictAccumArray :: Ix i => (e -> a -> e) -> e -> (i,i) -> [(i,a)] -> Array i e
strictAccumArray fun ini (l,u) ies = case iar of
Array _ _ m barr -> Array l u m barr
where
iar = runSTArray $ do
let n = safeRangeSize (l,u)
stuff = [(safeIndex (l,u) n i, e) | (i, e) <- ies]
arr <- newArray (0,n-1) ini
let go ((i,v):ivs) = do
old <- unsafeRead arr i
unsafeWrite arr i $! fun old v
go ivs
go [] = return arr
go stuff
With a strict write, the thunks are kept small, so there's no stack overflow. But beware, the lists take a lot of space, so if your list is too long, you may get a heap exhaustion.
Another option would be to use a Data.Map (or Data.IntMap, if the version of containers is 0.4.1.0 or later) instead of an array, since that comes with insertWith', which forces the result of the combining function on use. The code could for example be
import qualified Data.Map as M -- or Data.IntMap
import Data.List (foldl')
histo :: M.Map Int (Int,[Int]) -- M.IntMap (Int,[Int])
histo = foldl' upd M.empty [(0,n) | n <- [0 .. 15000000]]
where
upd mp (i,n) = M.insertWith' add i (1,[n]) mp
add (j,val:_) (k,vals) = k `seq` vals `seq` (k+j,val:vals)
add _ pr = pr -- to avoid non-exhaustive pattern warning
Disadvantages of using a Map are
the combining function must have type a -> a -> a, so it needs to be a bit more complicated in your case.
an update is O(log size) instead of O(1), so for large histograms, it will be considerably slower.
Maps and IntMaps have some book-keeping overhead, so that will use more space than an array. But if the list of updates is large compared to the number of indices, the difference will be negligible (the overhead is k words per key, independent of the size of the values) in this case, where the size of the values grows with each update.