Counting different values using Data.Map leaks memory - haskell

The following program uses 100+ MB RAM when counting different line lengths in a 250 MB file. How do I fix it to use less RAM? I suppose I misused lazy IO, foldr and laziness of Data.Map in values.
import Control.Applicative
import qualified Data.Map as M
import Data.List
main = do
content <- readFile "output.csv"
print $ (foldr count M.empty . map length . lines) content
count a b = M.insertWith (+) a 1 b

The first big mistake in
main = do
content <- readFile "output.csv"
print $ (foldr count M.empty . map length . lines) content
count a b = M.insertWith (+) a 1 b
is using foldr. That constructs an expression of the form
length firstLine `count` length secondLine `count` ... `count` length lastLine `count` M.empty
traversing the entire list of lines constructing the thunk - at that time not even evaluating the length calls due to laziness - before it is then evaluated right to left. So the entire file contents is in memory in addition to the thunk for building the Map.
If you build up a map from a list of things, always use a strict left fold (well, if the list is short, and the things not huge, it doesn't matter) unless the semantics require a right fold (if you're combining values using a non-commutative function, that might be the case, but even then it is often preferable to use a left fold and reverse the list before building the map).
Data.Maps (or Data.IntMaps) are spine-strict, that alone makes it impossible to generate partial output before the entire list has been traversed, so the strengths of foldr cannot be used here.
The next (possible) problem is (again laziness) that you don't evaluate the mapped-to values when putting them in the Map, so if there is a line length that occurs particularly often, that value becomes a huge thunk
((...((1+1)+1)...+1)+1)
Make it
main = do
content <- readFile "output.csv"
print $ (foldl' count M.empty . map length . lines) content
count mp a = M.insertWith' (+) a 1 mp
so that the lines can be garbage collected as soon as they have been read in, and no thunks can build up in the values. That way you never need more than one line of the file in memory at once, and even that need not be in memory entirely, since the length is evaluated before it is recorded in the Map.
If your containers package is recent enough, you could also
import Data.Map.Strict
and leave count using insertWith (without prime, the Data.Map.Strict module always evaluates values put into the map).

One way to get max residency down is to use IntMap instead of Map, which is a specialized version of the Map data structure for Int keys. It's a simple change:
import Control.Applicative
import qualified Data.IntMap as I
import Data.List
main = do
content <- readFile "output.csv"
print $ (foldr count I.empty . map length . lines) content
count a b = I.insertWith (+) a 1 b
Comparing this version against yours using /usr/share/dict/words saw max residency go from about 100MB to 60MB. Note that this was also without any optimization flags. If you crank those, max residency will very likely see further improvement.

Related

Does Haskell's `readFile` read the whole file content into memory?

I want to select the n-th last line from a large text file (~10GB) in a Haskell program.
I found a way how to get the n-th last from an internal string:
myLen = 7
n = 3 -- one-based from the end
myLines = lines myText
idx = myLen - n
theLine = head (drop idx myLines)
main :: IO ()
main = do
putStrLn theLine
The documentation about the readFile function says it "reads the content lazily", so once readFile got to the n-th last line will it have stored all the lines before in memory (and then explodes because I don't have that much memory)?
So, is readFile the right approach here? Plus how do I get the IO String output from readFile "in a lazy way" into a list of lines so that I can then select the n-th last line?
The question has several parts:
The documentation about the readFile function says it "reads the content lazily", so once readFile got to the n-th last line will it have stored all the lines before in memory (and then explodes because I don't have that much memory)?
Not necessarily. If you only iterate over the contents and produce a result, then the garbage collector should deallocate the contents.
So, is readFile the right approach here?
My opinionated answer is that if it's for a serious tool, readFile isn't the right approach because "lazy IO" is a can of worms.
If it's for a quick and dirty script then go ahead, but if not, and if performance is important, then it is probably best to use lower level calls to read strict ByteStrings, and for your problem read directly from the end of the file and process that.
The following program will require only about as much memory as the longest n lines in the file being read:
-- like drop, but takes its number encoded as a lazy
-- unary number via the length of the first list
dropUnary :: [a] -> [b] -> [b]
dropUnary [] bs = bs
dropUnary (_:as) (_:bs) = dropUnary as bs
takeLast :: Int -> [a] -> [a]
takeLast n as = dropUnary (drop n as) as
main :: IO ()
main = putStrLn . head . takeLast 3 . lines =<< readFile
The Prelude's lines function is already suitably lazy, but some care was taken in writing takeLast here. You can think of this as operating in "one pass" of the file, looking at subsequent chunks of n consecutive lines until it finds the last chunk. Because it does not maintain any references to the contents of the file from before the chunk it's currently looking at, all of the file contents up to the current chunk can be garbage collected (and generally is, fairly soon).

Can I exploit lazy evaluation to reference future values without space leaks?

I'm looking to try to run a moderately expensive function on a large list of inputs, using part of the output of that function as one of its inputs. The code runs as expected, unfortunately it consumes a large amount of memory in the process (just under 22GiB on the heap, just over 1GiB maximum residency). Here is a simplified example of what I mean:
{-# LANGUAGE OverloadedStrings #-}
import Data.List (foldl')
import qualified Data.Text as T
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.IO as TL
import qualified Data.Text.Lazy.Builder as TB
main :: IO ()
main = TL.putStr $ TB.toLazyText showInts
showInts :: TB.Builder
showInts = foldMap fst shownLines
where
shownLines = map (showInt maxwidth) [0..10^7]
maxwidth = foldl' (\n -> max n . snd) 0 shownLines
showInt :: Int -> Int -> (TB.Builder, Int)
showInt maxwidth n = (builder, len)
where
builder = TB.fromText "This number: "
<> TB.fromText (T.replicate (maxwidth - len) " ") <> thisText
<> TB.singleton '\n'
(thisText, len) = expensiveShow n
expensiveShow :: Int -> (TB.Builder, Int)
expensiveShow n = (TB.fromText text, T.length text)
where text = T.pack (show n)
Note that in the where clause of showInts, showInt takes maxwidth as an argument, where maxwidth itself depends on the output of running showInt maxwidth on the whole list.
If, on the other hand, I do the naìˆve thing and replace the definition of maxwidth with foldl' max 0 $ map (snd . expensiveShow) [0..10^7], then maximum residency falls to just 44KiB. I would hope that performance like this would be achievable without workarounds like precomputing expensiveShow and then zipping it with the list [0..10^7].
I tried consuming the list strictly (using the foldl package), but this did not improve the situation.
I'm trying to have my cake and eat it too: exploiting laziness, while also making things strict enough that we don't build up a mountain of thunks. Is this possible to do? Or is there a better technique for accomplishing this?
You can't do it like this.
The problem is that your showInts has to traverse the list twice, first to find the longest number, second to print the numbers with the necessary format. That means the list has to be held in memory between the first and second passes. This isn't a problem with unevaluated thunks; it is simply that the whole list, completely evaluated, is being traversed twice.
The only solution is to generate the same list twice. In this case it is trivial; just have two [0..10^7] values, one for the maximum length and the second to format them. I suspect in your real application you are reading them from a file or something, in which case you need to read the file twice.

Haskell requires more memory than Python when I read map from file. Why?

I have this simple code in Python:
input = open("baseforms.txt","r",encoding='utf8')
S = {}
for i in input:
words = i.split()
S.update( {j:words[0] for j in words} )
print(S.get("sometext","not found"))
print(len(S))
It requires 300MB for work. "baseforms.txt" size is 123M.
I've wrote the same code in Haskell:
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.Map as M
import qualified Data.ByteString.Lazy.Char8 as B
import Data.Text.Lazy.Encoding(decodeUtf8)
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as I
import Control.Monad(liftM)
main = do
text <- B.readFile "baseforms.txt"
let m = (M.fromList . (concatMap (parseLine.decodeUtf8))) (B.lines text)
print (M.lookup "sometext" m)
print (M.size m)
where
parseLine line = let base:forms = T.words line in [(f,base)| f<-forms]
It requires 544 MB and it's slower than Python version. Why? Is it possible to optimise Haskell version?
There is a lot happening in the Haskell version that's not happening in the Python version.
readFile uses lazy IO, which is a bit weird in general. I would generally avoid lazy IO.
The file, as a bytestring, is broken into lines which are then decoded as UTF-8. This seems a little unnecessary, given the existence of Text IO functions.
The Haskell version is using a tree (Data.Map) whereas the Python version is using a hash table.
The strings are all lazy, which is probably not necessary if they're relatively short. Lazy strings have a couple words of overhead per string, which can add up. You could fuse the lazy strings, or you could read the file all at once, or you could use something like conduit.
GHC uses a copying collector, whereas the default Python implementation uses malloc() with reference counting and the occasional GC. This fact alone can account for large differences in memory usage, depending on your program.
Who knows how many thunks are getting created in the Haskell version.
It's unknown whether you've enabled optimizations.
It's unknown how much slower the Haskell version is.
We don't have your data file so we can't really test it ourselves.
It's a bit late, but I studied this a little and think Dietrich Epp's account is right, but can be simplified a little. Notice that there doesn't seem to be any real python programming going on in the python file: it is orchestrating a very simple sequence of calls to C string operations and then to a C hash table implementation. (This is often a problem with really simple python v. Haskell benchmarks.) The Haskell, by contrast, is building an immense persistent Map, which is a fancy tree. So the main points of opposition here are C vs Haskell, and hashtable-with-destructive-update vs persistent map. Since there is little overlap in the input file, the tree you are constructing includes all the information in the input string, some of it repeated, and then rearranged with a pile of Haskell constructors. This is I think the source of the alarm you are experiencing, but it can be explained.
Compare these two files, one using ByteString:
import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B
main = do m <- fmap proc (B.readFile "baseforms.txt")
print (M.lookup (B.pack "sometext") m)
print (M.size m)
proc = M.fromList . concatMap (\(a:bs) -> map (flip (,) a) bs)
. map B.words . B.lines
and the other a Text-ified equivalent:
import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B
import Data.Text.Encoding(decodeUtf8)
import qualified Data.Text as T
main = do
m <- fmap proc (B.readFile "baseforms.txt")
print (M.lookup (T.pack "sometext") m)
print (M.size m)
proc = M.fromList . concatMap (\(a:bs) -> map (flip (,) a) bs)
. map T.words . T.lines . decodeUtf8
On my machine, the python/C takes just under 6 seconds, the bytestring file takes 8 seconds, and the text file just over 10.
The bytestring implementation seems to use a bit more memory than the python, the text implementation distinctly more. The text implementation takes more time because, of course, it adds a conversion to text and then uses text operations to break the string and text comparisons to build the map.
Here is a go at analyzing the memory phenomena in the text case. First we have the bytestring in memory (130m). Once the text is constructed (~250m, to judge unscientifically from what's going on in top), the bytestring is garbage collected while we construct the tree. In the end the text tree (~380m it looks like) uses more memory than the bytestring tree (~260m) because the text fragments in the tree are bigger. The program as a whole uses more because the text held in memory during the tree construction is itself bigger. To put it crudely: each bit of white-space is being turned into a tree constructor and two text constructors together with the text version of whatever the first 'word' of the line was and whatever the text representation next word is. The weight of the constructors seems in either case to be about 130m, so at the last moment of the construction of the tree we are using something like 130m + 130m + 130m = 390m in the bytestring case, and 250m + 130m + 250m = 630m in the text case.

Fast counting known elements in a list

I am trying to write a solution for one of the Hackerrank problems. The challenge is to count elements in a list, the elements vary from 0 to 99, so it is possible to count them in linear time. Here is what I got:
{-# LANGUAGE BangPatterns #-}
{-# OPTIONS_GHC -O3 #-}
module Main where
import Data.STRef
import Data.Foldable
import Control.Monad
import Control.Monad.ST
main = do
line1 <- getLine
line2 <- getLine
let
!ns = map read $ words line2 :: [Int]
res = runST $ do
refs <- forM [0..99] $ \i ->
newSTRef (0 :: Int)
traverse_ (\x -> modifySTRef' (refs !! x) (+1) ) ns
mapM (\ref -> readSTRef ref) refs
putStrLn . unwords . map show $ res
This code works but not fast enough to pass the last test case. Can someone recommend an improvement to it? (link to the problem)
This can be done as a one-liner using accumArray from Data.Array. Something like accumArray (+) 0 (0,99) . zip values $ repeat 1 where values is the input.
It appears to still not be fast enough, which is somewhat vexing. accumArray is more or less as efficient as possible for what it does. Testing on my system reveals the time for processing 1,000,000 input values to be about 1 second, even without compiling it, and that time is dominated by generating the random inputs. That's a far cry from the 5 seconds on the test site.. I have to wonder how overloaded that system is.
One problem you have is that you're looking up your STRefs in a list which means that you'll have to traverse O(n) steps for every lookup and modification. This can be alleviated by using something like Data.Map.Map which has O(log(n)) lookup and modification time.
You could also use a mutable Array or Vector for O(1) lookup/modification time in the ST monad. This is probably the fastest method.

Why is Data.Binary's encodeFile not acting lazy?

In GHCI, I run this simple test:
encodeFile "test" [0..10000000]
The line runs really quickly (<10sec), but my memory usage shoots up to ~500MB before it finishes. Shouldn't encodeFile be lazy since it uses ByteString.Lazy?
Edit: Roman's answer below is great! I also want to point out this answer to another question, that explains why Data.Binary does strict encoding on lists and provides a slightly more elegant work around.
Here's how serialization of lists is defined:
instance Binary a => Binary [a] where
put l = put (length l) >> mapM_ put l
That is, first serialize the length of the list, then serialize the list itself.
In order to find out the length of the list, we need to evaluate the whole list.
But we cannot garbage-collect it, because its elements are needed for the second
part, mapM_ put l. So the whole list has to be stored in memory after the
length is evaluated and before the elements serialization starts.
Here's how the heap profile looks like:
Notice how it grows while the list is being built to compute its length, and
then decreases while the elements are serialized and can be collected by the GC.
So, how to fix this? In your example, you already know the length. So you
can write a function which takes the known length, as opposed to computing it:
import Data.Binary
import Data.ByteString.Lazy as L
import qualified Data.ByteString as B
import Data.Binary.Put
main = do
let len = 10000001 :: Int
bs = encodeWithLength len [0..len-1]
L.writeFile "test" bs
putWithLength :: Binary a => Int -> [a] -> Put
putWithLength len list =
put len >> mapM_ put list
encodeWithLength :: Binary a => Int -> [a] -> ByteString
encodeWithLength len list = runPut $ putWithLength len list
This program runs within 53k of heap space.
You can also include a safety feature into putWithLength: compute the length while serializing the list, and check with the first argument in the end. If there's a mismatch, throw an error.
Exercise: why do you still need to pass in the length to putWithLength instead of using the computed value as described above?

Resources