Why is Data.Binary's encodeFile not acting lazy? - haskell

In GHCI, I run this simple test:
encodeFile "test" [0..10000000]
The line runs really quickly (<10sec), but my memory usage shoots up to ~500MB before it finishes. Shouldn't encodeFile be lazy since it uses ByteString.Lazy?
Edit: Roman's answer below is great! I also want to point out this answer to another question, that explains why Data.Binary does strict encoding on lists and provides a slightly more elegant work around.

Here's how serialization of lists is defined:
instance Binary a => Binary [a] where
put l = put (length l) >> mapM_ put l
That is, first serialize the length of the list, then serialize the list itself.
In order to find out the length of the list, we need to evaluate the whole list.
But we cannot garbage-collect it, because its elements are needed for the second
part, mapM_ put l. So the whole list has to be stored in memory after the
length is evaluated and before the elements serialization starts.
Here's how the heap profile looks like:
Notice how it grows while the list is being built to compute its length, and
then decreases while the elements are serialized and can be collected by the GC.
So, how to fix this? In your example, you already know the length. So you
can write a function which takes the known length, as opposed to computing it:
import Data.Binary
import Data.ByteString.Lazy as L
import qualified Data.ByteString as B
import Data.Binary.Put
main = do
let len = 10000001 :: Int
bs = encodeWithLength len [0..len-1]
L.writeFile "test" bs
putWithLength :: Binary a => Int -> [a] -> Put
putWithLength len list =
put len >> mapM_ put list
encodeWithLength :: Binary a => Int -> [a] -> ByteString
encodeWithLength len list = runPut $ putWithLength len list
This program runs within 53k of heap space.
You can also include a safety feature into putWithLength: compute the length while serializing the list, and check with the first argument in the end. If there's a mismatch, throw an error.
Exercise: why do you still need to pass in the length to putWithLength instead of using the computed value as described above?

Related

Can I exploit lazy evaluation to reference future values without space leaks?

I'm looking to try to run a moderately expensive function on a large list of inputs, using part of the output of that function as one of its inputs. The code runs as expected, unfortunately it consumes a large amount of memory in the process (just under 22GiB on the heap, just over 1GiB maximum residency). Here is a simplified example of what I mean:
{-# LANGUAGE OverloadedStrings #-}
import Data.List (foldl')
import qualified Data.Text as T
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.IO as TL
import qualified Data.Text.Lazy.Builder as TB
main :: IO ()
main = TL.putStr $ TB.toLazyText showInts
showInts :: TB.Builder
showInts = foldMap fst shownLines
where
shownLines = map (showInt maxwidth) [0..10^7]
maxwidth = foldl' (\n -> max n . snd) 0 shownLines
showInt :: Int -> Int -> (TB.Builder, Int)
showInt maxwidth n = (builder, len)
where
builder = TB.fromText "This number: "
<> TB.fromText (T.replicate (maxwidth - len) " ") <> thisText
<> TB.singleton '\n'
(thisText, len) = expensiveShow n
expensiveShow :: Int -> (TB.Builder, Int)
expensiveShow n = (TB.fromText text, T.length text)
where text = T.pack (show n)
Note that in the where clause of showInts, showInt takes maxwidth as an argument, where maxwidth itself depends on the output of running showInt maxwidth on the whole list.
If, on the other hand, I do the naìˆve thing and replace the definition of maxwidth with foldl' max 0 $ map (snd . expensiveShow) [0..10^7], then maximum residency falls to just 44KiB. I would hope that performance like this would be achievable without workarounds like precomputing expensiveShow and then zipping it with the list [0..10^7].
I tried consuming the list strictly (using the foldl package), but this did not improve the situation.
I'm trying to have my cake and eat it too: exploiting laziness, while also making things strict enough that we don't build up a mountain of thunks. Is this possible to do? Or is there a better technique for accomplishing this?
You can't do it like this.
The problem is that your showInts has to traverse the list twice, first to find the longest number, second to print the numbers with the necessary format. That means the list has to be held in memory between the first and second passes. This isn't a problem with unevaluated thunks; it is simply that the whole list, completely evaluated, is being traversed twice.
The only solution is to generate the same list twice. In this case it is trivial; just have two [0..10^7] values, one for the maximum length and the second to format them. I suspect in your real application you are reading them from a file or something, in which case you need to read the file twice.

Haskell - generate and use the same random list

In Haskell, I'd like to generate a random list of Ints and use it throughout my program. My current solution causes the array to be created randomly each time I access/use it.
How can I overcome this problem?
Here is a simple example (#luqui mentioned) you should be able to generalize to your need:
module Main where
import Control.Monad (replicateM)
import System.Random (randomRIO)
main :: IO ()
main = do
randomList <- randomInts 10 (1,6)
print randomList
let s = myFunUsingRandomList randomList
print s
myFunUsingRandomList :: [Int] -> Int
myFunUsingRandomList list = sum list
randomInts :: Int -> (Int,Int) -> IO [Int]
randomInts len bounds = replicateM len $ randomRIO bounds
remarks
randomInts is an IO-action that will produce you your random list of integers, you see it's use in the very first line of mains do-block. Once it is used there the list will stay the same.
myFunUsingRandomList is a very simple example (just summing it) of using such a list in a pure fashion - indeed it just expects a list of integers and I just happen to call it with the random list from main
Follow this example. Then, pass rs from main to whatever function needs it.
You can't have a global variable for this without using something like unsafePerformIO, which, as the name indicates, should not be used. The reason is that picking random numbers is an inherently non-deterministic operation, dependent on the outside IO state.

Reading a file into an array of strings

I'm pretty new to Haskell, and am trying to simply read a file into a list of strings. I'd like one line of the file per element of the list. But I'm running into a type issue that I don't understand. Here's what I've written for my function:
readAllTheLines hdl = (hGetLine hdl):(readAllTheLines hdl)
That compiles fine. I had thought that the file handle needed to be the same one returned from openFile. I attempted to simply show the list from the above function by doing the following:
displayFile path = show (readAllTheLines (openFile path ReadMode))
But when I try to compile it, I get the following error:
filefun.hs:5:43:
Couldn't match expected type 'Handle' with actual type 'IO Handle'
In the return type of a call of 'openFile'
In the first argument of 'readAllTheLines', namely
'(openFile path ReadMode)'
In the first argument of 'show', namely
'(readAllTheLines (openFile path ReadMode))'
So it seems like openFile returns an IO Handle, but hGetLine needs a plain old Handle. Am I misunderstanding the use of these 2 functions? Are they not intended to be used together? Or is there just a piece I'm missing?
Use readFile and lines for a better alternative.
readLines :: FilePath -> IO [String]
readLines = fmap lines . readFile
Coming back to your solution openFile returns IO Handle so you have to run the action to get the Handle. You also have to check if the Handle is at eof before reading something from that. It is much simpler to just use the above solution.
import System.IO
readAllTheLines :: Handle -> IO [String]
readAllTheLines hndl = do
eof <- hIsEOF hndl
notEnded eof
where notEnded False = do
line <- hGetLine hndl
rest <- readAllTheLines hndl
return (line:rest)
notEnded True = return []
displayFile :: FilePath -> IO [String]
displayFile path = do
hndl <- openFile path ReadMode
readAllTheLines hndl
To add on to Satvik's answer, the example below shows how you can utilize a function to populate an instance of Haskell's STArray typeclass in case you need to perform computations on a truly random access data type.
Code Example
Let's say we have the following problem. We have lines in a text file "test.txt", and we need to load it into an array and then display the line found in the center of that file. This kind of computation is exactly the sort situation where one would want to use a random access array over a sequentially structured list. Granted, in this example, there may not be a huge difference between using a list and an array, but, generally speaking, list accesses will cost O(n) in time whereas array accesses will give you constant time performance.
First, let's create our sample text file:
test.txt
This
is
definitely
a
test.
Given the file above, we can use the following Haskell program (located in the same directory as test.txt) to print out the middle line of text, i.e. the word "definitely."
Main.hs
{-# LANGUAGE BlockArguments #-} -- See footnote 1
import Control.Monad.ST (runST, ST)
import Data.Array.MArray (newArray, readArray, writeArray)
import Data.Array.ST (STArray)
import Data.Foldable (for_)
import Data.Ix (Ix) -- See footnote 2
populateArray :: (Integral i, Ix i) => STArray s i e -> [e] -> ST s () -- See footnote 3
populateArray stArray es = for_ (zip [0..] es) (uncurry (writeArray stArray))
middleWord' :: (Integral i, Ix i) => i -> STArray s i String -> ST s String
middleWord' arrayLength = flip readArray (arrayLength `div` 2)
middleWord :: [String] -> String
middleWord ws = runST do
let len = length ws
array <- newArray (0, len - 1) "" :: ST s (STArray s Int String)
populateArray array ws
middleWord' len array
main :: IO ()
main = do
ws <- words <$> readFile "test.txt"
putStrLn $ middleWord ws
Explanation
Starting with the top of Main.hs, the ST s monad and its associated function runST allow us to extract pure values from imperative-style computations with in-place updates in a referentially transparent manner. The module Data.Array.MArray exports the MArray typeclass as an interface for instantiating mutable array data types and provides helper functions for creating, reading, and writing MArrays. These functions can be used in conjunction with STArrays since there is an instance of MArray defined for STArray.
The populateArray function is the crux of our example. It uses for_ to "applicatively" loop over a list of tuples of indices and list elements to fill the given STArray with those list elements, producing a value of type () in the ST s monad.
The middleWord' helper function uses readArray to produce a String (wrapped in the ST s monad) that corresponds to the middle element of a given STArray of Strings.
The middleWord function instantiates a new STArray, uses populateArray to fill the array with values from a provided list of strings, and calls middleWord' to obtain the middle string in the array. runST is applied to this whole ST s monadic computation to extract the pure String result.
We finally use our middleWord function in main to find the middle word in the text file "test.txt".
Further Reading
Haskell's STArray is not the only way to work with arrays in Haskell. There are in fact Arrays, IOArrays, DiffArrays and even "unboxed" versions of all of these array types that avoid using the indirection of pointers to simply store "raw" values. There is a page on the Haskell Wikibook on this topic that may be worth some study. Before that, however, looking at the Wikibook page on mutable objects may give you some insight as to why the ST s monad allows us to safely compute pure values from functions that use imperative/destructive operations.
Footnotes
1 The BlockArguments language extension is what allows us to pass a do block directly to a function without any parentheses or use of the function application operator $.
2 As suggested by the Hackage documentation, Ix is a typeclass mainly meant to be used to specify types for indexing arrays.
3 The use of the Integral and Ix type constraints may be a bit of overkill, but it's used to make our type signatures as general as possible.

Counting different values using Data.Map leaks memory

The following program uses 100+ MB RAM when counting different line lengths in a 250 MB file. How do I fix it to use less RAM? I suppose I misused lazy IO, foldr and laziness of Data.Map in values.
import Control.Applicative
import qualified Data.Map as M
import Data.List
main = do
content <- readFile "output.csv"
print $ (foldr count M.empty . map length . lines) content
count a b = M.insertWith (+) a 1 b
The first big mistake in
main = do
content <- readFile "output.csv"
print $ (foldr count M.empty . map length . lines) content
count a b = M.insertWith (+) a 1 b
is using foldr. That constructs an expression of the form
length firstLine `count` length secondLine `count` ... `count` length lastLine `count` M.empty
traversing the entire list of lines constructing the thunk - at that time not even evaluating the length calls due to laziness - before it is then evaluated right to left. So the entire file contents is in memory in addition to the thunk for building the Map.
If you build up a map from a list of things, always use a strict left fold (well, if the list is short, and the things not huge, it doesn't matter) unless the semantics require a right fold (if you're combining values using a non-commutative function, that might be the case, but even then it is often preferable to use a left fold and reverse the list before building the map).
Data.Maps (or Data.IntMaps) are spine-strict, that alone makes it impossible to generate partial output before the entire list has been traversed, so the strengths of foldr cannot be used here.
The next (possible) problem is (again laziness) that you don't evaluate the mapped-to values when putting them in the Map, so if there is a line length that occurs particularly often, that value becomes a huge thunk
((...((1+1)+1)...+1)+1)
Make it
main = do
content <- readFile "output.csv"
print $ (foldl' count M.empty . map length . lines) content
count mp a = M.insertWith' (+) a 1 mp
so that the lines can be garbage collected as soon as they have been read in, and no thunks can build up in the values. That way you never need more than one line of the file in memory at once, and even that need not be in memory entirely, since the length is evaluated before it is recorded in the Map.
If your containers package is recent enough, you could also
import Data.Map.Strict
and leave count using insertWith (without prime, the Data.Map.Strict module always evaluates values put into the map).
One way to get max residency down is to use IntMap instead of Map, which is a specialized version of the Map data structure for Int keys. It's a simple change:
import Control.Applicative
import qualified Data.IntMap as I
import Data.List
main = do
content <- readFile "output.csv"
print $ (foldr count I.empty . map length . lines) content
count a b = I.insertWith (+) a 1 b
Comparing this version against yours using /usr/share/dict/words saw max residency go from about 100MB to 60MB. Note that this was also without any optimization flags. If you crank those, max residency will very likely see further improvement.

Dice Game in Haskell

I'm trying to spew out randomly generated dice for every roll that the user plays. The user has 3 rolls per turn and he gets to play 5 turns (I haven't implemented this part yet and I would appreciate suggestions).
I'm also wondering how I can display the colors randomly. I have the list of tuples in place, but I reckon I need some function that uses random and that list to match those colors. I'm struggling as to how.
module Main where
import System.IO
import System.Random
import Data.List
diceColor = [("Black",1),("Green",2),("Purple",3),("Red",4),("White",5),("Yellow",6)]
{-
randomList :: (RandomGen g) -> Int -> g -> [Integer]
random 0 _ = []
randomList n generator = r : randomList (n-1) newGenerator
where (r, newGenerator) = randomR (1, 6) generator
-}
rand :: Int -> [Int] -> IO ()
rand n rlst = do
num <- randomRIO (1::Int, 6)
if n == 0
then doSomething rlst
else rand (n-1) (num:rlst)
doSomething x = putStrLn (show (sort x))
main :: IO ()
main = do
--hSetBuffering stdin LineBuffering
putStrLn "roll, keep, score?"
cmd <- getLine
doYahtzee cmd
--rand (read cmd) []
doYahtzee :: String -> IO ()
doYahtzee cmd = do
if cmd == "roll"
then rand 5 []
else do print "You won"
There's really a lot of errors sprinkled throughout this code, which suggests to me that you tried to build the whole thing at once. This is a recipe for disaster; you should be building very small things and testing them often in ghci.
Lecture aside, you might find the following facts interesting (in order of the associated errors in your code):
List is deprecated; you should use Data.List instead.
No let is needed for top-level definitions.
Variable names must begin with a lower case letter.
Class prerequisites are separated from a type by =>.
The top-level module block should mainly have definitions; you should associate every where clause (especially the one near randomList) with a definition by either indenting it enough not to be a new line in the module block or keeping it on the same line as the definition you want it to be associated with.
do introduces a block; those things in the block should be indented equally and more than their context.
doYahtzee is declared and used as if it has three arguments, but seems to be defined as if it only has one.
The read function is used to parse a String. Unless you know what it does, using read to parse a String from another String is probably not what you want to do -- especially on user input.
putStrLn only takes one argument, not four, and that argument has to be a String. However, making a guess at what you wanted here, you might like the (!!) and print functions.
dieRoll doesn't seem to be defined anywhere.
It's possible that there are other errors, as well. Stylistically, I recommend that you check out replicateM, randomRs, and forever. You can use hoogle to search for their names and read more about them; in the future, you can also use it to search for functions you wish existed by their type.

Resources