Haskell ByteStrings - ending up with large file loaded into memory - haskell

Greetings,
I'm trying to understand why I'm seeing the entire file loaded into memory with the following program, yet if you comment out the line below "(***)" then the program runs in constant (about 1.5M) space.
EDIT: The file is about 660MB, the field in column 26 is a date string like '2009-10-01', and there are one million lines. The process uses about 810MB by the time it hits the 'getLine'
Am I right in thinking it's related to the splitting of the string using 'split', and that somehow the underlying ByteString that has been read from the file can't be garbage-collected because it's still referenced? But if so, then I thought BS.copy would work around that. Any ideas how to force the computation - I can't seem to get 'seq' into the right place to have an effect.
(NB the source file is tab-separated lines)
Thanks in advance,
Kevin
module Main where
import System.IO
import qualified Data.ByteString.Lazy.Char8 as BS
import Control.Monad
type Record = BS.ByteString
importRecords :: String -> IO [Record]
importRecords filename = do
liftM (map importRecord.BS.lines) (BS.readFile filename)
importRecord :: BS.ByteString -> Record
importRecord txt = r
where
r = getField 26
getField f = BS.copy $ ((BS.split '\t' txt) !! f)
loopInput :: [Record] -> IO ()
loopInput jrs = do
putStrLn $ "Done" ++ (show $ last jrs)
hFlush stdout
x <- getLine
return ()
-- (***)
loopInput jrs
main = do
jrs <- importRecords "c:\\downloads\\lcg1m.txt"
loopInput jrs

Your call to last forces the list, jrs. To figure that out it must run through the entire file building up thunks for each entry in jrs. Because you aren't evaluating each element in jrs (except the last one) these thunks hang out with references to the bytestring, so that must stay in memory.
The solution is to force the evaluation of those thunks. Because we're talking about space the first thing I did was actually to store your info in a smaller format:
type Year = Word16
type Month = Word8
type Day = Word8
data Record = Rec {-# UNPACK #-} !Year {-# UNPACK #-} !Month {-# UNPACK #-} !Day
deriving (Eq, Ord, Show, Read)
This reduces that ugly 10 byte Bytestring (+ overhead of ~16 bytes of structure information) to around 8 bytes.
importRecord now has to call toRecord r to get the right type:
toRecord :: BS.ByteString -> Record
toRecord bs =
case BS.splitWith (== '-') bs of
(y:m:d:[]) -> Rec (rup y) (rup m) (rup d)
_ -> Rec 0 0 0
rup :: (Read a) => BS.ByteString -> a
rup = read . BS.unpack
We'll need to evalute data when we convert from ByteString to Record, so lets use the parallel package and define an NFData instance from DeepSeq.
instance NFData Record where
rnf (Rec y m d) = y `seq` m `seq` d `seq` ()
Now we're ready to go, I modified main to use evalList, thus forcing the whole list before your function that wants the last one:
main = do
jrs <- importRecords "./tabLines"
let jrs' = using jrs (evalList rdeepseq)
loopInput jrs'
And we can see the heap profile looks beautiful (and top agrees, the program uses very little memory).
Sorry about that other misleading wrong answer - I was hooked on the fact that incremental processing fixes it and didn't really realize the thunks really were hanging around, not sure why my brain glided over that. Though I do stand by the gist, you should incrementally process this information making all of this answer moot.
FYI the huge bytestring didn't show up in those previous heap profiles I posted because foreign allocations (which includes ByteString) aren't tracked by the heap profiler.

There seem to be two questions here:
why does the memory usage depend on the presence or absence of the line (***);
why is the memory usage with (***) present about 800MB, rather than, say, 40MB.
I don't really know what to say about the first one that TomMD didn't already say; inside the loopInput loop, jrs can never be freed, because it's needed as an argument to the recursive call of loopInput. (You know that return () doesn't do anything when (***) is present, right?)
As for the second question, I think you are right that the input ByteString isn't being garbage collected. The reason is that you never evaluate the elements of your list jrs besides the last one, so they still contain references to the original ByteString (even though they are of the form BS.copy ...). I would think that replacing show $ last jrs with show jrs would reduce your memory usage; does it? Alternatively, you could try a stricter map, like
map' f [] = []
map' f (x:xs) = ((:) $! (f $! x)) (map' f xs)
Replace the map in importRecords with map' and see whether that reduces your memory usage.

Related

Parallel Haskell with HXT

I'm trying to get performance increases in a program I have that parses XML. The program can parse multiple XML files so I thought that I could make this run in parallel, but all my attempts have resulted in lower performance!
For XML parsing, I am using HXT.
I have a run function defined like this:
run printTasks xs = pExec xs >>= return . concat >>= doPrint printTasks 1
'pExec' is given a list of file names and is defined as:
pExec xs = do
ex <- mapM exec xs
as <- ex `usingIO` parList rdeepseq
return as
where 'exec' is defined as:
exec = runX . process
threadscope shows only one thread e ver being used (until the very end).
Can anyone explain why I have failed so miserably to parallelise this code?
In case it helps:
exec :: FilePath -> [CV_scene]
pExec :: [FilePath] -> IO [[CV_scene]]
data CV_scene = Scene [CV_layer] Time deriving (Show)
data CV_layer = Layer [DirtyRects] SourceCrop deriving (Show)
data Rect = Rect Int Int Int Int deriving (Show)-- Left Top Width Height
instance NFData CV_scene where
rnf = foldScene reduceScene
where reduceScene l t = rnf (seq t l)
instance NFData CV_layer where
rnf = foldLayer reduceLayer
where reduceLayer d s = rnf (seq s d)
instance NFData Rect where
rnf = foldRect reduceRect
where reduceRect l t w h = rnf [l,t,w,h]
type SourceCrop = Rect
type DirtyRect = Rect
type Time = Int64
Thanks in advance for your help!
First, it looks like you mislabeled the signature of exec, which should probably be:
exec :: FilePath -> IO [CV_scene]
Now for the important part. I've commented inline on what I think you think is going on.
pExec xs = do
-- A. Parse the file found at each location via exec.
ex <- mapM exec xs
-- B. Force the lazy parsing in parallel.
as <- ex `usingIO` parList rdeepseq
return as
Note that line A does not happen in paralell, which you might think is okay since it will just set up the parsing thunks which are forced in parallel in B. This is a fair assumption, and a clever use of laziness, but the results pull that into question for me.
I suspect that the implementation of exec forces most of the parsing before line B is even reached so that the deep seq doesn't do much. That fits pretty well with my experince parsing and the profiling supports that explanation.
Without the ability to test your code, I can only make the following suggestions. First try separating the parsing of the file from the IO and put the parsing in the parallel execution strategy. In that case lines A and B become something like:
ex <- mapM readFile xs
as <- ex `usingIO` parList (rdeepseq . exec')
with exec' the portion of exec after the file is read from disk.
exec' :: FilePath -> [CVScene]
Also, you may not even need rdeepSeq after this change.
As an alternative, you can do the IO and parsing in parallel using Software Transactional Memory. STM approaches are normally used for separate IO threads which act more like services, rather than pure computations. But if for some reason you cant get the strategies based approach to work, this might be worth a try.
import Control.Concurrent.STM.TChan --(from stm package)
import Control.Concurrent(forkIO)
pExec'' :: [FilePath] -> IO [[CVSene]]
pExec'' xs = do
-- A. create [(Filename,TChan [CVScene])]
tcx <- mapM (\x -> (x,) <$> newTChanIO) xs
-- B. do the reading/parsing in separate threads
mapM_ (forkIO . exec'') tcx
-- C. Collect the results
cvs <- mapM (atomically . readTChan . snd) tcx
exec'' :: [(FilePath,TChan [CVScene])] -> IO ()
exec'' (x,tch) = do
--D. The original exec function
cv <- exec x
--E. Put on the channel fifo buffer
atomically $ writeTChan tch cv
Good luck!

Reduce memory usage of a Haskell program

I have a following program in Haskell:
processDate :: String -> IO ()
processDate date = do
...
let newFlattenedPropertiesWithPrice = filter (notYetInserted date existingProperties) flattenedPropertiesWithPrice
geocodedProperties <- propertiesWithGeocoding newFlattenedPropertiesWithPrice
propertiesWithGeocoding :: [ParsedProperty] -> IO [(ParsedProperty, Maybe LatLng)]
propertiesWithGeocoding properties = do
let addresses = fmap location properties
let batchAddresses = chunksOf 100 addresses
batchGeocodedLocations <- mapM geocodeAddresses batchAddresses
let geocodedLocations = fromJust $ concat <$> sequence batchGeocodedLocations
return (zip properties geocodedLocations)
geocodeAddresses :: [String] -> IO (Maybe [Maybe LatLng])
geocodeAddresses addresses = do
mapQuestKey <- getEnv "MAP_QUEST_KEY"
geocodeResponse <- openURL $ mapQuestUrl mapQuestKey addresses
return $ geocodeResponseToResults geocodeResponse
geocodeResponseToResults :: String -> Maybe [Maybe LatLng]
geocodeResponseToResults inputResponse =
latLangs
where
decodedResponse :: Maybe GeocodingResponse
decodedResponse = decodeGeocodingResponse inputResponse
latLangs = fmap (fmap geocodingResultToLatLng . results) decodedResponse
decodeGeocodingResponse :: String -> Maybe GeocodingResponse
decodeGeocodingResponse inputResponse = Data.Aeson.decode (fromString inputResponse) :: Maybe GeocodingResponse
It reads a list of properties (homes and apartments) from html files, parses them, geocodes the addresses and saves the results into sqlite db.
Everything works fine except for a very high memory usage (around 800M).
By commenting code out I have pinpointed the problem to be the geocoding step.
I send 100 addresses at a time to MapQuest api (https://developer.mapquest.com/documentation/geocoding-api/batch/get/).
The response for 100 addresses is quite massive so it might be one of the culprits, but 800M? I feel like it holds to all of the results until the end which drives the memory usage so high.
After commenting out the geocoding part of the program memory usage is around 30M which is fine.
You can get the full version which reproduces the issue here: https://github.com/Leonti/haskell-memory-so
I'm quite a newbie in Haskell, so not sure how I can optimize it.
Any ideas?
Cheers!
It might be worth recording that this turned out to be a simple streaming problem arising from use of mapM and sequence, which with replicateM and traverse and other things that make you "extract a list from IO" always raise accumulation worries. So a little detour by a streaming library was needed. So in the repo it was necessary just to replace
processDate :: String -> IO ()
processDate date = do
allFiles <- listFiles date
allProperties <- mapM fileToProperties allFiles
let flattenedPropertiesWithPrice = filter hasPrice $ concat allProperties
geocodedProperties <- propertiesWithGeocoding flattenedPropertiesWithPrice
print geocodedProperties
propertiesWithGeocoding :: [ParsedProperty] -> IO [(ParsedProperty, Maybe LatLng)]
propertiesWithGeocoding properties = do
let batchProperties = chunksOf 100 properties
batchGeocodedLocations <- mapM geocodeAddresses batchProperties
let geocodedLocations = fromJust $ concat <$> sequence batchGeocodedLocations
return geocodedLocations
with something like this
import Streaming
import qualified Streaming.Prelude as S
processDate :: String -> IO ()
processDate date = do
allFiles <- listFiles date -- we accept an unstreamed list
S.print $ propertiesWithGeocoding -- this was the main pain point see below
$ S.filter hasPrice
$ S.concat
$ S.mapM fileToProperties -- this mapM doesn't accumulate
$ S.each allFiles -- the list is converted to a stream
propertiesWithGeocoding
:: Stream (Of ParsedProperty) IO r
-> Stream (Of (ParsedProperty, Maybe LatLng)) IO r
propertiesWithGeocoding properties =
S.concat $ S.concat
$ S.mapM geocodeAddresses -- this mapM doesn't accumulate results from mapquest
$ S.mapped S.toList -- convert segments to haskell lists
$ chunksOf 100 properties -- this is the streaming `chunksOf`
-- concat here flattens a stream of lists of as into a stream of as
-- and a stream of maybe as into a stream of as
Then the memory use looks like so, each peak corresponding to a trip to Mapquest promply followed by a little processing and a print, whereupon ghc forgets all about it and moves on:
Of course this could be done with pipes or conduit. But here we just need a little bit of simple mapM / sequence/ traverse / replicateM avoidance and streaming is perhaps simplest for this sort of quick local refactoring. Note that this list is quite short so the thought 'but short lists are cool with mapM/traverse/etc !" can be quite spectacularly false. Why not just get rid of them? Whenever you are about to write list mapM f it is a good idea to consider S.mapM f . S.each (or conduit or pipes equivalent) . You will now have a stream and can recover a list with S.toList or an equivalent, but it is likely that, as in this case, you will find you don't need a reified accumulated list but can e.g. use some streaming process like printing to file or stdout or writing things to a database, after making whatever list like manipulations are needed (here we use eg. streaming filter and also concat to flatten streamed lists and as a sort of catMaybe).

Haskell: How to use a HashMap in a main function

I beg for your help, speeding up the following program:
main = do
jobsToProcess <- fmap read getLine
forM_ [1..jobsToProcess] $ \_ -> do
[r, k] <- fmap (map read . words) getLine :: IO [Int]
putStrLn $ doSomeReallyLongWorkingJob r k
There could(!) be a lot of identical jobs to do, but it's not up to me modifying the inputs, so I tried to use Data.HashMap for backing up already processed jobs. I already optimized the algorithms in the doSomeReallyLongWorkingJob function, but now it seems, it's quite as fast as C.
But unfortunately it seems, I'm not able to implement a simple cache without producing a lot of errors. I need a simple cache of Type HashMap (Int, Int) Int, but everytime I have too much or too few brackets. And IF I manage to define the cache, I'm stuck in putting data into or retrieving data from the cache cause of lots of errors.
I already Googled for some hours but it seems I'm stuck. BTW: The result of the longrunner is an Int as well.
It's pretty simple to make a stateful action that caches operations. First some boilerplate:
{-# LANGUAGE FlexibleContexts #-}
import Control.Monad.State
import Data.Map (Map)
import qualified Data.Map as M
import Debug.Trace
I'll use Data.Map, but of course you can substitute in a hash map or any similar data structure without much trouble. My long-running computation will just add up its arguments. I'll use trace to show when this computation is executed; we'll hope not to see the output of the trace when we enter a duplicate input.
reallyLongRunningComputation :: [Int] -> Int
reallyLongRunningComputation args = traceShow args $ sum args
Now the caching operation will just look up whether we've seen a given input before. If we have, we'll return the precomputed answer; otherwise we'll compute the answer now and store it.
cache :: (MonadState (Map a b) m, Ord a) => (a -> b) -> a -> m b
cache f x = do
mCached <- gets (M.lookup x)
case mCached of
-- depending on your goals, you may wish to force `result` here
Nothing -> modify (M.insert x result) >> return result
Just cached -> return cached
where
result = f x
The main function now just consists of calling cache reallyLongRunningComputation on appropriate inputs.
main = do
iterations <- readLn
flip evalStateT M.empty . replicateM_ iterations
$ liftIO getLine
>>= liftIO . mapM readIO . words
>>= cache reallyLongRunningComputation
>>= liftIO . print
Let's try it in ghci!
> main
5
1 2 3
[1,2,3]
6
4 5
[4,5]
9
1 2
[1,2]
3
1 2
3
1 2 3
6
As you can see by the bracketed outputs, reallyLongRunningComputation was called the first time we entered 1 2 3 and the first time we entered 1 2, but not the second time we entered these inputs.
I hope i'm not too far off base, but first you need a way to carry around the past jobs with you. Easiest would be to use a foldM instead of a forM.
import Control.Monad
import Data.Maybe
main = do
jobsToProcess <- fmap read getLine
foldM doJobAcc acc0 [1..jobsToProcess]
where
acc0 = --initial value of some type of accumulator, i.e. hash map
doJobAcc acc _ = do
[r, k] <- fmap (map read . words) getLine :: IO [Int]
case getFromHash acc (r,k) of
Nothing -> do
i <- doSomeReallyLongWorkingJob r k
return $ insertNew acc (r,k) i
Just i -> do
return acc
Note, I don't actually use the interface for putting and getting the hash table key. It doesn't actually have to be a hash table, Data.Map from containers could work. Or even a list if its going to be a small one.
Another way to carry around the hash table would be to use a State transformer monad.
I am just adding this answer since I feel like the other answers are diverging a bit from the original question, namely using hashtable constructs in Main function (inside IO monad).
Here is a minimal hashtable example using hashtables module. To install the module with cabal, simply use
cabal install hashtables
In this example, we simply put some values in a hashtable and use lookup to print a value retrieved from the table.
import qualified Data.HashTable.IO as H
main :: IO ()
main = do
t <- H.new :: IO (H.CuckooHashTable Int String)
H.insert t 22 "Hello world"
H.insert t 5 "No problem"
msg <- H.lookup t 5
print msg
Notice that we need to use explicit type annotation to specify which implementation of the hashtable we wish to use.

Haskell speed / memory usage

I'm trying to process some Point Cloud data with Haskell, and it seems to use a LOT of memory. The code I'm using is below, it basically parses the data into a format I can work with. The dataset has 440MB with 10M rows. When I run it with runhaskell, it uses up all the ram in a short time (~3-4gb) and then crashes. If I compile it with -O2 and run it, it goes to 100% cpu and takes a long time to finish (~3 minutes). I should mention that I'm using an i7 cpu with 4GB ram and an SSD, so there should be plenty of resources. How can I improve the performance of this?
{-# LANGUAGE OverloadedStrings #-}
import Prelude hiding (lines, readFile)
import Data.Text.Lazy (Text, splitOn, unpack, lines)
import Data.Text.Lazy.IO (readFile)
import Data.Maybe (fromJust)
import Text.Read (readMaybe)
filename :: FilePath
filename = "sample.txt"
readTextMaybe = readMaybe . unpack
data Classification = Classification
{ id :: Int, description :: Text
} deriving (Show)
data Point = Point
{ x :: Int, y :: Int, z :: Int, classification :: Classification
} deriving (Show)
type PointCloud = [Point]
maybeReadPoint :: Text -> Maybe Point
maybeReadPoint text = parse $ splitOn "," text
where toMaybePoint :: Maybe Int -> Maybe Int -> Maybe Int -> Maybe Int -> Text -> Maybe Point
toMaybePoint (Just x) (Just y) (Just z) (Just cid) cdesc = Just (Point x y z (Classification cid cdesc))
toMaybePoint _ _ _ _ _ = Nothing
parse :: [Text] -> Maybe Point
parse [x, y, z, cid, cdesc] = toMaybePoint (readTextMaybe x) (readTextMaybe y) (readTextMaybe z) (readTextMaybe cid) cdesc
parse _ = Nothing
readPointCloud :: Text -> PointCloud
readPointCloud = map (fromJust . maybeReadPoint) . lines
main = (readFile filename) >>= (putStrLn . show . sum . map x . readPointCloud)
The reason this uses all your memory when compiled without optimization is most likely because sum is defined using foldl. Without the strictness analysis that comes with optimization, that will blow up badly. You can try using this function instead:
sum' :: Num n => [n] -> n
sum' = foldl' (+) 0
The reason this is slow when compiled with optimization seems likely related to the way you parse the input. A cons will be allocated for each character when reading in the input, and again when breaking the input into lines, and probably yet again when splitting on commas. Using a proper parsing library (any of them) will almost certainly help; using one of the streaming ones like pipes or conduit may or may not be best (I'm not sure).
Another issue, not related to performance: fromJust is rather poor form in general, and is a really bad idea when dealing with user input. You should instead mapM over the list in the Maybe monad, which will produce a Maybe [Point] for you.

How do I avoid memory problems when writing to file using the Writer monad?

I am building some moderately large DIMACS files, however with the method used below the memory usage is rather large compared to the size of the files generated, and on some of the larger files I need to generate I run in to out of memory problems.
import Control.Monad.State.Strict
import Control.Monad.Writer.Strict
import qualified Data.ByteString.Lazy.Char8 as B
import Control.Monad
import qualified Text.Show.ByteString as BS
import Data.List
main = printDIMACS "test.cnf" test
test = do
xs <- freshs 100000
forM_ (zip xs (tail xs))
(\(x,y) -> addAll [[negate x, negate y],[x,y]])
type Var = Int
type Clause = [Var]
data DIMACSS = DS{
nextFresh :: Int,
numClauses :: Int
} deriving (Show)
type DIMACSM a = StateT DIMACSS (Writer B.ByteString) a
freshs :: Int -> DIMACSM [Var]
freshs i = do
next <- gets nextFresh
let toRet = [next..next+i-1]
modify (\s -> s{nextFresh = next+i})
return toRet
fresh :: DIMACSM Int
fresh = do
i <- gets nextFresh
modify (\s -> s{nextFresh = i+1})
return i
addAll :: [Clause] -> DIMACSM ()
addAll c = do
tell
(B.concat .
intersperse (B.pack " 0\n") .
map (B.unwords . map BS.show) $ c)
tell (B.pack " 0\n")
modify (\s -> s{numClauses = numClauses s + length c})
add h = addAll [h]
printDIMACS :: FilePath -> DIMACSM a -> IO ()
printDIMACS file f = do
writeFile file ""
appendFile file (concat ["p cnf ", show i, " ", show j, "\n"])
B.appendFile file b
where
(s,b) = runWriter (execStateT f (DS 1 0))
i = nextFresh s - 1
j = numClauses s
I would like to keep the monadic building of clauses since it is very handy, but I need to overcome the memory problem. How do I optimize the above program so that it doesn't use too much memory?
If you want good memory behavior, you need to make sure that you write out the clauses as you generate them, instead of collecting them in memory and dumping them as such, either using lazyness or a more explicit approach such as conduits, enumerators, pipes or the like.
The main obstacle to that approach is that the DIMACS format expects the number of clauses and variables in the header. This prevents the naive implementation from being sufficiently lazy. There are two possibilities:
The pragmatic one is to write the clauses first to a temporary location. After that the numbers are known, so you write them to the real file and append the contents of the temporary file.
The prettier approach is possible if the generation of clauses has no side effects (besides the effects offered by your DIMACSM monad) and is sufficiently fast: Run it twice, first throwing away the clauses and just calculating the numbers, print the header line, run the generator again; now printing the clauses.
(This is from my experience with implementing SAT-Britney, where I took the second approach, because it fitted better with other requirements in that context.)
Also, in your code, addAll is not lazy enough: The list c needs to be retained even after writing (in the MonadWriter sense) the clauses. This is another space leak. I suggest you implement add as the primitive operation and then addAll = mapM_ add.
As explained in Joachim Breitner's answer the problem was that DIMACSM was not lazy enough, both because the strict versions of the monads was used and because the number of variables and clauses are needed before the ByteString can be written to the file. The solution is to use the lazy versions of the Monads and execute them twice. It turns out that it is also necessary to have WriterT be the outer monad:
import Control.Monad.State
import Control.Monad.Writer
...
type DIMACSM a = WriterT B.ByteString (State DIMACSS) a
...
printDIMACS :: FilePath -> DIMACSM a -> IO ()
printDIMACS file f = do
writeFile file ""
appendFile file (concat ["p cnf ", show i, " ", show j, "\n"])
B.appendFile file b
where
s = execState (execWriterT f) (DS 1 0)
b = evalState (execWriterT f) (DS 1 0)
i = nextFresh s - 1
j = numClauses s

Resources