Where is the space leak in this code? - haskell

I am trying to split a file into two separate files by alternating lines. (i.e. lines 1,3,5,7.. written to file 1 and lines 2,4,6,8... written to file 2).
The file I am working with is ~700MB, so when I seen the memory usage balloon over 6GB, I know something is wrong.
main :: IO()
main = withFile splitFile ReadMode splitData
where
splitData h = do
dataSet <- lines <$> hGetContents h
let (s1,s2) = foldl' (\(l,r) x -> (x:r,l)) ([],[]) dataSet
writeFile testFile $ unlines s1
writeFile trainingFile $ unlines s2
I initially was using the lazy version of foldl, but after some research it seemed that using the strict version would help. But alas, it made no noticeable difference. I also tried compiling with -O2, but that did nothing either.
I am using GHC 7.10.2
I'm not getting a stack overflow, so what is it using all that memory for?

As mentioned in a comment by #dfeuer, the use of writeFile will force the entire string to be written to be computed, which also forces the entire input to be read. The space leak is caused by the fact that the entire second file must be kept in memory while the first file is being written, when it is obvious that one must only keep in memory one line at a time. And indeed the solution is to write line by line:
import Control.Monad
import System.IO
main :: IO ()
main =
withFile splitFile ReadMode $ \hIn ->
withFile testFile WriteMode $ \hOdd ->
withFile trainingFile WriteMode $ \hEven ->
zipWithM_ hPutStrLn (cycle [hOdd, hEven]) . lines =<< hGetContents hIn
This program runs in constant space.

Related

How to reuse efficiently input from stdin in Haskell

I understand that I should not try to re-read from stdin because of errors about Haskell IO - handle closed
For example, in below:
main = do
x <- getContents
putStrLn $ map id x
x <- getContents --problem line
putStrLn x
the second call x <- getContents will cause the error:
test: <stdin>: hGetContents: illegal operation (handle is closed)
Of course, I can omit the second line to read from getContents.
main = do
x <- getContents
putStrLn $ map id x
putStrLn x
But will this become a performance/memory issue? Will GHC have to keep all of the contents read from stdin in the main memory?
I imagine the first time around when x is consumed, GHC can throw away the portions of x that are already processed. So theoretically, GHC could only use a small amount of constant memory for the processing. But since we are going to use x again (and again), it seems that GHC cannot throw away anything. (Nor can it read again from stdin).
Is my understanding about the memory implications here correct? And if so, is there a fix?
Yes, your understanding is correct: If you reuse x, ghc has to keep it all in memory.
I think a possible fix is to consume it lazily (once).
Let's say you want to output x to several output handles hdls :: [Handle]. The naive approach is:
main :: IO ()
main = do
x <- getContents
forM_ hdls $ \hdl -> do
hPutStr hdl x
This will read stdin into x as the first hPutStr traverses the string (at least for unbuffered handles, hPutStr is simply a loop that calls hPutChar for each character in the string). From then on it'll be kept in memory for all following hdls.
Alternatively:
main :: IO ()
main = do
x <- getContents
forM_ x $ \c -> do
forM_ hdls $ \hdl -> do
hPutChar hdl c
Here we've transposed the loops: Instead of iterating over the handles (and for each handle iterating over the input characters), we iterate over the input characters, and for each character, we print it to each handle.
I haven't tested it, but this form should guarantee that we don't need a lot of memory because each input character c is used once and then discarded.

Using Data.Binary.decodeFile, encountered error "demandInput: not enough bytes"

I'm attempting to use the encodeFile and decodeFile functions in Data.Binary to save a very large datastructure so that I don't have to recompute it every time I run this program. The relevant encoding- and decoding-functions are as follows:
writePlan :: IO ()
writePlan = do (d, _, bs) <- return subjectDomain
outHandle <- openFile "outputfile" WriteMode
((ebsP, aP), cacheData) <- preplanDomain d bs
putStrLn "Calculated."
let toWrite = ((map pseudofyOverEBS ebsP, aP),
pseudofyOverMap cacheData) :: WrittenData
in do encodeFile preplanFilename $ encode toWrite
putStrLn "Done."
readPlan :: IO (([EvaluatedBeliefState], [Action]), MVar HeuCache)
readPlan = do (d, _, _) <- return subjectDomain
inHandle <- openFile "outputfile" ReadMode
((ebsP, aP), cacheData) <- decodeFile preplanFilename :: IO WrittenData
fancyCache <- newMVar (M.empty, depseudofyOverMap cacheData)
return $! ((map depseudofyOverEBS ebsP, aP), fancyCache)
The program to calculate and write the file (using writePlan) executes without error, outputting a gigantic binary file. However, when I run the program which takes in this file, executing readPlan results in the error (the program name is "Realtime"):
Realtime: demandInput: not enough bytes
I can't make head nor tail of this, and scouring Google has turned up no substantial documentation or discussion of this message. Any insight would be appreciated!
I am very late to the party, but found this while looking for help with a similar issue. I'm working with the incremental interface for Data.Binary.Get. As you can see in here, the function is defined in Data.Binary.Get.Internal. Now I am guessing, but your decodeFile function probably does some sort of parsing and the error is thrown because the file does not parse completely (i.e. the parser thinks that there must be something else in the file but it reaches EOF already).
Hope that helps anyone with this/similar issues!

Using mapM f [list] where f is defined with do notation

I currently have this code which will perform the main' function on each of the filenames in the list files.
Ideally I have been trying to combine main and main' but I haven't made much progress. Is there a better way to simplify this or will I need to keep them separate?
{- Start here -}
main :: IO [()]
main = do
files <- getArgs
mapM main' files
{- Main's helper function -}
main' :: FilePath -> IO ()
main' file = do
contents <- readFile file
case (runParser parser 0 file $ lexer contents) of Left err -> print err
Right xs -> putStr xs
Thanks!
Edit: As most of you are suggesting; I was trying a lambda abstraction for this but wasn't getting it right. - Should've specified this above. With the examples I see this better.
The Control.Monad library defines the function forM which is mapM is reverse arguments. That makes it easier to use in your situation, i.e.
main :: IO ()
main = do
files <- getArgs
forM_ files $ \file -> do
contents <- readFile file
case (runParser f 0 file $ lexer contents) of
Left err -> print err
Right xs -> putStr xs
The version with the underscore at the end of the name is used when you are not interested in the resulting list (like in this case), so main can simply have the type IO (). (mapM has a similar variant called mapM_).
You can use forM, which equals flip mapM, i.e. mapM with its arguments flipped, like this:
forM_ files $ \file -> do
contents <- readFile file
...
Also notice that I used forM_ instead of forM. This is more efficient when you are not interested in the result of the computation.

Better data stream reading in Haskell

I am trying to parse an input stream where the first line tells me how many lines of data there are. I'm ending up with the following code, and it works, but I think there is a better way. Is there?
main = do
numCases <- getLine
proc $ read numCases
proc :: Integer -> IO ()
proc numCases
| numCases == 0 = return ()
| otherwise = do
str <- getLine
putStrLn $ findNextPalin str
proc (numCases - 1)
Note: The code solves the Sphere problem https://www.spoj.pl/problems/PALIN/ but I didn't think posting the rest of the code would impact the discussion of what to do here.
Use replicate and sequence_.
main, proc :: IO ()
main = do numCases <- getLine
sequence_ $ replicate (read numCases) proc
proc = do str <- getLine
putStrLn $ findNextPalin str
sequence_ takes a list of actions, and runs them one after the other, in sequence. (Then it throws away the results; if you were interested in the return values from the actions, you'd use sequence.)
replicate n x makes a list of length n, with each element being x. So we use it to build up the list of actions we want to run.
Dave Hinton's answer is correct, but as an aside here's another way of writing the same code:
import Control.Applicative
main = (sequence_ . proc) =<< (read <$> getLine)
proc x = replicate x (putStrLn =<< (findNextPalin <$> getLine))
Just to remind everyone that do blocks aren't necessary! Note that in the above, both =<< and <$> stand in for plain old function application. If you ignore both operators, the code reads exactly the same as similarly-structured pure functions would. I've added some gratuitous parentheses to make things more explicit.
Their purpose is that <$> applies a regular function inside a monad, while =<< does the same but then compresses an extra layer of the monad (e.g., turning IO (IO a) into IO a).
The interesting part of looking at code this way is that you can mostly ignore where the monads and such are; typically there's very few ways to place the "function application" operators to make the types work.
You (and the previous answers) should work harder to divide up the IO from the logic. Make main gather the input and separately (purely, if possible) do the work.
import Control.Monad -- not needed, but cleans some things up
main = do
numCases <- liftM read getLine
lines <- replicateM numCases getLine
let results = map findNextPalin lines
mapM_ putStrLn results
When solving SPOJ problems in Haskell, try not to use standard strings at all. ByteStrings are much faster, and I've found you can usually ignore the number of tests and just run a map over everything but the first line, like so:
{-# OPTIONS_GHC -O2 -optc-O2 #-}
import qualified Data.ByteString.Lazy.Char8 as BS
main :: IO ()
main = do
(l:ls) <- BS.lines `fmap` BS.getContents
mapM_ findNextPalin ls
The SPOJ page in the Haskell Wiki gives a lot of good pointers about how to read Ints from ByteStrings, as well as how to deal with a large quantities of input. It'll help you avoid exceeding the time limit.

In Haskell, I want to read a file and then write to it. Do I need strictness annotation?

Still quite new to Haskell..
I want to read the contents of a file, do something with it possibly involving IO (using putStrLn for now) and then write new contents to the same file.
I came up with:
doit :: String -> IO ()
doit file = do
contents <- withFile tagfile ReadMode $ \h -> hGetContents h
putStrLn contents
withFile tagfile WriteMode $ \h -> hPutStrLn h "new content"
However this doesn't work due to laziness. The file contents are not printed. I found this post which explains it well.
The solution proposed there is to include putStrLn within the withFile:
doit :: String -> IO ()
doit file = do
withFile tagfile ReadMode $ \h -> do
contents <- hGetContents h
putStrLn contents
withFile tagfile WriteMode $ \h -> hPutStrLn h "new content"
This works, but it's not what I want to do. The operation in I will eventually replace putStrLn might be long, I don't want to keep the file open the whole time. In general I just want to be able to get the file content out and then close it before working with that content.
The solution I came up with is the following:
doit :: String -> IO ()
doit file = do
c <- newIORef ""
withFile tagfile ReadMode $ \h -> do
a <- hGetContents h
writeIORef c $! a
d <- readIORef c
putStrLn d
withFile tagfile WriteMode $ \h -> hPutStrLn h "Test"
However, I find this long and a bit obfuscated. I don't think I should need an IORef just to get a value out, but I needed "place" to put the file contents. Also, it still didn't work without the strictness annotation $! for writeIORef. I guess IORefs are not strict by nature?
Can anyone recommend a better, shorter way to do this while keeping my desired semantics?
Thanks!
The reason your first program does not work is that withFile closes the file after executing the IO action passed to it. In your case, the IO action is hGetContents which does not read the file right away, but only as its contents are demanded. By the time you try to print the file's contents, withFile has already closed the file, so the read fails (silently).
You can fix this issue by not reinventing the wheel and simply using readFile and writeFile:
doit file = do
contents <- readFile file
putStrLn contents
writeFile file "new content"
But suppose you want the new content to depend on the old content. Then you cannot, generally, simply do
doit file = do
contents <- readFile file
writeFile file $ process contents
because the writeFile may affect what the readFile returns (remember, it has not actually read the file yet). Or, depending on your operating system, you might not be able to open the same file for reading and writing on two separate handles. The simple but ugly workaround is
doit file = do
contents <- readFile file
length contents `seq` (writeFile file $ process contents)
which will force readFile to read the entire file and close it before the writeFile action can begin.
I think the easiest way to solve this problem is useing strict IO:
import qualified System.IO.Strict as S
main = do
file <- S.readFile "filename"
writeFile "filename" file
You can duplicate the file Handle, do lazy write with original one (to the end of file) and lazy read with another. So no strictness annotation involved in case of appending to file.
import System.IO
import GHC.IO.Handle
main :: IO ()
main = do
h <- openFile "filename" ReadWriteMode
h2 <- hDuplicate h
hSeek h2 AbsoluteSeek 0
originalFileContents <- hGetContents h2
putStrLn originalFileContents
hSeek h SeekFromEnd 0
hPutStrLn h $ concatMap ("{new_contents}" ++) (lines originalFileContents)
hClose h2
hClose h
The hDuplicate function is provided by GHC.IO.Handle module.
Returns a duplicate of the original handle, with its own buffer. The two Handles will share a file pointer, however. The original handle's buffer is flushed, including discarding any input data, before the handle is duplicated.
With hSeek you can set position of the handle before reading or writing.
But I'm not sure how reliable would be using "AbsoluteSeek 0" instead of "SeekFromEnd 0" for writing, i.e. overwriting contents. Generally I would suggest to write to a temporary file first, for example using openTempFile (from System.IO), and then replace original.
It's ugly but you can force the contents to be read by asking for the length of the input and seq'ing it with the next statement in your do-block. But really the solution is to use a strict version of hGetContents. I'm not sure what it's called.

Resources