I have probably just spend a day of computation time in vain :)
The problem is that I (naively) wrote about 3.5GB of (compressed) [(Text, HashMap Text Int)] data to a file and at that point my program crashed. Of course there is no final ] at the end of the data and the sheer size of it makes editing it by hand impossible.
The data was formatted via Prelude.show and just at this point I realize that Prelude.read will need to the whole dataset into memory (impossible) before any data is returned.
Now ... is there a way to recover the data without resorting to write a parser manually?
Update 1
main = do
s <- getContents
let hs = read s :: [(String, M.Map String Integer)]
print $ head hs
This I tried ... but it just keeps consuming more memory until it gets killed by the OS.
Sort of. You will still be writing a parser manually... but it is a very short and very easy-to-write parser, because almost all of it will ship out to read. The idea is this: read is strict, but reads, when working on a single element, is lazyish. So we just need to strip out the bits that reads isn't expecting when working on a single element. Here's an example to get you started:
> let s = "[3,4,5," ++ undefined
> reads (drop 1 s) :: [(Int, String)]
[(3,",4,5,*** Exception: Prelude.undefined
I included the undefined at the end as evidence that it is in fact not reading the entire String before producing the parsed 3 at the head of the list.
Daniels answer can be extended to parse the whole list at once using this function. Then you can directly access it as a list the way you want
lazyread :: Read a => [Char] -> [a]
lazyread xs = go (tail xs)
where go xs = a : go (tail b)
where (a,b) = head $ reads xs
Manually delete the opening '['. After that you might be able to use reads (note the s) to incrementally access getContents.
I want to select the n-th last line from a large text file (~10GB) in a Haskell program.
I found a way how to get the n-th last from an internal string:
myLen = 7
n = 3 -- one-based from the end
myLines = lines myText
idx = myLen - n
theLine = head (drop idx myLines)
main :: IO ()
main = do
putStrLn theLine
The documentation about the readFile function says it "reads the content lazily", so once readFile got to the n-th last line will it have stored all the lines before in memory (and then explodes because I don't have that much memory)?
So, is readFile the right approach here? Plus how do I get the IO String output from readFile "in a lazy way" into a list of lines so that I can then select the n-th last line?
The question has several parts:
The documentation about the readFile function says it "reads the content lazily", so once readFile got to the n-th last line will it have stored all the lines before in memory (and then explodes because I don't have that much memory)?
Not necessarily. If you only iterate over the contents and produce a result, then the garbage collector should deallocate the contents.
So, is readFile the right approach here?
My opinionated answer is that if it's for a serious tool, readFile isn't the right approach because "lazy IO" is a can of worms.
If it's for a quick and dirty script then go ahead, but if not, and if performance is important, then it is probably best to use lower level calls to read strict ByteStrings, and for your problem read directly from the end of the file and process that.
The following program will require only about as much memory as the longest n lines in the file being read:
-- like drop, but takes its number encoded as a lazy
-- unary number via the length of the first list
dropUnary :: [a] -> [b] -> [b]
dropUnary [] bs = bs
dropUnary (_:as) (_:bs) = dropUnary as bs
takeLast :: Int -> [a] -> [a]
takeLast n as = dropUnary (drop n as) as
main :: IO ()
main = putStrLn . head . takeLast 3 . lines =<< readFile
The Prelude's lines function is already suitably lazy, but some care was taken in writing takeLast here. You can think of this as operating in "one pass" of the file, looking at subsequent chunks of n consecutive lines until it finds the last chunk. Because it does not maintain any references to the contents of the file from before the chunk it's currently looking at, all of the file contents up to the current chunk can be garbage collected (and generally is, fairly soon).
Consider the two following variations:
myReadListTailRecursive :: IO [String]
myReadListTailRecursive = go []
go :: [String] -> IO [String]
go l = do {
inp <- getLine;
if (inp == "") then
return l;
else go (inp:l);
myReadListOrdinary :: IO [String]
myReadListOrdinary = do
inp <- getLine
if inp == "" then
return []
moreInps <- myReadListOrdinary
return (inp:moreInps)
In ordinary programming languages, one would know that the tail recursive variant is a better choice.
However, going through this answer, it is apparent that haskell's implementation of recursion is not similar to that of using the recursion stack repeatedly.
But because in this case the program in question involves actions, and a strict monad, I am not sure if the same reasoning applies. In fact, I think in the IO case, the tail recursive form is indeed better. I am not sure how to correctly reason about this.
EDIT: David Young pointed out that the outermost call here is to (>>=). Even in that case, does one of these styles have an advantage over the other?
FWIW, I'd go for existing monadic combinators and focus on readability/consiseness. Using unfoldM :: Monad m => m (Maybe a) -> m [a]:
import Control.Monad (liftM, mfilter)
import Control.Monad.Loops (unfoldM)
myReadListTailRecursive :: IO [String]
myReadListTailRecursive = unfoldM go
go :: IO (Maybe String)
go = do
line <- getLine
return $ case line of
"" -> Nothing
s -> Just s
Or using MonadPlus instance of Maybe, with mfilter :: MonadPlus m => (a -> Bool) -> m a -> m a:
myReadListTailRecursive :: IO [String]
myReadListTailRecursive = unfoldM (liftM (mfilter (/= "") . Just) getLine)
Another, more versatile option, might be to use LoopT.
That’s really not how I would write it, but it’s clear enough what you’re doing. (By the way, if you want to be able to efficiently insert arbitrary output from any function in the chain, without using monads, you might try a Data.ByteString.Builder.)
Your first implementation is very similar to a left fold, and your second very similar to a right fold or map. (You might try actually writing them as such!) The second one has several advantages for I/O. One of the most important, for handling input and output, is that it can be interactive.
You’ll notice that the first builds the entire list from the outside in: in order to determine what the first element of the list is, the program needs to compute the entire structure to get to the innermost thunk, which is return l. The program generates the entire data structure first, then starts to process it. That’s useful when you’re reducing a list, because tail-recursive functions and strict left folds are efficient.
With the second, the outermost thunk contains the head and tail of the list, so you can grab the tail, then call the thunk to generate the second list. This can work with infinite lists, and it can produce and return partial results.
Here’s a contrived example: a program that reads in one integer per line and prints the sums so far.
main :: IO ()
main = interact( display . compute 0 . parse . lines )
where parse :: [String] -> [Int]
parse [] = []
parse (x:xs) = (read x):(parse xs)
compute :: Int -> [Int] -> [Int]
compute _ [] = []
compute accum (x:xs) = let accum' = accum + x
in accum':(compute accum' xs)
display = unlines . map show
If you run this interactively, you’ll get something like:
$ 1
$ 2
$ 3
$ 4
But you could also write compute tail-recursively, with an accumulating parameter:
main :: IO ()
main = interact( display . compute [] . parse . lines )
where parse :: [String] -> [Int]
parse = map read
compute :: [Int] -> [Int] -> [Int]
compute xs [] = reverse xs
compute [] (y:ys) = compute [y] ys
compute (x:xs) (y:ys) = compute (x+y:x:xs) ys
display = unlines . map show
This is an artificial example, but strict left folds are a common pattern. If, however, you write either compute or parse with an accumulating parameter, this is what you get when you try to run interactively, and hit EOF (control-D on Unix, control-Z on Windows) after the number 4:
$ 1
$ 2
$ 3
$ 4
This left-folded version needs to compute the entire data structure before it can read any of it. That can’t ever work on an infinite list (When would you reach the base case? How would you even reverse an infinite list if you did?) and an application that can’t respond to user input until it quits is a deal-breaker.
On the other hand, the tail-recursive version can be strict in its accumulating parameter, and will run more efficiently, especially when it’s not being consumed immediately. It doesn’t need to keep any thunks or context around other than its parameters, and it can even re-use the same stack frame. A strict accumulating function, such as Data.List.foldl', is a great choice whenver you’re reducing a list to a value, not building an eagerly-evaluated list of output. Functions such as sum, product or any can’t return any useful intermediate value. They inherently have to finish the computation first, then return the final result.
I have a strange whim. Suppose I have something like this:
data Statement = StatementType Stuff Source
Now I want to parse such a statement, parse all the stuff, and after that I want to put all characters that I've processed (for this particular statement) into resulting data structure. For some reason.
Is it possible, and if yes, how to accomplish that?
In general this is not possible. parsec does not expect a lot from its stream type, in particular there is no way to efficently split a stream.
But for a concrete stream type (e.g. String, or [a], or ByteString) a hack like this would work:
parseWithSource :: Parsec [c] u a -> Parsec [c] u ([c], a)
parseWithSource p = do
input <- getInput
a <- p
input' <- getInput
return (take (length input - length input') input, a)
This solution relies on function getInput that returns current input. So we can get the input twice: before and after parsing, this gives us exact number of consumed elements, and knowing that we can take these elements from the original input.
Here you can see it in action:
*Main Text.Parsec> parseTest (between (char 'x') (char 'x') (parseWithSource ((read :: String -> Int) `fmap` many1 digit))) "x1234x"
But you should also look into attoparsec, as it properly supports this functionality with the match function.
I am parsing binary data out of files using Binary.Get and have something like the following:
data FileMessageHeaders = FileMessageHeaders [FileMessageHeader]
data FileMessageHeader = FileMessageHeader ...
instance Binary FileMessageHeaders where
put = undefined
get = do
messages <- untilM get isEmpty
return (FileMessageHeaders messages)
instance Binary FileMessageHeader where
put = undefined
get = ..
The problem I am having is that the untilM from monad-loops on hackage uses sequence so I believe that this is what is causing a massive delay in returning the head of the FileMessageHeader list as the whole file must be read (is this correct?). I am having trouble coming up with a way to rewrite this and avoid sequencing all of the FileMessageHeaders in the file. Any suggestions?
As FUZxxl notes, the problem is untilM; the Get monad is strict and requires that the entire untilM action completes before it returns. IO has nothing to do with it.
The easiest thing to do is probably switch to attoparsec and use that for parsing instead of binary. Attoparsec supports streaming parses and would likely be much easier to use for this case.
If you can't switch to attoparsec, you'll need to use some of the lower-level functions of binary rather than just using the Binary instance. Something like the following (completely untested).
getHeaders :: ByteString -> [FileMessageHeader]
getHeaders b = go b 0
go bs n
| B.null bs = []
| otherwise = let (header, bs', n') = runGetState get bs n
in header : go bs' n'
Unfortunately this means you won't be able to use the Binary instance or the get function, you'll have to use getHeaders. It will stream though.
The problem here is, that an IO action has to finish before the control flow can continue. Thus, the program has to read in all the messages, before they get evaluated. You could try to define an own combinator sequenceI, that uses the function unsafeInterleaveIO from System.IO.Unsafe. This function allows you, well, to interleave actions. It is used, for instance by getContents. I would define sequenceI like this:
sequenceI (x:xs) = do v <- x
vs <- unsafeInterleaveIO $ sequenceI xs
return (v:vs)
On top of this combinator, you can define your own untilM, that streams. Doing this is left as an excercise to the reader.
Edit (corrected for compilation)
This is a proof-of-concept, untested implementation of untilM:
untilMI f p = do
f' <- f
p' <- p
if p'
then return [f']
else do g' <- unsafeInterleaveIO $ untilMI f p
return (f' : g')
I'm trying to understand why I'm seeing the entire file loaded into memory with the following program, yet if you comment out the line below "(***)" then the program runs in constant (about 1.5M) space.
EDIT: The file is about 660MB, the field in column 26 is a date string like '2009-10-01', and there are one million lines. The process uses about 810MB by the time it hits the 'getLine'
Am I right in thinking it's related to the splitting of the string using 'split', and that somehow the underlying ByteString that has been read from the file can't be garbage-collected because it's still referenced? But if so, then I thought BS.copy would work around that. Any ideas how to force the computation - I can't seem to get 'seq' into the right place to have an effect.
(NB the source file is tab-separated lines)
Thanks in advance,
module Main where
import System.IO
import qualified Data.ByteString.Lazy.Char8 as BS
import Control.Monad
type Record = BS.ByteString
importRecords :: String -> IO [Record]
importRecords filename = do
liftM (map importRecord.BS.lines) (BS.readFile filename)
importRecord :: BS.ByteString -> Record
importRecord txt = r
r = getField 26
getField f = BS.copy $ ((BS.split '\t' txt) !! f)
loopInput :: [Record] -> IO ()
loopInput jrs = do
putStrLn $ "Done" ++ (show $ last jrs)
hFlush stdout
x <- getLine
return ()
-- (***)
loopInput jrs
main = do
jrs <- importRecords "c:\\downloads\\lcg1m.txt"
loopInput jrs
Your call to last forces the list, jrs. To figure that out it must run through the entire file building up thunks for each entry in jrs. Because you aren't evaluating each element in jrs (except the last one) these thunks hang out with references to the bytestring, so that must stay in memory.
The solution is to force the evaluation of those thunks. Because we're talking about space the first thing I did was actually to store your info in a smaller format:
type Year = Word16
type Month = Word8
type Day = Word8
data Record = Rec {-# UNPACK #-} !Year {-# UNPACK #-} !Month {-# UNPACK #-} !Day
deriving (Eq, Ord, Show, Read)
This reduces that ugly 10 byte Bytestring (+ overhead of ~16 bytes of structure information) to around 8 bytes.
importRecord now has to call toRecord r to get the right type:
toRecord :: BS.ByteString -> Record
toRecord bs =
case BS.splitWith (== '-') bs of
(y:m:d:[]) -> Rec (rup y) (rup m) (rup d)
_ -> Rec 0 0 0
rup :: (Read a) => BS.ByteString -> a
rup = read . BS.unpack
We'll need to evalute data when we convert from ByteString to Record, so lets use the parallel package and define an NFData instance from DeepSeq.
instance NFData Record where
rnf (Rec y m d) = y `seq` m `seq` d `seq` ()
Now we're ready to go, I modified main to use evalList, thus forcing the whole list before your function that wants the last one:
main = do
jrs <- importRecords "./tabLines"
let jrs' = using jrs (evalList rdeepseq)
loopInput jrs'
And we can see the heap profile looks beautiful (and top agrees, the program uses very little memory).
Sorry about that other misleading wrong answer - I was hooked on the fact that incremental processing fixes it and didn't really realize the thunks really were hanging around, not sure why my brain glided over that. Though I do stand by the gist, you should incrementally process this information making all of this answer moot.
FYI the huge bytestring didn't show up in those previous heap profiles I posted because foreign allocations (which includes ByteString) aren't tracked by the heap profiler.
There seem to be two questions here:
why does the memory usage depend on the presence or absence of the line (***);
why is the memory usage with (***) present about 800MB, rather than, say, 40MB.
I don't really know what to say about the first one that TomMD didn't already say; inside the loopInput loop, jrs can never be freed, because it's needed as an argument to the recursive call of loopInput. (You know that return () doesn't do anything when (***) is present, right?)
As for the second question, I think you are right that the input ByteString isn't being garbage collected. The reason is that you never evaluate the elements of your list jrs besides the last one, so they still contain references to the original ByteString (even though they are of the form BS.copy ...). I would think that replacing show $ last jrs with show jrs would reduce your memory usage; does it? Alternatively, you could try a stricter map, like
map' f [] = []
map' f (x:xs) = ((:) $! (f $! x)) (map' f xs)
Replace the map in importRecords with map' and see whether that reduces your memory usage.