Limiting memory usage when reading files

Limiting memory usage when reading files - haskell

I'm a Haskell beginner and thought this would be good exercise. I have an
assignment where I need to read file in a thread A, handle the file lines
in threads B_i, and then output the results in thread C.
I have implemented this far already, but one of the requirements is that we
cannot trust that the entire file fits into memory. I was hoping that lazy
IO and garbage collector would do this for me, but alas the memory usage
keeps rising and rising.
The reader thread (A) reads the file with readFile which is then zipped
with line numbers and wrapped in Just. These zipped lines are then written
to Control.Concurrent.Chan. Each consumer thread B has its own channel.
Each consumer reads their own channel when it has data and if the regex
matches, it's outputted to their own respective output channel wrapped
within Maybe (made of lists).
The printer checks the output channel of each of the B threads. If none of
the results (line) is Nothing, the line is printed. Since at this point
there should be no reference to the older lines, I thought that the garbage
collector would be able to release these lines, but alas I seem to be in
the wrong here.
The .lhs file is in here:
http://gitorious.org/hajautettujen-sovellusten-muodostamistekniikat/hajautettujen-sovellusten-muodostamistekniikat/blobs/master/mgrep.lhs
So the question is, how do I limit the memory usage, or allow the garbage
collector to remove the lines.
Snippets as per requested. Hopefully indenting isn't too badly destroyed :)
data Global = Global {done :: MVar Bool, consumers :: Consumers}
type Done = Bool
type Linenum = Int
type Line = (Linenum, Maybe String)
type Output = MVar [Line]
type Input = Chan Line
type Consumers = MVar (M.Map ThreadId (Done, (Input, Output)))
type State a = ReaderT Global IO a
producer :: [Input] -> FilePath -> State ()
producer c p = do
liftIO $ Main.log "Starting producer"
d <- asks done
f <- liftIO $ readFile p
mapM_ (\l -> mapM_
(liftIO . flip writeChan l) c)
$ zip [1..] $ map Just $ lines f
liftIO $ modifyMVar_ d (return . not)
printer :: State ()
printer = do
liftIO $ Main.log "Starting printer"
c <- (fmap (map (snd . snd) . M.elems)
(asks consumers >>= liftIO . readMVar))
uniq' c
where head' :: Output -> IO Line
head' ch = fmap head (readMVar ch)
tail' = mapM_ (liftIO . flip modifyMVar_
(return . tail))
cont ch = tail' ch >> uniq' ch
printMsg ch = readMVar (head ch) >>=
liftIO . putStrLn . fromJust . snd . head
cempty :: [Output] -> IO Bool
cempty ch = fmap (any id)
(mapM (fmap ((==) 0 . length) . readMVar ) ch)
{- Return false unless none are Nothing -}
uniq :: [Output] -> IO Bool
uniq ch = fmap (any id . map (isNothing . snd))
(mapM (liftIO . head') ch)
uniq' :: [Output] -> State ()
uniq' ch = do
d <- consumersDone
e <- liftIO $ cempty ch
if not e
then do
u <- liftIO $ uniq ch
if u then cont ch else do
liftIO $ printMsg ch
cont ch
else unless d $ uniq' ch

Concurrent programming offers no defined execution order unless you enforce one yourself with mvars and the like. So its likely that the producer thread sticks all/most of the lines in the chan before any consumer reads them off and passes them on. Another architecture that should fit the requirements is just have thread A call the lazy readfile and stick the result in an mvar. Then each consumer thread takes the mvar, reads a line, then replaces the mvar before proceeding to handle the line. Even then, if the output thread can't keep up, then the number of matching lines stored on the chan there can build up arbitrarily.
What you have is a push architecture. To really make it work in constant space, think in terms of demand driven. Find a mechanism such that the output thread signals to the processing threads that they should do something, and such that the processing threads signal to the reader thread that they should do something.
Another way to do this is to have chans of limited size instead -- so the reader thread blocks when the processor threads haven't caught up, and so the processor threads block when the output thread hasn't caught up.
As a whole, the problem in fact reminds me of Tim Bray's widefinder benchmark, although the requirements are somewhat different. In any case, it led to a widespread discussion on the best way to implement multicore grep. The big punchline was that the problem is IO bound, and you want multiple reader threads over mmapped files.
See here for more than you'll ever want to know: http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-Finder

Related

How to get this function to be evaluated lazily

I have the following function:
main = do xs <- getContents
edLines <- ed $ lines xs
putStr $ unlines edLines
Firstly I used the working version main = interact (unlines . ed . lines) but changed the signature of ed since. Now it returns IO [String] instead of just [String] so I can't use this convenient definition any more.
The problem is that now my function ed is still getting evaluated partly but nothing is displayed till I close the stdin via CTRL + D.
Definition of ed:
ed :: Bool -> [EdCmdLine] -> IO EdLines
ed xs = concatM $ map toLinesExt $ scanl (flip $ edLine defHs) (return [Leaf ""]) xs where
toLinesExt :: IO [EdState] -> IO EdLines
toLinesExt rsIO = do
rs#(r:_) <- rsIO -- todo add fallback pattern with (error)
return $ fromEd r ++ [" "]
The scanl is definitely evaluated lazy because edLine is getting evaluated for sure (observable by the side effects).
I think it could have to do with concatM:
concatM :: (Foldable t, Monad m) => t (m [a]) -> m [a]
concatM xsIO = foldr (\accIO xIO -> do {x <- xIO; acc <- accIO; return $ acc ++ x}) (return []) xsIO

All I/O in Haskell is explicitly ordered. The last two lines of your main function desugar into something like
ed (lines xs) >>= (\edLines -> putStr $ unlines edLines)
>>= sequences all of the I/O effects on the left before all of those on the right. You're constructing an I/O action of the form generate line 1 >> ... >> generate line n >> output line 1 >> ... >> output line n.
This isn't really an evaluation order issue, it's a correctness issue. An implementation is free to evaluate in any order it wants, but it can't change the ordering of I/O actions that you specified, any more than it can reorder the elements of a list.
Here's a toy example showing what you need to do:
lineProducingActions :: [IO String]
lineProducingActions = replicate 10 getLine
wrongOrder, correctOrder :: IO ()
wrongOrder = do
xs <- sequence lineProducingActions
mapM_ putStrLn xs
correctOrder = do
let xs = [x >>= putStrLn | x <- lineProducingActions]
sequence_ xs
Note that you can decouple the producer and consumer while getting the ordering you want. You just need to avoid combining the I/O actions in the producer. I/O actions are pure values that can be manipulated just like any other values. They aren't side-effectful expressions that happen immediately as they're written. They happen, rather, in whatever order you glue them together in.

You would need to use unsafeInterleaveIO to schedule some of your IO actions for later. Beware that the IO actions may then be executed in a different order than you might first expect!
However, I strongly recommend not doing that. Change your IO [String] action to print each line as it's produced instead.
Alternately, if you really want to maintain the computation-as-pipeline view, check out one of the many streaming libraries available on Hackage (streamly, pipes, iteratees, conduit, machines, and probably half a dozen others).

Thanks to #benrg answer I was able to solve the issue with the following code:
ed :: [EdCmdLine] -> [IO EdLines]
ed cmds = map (>>= return . toLines . head) $ edHistIO where
toLines :: EdState -> EdLines
toLines r = fromEd r ++ [" "]
edHistIO = edRec defHs cmds (return [initState])
edRec :: [HandleHandler] -> [EdCmdLine] -> IO EdHistory -> [IO EdHistory]
edRec _ [] hist = [hist] -- if CTRL + D
edRec defHs (cmd:cmds) hist = let next = edLine defHs cmd hist in next : edRec defHs cmds next
main = getContents >>= mapM_ (>>= (putStr . unlines)) . ed . lines

Use two monads without a transformer

In order to understand how to use monad transformers, I wrote the following code without one. It reads standard input line by line and displays each line reversed until an empty line is encountered. It also counts the lines using State and in the end displays the total number.
import Control.Monad.State
main = print =<< fmap (`evalState` 0) go where
go :: IO (State Int Int)
go = do
l <- getLine
if null l
then return get
else do
putStrLn (reverse l)
-- another possibility: fmap (modify (+1) >>) go
rest <- go
return $ do
modify (+1)
rest
I wanted to add the current line number before each line. I was able to do it with StateT:
import Control.Monad.State
main = print =<< evalStateT go 0 where
go :: StateT Int IO Int
go = do
l <- lift getLine
if null l
then get
else do
n <- get
lift (putStrLn (show n ++ ' ' : reverse l))
modify (+1)
go
My question is: how to do the same in the version without monad transformers?

The problem you're having is that the hand-unrolling of StateT s IO a is s -> IO (s, a), not IO (s -> (s, a))! Once you have this insight, it's pretty easy to see how to do it:
go :: Int -> IO (Int, Int)
go s = do
l <- getLine
if null l
then return (s, s)
else do
putStrLn (show s ++ ' ' : reverse l)
go (s+1)

You'd just need to run the accumulated state computation on every line. This is O(n²) time, but since your first program is already using O(n) space, that's not too terrible. Of course, the StateT approach is superior in pretty much every way! If you really want to do it "by hand" and not pay an efficiency price, just manage the state by hand instead of building a state transformer at all. You're really not getting any benefit by using State instead of Int in the first program.

Maybe this is what you are looking for?
main = print =<< fmap (`evalState` 0) (go get) where
go :: State Int Int -> IO (State Int Int)
go st = do
l <- getLine
if null l
then return (st >>= \_ -> get)
else do
let ln = evalState st 0
putStrLn(show ln ++ ' ' : reverse l)
go (st >>= \_ -> modify (+1) >>= \_ -> get)
The idea here is to make go tail recursive, building up your state computation, which you can then evaluate at each step.
EDIT
This version will bound the size of the state computation to a constant size, although under lazy evaluation, when the previous state computation is forced, we should be able to reuse it without re-evaluating it, so I'm guessing that these are essentially the same...
main = print =<< fmap (`evalState` 0) (go get) where
go :: State Int Int -> IO (State Int Int)
go st = do
l <- getLine
if null l
then return st
else do
let ln = evalState st 0
putStrLn(show ln ++ ' ' : reverse l)
go (modify (\s -> s+ln+1) >>= \_ -> get)

Pipeline-like operation using TChan

I want to implement a pipeline between two threads. I have thread A that take the data, process it, and send it to thread B. I have a MVar that check if the data is completely processed
However, I'm having an exception *** Exception: thread blocked indefinitely in an STM transaction
Why are my threads blocked? I though than when the first thread write on the channel, then when there is a data on the channel, the second one can read it
fstPipe :: (a -> b) -> TChan b -> MVar () -> [a] -> IO ()
fstPipe f chIn m xs = do
( mapM_(\x-> atomically $ writeTChan chIn $ f x) xs) >> putMVar m ()
pipelineDone channel mIn = do
isDone <- fmap isJust $ tryTakeMVar mIn
isEmpty <- atomically $ isEmptyTChan channel
return $ isDone && isEmpty
lastPipe f chIn mIn = iter
where iter = do
atomically $ fmap f $ readTChan chIn
isDone <- pipelineDone chIn mIn
unless isDone $ iter
pipeline = do
chIn <- atomically newTChan
m <- newEmptyMVar
first <- async $ fstPipe reverse chIn m $ replicate 10 [1..500]
last <- async $ lastPipe print chIn m
wait first
wait last

It seems odd to me to be using STM and semaphores in the same code block... Why not do the entire thing in STM?
In particular, why not a TChan (Maybe x), with Nothing indicating the end of the sequence?
Also, notice that your fstPipe likely just generates a bunch of unevaluated thunks and immediately chucks them into the TChan, without actually computing anything. You probably want a seq or similar in there to force some actual work to happen on that thread.

I think there's a race condition:
stop fstPipe just before the putMVar
advance lastPipe to read everything, and then call pipelineDone
pipelineDone returns False since putMVar was not yet done
lastPipe will try to read from the channel
putMVar executes, but it's too late
Now lastPipe is stuck reading on an empty channel.

Your problem is in the logic of pipelineDone. Currently, you have:
pipelineDone channel mIn = do
isDone <- fmap isJust $ tryTakeMVar mIn
isEmpty <- atomically $ isEmptyTChan channel
return $ isDone && isEmpty
tryTakeMVar is going to take the contents of the MVar assuming there is something in there. Assuming your producer finishes first, it is going to write () into the MVar. Your consumer is then going to try and take the contents of it. If it succeeds, then the MVar goes empty. Any subsequent tryTakeMVar will always return Nothing, thus isDone && isEmpty will always return false and you will keep trying to read from the TChan. Once the TChan goes empty, GHC can tell you that it has encountered a deadlock.
You should instead change your pipelineDone implementation to:
pipelineDone channel mIn = do
stillRunning <- isEmptyMVar mIn
isEmpty <- atomically $ isEmptyTChan channel
return $ (not stillRunning) && isEmpty
This will instead simply poll the MVar, instead of actually emptying it.

Haskell: Reading from /proc. Issues with strictness and laziness. Process statistics

I have really strange behaviour while reading files from /proc
If I read /proc/pid/stat lazily with prelude's readFile - it works but not the way I want.
Switching to strict reading with Data.ByteString.readFile gives me an empty string.
I need strict reading here to be able to compare the results of two reads within short interval.
So using System.IO.readFile to read /proc/pid/stat simply does not work. It gives me the same result within 0.5 sec interval. I figure this is due to laziness and half closed handle or something ...
Opening and closing the file handle explicitly works.
h <- openFile "/proc/pid/stat" ReadMode
st1 <- hGetLine h; hClose h
But why do the above if we have the bytestring strict reading. Right?
This is where I got stuck.
import qualified Data.ByteString as B
B.readFile "/proc/pid/stat" >>= print
This always returns an empty string. Also tested in GHCI.
Any suggestions. Thanks.
--- UPDATE ---
Thank you Daniel for suggestions.
This is what I actually need to do. This might help to show my dilemma in full and bring more general suggestions.
I need to calculate process statistics. Here is part of the code (just the CPU usage) as an example.
cpuUsage pid = do
st1 <- readProc $ "/proc" </> pid </> "stat"
threadDelay 500000 -- 0.5 sec
st2 <- readProc $ "/proc" </> pid </> "stat"
let sum1 = (read $ words st1 !! 13) +
(read $ words st1 !! 14)
sum2 = (read $ words st2 !! 13) +
(read $ words st2 !! 14)
return $ round $ fromIntegral (sum2 - sum1) * jiffy / delay * 100
where
jiffy = 0.01
delay = 0.5
readProc f = do
h <- openFile f ReadMode
c <- hGetLine h
hClose h
return c
Prelude.readFile does not work due to the laziness
Strict functions from ByteString don't work. Thank you Daniel for the explanation.
withFile would work (it closes the handle properly) if I stuffed the whole computation in it but then the interval will not be strictly 0.5 as computations take time.
Opening and closing handles explicitly and using hGetContents does not work! For the same reason readFile doesn't.
The only thing that work in this situation is explicitly opening and closing handles with hGetLine in above code snippet. But this is not good enough as some proc files are more then one line like /proc/meminfo.
So I need a function that would read the whole file strictly. Something like hGetContents but strict.
I was trying to do this:
readProc f = do
h <- openFile f ReadMode
c <- hGetContents h
let c' = lines c
hClose h
return c'
Hoping that lines would trigger it to read the file in full. No luck. Still get an empty list.
Any help, suggestion is very appreciated.

The ByteString code is
readFile :: FilePath -> IO ByteString
readFile f = bracket (openBinaryFile f ReadMode) hClose
(\h -> hFileSize h >>= hGet h . fromIntegral)
But /proc/whatever isn't a real file, it's generated on demand, when you stat them to get the file size, you get 0. So ByteString's readFile successfully reads 0 bytes.

Before coding this type of thing, it's usually a good idea to check if something already exists on Hackage. In this case, I found the procstat package, which seems to work nicely:
import System.Linux.ProcStat
cpuUsage pid = do
Just before <- fmap procTotalTime <$> procStat pid
threadDelay 500000 -- 0.5 sec
Just after <- fmap procTotalTime <$> procStat pid
return . round $ fromIntegral (after - before) * jiffy / delay * 100
where
procTotalTime info = procUTime info + procSTime info
jiffy = 0.01
delay = 0.5

Having trouble finishing off this enumeratee

At one point I wrote a packet capture program in haskell and it used lazy IO to catch all the tcp packets. The problem was that sometimes packets are out of order, so I had to insert all of them into a list until I got a fin flag to be sure that I had all the packets necessary to do anything with them, and if I was sniffing something really big, like a video, I had to hold all that in memory. To do it any other way would require some difficult imperative code.
So later I learned about iteratees, and I decided to implement my own. How it would work is, there is an enumeratee. You supply it with the number of packets you want it to hold. As it pulls in packets, it sorts them, and then once it gets up to the number you specify, it starts flushing, but leaves a few in there so that new chunks are sorted into that list before more packets are flushed. The idea is that chunks will be almost in order before they hit this enumeratee, and it will fix most small order problems. When it gets an EOF, it should send all remaining packets back out.
So it almost works. I realize some of these could be replaced by standard enumerator functions, but I wanted to write them myself to understand how it works better. Here's some code:
Readlines just gets lines from a file one line at a time and feeds it.
PrintLines just prints each chunk.
numbers.txt is a line delimited set of numbers that are slightly out of order, some numbers are several spaces before or after they should be.
Reorder is the function that holds n numbers and sorts new ones into its accumulator list, and then shoves out all but the last n of those numbers.
import Prelude as P
import Data.Enumerator as E
import Data.Enumerator.List as EL
import Data.List (sort, insert)
import IO
import Control.Monad.Trans (lift)
import Control.Monad (liftM)
import Control.Exception as Exc
import Debug.Trace
test = run_ (readLines "numbers.txt" $$ EL.map (read ::String -> Int) =$ reorder 10 =$ printLines)
reorder :: (Show a, Ord a) => (Monad m) => Int -> Enumeratee a a m b
reorder n step = reorder' [] n step
where
reorder' acc n (Continue k) =
let
len = P.length
loop buf n' (Chunks xs)
| (n' - len xs >= 0) = continue (loop (foldr insert buf xs) (n' - len xs))
| otherwise =
let allchunx = foldr insert buf xs
(excess,store)= P.splitAt (negate (n' - len xs)) allchunx
in k (Chunks excess) >>== reorder' store 0
loop buf n' (EOF) = k (Chunks (trace ("buf:" ++ show buf) buf)) >>== undefined
in continue (loop acc n)
printLines :: (Show a) => Iteratee a IO ()
printLines = continue loop
where
loop (Chunks []) = printLines
loop (Chunks (x:xs)) = do
lift $ print x
printLines
loop (EOF) = yield () EOF
readLines :: FilePath -> Enumerator String IO ()
readLines filename s = do
h <- tryIO $ openFile filename ReadMode
Iteratee (Exc.finally (runIteratee $ checkContinue0 (blah h) s) (hClose h))
where
blah h loop k = do
x <- lift $ myGetLine h
case x of
Nothing -> continue k
Just line -> k (Chunks [line]) >>== loop
myGetLine h = Exc.catch (liftM Just (hGetLine h)) checkError
checkError :: IOException -> IO (Maybe String)
checkError e = return Nothing
My problem is at the undefined in reorder. What happens is reorder has 10 items stuck in it, and then it receives an EOF from up the stack. So it goes k (Chunks those10items) and then there is an undefined because I don't know what to put here to make it work.
What happens is that the last 10 items get chopped out of the output of the program. You can see the trace, that variable buf has all the remaining items in it. I have tried yielding, but I'm not sure what to yield or if I should yield at all. I'm not sure what to put there to make this work.
Edit: Turns out the reorder was fixed by changing the undefined part of the loop to:
loop buf n' EOF = k (Chunks buf) >>== (\s -> yield s EOF)
which I almost definitely had at one point, but I didn't get the right answer so I assumed it was wrong.
The problem was with printLines. Since reorder was sending out chunks one at a time until it got to the very end, I never noticed the problem with printLines which was that it was discarding chunks other than the first one per loop. In my head I thought that the chunks would carry over or something, which was stupid.
Anyways I changed printLines to this:
printLines :: (Show a) => Iteratee a IO ()
printLines = continue loop
where
loop (Chunks []) = printLines
loop (Chunks xs) = do
lift $ mapM_ print xs
printLines
loop (EOF) = yield () EOF
And now it works. Thanks a lot, I was afraid I wouldn't get an answer.

How about
loop buf n' (EOF) = k (Chunks buf) >>== (\s -> yield s EOF)
(idea taken from EB.isolate).
Depending on what exactly you're trying to do, your printLines may also need fixing; the case for Chunks (x:xs) throws away xs. Something like
loop (Chunks (x:xs)) = do
lift $ print x
loop (Chunks xs)
may (or may not) have been what you intended.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string