Having trouble finishing off this enumeratee - haskell

At one point I wrote a packet capture program in haskell and it used lazy IO to catch all the tcp packets. The problem was that sometimes packets are out of order, so I had to insert all of them into a list until I got a fin flag to be sure that I had all the packets necessary to do anything with them, and if I was sniffing something really big, like a video, I had to hold all that in memory. To do it any other way would require some difficult imperative code.
So later I learned about iteratees, and I decided to implement my own. How it would work is, there is an enumeratee. You supply it with the number of packets you want it to hold. As it pulls in packets, it sorts them, and then once it gets up to the number you specify, it starts flushing, but leaves a few in there so that new chunks are sorted into that list before more packets are flushed. The idea is that chunks will be almost in order before they hit this enumeratee, and it will fix most small order problems. When it gets an EOF, it should send all remaining packets back out.
So it almost works. I realize some of these could be replaced by standard enumerator functions, but I wanted to write them myself to understand how it works better. Here's some code:
Readlines just gets lines from a file one line at a time and feeds it.
PrintLines just prints each chunk.
numbers.txt is a line delimited set of numbers that are slightly out of order, some numbers are several spaces before or after they should be.
Reorder is the function that holds n numbers and sorts new ones into its accumulator list, and then shoves out all but the last n of those numbers.
import Prelude as P
import Data.Enumerator as E
import Data.Enumerator.List as EL
import Data.List (sort, insert)
import IO
import Control.Monad.Trans (lift)
import Control.Monad (liftM)
import Control.Exception as Exc
import Debug.Trace
test = run_ (readLines "numbers.txt" $$ EL.map (read ::String -> Int) =$ reorder 10 =$ printLines)
reorder :: (Show a, Ord a) => (Monad m) => Int -> Enumeratee a a m b
reorder n step = reorder' [] n step
where
reorder' acc n (Continue k) =
let
len = P.length
loop buf n' (Chunks xs)
| (n' - len xs >= 0) = continue (loop (foldr insert buf xs) (n' - len xs))
| otherwise =
let allchunx = foldr insert buf xs
(excess,store)= P.splitAt (negate (n' - len xs)) allchunx
in k (Chunks excess) >>== reorder' store 0
loop buf n' (EOF) = k (Chunks (trace ("buf:" ++ show buf) buf)) >>== undefined
in continue (loop acc n)
printLines :: (Show a) => Iteratee a IO ()
printLines = continue loop
where
loop (Chunks []) = printLines
loop (Chunks (x:xs)) = do
lift $ print x
printLines
loop (EOF) = yield () EOF
readLines :: FilePath -> Enumerator String IO ()
readLines filename s = do
h <- tryIO $ openFile filename ReadMode
Iteratee (Exc.finally (runIteratee $ checkContinue0 (blah h) s) (hClose h))
where
blah h loop k = do
x <- lift $ myGetLine h
case x of
Nothing -> continue k
Just line -> k (Chunks [line]) >>== loop
myGetLine h = Exc.catch (liftM Just (hGetLine h)) checkError
checkError :: IOException -> IO (Maybe String)
checkError e = return Nothing
My problem is at the undefined in reorder. What happens is reorder has 10 items stuck in it, and then it receives an EOF from up the stack. So it goes k (Chunks those10items) and then there is an undefined because I don't know what to put here to make it work.
What happens is that the last 10 items get chopped out of the output of the program. You can see the trace, that variable buf has all the remaining items in it. I have tried yielding, but I'm not sure what to yield or if I should yield at all. I'm not sure what to put there to make this work.
Edit: Turns out the reorder was fixed by changing the undefined part of the loop to:
loop buf n' EOF = k (Chunks buf) >>== (\s -> yield s EOF)
which I almost definitely had at one point, but I didn't get the right answer so I assumed it was wrong.
The problem was with printLines. Since reorder was sending out chunks one at a time until it got to the very end, I never noticed the problem with printLines which was that it was discarding chunks other than the first one per loop. In my head I thought that the chunks would carry over or something, which was stupid.
Anyways I changed printLines to this:
printLines :: (Show a) => Iteratee a IO ()
printLines = continue loop
where
loop (Chunks []) = printLines
loop (Chunks xs) = do
lift $ mapM_ print xs
printLines
loop (EOF) = yield () EOF
And now it works. Thanks a lot, I was afraid I wouldn't get an answer.

How about
loop buf n' (EOF) = k (Chunks buf) >>== (\s -> yield s EOF)
(idea taken from EB.isolate).
Depending on what exactly you're trying to do, your printLines may also need fixing; the case for Chunks (x:xs) throws away xs. Something like
loop (Chunks (x:xs)) = do
lift $ print x
loop (Chunks xs)
may (or may not) have been what you intended.

Related

Use two monads without a transformer

In order to understand how to use monad transformers, I wrote the following code without one. It reads standard input line by line and displays each line reversed until an empty line is encountered. It also counts the lines using State and in the end displays the total number.
import Control.Monad.State
main = print =<< fmap (`evalState` 0) go where
go :: IO (State Int Int)
go = do
l <- getLine
if null l
then return get
else do
putStrLn (reverse l)
-- another possibility: fmap (modify (+1) >>) go
rest <- go
return $ do
modify (+1)
rest
I wanted to add the current line number before each line. I was able to do it with StateT:
import Control.Monad.State
main = print =<< evalStateT go 0 where
go :: StateT Int IO Int
go = do
l <- lift getLine
if null l
then get
else do
n <- get
lift (putStrLn (show n ++ ' ' : reverse l))
modify (+1)
go
My question is: how to do the same in the version without monad transformers?
The problem you're having is that the hand-unrolling of StateT s IO a is s -> IO (s, a), not IO (s -> (s, a))! Once you have this insight, it's pretty easy to see how to do it:
go :: Int -> IO (Int, Int)
go s = do
l <- getLine
if null l
then return (s, s)
else do
putStrLn (show s ++ ' ' : reverse l)
go (s+1)
You'd just need to run the accumulated state computation on every line. This is O(n²) time, but since your first program is already using O(n) space, that's not too terrible. Of course, the StateT approach is superior in pretty much every way! If you really want to do it "by hand" and not pay an efficiency price, just manage the state by hand instead of building a state transformer at all. You're really not getting any benefit by using State instead of Int in the first program.
Maybe this is what you are looking for?
main = print =<< fmap (`evalState` 0) (go get) where
go :: State Int Int -> IO (State Int Int)
go st = do
l <- getLine
if null l
then return (st >>= \_ -> get)
else do
let ln = evalState st 0
putStrLn(show ln ++ ' ' : reverse l)
go (st >>= \_ -> modify (+1) >>= \_ -> get)
The idea here is to make go tail recursive, building up your state computation, which you can then evaluate at each step.
EDIT
This version will bound the size of the state computation to a constant size, although under lazy evaluation, when the previous state computation is forced, we should be able to reuse it without re-evaluating it, so I'm guessing that these are essentially the same...
main = print =<< fmap (`evalState` 0) (go get) where
go :: State Int Int -> IO (State Int Int)
go st = do
l <- getLine
if null l
then return st
else do
let ln = evalState st 0
putStrLn(show ln ++ ' ' : reverse l)
go (modify (\s -> s+ln+1) >>= \_ -> get)

Reading multiline user's input

I want to lazily read user input and do something with it line by line. But if user ends a line with , (comma) followed by any number of spaces (including zero), I want give him opportunity to finish his input on the next line.
And here is what I've got:
import System.IO
import Data.Char
chop :: String -> [String]
chop = f . map (++ "\n") . lines
where f [] = []
f [x] = [x]
f (x : y : xs) = if (p . tr) x
then f ((x ++ y) : xs)
else x : f (y : xs)
p x = (not . null) x && ((== ',') . last) x
tr xs | all isSpace xs = ""
tr (x : xs) = x :tr xs
main :: IO ()
main =
do putStrLn "Welcome to hell, version 0.1.3!"
putPrompt
mapM_ process . takeWhile (/= "quit\n") . chop =<< getContents
where process str = putStr str >> putPrompt
putPrompt = putStr ">>> " >> hFlush stdout
Sorry, it doesn't work at all. Bloody mess.
P.S. I want to preserve \n characters on end of every chunk. Currently I add them manually with map (++ "\n") after lines.
How about changing the type of chop a little:
readMultiLine :: IO [String]
readMultiLine = do
ln <- getLine
if (endswith (rstrip ln) ",") then
liftM (ln:) readMultiLine
else
return [ln]
Now you know that if the last list is not empty, then the user didn't finish typing (the last input ended with ',').
Of course, either import Data.String.Utils, or write your own. Could be as simple as:
endswith xs ys = (length xs >= length ys)
&& (and $ zipWith (==) (reverse xs) (reverse ys))
rstrip = reverse . dropWhile isSpace . reverse
But I missed the point at first. Here's the actual thing.
unfoldM :: (Monad m) => (a -> Maybe (m b, m a)) -> a -> m [b]
unfoldM f z = case f z of
Nothing -> return []
Just (x, y) -> liftM2 (:) x $ y >>= unfoldM f
main = unfoldM (\x -> if (x == ["quit"]) then Nothing
else Just (print x, readMultiLine)) =<< readMultiLine
The reason is, you need to be able to insert the "action" to be done on input between reading one multi-line input and the next. Here print x is the action inserted between two readMultiLine
Since you have questions about getContents, let me add. Even though getContents provides a lazy String, its effectful changes to the world are ordered with the subsequent effects of processing the list. But the processing of the list attempts to insert effects between effects of reading particular list items. To do that, you need a function that exposes the chain of effects, so you can insert your own effects between them.
You can do this using pipes, preserving the laziness of the user's input
import Data.Char (isSpace)
import Pipes
import qualified Pipes.Prelude as Pipes
endsWithComma :: String -> Bool
endsWithComma str =
case (dropWhile isSpace $ reverse str) of
',':_ -> True
_ -> False
finish :: Monad m => Pipe String String m ()
finish = do
str <- await
yield str
if endsWithComma str
then do
str' <- await
yield str'
else finish
user :: Producer String IO ()
user = Pipes.stdinLn >-> finish
You can then hook up the user Producer to any downstream Consumer. For example, to echo the stream back out you can write:
main = runEffect (user >-> Pipes.stdoutLn)
To learn more about pipes you can read the tutorial.
Sorry, I wrote something wrong in a comment and I thought that now that I understood what you were trying to do, I'd give an answer with a little more substance. The core idea is that you're going to need a state buffer while you loop through the string, as far as I can tell. You have f :: [String] -> [String] but you'll need an extra string of buffer before you can solve this puzzle.
So let me assume an answer which looks like:
chop = joinCommas "" . map (++ "\n") . lines
Then the structure of joinCommas is going to look like:
import Data.List (isSuffixOf)
-- override with however you want to handle the ",\n" between lines.
joinLines = (++)
incomplete = isSuffixOf ",\n"
joinCommas :: String -> [String] -> [String]
joinCommas prefix (line : rest)
| incomplete prefix = joinCommas (joinLines prefix line) rest
| otherwise = prefix : joinCommas line rest
joinCommas prefix []
| incomplete prefix = error "Incomplete input"
| otherwise = [prefix]
The prefix stores up lines until it doesn't end with ",\n" at which point it emits the prefix and continues with the rest of the lines. On EOF we process the last line unless that line is incomplete.

Why is putStrLn not atomic?

To practice concurrent programming, I wrote the following (suboptimal) program, which repeatedly calculates the first prime bigger than whatever the user inputs:
import Control.Concurrent
import Control.Concurrent.Chan
import Control.Monad (forever)
primeAtLeast n = -- Some pure code that looks up the first prime at least as big as n
outputPrimeAtLeast n = putStrLn $ show $ (n, primeAtLeast n)
main = do
chan <- newChan
worker <- forkIO $ forever $ readChan chan >>= outputPrimeAtLeast
forever $ (readLn :: (IO Int)) >>= (writeChan chan)
killThread worker
I want to have a worker thread in the background that does the actual calculation and outputs (n, primeAtLeast n) as soon as it's finished.
What it's doing now: As soon as I enter a number n, it immediately outputs (n,, returns the control to the main thread, calculates primeAtLeast n in the background and outputs the second half primeAtLeast n) as soon as it's finished.
So is putStrLn not atomic? Or where is the problem?
Try this:
outputPrimeAtLeast n = let p = primeAtLeast n in p `seq` putStrLn $ show (n, p)
The above forces the computation of the prime before the putStrLn is run.
Further, you may use print instead of putStrLn . show:
outputPrimeAtLeast n = let p = primeAtLeast n in p `seq` print (n, p)
Alternatively, you may use a putStrLn function which forces every single character before starting printing anything.
strictPutStrLn :: Show a => a -> IO ()
strictPutStrLn x = let str = show x in str `listSeq` putStrLn str
listSeq :: [a] -> b -> b
listSeq [] w = w
listSeq (x:xs) w = x `seq` listSeq xs w

Why doesn't this code operate in constant memory?

I'm using Data.Text.Lazy to process some text files. I read in 2 files and distribute their text to 3 files according to some criteria. The loop which does the processing is go'. I've designed it in a way in which it should process the files incrementally and keep nothing huge in memory. However, as soon as the execution reaches the go' part the memory keeps on increasing till it reaches around 90MB at the end, starting from 2MB.
Can someone explain why this memory increase happens and how to avoid it?
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TI
import System.IO
import System.Environment
import Control.Monad
main = do
[in_en, in_ar] <- getArgs
[h_en, h_ar] <- mapM (`openFile` ReadMode) [in_en, in_ar]
hSetEncoding h_en utf8
en_txt <- TI.hGetContents h_en
let len = length $ T.lines en_txt
len `seq` hClose h_en
h_en <- openFile in_en ReadMode
hs#[hO_lm, hO_en, hO_ar] <- mapM (`openFile` WriteMode) ["lm.txt", "tun_"++in_en, "tun_"++in_ar]
mapM_ (`hSetEncoding` utf8) [h_en, h_ar, hO_lm, hO_en, hO_ar]
[en_txt, ar_txt] <- mapM TI.hGetContents [h_en, h_ar]
let txts#[_, _, _] = map T.unlines $ go len en_txt ar_txt
zipWithM_ TI.hPutStr hs txts
mapM_ (liftM2 (>>) hFlush hClose) hs
print "success"
where
go len en_txt ar_txt = go' (T.lines en_txt) (T.lines ar_txt)
where (q,r) = len `quotRem` 3000
go' [] [] = [[],[],[]]
go' en ar = let (h:bef, aft) = splitAt q en
(hA:befA, aftA) = splitAt q ar
~[lm,en',ar'] = go' aft aftA
in [bef ++ lm, h:en', hA:ar']
EDIT
As per #kosmikus's suggestion I've tried replacing zipWithM_ TI.hPutStr hs txts with a loop which prints line by line as shown below. The memory consumption is now 2GB+!
fix (\loop lm en ar -> do
case (en,ar,lm) of
([],_,lm) -> TI.hPutStr hO_lm $ T.unlines lm
(h:t,~(h':t'),~(lh:lt)) -> do
TI.hPutStrLn hO_en h
TI.hPutStrLn hO_ar h'
TI.hPutStrLn hO_lm lh
loop lt t t')
lm en ar
What's going on here?
The function go' builds a [T.Text] with three elements. The list is built lazily: in each step of go each of the three lists becomes known to a certain extent. However, you consume this structure by printing each element to a file in order, using the line:
zipWithM_ TI.hPutStr hs txts
So the way you consume the data does not match the way you produce the data. While printing the first of the three list elements to a file, the other two are built and kept in memory. Hence the space leak.
Update
I think that for the current example, the easiest fix would be to write to the target files during the loop, i.e., in the go' loop. I'd modify go' as follows:
go' :: [T.Text] -> [T.Text] -> IO ()
go' [] [] = return ()
go' en ar = let (h:bef, aft) = splitAt q en
(hA:befA, aftA) = splitAt q ar
in do
TI.hPutStrLn hO_en h
TI.hPutStrLn hO_ar hA
mapM_ (TI.hPutStrLn hO_lm) bef
go' aft aftA
And then replace the call to go and the subsequent zipWithM_ call with a plain call to:
go hs len en_txt ar_txt

How to monitor computation process in Haskell

I have a function in my main block
map anyHeavyFunction [list]
I'd like to show a progress bar during the computation process or add additional actions (pause, stop process etc.), but because map is a pure function I can't do it directly. I can guess I have to use monads, but what monad is appropriate? IO, State?
I know there is at least one library on hackage that has some pre-made monad transformers for this task, but I normally turn to the pipes package to roll my own when I need one. I am using pipes-4.0.0 it is going to be on hackage this weekend, but you can grab it form the github repo before that.
I also used terminal-progress-bar package so that it makes a nice terminal animation as well.
{-# language BangPatterns #-}
import Pipes
import qualified Pipes.Prelude as P
import Control.Monad.IO.Class
import System.ProgressBar
import System.IO ( hSetBuffering, BufferMode(NoBuffering), stdout )
-- | Takes the total size of the stream to be processed as l and the function
-- to map as fn
progress l = loop 0
where
loop n = do
liftIO $ progressBar (msg "Working") percentage 40 n l
!x <- await -- bang pattern to make strict
yield x
loop (n+1)
main = do
-- Force progress bar to print immediately
hSetBuffering stdout NoBuffering
let n = 10^6
let heavy x = last . replicate n $ x -- time wasting function
r <- P.toListM $ each [1..100] >-> P.map heavy >-> progress 100
putStrLn ""
return r
This animates:
> Working [=>.......................] 7%
> Working [=====>...................] 20%
Every update erases the last bar so it only take up one line on the terminal. Then it finishes like so:
> main
Working [=========================] 100%
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100]
Here's a (kind of) simple answer that I'm not satisfied with. It is based on the fact that #shellenberg wanted to apply a heavy function on each element of a (supposedly long) list. If it suffices to move the "progress bar" once for every element of the list, then the following can be turned into a general solution.
First of all, you need to pick the monad in which you'll work. This depends on what exactly your "progress bar" is. For this discussion, let's say that the IO monad is enough and that we want to alternately display the characters -, /, | and \. You'll also (most probably) need some kind of state S (here it is only the number of elements processed so far, therefore S is Int), so the real monad used will be StateT S IO.
Suppose your original program is:
m = 100000 -- how many elements the list has
-- Your (pure) function
anyHeavyFunction :: Int -> Bool
anyHeavyFunction n =
length [1..n] + length [n+1..4217] == 4217
-- Your list
list :: [Int]
list = take m $ repeat 4217
-- The main program
main :: IO ()
main = do
let l = map anyHeavyFunction list
if and l
then putStrLn "OK"
else putStrLn "WRONG"
(Notice that, very conveniently, the heavy function takes the same time for each element of the list.)
This is how you could convert it to display the crude "progress bar":
import Control.Monad.State
import System.IO (hFlush, stdout)
m = 100000 -- how many elements the list has
k = 5000 -- how often you want to "tick"
tick :: a -> StateT Int IO a
tick x = do
s <- get
put $ s+1
when (s `mod` k == 0) $ liftIO $ do
let r = (s `div` k) `mod` 4
putChar $ "-/|\\" !! r
putChar '\b'
hFlush stdout
x `seq` return x
-- Your (pure) function
anyHeavyFunction :: Int -> Bool
anyHeavyFunction n =
length [1..n] + length [n+1..4217] == 4217
-- Your list
list :: [Int]
list = take m $ repeat 4217
-- The main program
main :: IO ()
main = do
l <- flip evalStateT 0 $ mapM (tick . anyHeavyFunction) list
if and l
then putStrLn "OK"
else putStrLn "WRONG"
An interesting point: The seq in tick forces evaluation of the result for each element of the list. This is enough, if the result has a basic type (Bool here). Otherwise, it's not clear what you would want to do -- remember Haskell is lazy!
If one wants a finer progress bar or if one is not satisfied with the assumption that one "tick" will be counted for each element of the list, then I believe it's necessary to incorporate the ticking in the logic of the heavy function. This makes it ugly... I'd like to see what kind of general solutions can be suggested to that. I'm all in for Haskell, but I think it just sucks for such things as progress bars... There's no free lunch; you can't be pure and lazy and have your progress bars made easy!
EDIT: A version which uses the ProgressBar module suggested by #Davorak. It certainly looks nicer than my rotating bar.
import Control.Monad.State
import System.ProgressBar
import System.IO (hSetBuffering, BufferMode(NoBuffering), stdout)
m = 100000 -- how many elements the list has
k = 5000 -- how often you want to "tick"
tick :: a -> StateT Int IO a
tick x = do
s <- get
put $ s+1
when (s `mod` k == 0) $ liftIO $ do
progressBar (msg "Working") percentage 40 (toInteger s) (toInteger m)
x `seq` return x
-- Your (pure) function
anyHeavyFunction :: Int -> Bool
anyHeavyFunction n =
length [1..n] + length [n+1..4217] == 4217
-- Your list
list :: [Int]
list = take m $ repeat 4217
-- The main program
main :: IO ()
main = do
hSetBuffering stdout NoBuffering
l <- flip evalStateT 0 $ mapM (tick . anyHeavyFunction) list
if and l
then putStrLn "OK"
else putStrLn "WRONG"
The idea is the same, the drawbacks too.
You could use parMap to apply the expensive function in parallel (if the dependencies permit) and a list of TVars corresponding to each list (or chunk of) element(s) and set them once the respective function application has completed. A separate thread could check on the values and update the display (obviously some IO action would happen here).

Resources