Attoparsec Iteratee - haskell

I wanted, just to learn a bit about Iteratees, reimplement a simple parser I made, using Data.Iteratee and Data.Attoparsec.Iteratee. I'm pretty much stumped though. Below I have a simple example that is able to parse one line from a file. My parser reads one line at a time, so I need a way of feeding lines to the iteratee until it's done. I've read all I've found googling this, but a lot of the material on iteratee/enumerators is pretty advanced. This is the part of the code that matters:
-- There are more imports above.
import Data.Attoparsec.Iteratee
import Data.Iteratee (joinI, run)
import Data.Iteratee.IO (defaultBufSize, enumFile)
line :: Parser ByteString -- left the implementation out (it doesn't check for
new line)
iter = parserToIteratee line
main = do
p <- liftM head getArgs
i <- enumFile defaultBufSize p $ iter
i' <- run i
print i'
This example will parse and print one line from a file with multiple lines. The original script mapped the parser over a list of ByteStrings. So I would like to do the same thing here. I found enumLinesin Iteratee, but I can't for the life of me figure out how to use it. Maybe I misunderstand its purpose?

Since your parser works on a line at a time, you don't even need to use attoparsec-iteratee. I would write this as:
import Data.Iteratee as I
import Data.Iteratee.Char
import Data.Attoparsec as A
parser :: Parser ParseOutput
type POut = Either String ParseOutput
processLines :: Iteratee ByteString IO [POut]
processLines = joinI $ (enumLinesBS ><> I.mapStream (A.parseOnly parser)) stream2list
The key to understanding this is the "enumeratee", which is just the iteratee term for a stream converter. It takes a stream processor (iteratee) of one stream type and converts it to work with another stream. Both enumLinesBS and mapStream are enumeratees.
To map your parser over multiple lines, mapStream is sufficient:
i1 :: Iteratee [ByteString] IO (Iteratee [POut] IO [POut]
i1 = mapStream (A.parseOnly parser) stream2list
The nested iteratees just mean that this converts a stream of [ByteString] to a stream of [POut], and when the final iteratee (stream2list) is run it returns that stream as [POut]. So now you just need the iteratee equivalent of lines to create that stream of [ByteString], which is what enumLinesBS does:
i2 :: Iteratee ByteString IO (Iteratee [ByteString] IO (Iteratee [POut] m [POut])))
i2 = enumLinesBS $ mapStream f stream2list
But this function is pretty unwieldy to use because of all the nesting. What we really want is a way to pipe output directly between stream converters, and at the end simplify everything to a single iteratee. To do this we use joinI, (><>), and (><>):
e1 :: Iteratee [POut] IO a -> Iteratee ByteString IO (Iteratee [POut] IO a)
e1 = enumLinesBS ><> mapStream (A.parseOnly parser)
i' :: Iteratee ByteString IO [POut]
i' = joinI $ e1 stream2list
which is equivalent to how I wrote it above, with e1 inlined.
There's still important element remaining though. This function simply returns the parse results in a list. Typically you would want to do something else, such as combine the results with a fold.
edit: Data.Iteratee.ListLike.mapM_ is often useful to create consumers. At that point each element of the stream is a parse result, so if you want to print them you can use
consumeParse :: Iteratee [POut] IO ()
consumeParse = I.mapM_ (either (\e -> return ()) print)
processLines2 :: Iteratee ByteString IO ()
processLines2 = joinI $ (enumLinesBS ><> I.mapStream (A.parseOnly parser)) consumeParse
This will print just the successful parses. You could easily report errors to STDERR, or handle them in other ways, as well.

Related

Monad transformers with IO and Maybe

I am trying to stack up IO and Maybe monads but either I don't understand monad transformers well enough or this is not possible using transformers. Can some one help me understand this?
f :: String -> Maybe String
main :: IO ()
main = do
input <- getLine -- IO String
output <- f input -- Maybe String (Can't extract because it is IO do block)
writeFile "out.txt" output -- gives error because writeFile expects output :: String
In the above simplified example, I have a function f that returns a Maybe String and I would like to have a neat way of extracting this in the IO do block. I tried
f :: String -> MaybeT IO String
main :: IO ()
main = do
input <- getLine -- IO String
output <- runMaybeT (f input) -- Extracts output :: Maybe String instead of String
writeFile "out.txt" output -- gives error because writeFile expects output :: String
which lets me extract the Maybe String out in the second line of do block but I need to extract the string out of that. Is there a way to do this without using case?
Let's stick for a moment with your first snippet. If f input is a Maybe String, and you want to pass its result to writeFile "out.txt", which takes a String, you need to deal with the possibility of f input being Nothing. You don't have to literally use a case-statement. For instance:
maybe from the Prelude is case analysis packaged as a function;
fromMaybe from Data.Maybe lets you easily supply a default value, if that makes sense for your use case;
traverse_ and for_ from Data.Foldable could be used to silently ignore Nothing-ness:
for_ (f input) (writeFile "out.txt") -- Does nothing if `f input` is `Nothing`.
Still, no matter what you choose to do, it will involve handling Nothing somehow.
As for MaybeT, you don't really want monad transformers here. MaybeT IO is for when you want something like a Maybe computation but in which you can also include IO computations. If f :: String -> Maybe String already does what you want, you don't need to add an underlying IO layer to it.

Infinite stream of effectful actions

I would like to parse an infinite stream of bytes into an infinite stream of Haskell data. Each byte is read from the network, thus they are wrapped into IO monad.
More concretely I have an infinite stream of type [IO(ByteString)]. On the other hand I have a pure parsing function parse :: [ByteString] -> [Object] (where Object is a Haskell data type)
Is there a way to plug my infinite stream of monad into my parsing function ?
For instance, is it possible to write a function of type [IO(ByteString)] -> IO [ByteString] in order for me to use my function parse in a monad?
The Problem
Generally speaking, in order for IO actions to be properly ordered and behave predictably, each action needs to complete fully before the next action is run. In a do-block, this means that this works:
main = do
sequence (map putStrLn ["This","action","will","complete"])
putStrLn "before we get here"
but unfortunately this won't work, if that final IO action was important:
dontRunMe = do
putStrLn "This is a problem when an action is"
sequence (repeat (putStrLn "infinite"))
putStrLn "<not printed>"
So, even though sequence can be specialized to the right type signature:
sequence :: [IO a] -> IO [a]
it doesn't work as expected on an infinite list of IO actions. You'll have no problem defining such a sequence:
badSeq :: IO [Char]
badSeq = sequence (repeat (return '+'))
but any attempt to execute the IO action (e.g., by trying to print the head of the resulting list) will hang:
main = (head <$> badSeq) >>= print
It doesn't matter if you only need a part of the result. You won't get anything out of the IO monad until the entire sequence is done (so "never" if the list is infinite).
The "Lazy IO" Solution
If you want to get data from a partially completed IO action, you need to be explicit about it and make use of a scary-sounding Haskell escape hatch, unsafeInterleaveIO. This function takes an IO action and "defers" it so that it won't actually execute until the value is demanded.
The reason this is unsafe in general is that an IO action that makes sense now, might mean something different if actually executed at a later time point. As a simple example, an IO action that truncates/removes a file has a very different effect if it's executed before versus after updated file contents are written!
Anyway, what you'd want to do here is write a lazy version of sequence:
import System.IO.Unsafe (unsafeInterleaveIO)
lazySequence :: [IO a] -> IO [a]
lazySequence [] = return [] -- oops, not infinite after all
lazySequence (m:ms) = do
x <- m
xs <- unsafeInterleaveIO (lazySequence ms)
return (x:xs)
The key point here is that, when a lazySequence infstream action is executed, it will actually execute only the first action; the remaining actions will be wrapped up in a deferred IO action that won't truly execute until the second and subsequent elements of the returned list are demanded.
This works for fake IO actions:
> take 5 <$> lazySequence (repeat (return ('+'))
"+++++"
>
(where if you replaced lazySequence with sequence, it would hang). It also works for real IO actions:
> lns <- lazySequence (repeat getLine)
<waits for first line of input, then returns to prompt>
> print (head lns)
<prints whatever you entered>
> length (head (tail lns)) -- force next element
<waits for second line of input>
<then shows length of your second line before prompt>
>
Anyway, with this definition of lazySequence and types:
parse :: [ByteString] -> [Object]
input :: [IO ByteString]
you should have no trouble writing:
outputs :: IO [Object]
outputs = parse <$> lazySequence inputs
and then using it lazily however you want:
main = do
objs <- outputs
mapM_ doSomethingWithObj objs
Using Conduit
Even though the above lazy IO mechanism is pretty simple and straightforward, lazy IO has fallen out of favor for production code due to issues with resource management, fragility with respect to space leaks (where a small change to your code blows up the memory footprint), and problems with exception handling.
One solution is the conduit library. Another is pipes. Both are carefully designed streaming libraries that can support infinite streams.
For conduit, if you had a parse function that created one object per byte string, like:
parse1 :: ByteString -> Object
parse1 = ...
then given:
inputs :: [IO ByteString]
inputs = ...
useObject :: Object -> IO ()
useObject = ...
the conduit would look something like:
import Conduit
main :: IO ()
main = runConduit $ mapM_ yieldM inputs
.| mapC parse1
.| mapM_C useObject
Given that your parse function has signature:
parse :: [ByteString] -> [Object]
I'm pretty sure you can't integrate this with conduit directly (or at least not in any way that wouldn't toss out all the benefits of using conduit). You'd need to rewrite it to be conduit friendly in how it consumed byte strings and produced objects.

How to track progress through a streaming ByteString?

I'm using the streaming-utils streaming-utils to stream a HTTP response body. I want to track the progress similar to how bytestring-progress allows with lazy ByteStrings. I suspect something like toChunks would be necessary, then reducing some cumulative bytes read and returning the original stream unmodified. But I cannot figure it out, and the streaming documentation is very unhelpful, mostly full of grandiose comparisons to alternative libraries.
Here's some code with my best effort so far. It doesn't include the counting yet, and just tries to print the size of chunks as they stream past (and doesn't compile).
download :: ByteString -> FilePath -> IO ()
download i file = do
req <- parseRequest . C.unpack $ i
m <- newHttpClientManager
runResourceT $ do
resp <- http req m
lift . traceIO $ "downloading " <> file
let body = SBS.fromChunks $ mapsM step $ SBS.toChunks $ responseBody resp
SBS.writeFile file body
step bs = do
traceIO $ "got " <> show (C.length bs) <> " bytes"
return bs
What we want is to traverse the Stream (Of ByteString) IO () in two ways:
One that accumulates the incoming lengths of the ByteStrings and prints updates to console.
One that writes the stream to a file.
We can do that with the help of the copy function, which has type:
copy :: Monad m => Stream (Of a) m r -> Stream (Of a) (Stream (Of a) m) r
copy takes a stream and duplicates it into two different monadic layers, where each element of the original stream is emitted by both layers of the new dissociated stream.
(Notice that we are changing the base monad, not the functor. What changing the functor to another Stream does is to delimit groups in a single stream, and we aren't interested in that here.)
The following function takes a stream, copies it, accumulates the length of incoming strings with S.scan, prints them, and returns another stream that you can still work with, for example writing it to a file:
{-# LANGUAGE OverloadedStrings #-}
import Streaming
import qualified Streaming.Prelude as S
import qualified Data.ByteString as B
track :: Stream (Of B.ByteString) IO r -> Stream (Of B.ByteString) IO r
track stream =
S.mapM_ (liftIO . print) -- brings us back to the base monad, here another stream
. S.scan (\s b -> s + B.length b) (0::Int) id
$ S.copy stream
This will print the ByteStrings along with the accumulated lengths:
main :: IO ()
main = S.mapM_ B.putStr . track $ S.each ["aa","bb","c"]

Haskell iteratee: simple worked example of stripping trailing whitespace

I'm trying to understand how to use the iteratee library with Haskell. All of the articles I've seen so far seem to focus on building an intuition for how iteratees could be built, which is helpful, but now that I want to get down and actually use them, I feel a bit at sea. Looking at the source code for iteratees has been of limited value for me.
Let's say I have this function which trims trailing whitespace from a line:
import Data.ByteString.Char8
rstrip :: ByteString -> ByteString
rstrip = fst . spanEnd isSpace
What I'd like to do is: make this into an iteratee, read a file and write it out somewhere else with the trailing whitespace stripped from each line. How would I go about structuring that with iteratees? I see there's an enumLinesBS function in Data.Iteratee.Char which I could plumb into this, but I don't know if I should use mapChunks or convStream or how to repackage the function above into an iteratee.
If you just want code, it's this:
procFile' iFile oFile = fileDriver (joinI $
enumLinesBS ><>
mapChunks (map rstrip) $
I.mapM_ (B.appendFile oFile))
iFile
Commentary:
This is a three-stage process: first you transform the raw stream into a stream of lines, then you apply your function to convert that stream of lines, and finally you consume the stream. Since rstrip is in the middle stage, it will be creating a stream transformer (Enumeratee).
You can use either mapChunks or convStream, but mapChunks is simpler. The difference is that mapChunks doesn't allow for you to cross chunk boundaries, whereas convStream is more general. I prefer convStream because it doesn't expose any of the underlying implementation, but if mapChunks is sufficient the resulting code is usually shorter.
rstripE :: Monad m => Enumeratee [ByteString] [ByteString] m a
rstripE = mapChunks (map rstrip)
Note the extra map in rstripE. The outer stream (which is the input to rstrip) has type [ByteString], so we need to map rstrip onto it.
For comparison, this is what it would look like if implemented with convStream:
rstripE' :: Enumeratee [ByteString] [ByteString] m a
rstripE' = convStream $ do
mLine <- I.peek
maybe (return B.empty) (\line -> I.drop 1 >> return (rstrip line)) mLine
This is longer, and it's less efficient because it will only apply the rstrip function to one line at a time, even though more lines may be available. It's possible to work on all of the currently available chunk, which is closer to the mapChunks version:
rstripE'2 :: Enumeratee [ByteString] [ByteString] m a
rstripE'2 = convStream (liftM (map rstrip) getChunk)
Anyway, with the stripping enumeratee available, it's easily composed with the enumLinesBS enumeratee:
enumStripLines :: Monad m => Enumeratee ByteString [ByteString] m a
enumStripLines = enumLinesBS ><> rstripE
The composition operator ><> follows the same order as the arrow operator >>>. enumLinesBS splits the stream into lines, then rstripE strips them. Now you just need to add a consumer (which is a normal iteratee), and you're done:
writer :: FilePath -> Iteratee [ByteString] IO ()
writer fp = I.mapM_ (B.appendFile fp)
processFile iFile oFile =
enumFile defaultBufSize iFile (joinI $ enumStripLines $ writer oFile) >>= run
The fileDriver functions are shortcuts for simply enumerating over a file and running the resulting iteratee (unfortunately the argument order is switched from enumFile):
procFile2 iFile oFile = fileDriver (joinI $ enumStripLines $ writer oFile) iFile
Addendum: here's a situation where you would need the extra power of convStream. Suppose you want to concatenate every 2 lines into one. You can't use mapChunks. Consider when the chunk is a singleton element, [bytestring]. mapChunks doesn't provide any way to access the next chunk, so there's nothing else to concatenate with this. With convStream however, it's simple:
concatPairs = convStream $ do
line1 <- I.head
line2 <- I.head
return $ line1 `B.append` line2
this looks even nicer in applicative style,
convStream $ B.append <$> I.head <*> I.head
You can think of convStream as continually consuming a portion of the stream with the provided iteratee, then sending the transformed version to the inner consumer. Sometimes even this isn't general enough, since the same iteratee is called at each step. In that case, you can use unfoldConvStream to pass state between successive iterations.
convStream and unfoldConvStream also allow for monadic actions, since the stream processing iteratee is a monad transformer.

Haskell enumerator: analog to iteratees `enumWith` operator?

Earlier today I wrote a small test app for iteratees that composed an iteratee for writing progress with an iteratee for actually copying data. I wound up with values like these:
-- NOTE: this snippet is with iteratees-0.8.5.0
-- side effect: display progress on stdout
displayProgress :: Iteratee ByteString IO ()
-- side effect: copy the bytestrings of Iteratee to Handle
fileSink :: Handle -> Iteratee ByteString IO ()
writeAndDisplayProgress :: Handle -> Iteratee ByteString IO ()
writeAndDisplayProgress handle = sequence_ [fileSink handle, displayProgress]
In looking at the enumerator library, I don't see an analog of sequence_ or enumWith. All I want to do is compose two iteratees so they act as one. I could discard the result (it's going to be () anyway) or keep it, I don't care. (&&&) from Control.Arrow is what I want, only for iteratees rather than arrows.
I tried these two options:
-- NOTE: this snippet is with enumerator-0.4.10
run_ $ enumFile source $$ sequence_ [iterHandle handle, displayProgress]
run_ $ enumFile source $$ sequence_ [displayProgress, iterHandle handle]
The first one copies the file, but doesn't show progress; the second one shows progress, but doesn't copy the file, so obviously the effect of the built-in sequence_ on enumerator's iteratees is to run the first iteratee until it terminates and then run the other, which is not what I want. I want to be running the iteratees in parallel rather than serially. I feel like I'm missing something obvious, but in reading the wc example for the enumerator library, I see this curious comment:
-- Exactly matching wc's output is too annoying, so this example
-- will just print one line per file, and support counting at most
-- one statistic per run
I wonder if this remark indicates that combining or composing iteratees within the enumerations framework isn't possible out of the box. What's the generally-accepted right way to do this?
Edit:
It seems as though there is no built-in way to do this. There's discussion on the Haskell mailing list about adding combinators like enumSequence and manyToOne but so far, there doesn't seem to be anything actually in the enumerator package that furnishes this capability.
It seems to me like rather than trying to have two Iteratees consume the sequence in parallel, it would be better to feed the stream through an identity Enumeratee that simply counts the bytes passing it.
Here's a simple example that copies a file and prints the number of bytes copied after each chunk.
import System.Environment
import System.IO
import Data.Enumerator
import Data.Enumerator.Binary (enumFile, iterHandle)
import Data.Enumerator.List (mapAccumM)
import qualified Data.ByteString as B
printBytes :: Enumeratee B.ByteString B.ByteString IO ()
printBytes = flip mapAccumM 0 $ \total bytes -> do
let total' = total + B.length bytes
print total'
return (total', bytes)
copyFile s t = withBinaryFile t WriteMode $ \h -> do
run_ $ (enumFile s $= printBytes) $$ iterHandle h
main = do
[source, target] <- getArgs
copyFile source target

Resources