Forking the streaming flow in haskell-pipes - haskell

I'm having trouble directing flow though a pipeline with haskell-pipes. Basically, I analyze a bunch of files and then I have to either
print results to the terminal in a human-friendly way
encode results to JSON
The chosen path depends upon a command line option.
In the second case, I have to output an opening bracket, then every incoming value followed by a comma and then a closing bracket. Currently insertCommas never terminates, so the closing bracket is never outputted.
import Pipes
import Data.ByteString.Lazy as B
import Data.Aeson (encode)
insertCommas :: Consumer B.ByteString IO ()
insertCommas = do
first <- await
lift $ B.putStr first
for cat $ \obj -> lift $ do
putStr ","
B.putStr obj
jsonExporter :: Consumer (FilePath, AnalysisResult) IO ()
jsonExporter = do
lift $ putStr "["
P.map encode >-> insertCommas
lift $ putStr "]"
exportStream :: Config -> Consumer (FilePath, AnalysisResult) IO ()
exportStream conf =
case outputMode conf of
JSON -> jsonExporter
_ -> P.map (export conf) >-> P.stdoutLn
main :: IO ()
main = do
-- The first two lines are Docopt stuff, not relevant
args <- parseArgsOrExit patterns =<< getArgs
ins <- allFiles $ args `getAllArgs` argument "paths"
let conf = readConfig args
runEffect $ each ins
>-> P.mapM analyze
>-> P.map (filterResults conf)
>-> P.filter filterNulls
>-> exportStream conf

AFAIK a Consumer cannot detect the end of a stream. In order to do that you need to use a Pipes.Parser and invert the control.
Here is a Parser which inserts commas between String elements:
import Pipes
import qualified Pipes.Prelude as P
import Pipes.Parse (draw, evalStateT)
commify = do
lift $ putStrLn "["
m1 <- draw
case m1 of
Nothing -> lift $ putStrLn "]"
Just x1 -> do
lift $ putStrLn x1
let loop = do mx <- draw
case mx of
Nothing -> lift $ putStrLn "]"
Just x -> lift (putStr "," >> putStrLn x) >> loop
loop
test1 = evalStateT commify ( mapM_ yield (words "this is a test") )
test2 = evalStateT commify P.stdinLn
To handle the different output formats I would probably make both formats a Parser:
exportParser = do
mx <- draw
case mx of
Nothing -> return ()
Just x -> (lift $ putStrLn $ export x) >> exportParser
and then:
let parser = case outputMode of
JSON -> commify
_ -> exportParser
evalStateT parser (P.mapM analyze
>-> P.map (filterResults conf)
>-> P.filter filterNulls)
There is probably a slicker way to write exportParser in terms of foldAllM. You can also use the MaybeT transformer to more succinctly write the commify parser. I've written both out explicitly to make them easier to understand.

I think you should 'commify' with pipes-group. It has an intercalates, but not an intersperse, but it's not a big deal to write. You should stay away from the Consumer end, I think, for this sort of problem.
{-#LANGUAGE OverloadedStrings #-}
import Pipes
import qualified Pipes.Prelude as P
import qualified Data.ByteString.Lazy.Char8 as B
import Pipes.Group
import Lens.Simple -- or Control.Lens or Lens.Micro or anything with view/^.
import System.Environment
intersperse_ :: Monad m => a -> Producer a m r -> Producer a m r
intersperse_ a producer = intercalates (yield a) (producer ^. chunksOf 1)
main = do
args <- getArgs
let op prod = case args of
"json":_ -> yield "[" *> intersperse_ "," prod <* yield "]"
_ -> intersperse_ " " prod
runEffect $ op producer >-> P.mapM_ B.putStr
putStrLn ""
where
producer = mapM_ yield (B.words "this is a test")
which give me this
>>> :main json
[this,is,a,test]
>>> :main ---
this is a test

Related

How can I poll a process for it's stdout / stderrr output? Blocked by isEOF

The following example requires the packages of:
- text
- string-conversions
- process
Code:
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE LambdaCase #-}
module Example where
import qualified Data.Text as T
import Data.Text (Text)
import Data.Monoid
import Control.Monad.Identity
import System.Process
import GHC.IO.Handle
import Debug.Trace
import Data.String.Conversions
runGhci :: Text -> IO Text
runGhci _ = do
let expr = "print \"test\""
let inputLines = (<> "\n") <$> T.lines expr :: [Text]
print inputLines
createProcess ((proc "ghci" ["-v0", "-ignore-dot-ghci"]) {std_in=CreatePipe, std_out=CreatePipe, std_err=CreatePipe}) >>= \case
(Just pin, Just pout, Just perr, ph) -> do
output <-
forM inputLines (\i -> do
let script = i <> "\n"
do
hPutStr pin $ cs $ script
hFlush pin
x <- hIsEOF pout >>= \case
True -> return ""
False -> hGetLine pout
y <- hIsEOF perr >>= \case
True -> return ""
False -> hGetLine perr
let output = cs $! x ++ y
return $ trace "OUTPUT" $ output
)
let f i o = "ghci>" <> i <> o
let final = T.concat ( zipWith f (inputLines :: [Text]) (output :: [Text]) :: [Text])
print final
terminateProcess ph
pure $ T.strip $ final
_ -> error "Invaild GHCI process"
If I attempt to run the above:
stack ghci src/Example.hs
ghci> :set -XOverloadedStrings
ghci> runGhci ""
["print \"test\"\n"]
It appears to be blocking on hIsEOF perr, according to https://stackoverflow.com/a/26510673/1663462 it sounds like I shouldn't call this function unless there is 'some output' ready to be flushed / read... However how do I handle the case where it does not have any output at that stage? I don't mind periodically 'checking' or having a timeout.
How can I prevent the above from hanging? I've tried various approaches involving hGetContents, hGetLine however they all seem to end up blocking (or closing the handle) in this situation...
I had to use additional threads, MVars, as well as timeouts:
runGhci :: Text -> IO Text
runGhci _ = do
let expr = "123 <$> 123"
let inputLines = filter (/= "") (T.lines expr)
print inputLines
createProcess ((proc "ghci" ["-v0", "-ignore-dot-ghci"]) {std_in=CreatePipe, std_out=CreatePipe, std_err=CreatePipe}) >>= \case
(Just pin, Just pout, Just perr, ph) -> do
output <- do
forM inputLines
(\i -> do
let script = "putStrLn " ++ show magic ++ "\n"
++ cs i ++ "\n"
++ "putStrLn " ++ show magic ++ "\n"
do
stdoutMVar <- newEmptyMVar
stderrMVar <- newMVar ""
hPutStr pin script
hFlush pin
tOutId <- forkIO $ extract' pout >>= putMVar stdoutMVar
tErrId <- forkIO $ do
let f' = hGetLine perr >>= (\l -> modifyMVar_ stderrMVar (return . (++ (l ++ "\n"))))
forever f'
x <- timeout (1 * (10^6)) (takeMVar stdoutMVar) >>= return . fromMaybe "***ghci timed out"
y <- timeout (1 * (10^6)) (takeMVar stderrMVar) >>= return . fromMaybe "***ghci timed out"
killThread tOutId
killThread tErrId
return $ trace "OUTPUT" $ cs $! x ++ y
)
let final = T.concat ( zipWith f (inputLines :: [Text]) (output :: [Text]) :: [Text])
print final
terminateProcess ph
pure $ T.strip $ cs $ final
_ -> error "Invaild GHCI process"

Pipe that maintains state

I'm trying to calculate rolling hash values (buzzhash) for a big file using pipes.
Currently I have this. But don't know how to write a pipe that maintains a state.
import qualified Data.ByteString.Lazy as L
import Data.Word
import Data.Bits(xor,rotate)
import Data.Array
import Pipes
import Control.Monad.State.Strict
import Control.Monad(forever)
produceFromList (x:xs) = do
yield x
produceFromList xs
buzzHash = do
x <- await
h <- lift $ get -- pull out previous value
let h' = rotate h 1 `xor` (hashArrW8!x) -- calculate new value
lift $ put h' -- save new value
yield h'
stdoutLn :: Consumer Word64 IO ()
stdoutLn = do
a <- await
lift $ print a
main = do
bs <- L.unpack `fmap` L.getContents
runEffect $ produceFromList bs >-> buzzHash >-> stdoutLn
hashArrW8 :: Array Word8 Word64
How do I make buzzHash save previous value and use it for the calculation of next value? Initial state value should be 0.
You were almost there; you just need to run the state.
main = do
bs <- L.unpack `fmap` L.getContents
flip execStateT 0 $ runEffect $ produceList bs >-> buzzHash >-> hoist lift stdoutLn
I assume you don't want to recover the state, so I use execStateT rather than runStateT.
The only curiosity here is that stdoutLn was marked as Consumer Word64 IO () . So I use hoist lift to make it Consumer Word64 (StateT Word64 IO) () Everything in the series a >-> b >-> c must agree in the underlying monad and return type.
Here are a few further comments that might save you time. First produceFromList is each.
Moreover, you could have avoided the hoist lift by relabeling your stdoutLn:
stdoutLn :: MonadIO m => Consumer Word64 m ()
stdoutLn = do
a <- await
liftIO $ print a
But here there is some trouble: you are not repeating the action. This should pretty clearly be a loop:
stdoutLn :: MonadIO m => Consumer Word64 m ()
stdoutLn = do
a <- await
liftIO $ print a
stdoutLn
in fact this is already available as P.print, so we can write
import qualified Pipes.Prelude as P
main = do
bs <- L.unpack `fmap` L.getContents
flip execStateT 0 $ runEffect $ each bs >-> buzzHash >-> P.print
If I understand you, buzzHash is meant to be repeated indefinitely too:
buzzHash = do
x <- await
h <- lift $ get -- pull out previous value
let h' = rotate h 1 `xor` (hashArrW8!x) -- calculate new value
lift $ put h' -- save new value
yield h'
buzzHash
(this is forever buzzHash, where we use your buzzHash)
Finally, if you
import qualified Pipes.ByteString as PB
import Control.Lens (view) -- (or Lens.Micro.MTL or Lens.Simple)
we see we don't need the lazy bytestring IO, which doesn't stream properly anyway.
Pipes.ByteString already has the unpack we want, packaged as a lens, so that we use view PB.unpack where elsewhere we would use B.unpack. So in the end we can write
main = flip evalStateT 0 $ runEffect $ view PB.unpack PB.stdin >-> buzzHash >-> P.print
Once it is in this form we see we aren't using the underlying state of the pipeline except in buzzHash, so we can localize this
import Pipes.Lift (evalStateP)
main = runEffect $ view PB.unpack PB.stdin >-> evalStateP 0 buzzHash >-> P.print
or, if you like you can rewrite
buzzHash' :: Monad m => Word64 -> Pipe Word8 Word64 m r
buzzHash' n = evalStateP n $ forever $ do
x <- await
h <- lift $ get -- pull out previous value
let h' = rotate h 1 `xor` (hashArrW8!x) -- calculate new value
lift $ put h' -- save new value
yield h'
Then you would write
main = runEffect $ view PB.unpack PB.stdin >-> buzzHash' 0 >-> P.print

Using Pipes to read and write binary data in Haskell

I am trying to read and write very many ints in constant memory. I have figured out how to write the ints to memory but have not figured out how to read them back.
import Control.Lens (zoom)
import System.IO (IOMode(..), withFile)
import Pipes
import qualified Pipes.Prelude as P
import qualified Pipes.ByteString as PB
import qualified Pipes.Parse as P
import qualified Pipes.Binary as P
intStream :: Monad m => Proxy x' x () Int m b
intStream = go (0 :: Int) where
go i = yield i >> go (i + 1)
decoder :: Monad m => Int -> P.Parser P.ByteString m [Int]
decoder n = zoom (P.decoded . P.splitAt n) P.drawAll
main :: IO ()
main = do
withFile "ints" WriteMode $ \h -> do
runEffect $ for intStream P.encode >-> P.take 10000 >-> PB.toHandle h
withFile "ints" ReadMode $ \h -> do
xs <- P.evalStateT (decoder 10000000) (PB.fromHandle h)
print xs
I got the decoder function from the documentation for Pipes.Binary. However it uses drawAll which according to the documentation drawAll is not idiomatic use of Pipes and is provided for testing purposes.
My question is how to modify decoder so that it doesn't use drawAll and thus does not load all the values of xs into memory. So instead of printing the list of xs I could P.map print over a stream of decoded ints being read from the file.
The docs say that decoded is a lens from a stream of bytes to a stream of decoded values. We can get the latter out of the former using view from lens:
decoder :: Monad m => Int -> Producer P.ByteString m a -> Producer Int m ()
decoder n p = void (view P.decoded p) >-> P.take n
main :: IO ()
main = do
withFile "ints" WriteMode $ \h -> do
runEffect $ for intStream P.encode >-> P.take 10000 >-> PB.toHandle h
withFile "ints" ReadMode $ \h -> do
runEffect $ decoder 10000 (PB.fromHandle h) >-> P.print
I don't have much experience with pipes, I just followed the types here. The program seems to function as intended though.

Skipping first line in pipes-attoparsec

My types:
data Test = Test {
a :: Int,
b :: Int
} deriving (Show)
My parser:
testParser :: Parser Test
testParser = do
a <- decimal
tab
b <- decimal
return $ Test a b
tab = char '\t'
Now in order to skip the first line, I do something like this:
import qualified System.IO as IO
parser :: Parser Test
parser = manyTill anyChar endOfLine *> testParser
main = IO.withFile testFile IO.ReadMode $ \testHandle -> runEffect $
for (parsed (parser <* endOfLine) (fromHandle testHandle)) (lift . print)
But the above parser function makes every alternate link skip (which is obvious). How to only skip the first line in such a way that it works with Pipes ecosystem (Producer should produce a single Test value.) This is one obvious solution which I don't want (the below code will only work if I modify testParser to read newlines) because it returns the entire [Test] instead of a single value:
tests :: Parser [Test]
tests = manyTill anyChar endOfLine *>
many1 testParser
Any ideas to tackle this problem ?
You can drop the first line efficiently in constant space like this:
import Lens.Family (over)
import Pipes.Group (drops)
import Pipes.ByteString (lines)
import Prelude hiding (lines)
dropLine :: Monad m => Producer ByteString m r -> Producer ByteString m r
dropLine = over lines (drops 1)
You can apply dropLine to your Producer before you parse the Producer, like this:
main = IO.withFile testFile IO.ReadMode $ \testHandle -> runEffect $
let p = dropLine (fromHandle testHandle)
for (parsed (parser <* endOfLine) p) (lift . print)
If the first line doesn't contain any valid Test, you can use Either () Test in order to handle it:
parserEither :: Parser (Either () Test)
parserEither = Right <$> testParser <* endOfLine
<|> Left <$> (manyTill anyChar endOfLine *> pure ())
After this you can use the functions provided by Pipes.Prelude to get rid of the first result (and additionally of all non-parseable lines):
producer p = parsed parserEither p
>-> P.drop 1
>-> P.filter (either (const False) (const True))
>-> P.map (\(Right x) -> x)
main = IO.withFile testFile IO.ReadMode $ \testHandle -> runEffect $
for (producer (fromHandle testHandle)) (lift . print)

conduit: producing memory leak

Working on some observations on a previous question (haskell-data-hashset-from-unordered-container-performance-for-large-sets) I stumbled upon a strange memory leak
module Main where
import System.Environment (getArgs)
import Control.Monad.Trans.Resource (runResourceT)
import Data.Attoparsec.ByteString (sepBy, Parser)
import Data.Attoparsec.ByteString.Char8 (decimal, char)
import Data.Conduit
import qualified Data.Conduit.Attoparsec as CA
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
main :: IO ()
main = do (args:_) <- getArgs
writeFile "input.txt" $ unlines $ map show [1..4 :: Int]
case args of "list" -> m1
"fail" -> m2
"listlist" -> m3
"memoryleak" -> m4
--UPDATE
"bs-lines":_ -> m5
"bs":_ -> m6
_ -> putStr $ unlines ["Usage: conduit list"
," fail"
," listlist"
," memoryleak"
--UPDATE
," bs-lines"
," bs"
]
m1,m2,m3,m4 :: IO ()
m1 = do hs <- runResourceT
$ CB.sourceFile "input.txt"
$$ CB.lines
=$= CA.conduitParser (decimal :: Parser Int)
=$= CL.map snd
=$= CL.consume
print hs
m2 = do hs <- runResourceT
$ CB.sourceFile "input.txt"
$$ CA.conduitParser (decimal :: Parser Int)
=$= CL.map snd
=$= CL.consume
print hs
m3 = do hs <- runResourceT
$ CB.sourceFile "input.txt"
$$ CB.lines
=$= CA.conduitParser (decimal `sepBy` (char '\n') :: Parser [Int])
=$= CL.map snd
=$= CL.consume
print hs
m4 = do hs <- runResourceT
$ CB.sourceFile "input.txt"
$$ CA.conduitParser (decimal `sepBy` (char '\n') :: Parser [Int])
=$= CL.map snd
=$= CL.consume
print hs
-- UPDATE
m5 = do inpt <- BS.lines <$> BS.readFile "input.txt"
let Right hs = mapM (parseOnly (decimal :: Parser Int)) inpt
print hs
m6 = do inpt <- BS.readFile "input.txt"
let Right hs = (parseOnly (decimal `sepBy` (char '\n') :: Parser [Int])) inpt
print hs
Here is some example output:
$ > stack exec -- example list
[1234]
$ > stack exec -- example listlist
[[1234]]
$ > stack exec -- conduit fail
conduit: ParseError {errorContexts = [], errorMessage = "Failed reading: takeWhile1", errorPosition = 1:2}
$ > stack exec -- example memoryleak
(Ctrl+C)
-- UPDATE
$ > stack exec -- example bs-lines
[1,2,3,4]
$ > stack exec -- example bs
[1,2,3,4]
Now the questions I have is:
Why is m1 not producing [1,2,3,4]?
Why is m2 failing?
Why is m4 behaving totally different compared to all other versions and producing a space leak?
Why is m2 failing?
The input file as a character stream is:
1\n2\n3\n4\n
Since the decimal parser do not expect a newline character, after consuming the first number the remaining stream is:
\n2\n3\n4\n
As the input stream is not exhausted, conduitParser will run the parser on the stream again, this time it cannot even consume the first character so it failed.
Why is m4 behaving totally different compared to all other versions and producing a space leak?
decimal `sepBy` (char '\n') will only consume \n between two integers, after successfully parsed four numbers, the input stream has only one character in it:
\n
and decimal `sepBy` (char '\n') cannot consume it, even worse it will not fail: sepBy can consume nothing and return empty list. Therefore it parse nothing infinitely and never terminate.
Why is m1 not producing [1,2,3,4]?
I want to know it too! I guess it has something to do with fusing, maybe you should contact the author of conduit package, who just commented your question.
To answer the question about m1: when you use CB.lines, you're turning input that looks like:
["1\n2\n3\n4\n"]
into:
["1", "2", "3", "4"]
Then, attoparsec parses the "1", waits for more input, sees the "2", and so on.

Resources