conduit: producing memory leak - haskell

Working on some observations on a previous question (haskell-data-hashset-from-unordered-container-performance-for-large-sets) I stumbled upon a strange memory leak
module Main where
import System.Environment (getArgs)
import Control.Monad.Trans.Resource (runResourceT)
import Data.Attoparsec.ByteString (sepBy, Parser)
import Data.Attoparsec.ByteString.Char8 (decimal, char)
import Data.Conduit
import qualified Data.Conduit.Attoparsec as CA
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
main :: IO ()
main = do (args:_) <- getArgs
writeFile "input.txt" $ unlines $ map show [1..4 :: Int]
case args of "list" -> m1
"fail" -> m2
"listlist" -> m3
"memoryleak" -> m4
--UPDATE
"bs-lines":_ -> m5
"bs":_ -> m6
_ -> putStr $ unlines ["Usage: conduit list"
," fail"
," listlist"
," memoryleak"
--UPDATE
," bs-lines"
," bs"
]
m1,m2,m3,m4 :: IO ()
m1 = do hs <- runResourceT
$ CB.sourceFile "input.txt"
$$ CB.lines
=$= CA.conduitParser (decimal :: Parser Int)
=$= CL.map snd
=$= CL.consume
print hs
m2 = do hs <- runResourceT
$ CB.sourceFile "input.txt"
$$ CA.conduitParser (decimal :: Parser Int)
=$= CL.map snd
=$= CL.consume
print hs
m3 = do hs <- runResourceT
$ CB.sourceFile "input.txt"
$$ CB.lines
=$= CA.conduitParser (decimal `sepBy` (char '\n') :: Parser [Int])
=$= CL.map snd
=$= CL.consume
print hs
m4 = do hs <- runResourceT
$ CB.sourceFile "input.txt"
$$ CA.conduitParser (decimal `sepBy` (char '\n') :: Parser [Int])
=$= CL.map snd
=$= CL.consume
print hs
-- UPDATE
m5 = do inpt <- BS.lines <$> BS.readFile "input.txt"
let Right hs = mapM (parseOnly (decimal :: Parser Int)) inpt
print hs
m6 = do inpt <- BS.readFile "input.txt"
let Right hs = (parseOnly (decimal `sepBy` (char '\n') :: Parser [Int])) inpt
print hs
Here is some example output:
$ > stack exec -- example list
[1234]
$ > stack exec -- example listlist
[[1234]]
$ > stack exec -- conduit fail
conduit: ParseError {errorContexts = [], errorMessage = "Failed reading: takeWhile1", errorPosition = 1:2}
$ > stack exec -- example memoryleak
(Ctrl+C)
-- UPDATE
$ > stack exec -- example bs-lines
[1,2,3,4]
$ > stack exec -- example bs
[1,2,3,4]
Now the questions I have is:
Why is m1 not producing [1,2,3,4]?
Why is m2 failing?
Why is m4 behaving totally different compared to all other versions and producing a space leak?

Why is m2 failing?
The input file as a character stream is:
1\n2\n3\n4\n
Since the decimal parser do not expect a newline character, after consuming the first number the remaining stream is:
\n2\n3\n4\n
As the input stream is not exhausted, conduitParser will run the parser on the stream again, this time it cannot even consume the first character so it failed.
Why is m4 behaving totally different compared to all other versions and producing a space leak?
decimal `sepBy` (char '\n') will only consume \n between two integers, after successfully parsed four numbers, the input stream has only one character in it:
\n
and decimal `sepBy` (char '\n') cannot consume it, even worse it will not fail: sepBy can consume nothing and return empty list. Therefore it parse nothing infinitely and never terminate.
Why is m1 not producing [1,2,3,4]?
I want to know it too! I guess it has something to do with fusing, maybe you should contact the author of conduit package, who just commented your question.

To answer the question about m1: when you use CB.lines, you're turning input that looks like:
["1\n2\n3\n4\n"]
into:
["1", "2", "3", "4"]
Then, attoparsec parses the "1", waits for more input, sees the "2", and so on.

Related

How can I poll a process for it's stdout / stderrr output? Blocked by isEOF

The following example requires the packages of:
- text
- string-conversions
- process
Code:
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE LambdaCase #-}
module Example where
import qualified Data.Text as T
import Data.Text (Text)
import Data.Monoid
import Control.Monad.Identity
import System.Process
import GHC.IO.Handle
import Debug.Trace
import Data.String.Conversions
runGhci :: Text -> IO Text
runGhci _ = do
let expr = "print \"test\""
let inputLines = (<> "\n") <$> T.lines expr :: [Text]
print inputLines
createProcess ((proc "ghci" ["-v0", "-ignore-dot-ghci"]) {std_in=CreatePipe, std_out=CreatePipe, std_err=CreatePipe}) >>= \case
(Just pin, Just pout, Just perr, ph) -> do
output <-
forM inputLines (\i -> do
let script = i <> "\n"
do
hPutStr pin $ cs $ script
hFlush pin
x <- hIsEOF pout >>= \case
True -> return ""
False -> hGetLine pout
y <- hIsEOF perr >>= \case
True -> return ""
False -> hGetLine perr
let output = cs $! x ++ y
return $ trace "OUTPUT" $ output
)
let f i o = "ghci>" <> i <> o
let final = T.concat ( zipWith f (inputLines :: [Text]) (output :: [Text]) :: [Text])
print final
terminateProcess ph
pure $ T.strip $ final
_ -> error "Invaild GHCI process"
If I attempt to run the above:
stack ghci src/Example.hs
ghci> :set -XOverloadedStrings
ghci> runGhci ""
["print \"test\"\n"]
It appears to be blocking on hIsEOF perr, according to https://stackoverflow.com/a/26510673/1663462 it sounds like I shouldn't call this function unless there is 'some output' ready to be flushed / read... However how do I handle the case where it does not have any output at that stage? I don't mind periodically 'checking' or having a timeout.
How can I prevent the above from hanging? I've tried various approaches involving hGetContents, hGetLine however they all seem to end up blocking (or closing the handle) in this situation...
I had to use additional threads, MVars, as well as timeouts:
runGhci :: Text -> IO Text
runGhci _ = do
let expr = "123 <$> 123"
let inputLines = filter (/= "") (T.lines expr)
print inputLines
createProcess ((proc "ghci" ["-v0", "-ignore-dot-ghci"]) {std_in=CreatePipe, std_out=CreatePipe, std_err=CreatePipe}) >>= \case
(Just pin, Just pout, Just perr, ph) -> do
output <- do
forM inputLines
(\i -> do
let script = "putStrLn " ++ show magic ++ "\n"
++ cs i ++ "\n"
++ "putStrLn " ++ show magic ++ "\n"
do
stdoutMVar <- newEmptyMVar
stderrMVar <- newMVar ""
hPutStr pin script
hFlush pin
tOutId <- forkIO $ extract' pout >>= putMVar stdoutMVar
tErrId <- forkIO $ do
let f' = hGetLine perr >>= (\l -> modifyMVar_ stderrMVar (return . (++ (l ++ "\n"))))
forever f'
x <- timeout (1 * (10^6)) (takeMVar stdoutMVar) >>= return . fromMaybe "***ghci timed out"
y <- timeout (1 * (10^6)) (takeMVar stderrMVar) >>= return . fromMaybe "***ghci timed out"
killThread tOutId
killThread tErrId
return $ trace "OUTPUT" $ cs $! x ++ y
)
let final = T.concat ( zipWith f (inputLines :: [Text]) (output :: [Text]) :: [Text])
print final
terminateProcess ph
pure $ T.strip $ cs $ final
_ -> error "Invaild GHCI process"

Forking the streaming flow in haskell-pipes

I'm having trouble directing flow though a pipeline with haskell-pipes. Basically, I analyze a bunch of files and then I have to either
print results to the terminal in a human-friendly way
encode results to JSON
The chosen path depends upon a command line option.
In the second case, I have to output an opening bracket, then every incoming value followed by a comma and then a closing bracket. Currently insertCommas never terminates, so the closing bracket is never outputted.
import Pipes
import Data.ByteString.Lazy as B
import Data.Aeson (encode)
insertCommas :: Consumer B.ByteString IO ()
insertCommas = do
first <- await
lift $ B.putStr first
for cat $ \obj -> lift $ do
putStr ","
B.putStr obj
jsonExporter :: Consumer (FilePath, AnalysisResult) IO ()
jsonExporter = do
lift $ putStr "["
P.map encode >-> insertCommas
lift $ putStr "]"
exportStream :: Config -> Consumer (FilePath, AnalysisResult) IO ()
exportStream conf =
case outputMode conf of
JSON -> jsonExporter
_ -> P.map (export conf) >-> P.stdoutLn
main :: IO ()
main = do
-- The first two lines are Docopt stuff, not relevant
args <- parseArgsOrExit patterns =<< getArgs
ins <- allFiles $ args `getAllArgs` argument "paths"
let conf = readConfig args
runEffect $ each ins
>-> P.mapM analyze
>-> P.map (filterResults conf)
>-> P.filter filterNulls
>-> exportStream conf
AFAIK a Consumer cannot detect the end of a stream. In order to do that you need to use a Pipes.Parser and invert the control.
Here is a Parser which inserts commas between String elements:
import Pipes
import qualified Pipes.Prelude as P
import Pipes.Parse (draw, evalStateT)
commify = do
lift $ putStrLn "["
m1 <- draw
case m1 of
Nothing -> lift $ putStrLn "]"
Just x1 -> do
lift $ putStrLn x1
let loop = do mx <- draw
case mx of
Nothing -> lift $ putStrLn "]"
Just x -> lift (putStr "," >> putStrLn x) >> loop
loop
test1 = evalStateT commify ( mapM_ yield (words "this is a test") )
test2 = evalStateT commify P.stdinLn
To handle the different output formats I would probably make both formats a Parser:
exportParser = do
mx <- draw
case mx of
Nothing -> return ()
Just x -> (lift $ putStrLn $ export x) >> exportParser
and then:
let parser = case outputMode of
JSON -> commify
_ -> exportParser
evalStateT parser (P.mapM analyze
>-> P.map (filterResults conf)
>-> P.filter filterNulls)
There is probably a slicker way to write exportParser in terms of foldAllM. You can also use the MaybeT transformer to more succinctly write the commify parser. I've written both out explicitly to make them easier to understand.
I think you should 'commify' with pipes-group. It has an intercalates, but not an intersperse, but it's not a big deal to write. You should stay away from the Consumer end, I think, for this sort of problem.
{-#LANGUAGE OverloadedStrings #-}
import Pipes
import qualified Pipes.Prelude as P
import qualified Data.ByteString.Lazy.Char8 as B
import Pipes.Group
import Lens.Simple -- or Control.Lens or Lens.Micro or anything with view/^.
import System.Environment
intersperse_ :: Monad m => a -> Producer a m r -> Producer a m r
intersperse_ a producer = intercalates (yield a) (producer ^. chunksOf 1)
main = do
args <- getArgs
let op prod = case args of
"json":_ -> yield "[" *> intersperse_ "," prod <* yield "]"
_ -> intersperse_ " " prod
runEffect $ op producer >-> P.mapM_ B.putStr
putStrLn ""
where
producer = mapM_ yield (B.words "this is a test")
which give me this
>>> :main json
[this,is,a,test]
>>> :main ---
this is a test

Skipping first line in pipes-attoparsec

My types:
data Test = Test {
a :: Int,
b :: Int
} deriving (Show)
My parser:
testParser :: Parser Test
testParser = do
a <- decimal
tab
b <- decimal
return $ Test a b
tab = char '\t'
Now in order to skip the first line, I do something like this:
import qualified System.IO as IO
parser :: Parser Test
parser = manyTill anyChar endOfLine *> testParser
main = IO.withFile testFile IO.ReadMode $ \testHandle -> runEffect $
for (parsed (parser <* endOfLine) (fromHandle testHandle)) (lift . print)
But the above parser function makes every alternate link skip (which is obvious). How to only skip the first line in such a way that it works with Pipes ecosystem (Producer should produce a single Test value.) This is one obvious solution which I don't want (the below code will only work if I modify testParser to read newlines) because it returns the entire [Test] instead of a single value:
tests :: Parser [Test]
tests = manyTill anyChar endOfLine *>
many1 testParser
Any ideas to tackle this problem ?
You can drop the first line efficiently in constant space like this:
import Lens.Family (over)
import Pipes.Group (drops)
import Pipes.ByteString (lines)
import Prelude hiding (lines)
dropLine :: Monad m => Producer ByteString m r -> Producer ByteString m r
dropLine = over lines (drops 1)
You can apply dropLine to your Producer before you parse the Producer, like this:
main = IO.withFile testFile IO.ReadMode $ \testHandle -> runEffect $
let p = dropLine (fromHandle testHandle)
for (parsed (parser <* endOfLine) p) (lift . print)
If the first line doesn't contain any valid Test, you can use Either () Test in order to handle it:
parserEither :: Parser (Either () Test)
parserEither = Right <$> testParser <* endOfLine
<|> Left <$> (manyTill anyChar endOfLine *> pure ())
After this you can use the functions provided by Pipes.Prelude to get rid of the first result (and additionally of all non-parseable lines):
producer p = parsed parserEither p
>-> P.drop 1
>-> P.filter (either (const False) (const True))
>-> P.map (\(Right x) -> x)
main = IO.withFile testFile IO.ReadMode $ \testHandle -> runEffect $
for (producer (fromHandle testHandle)) (lift . print)

How to properly add IO to attoparsec Parser?

I want to do some tracing/debugging in my attoparsec parser. Here's minimal [not] working example:
import Data.Text as T
import Data.Attoparsec.Text
import Data.Attoparsec.Combinator
import Control.Applicative ((<*), (*>))
parseSentences :: Parser [T.Text]
parseSentences = many1 $ takeWhile1 (/= '.') <* char '.' <* skipSpace
parser :: Parser [T.Text]
parser = do
stuff <- parseSentences
-- putStrLn $ "Got stuff: " ++ show stuff
tail <- takeText
-- putStrLn $ "Got tail: " ++ show tail
return $ stuff ++ [tail, T.pack "more stuff"]
main = do
let input = T.pack "sample. example. bang"
print $ parseOnly parser input
What i have to do in order to use IO actions in my parser?
If you had used the Parsec library, you would have had the possibility of using the Parsec monad transformer for mixing IO and parser commands in your code.
Attoparsec, however, is a pure parser, so you will have to use the Debug.Trace.trace function to output messages to the terminal for debugging purposes.
parser = do
stuff <- parseSentences
tail <- takeText
return .
trace ("Got stuff: " + show stuff) .
trace ("Got tail: " + show tail) $
stuff ++ [tail, T.pack "more stuff"]
The messages will be printed when the associated value (here the result of the expression stuff ++ ...) is evaluated.

Read until end of stream in haskell

I'm fairly new to Haskell, and I'd like to keep reading lines from the console until the end of the stream, and outputting everything I get in upper case. So far, I've got
import Data.Char
main = myLoop
myLoop = do inp <- getLine
if (inp == "x")
then putStrLn "Bye!"
else do putStrLn(map toUpper inp)
myLoop
However, I can't seem to figure out how to avoid the if (inp == "x") condition and replace it with an end of stream condition.
In short, I'm looking for Haskell's equivalent to while (cin >> line) in C++
Use isEOF from System.IO.
import System.IO (isEOF)
import Data.Char
main = myLoop
myLoop = do done <- isEOF
if done
then putStrLn "Bye!"
else do inp <- getLine
putStrLn (map toUpper inp)
myLoop
This goes way beyond what you're really asking for, but I use this pattern a lot: interact:
myFun :: String -> String
myFun = ...
main = interact (unlines . myFun . lines)
Your function myFun will be executed against every line of standard input and the result sent to standard output.
You could also just rely on lazy IO.
import Data.Char
main :: IO ()
main = do
inp <- getContents
let ls = lines inp
upcased = map (map toUpper) ls
mapM_ putStrLn upcased
Another option is to use when whose documentation can be found here:
import System.IO (isEOF)
import Data.Char (toUpper)
import Control.Monad (forever, when)
import System.Exit (exitSuccess)
main = myLoop
myLoop = forever $ do
done <- isEOF
when done $ putStrLn "Bye!" >> exitSuccess
inp <- getLine
putStrLn (map toUpper inp)
myLoop
Otherwise, the answer is identical to the one by #dave4420.

Resources