How to properly add IO to attoparsec Parser? - haskell

I want to do some tracing/debugging in my attoparsec parser. Here's minimal [not] working example:
import Data.Text as T
import Data.Attoparsec.Text
import Data.Attoparsec.Combinator
import Control.Applicative ((<*), (*>))
parseSentences :: Parser [T.Text]
parseSentences = many1 $ takeWhile1 (/= '.') <* char '.' <* skipSpace
parser :: Parser [T.Text]
parser = do
stuff <- parseSentences
-- putStrLn $ "Got stuff: " ++ show stuff
tail <- takeText
-- putStrLn $ "Got tail: " ++ show tail
return $ stuff ++ [tail, T.pack "more stuff"]
main = do
let input = T.pack "sample. example. bang"
print $ parseOnly parser input
What i have to do in order to use IO actions in my parser?

If you had used the Parsec library, you would have had the possibility of using the Parsec monad transformer for mixing IO and parser commands in your code.
Attoparsec, however, is a pure parser, so you will have to use the Debug.Trace.trace function to output messages to the terminal for debugging purposes.
parser = do
stuff <- parseSentences
tail <- takeText
return .
trace ("Got stuff: " + show stuff) .
trace ("Got tail: " + show tail) $
stuff ++ [tail, T.pack "more stuff"]
The messages will be printed when the associated value (here the result of the expression stuff ++ ...) is evaluated.

Related

Forking the streaming flow in haskell-pipes

I'm having trouble directing flow though a pipeline with haskell-pipes. Basically, I analyze a bunch of files and then I have to either
print results to the terminal in a human-friendly way
encode results to JSON
The chosen path depends upon a command line option.
In the second case, I have to output an opening bracket, then every incoming value followed by a comma and then a closing bracket. Currently insertCommas never terminates, so the closing bracket is never outputted.
import Pipes
import Data.ByteString.Lazy as B
import Data.Aeson (encode)
insertCommas :: Consumer B.ByteString IO ()
insertCommas = do
first <- await
lift $ B.putStr first
for cat $ \obj -> lift $ do
putStr ","
B.putStr obj
jsonExporter :: Consumer (FilePath, AnalysisResult) IO ()
jsonExporter = do
lift $ putStr "["
P.map encode >-> insertCommas
lift $ putStr "]"
exportStream :: Config -> Consumer (FilePath, AnalysisResult) IO ()
exportStream conf =
case outputMode conf of
JSON -> jsonExporter
_ -> P.map (export conf) >-> P.stdoutLn
main :: IO ()
main = do
-- The first two lines are Docopt stuff, not relevant
args <- parseArgsOrExit patterns =<< getArgs
ins <- allFiles $ args `getAllArgs` argument "paths"
let conf = readConfig args
runEffect $ each ins
>-> P.mapM analyze
>-> P.map (filterResults conf)
>-> P.filter filterNulls
>-> exportStream conf
AFAIK a Consumer cannot detect the end of a stream. In order to do that you need to use a Pipes.Parser and invert the control.
Here is a Parser which inserts commas between String elements:
import Pipes
import qualified Pipes.Prelude as P
import Pipes.Parse (draw, evalStateT)
commify = do
lift $ putStrLn "["
m1 <- draw
case m1 of
Nothing -> lift $ putStrLn "]"
Just x1 -> do
lift $ putStrLn x1
let loop = do mx <- draw
case mx of
Nothing -> lift $ putStrLn "]"
Just x -> lift (putStr "," >> putStrLn x) >> loop
loop
test1 = evalStateT commify ( mapM_ yield (words "this is a test") )
test2 = evalStateT commify P.stdinLn
To handle the different output formats I would probably make both formats a Parser:
exportParser = do
mx <- draw
case mx of
Nothing -> return ()
Just x -> (lift $ putStrLn $ export x) >> exportParser
and then:
let parser = case outputMode of
JSON -> commify
_ -> exportParser
evalStateT parser (P.mapM analyze
>-> P.map (filterResults conf)
>-> P.filter filterNulls)
There is probably a slicker way to write exportParser in terms of foldAllM. You can also use the MaybeT transformer to more succinctly write the commify parser. I've written both out explicitly to make them easier to understand.
I think you should 'commify' with pipes-group. It has an intercalates, but not an intersperse, but it's not a big deal to write. You should stay away from the Consumer end, I think, for this sort of problem.
{-#LANGUAGE OverloadedStrings #-}
import Pipes
import qualified Pipes.Prelude as P
import qualified Data.ByteString.Lazy.Char8 as B
import Pipes.Group
import Lens.Simple -- or Control.Lens or Lens.Micro or anything with view/^.
import System.Environment
intersperse_ :: Monad m => a -> Producer a m r -> Producer a m r
intersperse_ a producer = intercalates (yield a) (producer ^. chunksOf 1)
main = do
args <- getArgs
let op prod = case args of
"json":_ -> yield "[" *> intersperse_ "," prod <* yield "]"
_ -> intersperse_ " " prod
runEffect $ op producer >-> P.mapM_ B.putStr
putStrLn ""
where
producer = mapM_ yield (B.words "this is a test")
which give me this
>>> :main json
[this,is,a,test]
>>> :main ---
this is a test

Reading numbers inline

Imagine I read an input block via stdin that looks like this:
3
12
16
19
The first number is the number of following rows. I have to process these numbers via a function and report the results separated by a space.
So I wrote this main function:
main = do
num <- readLn
putStrLn $ intercalate " " [ show $ myFunc $ read getLine | c <- [1..num]]
Of course that function doesn't compile because of the read getLine.
But what is the correct (read: the Haskell way) way to do this properly? Is it even possible to write this function as a one-liner?
Is it even possible to write this function as a one-liner?
Well, it is, and it's kind of concise, but see for yourself:
main = interact $ unwords . map (show . myFunc . read) . drop 1 . lines
So, how does this work?
interact :: (String -> String) -> IO () takes all contents from STDIN, passes it through the given function, and prints the output.
We use unwords . map (show . myFunc . read) . drop 1 . lines :: String -> String:
lines :: String -> [String] breaks a string at line ends.
drop 1 removes the first line, as we don't actually need the number of lines.
map (show . myFunc . read) converts each String to the correct type, uses myFunc, and then converts it back to a `String.
unwords is basically the same as intercalate " ".
However, keep in mind that interact isn't very GHCi friendly.
You can build a list of monadic actions with <$> (or fmap) and execute them all with sequence.
λ intercalate " " <$> sequence [show . (2*) . read <$> getLine | _ <- [1..4]]
1
2
3
4
"2 4 6 8"
Is it even possible to write this function as a one-liner?
Sure, but there is a problem with the last line of your main function. Because you're trying to apply intercalate " " to
[ show $ myFunc $ read getLine | c <- [1..num]]
I'm guessing you expect the latter to have type [String], but it is in fact not a well-typed expression. How can that be fixed? Let's first define
getOneInt :: IO Int
getOneInt = read <$> getLine
for convenience (we'll be using it multiple times in our code). Now, what you meant is probably something like
[ show . myFunc <$> getOneInt | c <- [1..num]]
which, if the type of myFunc aligns with the rest, has type [IO String]. You can then pass that to sequence in order to get a value of type IO [String] instead. Finally, you can "pass" that (using =<<) to
putStrLn . intercalate " "
in order to get the desired one-liner:
import Control.Monad ( replicateM )
import Data.List ( intercalate )
main :: IO ()
main = do
num <- getOneInt
putStrLn . intercalate " " =<< sequence [ show . myFunc <$> getOneInt | c <- [1..num]]
where
myFunc = (* 3) -- for example
getOneInt :: IO Int
getOneInt = read <$> getLine
In GHCi:
λ> main
3
45
23
1
135 69 3
Is the code idiomatic and readable, though? Not so much, in my opinion...
[...] what is the correct (read: the Haskell way) way to do this properly?
There is no "correct" way of doing it, but the following just feels more natural and readable to me:
import Control.Monad ( replicateM )
import Data.List ( intercalate )
main :: IO ()
main = do
n <- getOneInt
ns <- replicateM n getOneInt
putStrLn $ intercalate " " $ map (show . myFunc) ns
where
myFunc = (* 3) -- replace by your own function
getOneInt :: IO Int
getOneInt = read <$> getLine
Alternatively, if you want to eschew the do notation:
main =
getOneInt >>=
flip replicateM getOneInt >>=
putStrLn . intercalate " " . map (show . myFunc)
where
myFunc = (* 3) -- replace by your own function

Skipping first line in pipes-attoparsec

My types:
data Test = Test {
a :: Int,
b :: Int
} deriving (Show)
My parser:
testParser :: Parser Test
testParser = do
a <- decimal
tab
b <- decimal
return $ Test a b
tab = char '\t'
Now in order to skip the first line, I do something like this:
import qualified System.IO as IO
parser :: Parser Test
parser = manyTill anyChar endOfLine *> testParser
main = IO.withFile testFile IO.ReadMode $ \testHandle -> runEffect $
for (parsed (parser <* endOfLine) (fromHandle testHandle)) (lift . print)
But the above parser function makes every alternate link skip (which is obvious). How to only skip the first line in such a way that it works with Pipes ecosystem (Producer should produce a single Test value.) This is one obvious solution which I don't want (the below code will only work if I modify testParser to read newlines) because it returns the entire [Test] instead of a single value:
tests :: Parser [Test]
tests = manyTill anyChar endOfLine *>
many1 testParser
Any ideas to tackle this problem ?
You can drop the first line efficiently in constant space like this:
import Lens.Family (over)
import Pipes.Group (drops)
import Pipes.ByteString (lines)
import Prelude hiding (lines)
dropLine :: Monad m => Producer ByteString m r -> Producer ByteString m r
dropLine = over lines (drops 1)
You can apply dropLine to your Producer before you parse the Producer, like this:
main = IO.withFile testFile IO.ReadMode $ \testHandle -> runEffect $
let p = dropLine (fromHandle testHandle)
for (parsed (parser <* endOfLine) p) (lift . print)
If the first line doesn't contain any valid Test, you can use Either () Test in order to handle it:
parserEither :: Parser (Either () Test)
parserEither = Right <$> testParser <* endOfLine
<|> Left <$> (manyTill anyChar endOfLine *> pure ())
After this you can use the functions provided by Pipes.Prelude to get rid of the first result (and additionally of all non-parseable lines):
producer p = parsed parserEither p
>-> P.drop 1
>-> P.filter (either (const False) (const True))
>-> P.map (\(Right x) -> x)
main = IO.withFile testFile IO.ReadMode $ \testHandle -> runEffect $
for (producer (fromHandle testHandle)) (lift . print)

Can I be sure of order of IO actions in this example?

At the moment, I have this code in and around main:
import Control.Monad
import Control.Applicative
binSearch :: Ord a => [a] -> a -> Maybe Int
main = do
xs <- lines <$> readFile "Cars1.txt"
x <- getLine <* putStr "Registration: " -- Right?
putStrLn $ case binSearch xs x of
Just n -> "Found at position " ++ show n
Nothing -> "Not found"
My hope is for “Registration: ” to be printed, then for the program to wait for the input to x. Does what I've written imply that that will be the case? Do I need the <*, or will putting the putStr expression on the line above make things work as well?
PS: I know I have to convert binSearch to work with arrays rather than lists (otherwise it's probably not worth doing a binary search), but that's a problem for another day.
The line
x <- getLine <* putStr "Registration: "
orders the IO actions left-to-right: first a line is taken as input, then the message is printed, and finally variable x is bound to the result of getLine.
Do I need the <*, or will putting the putStr expression on the line
above make things work as well?
If you want the message to precede the input, you have to put the putStr on the line above, as follows:
main :: IO ()
main = do
xs <- lines <$> readFile "Cars1.txt"
putStr "Registration: "
x <- getLine
putStrLn $ case binSearch xs x of
Just n -> "Found at position " ++ show n
Nothing -> "Not found"
Alternatively,
x <- putStr "Registration: " *> getLine
or
x <- putStr "Registration: " >> getLine
would work, but they are less readable.
Finally, since you added the lazy-evaluation tag, let me add that your question is actually not about laziness, but about how the operator <* is defined, and in particular about the order in which it sequences the IO actions.

conduit: producing memory leak

Working on some observations on a previous question (haskell-data-hashset-from-unordered-container-performance-for-large-sets) I stumbled upon a strange memory leak
module Main where
import System.Environment (getArgs)
import Control.Monad.Trans.Resource (runResourceT)
import Data.Attoparsec.ByteString (sepBy, Parser)
import Data.Attoparsec.ByteString.Char8 (decimal, char)
import Data.Conduit
import qualified Data.Conduit.Attoparsec as CA
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
main :: IO ()
main = do (args:_) <- getArgs
writeFile "input.txt" $ unlines $ map show [1..4 :: Int]
case args of "list" -> m1
"fail" -> m2
"listlist" -> m3
"memoryleak" -> m4
--UPDATE
"bs-lines":_ -> m5
"bs":_ -> m6
_ -> putStr $ unlines ["Usage: conduit list"
," fail"
," listlist"
," memoryleak"
--UPDATE
," bs-lines"
," bs"
]
m1,m2,m3,m4 :: IO ()
m1 = do hs <- runResourceT
$ CB.sourceFile "input.txt"
$$ CB.lines
=$= CA.conduitParser (decimal :: Parser Int)
=$= CL.map snd
=$= CL.consume
print hs
m2 = do hs <- runResourceT
$ CB.sourceFile "input.txt"
$$ CA.conduitParser (decimal :: Parser Int)
=$= CL.map snd
=$= CL.consume
print hs
m3 = do hs <- runResourceT
$ CB.sourceFile "input.txt"
$$ CB.lines
=$= CA.conduitParser (decimal `sepBy` (char '\n') :: Parser [Int])
=$= CL.map snd
=$= CL.consume
print hs
m4 = do hs <- runResourceT
$ CB.sourceFile "input.txt"
$$ CA.conduitParser (decimal `sepBy` (char '\n') :: Parser [Int])
=$= CL.map snd
=$= CL.consume
print hs
-- UPDATE
m5 = do inpt <- BS.lines <$> BS.readFile "input.txt"
let Right hs = mapM (parseOnly (decimal :: Parser Int)) inpt
print hs
m6 = do inpt <- BS.readFile "input.txt"
let Right hs = (parseOnly (decimal `sepBy` (char '\n') :: Parser [Int])) inpt
print hs
Here is some example output:
$ > stack exec -- example list
[1234]
$ > stack exec -- example listlist
[[1234]]
$ > stack exec -- conduit fail
conduit: ParseError {errorContexts = [], errorMessage = "Failed reading: takeWhile1", errorPosition = 1:2}
$ > stack exec -- example memoryleak
(Ctrl+C)
-- UPDATE
$ > stack exec -- example bs-lines
[1,2,3,4]
$ > stack exec -- example bs
[1,2,3,4]
Now the questions I have is:
Why is m1 not producing [1,2,3,4]?
Why is m2 failing?
Why is m4 behaving totally different compared to all other versions and producing a space leak?
Why is m2 failing?
The input file as a character stream is:
1\n2\n3\n4\n
Since the decimal parser do not expect a newline character, after consuming the first number the remaining stream is:
\n2\n3\n4\n
As the input stream is not exhausted, conduitParser will run the parser on the stream again, this time it cannot even consume the first character so it failed.
Why is m4 behaving totally different compared to all other versions and producing a space leak?
decimal `sepBy` (char '\n') will only consume \n between two integers, after successfully parsed four numbers, the input stream has only one character in it:
\n
and decimal `sepBy` (char '\n') cannot consume it, even worse it will not fail: sepBy can consume nothing and return empty list. Therefore it parse nothing infinitely and never terminate.
Why is m1 not producing [1,2,3,4]?
I want to know it too! I guess it has something to do with fusing, maybe you should contact the author of conduit package, who just commented your question.
To answer the question about m1: when you use CB.lines, you're turning input that looks like:
["1\n2\n3\n4\n"]
into:
["1", "2", "3", "4"]
Then, attoparsec parses the "1", waits for more input, sees the "2", and so on.

Resources