Skipping first line in pipes-attoparsec - haskell

My types:
data Test = Test {
a :: Int,
b :: Int
} deriving (Show)
My parser:
testParser :: Parser Test
testParser = do
a <- decimal
tab
b <- decimal
return $ Test a b
tab = char '\t'
Now in order to skip the first line, I do something like this:
import qualified System.IO as IO
parser :: Parser Test
parser = manyTill anyChar endOfLine *> testParser
main = IO.withFile testFile IO.ReadMode $ \testHandle -> runEffect $
for (parsed (parser <* endOfLine) (fromHandle testHandle)) (lift . print)
But the above parser function makes every alternate link skip (which is obvious). How to only skip the first line in such a way that it works with Pipes ecosystem (Producer should produce a single Test value.) This is one obvious solution which I don't want (the below code will only work if I modify testParser to read newlines) because it returns the entire [Test] instead of a single value:
tests :: Parser [Test]
tests = manyTill anyChar endOfLine *>
many1 testParser
Any ideas to tackle this problem ?

You can drop the first line efficiently in constant space like this:
import Lens.Family (over)
import Pipes.Group (drops)
import Pipes.ByteString (lines)
import Prelude hiding (lines)
dropLine :: Monad m => Producer ByteString m r -> Producer ByteString m r
dropLine = over lines (drops 1)
You can apply dropLine to your Producer before you parse the Producer, like this:
main = IO.withFile testFile IO.ReadMode $ \testHandle -> runEffect $
let p = dropLine (fromHandle testHandle)
for (parsed (parser <* endOfLine) p) (lift . print)

If the first line doesn't contain any valid Test, you can use Either () Test in order to handle it:
parserEither :: Parser (Either () Test)
parserEither = Right <$> testParser <* endOfLine
<|> Left <$> (manyTill anyChar endOfLine *> pure ())
After this you can use the functions provided by Pipes.Prelude to get rid of the first result (and additionally of all non-parseable lines):
producer p = parsed parserEither p
>-> P.drop 1
>-> P.filter (either (const False) (const True))
>-> P.map (\(Right x) -> x)
main = IO.withFile testFile IO.ReadMode $ \testHandle -> runEffect $
for (producer (fromHandle testHandle)) (lift . print)

Related

Handling complicated input (with nested values)

There is the following task:
First line is the number of cases
For each case there is a line with the number of numbers to add
For each case there is also a line with the numbers
For each case I have to print summed numbers
Example:
Input:
2
5
1 2 3 4 5
2
-100 100
Output:
15
0
This is my implementation
import Control.Monad
main = do
linesCount <- readLn :: IO Int
numbers <- replicateM linesCount getCase
mapM_ putStrLn $ map (show.sum) numbers
getCase :: IO [Int]
getCase = do
numbersCount <- readLn :: IO Int -- actually I don't need this variable
numbersString <- getLine
let numbers = map read $ words numbersString
return numbers
It looks like a lot of code for parsing input. Are there any tricks to "compress" it? :)
If you merely want to make code shorter then check out the Stack Exchange community for code golfing. That is primarily for fun and games.
If we are thinking there is too much code it may not be that we need to make it shorter but rather that we need to make it clearer. Achieving this is a matter of experience and good practice. What we want to do is isolate simple concepts which are obviously correct and then combine them in obviously correct ways. Methodologies include top-down design (break the solution into smaller pieces) and bottom-up design (from smaller pieces build up to the solution) and mixes thereof.
A bottom-up piece that hits me straight away is the task of summing a list of numbers. This has a definition in Haskell's Prelude called sum :: (Num a, Foldable t) => t a -> a. Somewhere in the final solution we are going to use this.
Another method is to simplify the problem. We can be lead astray by the way a problem is phrased. Upon closer inspection we might find an equivalent and simpler phrasing.
What information do we actually need from the input? Just the lists of numbers. What is the simplest way to obtain the lists of numbers? The number of lists seems irrelevant because there is no need to have this information before we start looking at the lists. Drop the first line and we are left with:
5
1 2 3 4 5
2
-100 100
Then, the length of each list is also irrelevant because we do not need that information before summing the list. Therefore lets also drop every other line from this point:
1 2 3 4 5
-100 100
Now we just have the lists of numbers separated by line returns where each number is separated by a space.
At this point we have a clear way to break apart the solution in a top-down manner. First we simplify the input. Secondly we parse the lists of numbers. Thirdly we sum the lists. Fourthly we print the sums. This is therefore the skeleton of our solution:
simplifyInput :: String -> [String]
parseNumberList :: String -> [Integer]
-- Note we can use `sum` from Prelude to sum the number lists.
printSums :: [Integer] -> IO ()
main :: IO ()
main = getContents >>= printSums . fmap (sum . parseNumberList) . simplifyInput
Now it is just a matter of implementing each obvious piece of the solution.
simplifyInput :: String -> [String]
simplifyInput = dropEveryOther . drop 1 . lines
where
dropEveryOther :: [a] -> [a]
In writing simplifyInput I discovered that dropping every other line requires some more work. That is okay, we can just break the solution apart again.
dropEveryOther :: [a] -> [a]
dropEveryOther [] = []
dropEveryOther (x:y:xs) = y : dropEveryOther xs
Then continuing...
parseNumberList :: String -> [Integer]
parseNumberList = fmap read . words
printSums :: [Integer] -> IO ()
printSums = putStr . unlines . fmap show
Therefore, in totality:
simplifyInput :: String -> [String]
simplifyInput = dropEveryOther . drop 1 . lines
where
dropEveryOther :: [a] -> [a]
dropEveryOther [] = []
dropEveryOther (_:y:xs) = y : dropEveryOther xs
parseNumberList :: String -> [Integer]
parseNumberList = fmap read . words
printSums :: [Integer] -> IO ()
printSums = putStr . unlines . fmap show
main :: IO ()
main = getContents >>= printSums . fmap (sum . parseNumberList) . simplifyInput
The amount of code we have has gone up (compared to the first solution) but in exchange the code is made obvious. Now you should add some documentation comments so we do not forget our explanation for the solution.
Alec posted a super compressed version of my original code in one of the comments. I decided to post a small breakdown, in case someone gets lost and has no idea what's going on in there :)
Snippets below need to be preceded with valid imports:
import Control.Monad
import Control.Applicative
So we start with Alec's version:
main = readLn >>= flip replicateM_ (getLine >> sum . map read . words <$> getLine >>= print)
He used the flip function in order to remove one set of parenthesis:
main = readLn >>= (`replicateM_` (getLine >> (print =<< sum . map read . words <$> getLine)))
He used the infix notation for replicateM_ in order to partially apply the second parameter of replicateM_, we can replace is with a lambda:
main = readLn >>= \n -> replicateM_ n (getLine >> (print =<< sum . map read . words <$> getLine))
Now let's start extracting some pieces of code into separate meaningful functions:
printBatchResult = print =<< sum . map read . words <$> getLine
main = readLn >>= \n -> replicateM_ n (getLine >> printBatchResult)
We can flip the print =<< for more readability:
printBatchResult = sum . map read . words <$> getLine >>= print
main = readLn >>= \n -> replicateM_ n (getLine >> printBatchResult)
And so on:
printBatchResult = sum . map read . words <$> getLine >>= print
handleBatch = getLine >> printBatchResult
main = readLn >>= \n -> replicateM_ n handleBatch
And again:
sumLine = sum . map read . words
printBatchResult = sumLine <$> getLine >>= print
handleBatch = getLine >> printBatchResult
main = readLn >>= \n -> replicateM_ n handleBatch
And one more time:
sumLine = sum . map read . words
handleNumbersLine = sumLine <$> getLine
printBatchResult = handleNumbersLine >>= print
handleBatch = getLine >> printBatchResult
main = readLn >>= (\n -> replicateM_ n handleBatch)
And finally the last time:
sumLine = sum . map read . words
handleNumbersLine = sumLine <$> getLine
printBatchResult = handleNumbersLine >>= print
handleBatch = getLine >> printBatchResult
handleAllBatches n = replicateM_ n handleBatch
main = readLn >>= handleAllBatches
We can replace <$> with fmap:
sumLine = sum . map read . words
handleNumbersLine = fmap sumLine getLine
printBatchResult = handleNumbersLine >>= print
handleBatch = getLine >> printBatchResult
handleAllBatches n = replicateM_ n handleBatch
main = readLn >>= handleAllBatches
We can also remove every partial application:
sumLine line = (sum . map read . words) line
handleNumbersLine = fmap sumLine getLine
printBatchResult = handleNumbersLine >>= \sum -> print sum
handleBatch = getLine >> printBatchResult
handleAllBatches n = replicateM_ n handleBatch
main = readLn >>= \numberOfBatches -> handleAllBatches numberOfBatches
And finally, add signatures:
sumLine :: String -> Int
sumLine line = (sum . map read . words) line
handleNumbersLine :: IO Int
handleNumbersLine = fmap sumLine getLine
printBatchResult :: IO ()
printBatchResult = handleNumbersLine >>= \sum -> print sum
handleBatch :: IO ()
handleBatch = getLine >> printBatchResult
handleAllBatches :: Int -> IO ()
handleAllBatches n = replicateM_ n handleBatch
main = readLn >>= \numberOfBatches -> handleAllBatches numberOfBatches
Some final comments:
>>= - the bind function from monad converts one monad to another (or the same) and transforms its value. In main function it takes IO Int, transformation lambda and returns IO () - the result of the transformation, which is empty and prints result in the process.
>> - (used in handleBatch) ignores the left parameter (how many numbers there are in a line is (arguably) unnecessary) and just returns the right parameter - which is a function handling a line with numbers.

Forking the streaming flow in haskell-pipes

I'm having trouble directing flow though a pipeline with haskell-pipes. Basically, I analyze a bunch of files and then I have to either
print results to the terminal in a human-friendly way
encode results to JSON
The chosen path depends upon a command line option.
In the second case, I have to output an opening bracket, then every incoming value followed by a comma and then a closing bracket. Currently insertCommas never terminates, so the closing bracket is never outputted.
import Pipes
import Data.ByteString.Lazy as B
import Data.Aeson (encode)
insertCommas :: Consumer B.ByteString IO ()
insertCommas = do
first <- await
lift $ B.putStr first
for cat $ \obj -> lift $ do
putStr ","
B.putStr obj
jsonExporter :: Consumer (FilePath, AnalysisResult) IO ()
jsonExporter = do
lift $ putStr "["
P.map encode >-> insertCommas
lift $ putStr "]"
exportStream :: Config -> Consumer (FilePath, AnalysisResult) IO ()
exportStream conf =
case outputMode conf of
JSON -> jsonExporter
_ -> P.map (export conf) >-> P.stdoutLn
main :: IO ()
main = do
-- The first two lines are Docopt stuff, not relevant
args <- parseArgsOrExit patterns =<< getArgs
ins <- allFiles $ args `getAllArgs` argument "paths"
let conf = readConfig args
runEffect $ each ins
>-> P.mapM analyze
>-> P.map (filterResults conf)
>-> P.filter filterNulls
>-> exportStream conf
AFAIK a Consumer cannot detect the end of a stream. In order to do that you need to use a Pipes.Parser and invert the control.
Here is a Parser which inserts commas between String elements:
import Pipes
import qualified Pipes.Prelude as P
import Pipes.Parse (draw, evalStateT)
commify = do
lift $ putStrLn "["
m1 <- draw
case m1 of
Nothing -> lift $ putStrLn "]"
Just x1 -> do
lift $ putStrLn x1
let loop = do mx <- draw
case mx of
Nothing -> lift $ putStrLn "]"
Just x -> lift (putStr "," >> putStrLn x) >> loop
loop
test1 = evalStateT commify ( mapM_ yield (words "this is a test") )
test2 = evalStateT commify P.stdinLn
To handle the different output formats I would probably make both formats a Parser:
exportParser = do
mx <- draw
case mx of
Nothing -> return ()
Just x -> (lift $ putStrLn $ export x) >> exportParser
and then:
let parser = case outputMode of
JSON -> commify
_ -> exportParser
evalStateT parser (P.mapM analyze
>-> P.map (filterResults conf)
>-> P.filter filterNulls)
There is probably a slicker way to write exportParser in terms of foldAllM. You can also use the MaybeT transformer to more succinctly write the commify parser. I've written both out explicitly to make them easier to understand.
I think you should 'commify' with pipes-group. It has an intercalates, but not an intersperse, but it's not a big deal to write. You should stay away from the Consumer end, I think, for this sort of problem.
{-#LANGUAGE OverloadedStrings #-}
import Pipes
import qualified Pipes.Prelude as P
import qualified Data.ByteString.Lazy.Char8 as B
import Pipes.Group
import Lens.Simple -- or Control.Lens or Lens.Micro or anything with view/^.
import System.Environment
intersperse_ :: Monad m => a -> Producer a m r -> Producer a m r
intersperse_ a producer = intercalates (yield a) (producer ^. chunksOf 1)
main = do
args <- getArgs
let op prod = case args of
"json":_ -> yield "[" *> intersperse_ "," prod <* yield "]"
_ -> intersperse_ " " prod
runEffect $ op producer >-> P.mapM_ B.putStr
putStrLn ""
where
producer = mapM_ yield (B.words "this is a test")
which give me this
>>> :main json
[this,is,a,test]
>>> :main ---
this is a test

Reading numbers inline

Imagine I read an input block via stdin that looks like this:
3
12
16
19
The first number is the number of following rows. I have to process these numbers via a function and report the results separated by a space.
So I wrote this main function:
main = do
num <- readLn
putStrLn $ intercalate " " [ show $ myFunc $ read getLine | c <- [1..num]]
Of course that function doesn't compile because of the read getLine.
But what is the correct (read: the Haskell way) way to do this properly? Is it even possible to write this function as a one-liner?
Is it even possible to write this function as a one-liner?
Well, it is, and it's kind of concise, but see for yourself:
main = interact $ unwords . map (show . myFunc . read) . drop 1 . lines
So, how does this work?
interact :: (String -> String) -> IO () takes all contents from STDIN, passes it through the given function, and prints the output.
We use unwords . map (show . myFunc . read) . drop 1 . lines :: String -> String:
lines :: String -> [String] breaks a string at line ends.
drop 1 removes the first line, as we don't actually need the number of lines.
map (show . myFunc . read) converts each String to the correct type, uses myFunc, and then converts it back to a `String.
unwords is basically the same as intercalate " ".
However, keep in mind that interact isn't very GHCi friendly.
You can build a list of monadic actions with <$> (or fmap) and execute them all with sequence.
λ intercalate " " <$> sequence [show . (2*) . read <$> getLine | _ <- [1..4]]
1
2
3
4
"2 4 6 8"
Is it even possible to write this function as a one-liner?
Sure, but there is a problem with the last line of your main function. Because you're trying to apply intercalate " " to
[ show $ myFunc $ read getLine | c <- [1..num]]
I'm guessing you expect the latter to have type [String], but it is in fact not a well-typed expression. How can that be fixed? Let's first define
getOneInt :: IO Int
getOneInt = read <$> getLine
for convenience (we'll be using it multiple times in our code). Now, what you meant is probably something like
[ show . myFunc <$> getOneInt | c <- [1..num]]
which, if the type of myFunc aligns with the rest, has type [IO String]. You can then pass that to sequence in order to get a value of type IO [String] instead. Finally, you can "pass" that (using =<<) to
putStrLn . intercalate " "
in order to get the desired one-liner:
import Control.Monad ( replicateM )
import Data.List ( intercalate )
main :: IO ()
main = do
num <- getOneInt
putStrLn . intercalate " " =<< sequence [ show . myFunc <$> getOneInt | c <- [1..num]]
where
myFunc = (* 3) -- for example
getOneInt :: IO Int
getOneInt = read <$> getLine
In GHCi:
λ> main
3
45
23
1
135 69 3
Is the code idiomatic and readable, though? Not so much, in my opinion...
[...] what is the correct (read: the Haskell way) way to do this properly?
There is no "correct" way of doing it, but the following just feels more natural and readable to me:
import Control.Monad ( replicateM )
import Data.List ( intercalate )
main :: IO ()
main = do
n <- getOneInt
ns <- replicateM n getOneInt
putStrLn $ intercalate " " $ map (show . myFunc) ns
where
myFunc = (* 3) -- replace by your own function
getOneInt :: IO Int
getOneInt = read <$> getLine
Alternatively, if you want to eschew the do notation:
main =
getOneInt >>=
flip replicateM getOneInt >>=
putStrLn . intercalate " " . map (show . myFunc)
where
myFunc = (* 3) -- replace by your own function

Using Pipes to read and write binary data in Haskell

I am trying to read and write very many ints in constant memory. I have figured out how to write the ints to memory but have not figured out how to read them back.
import Control.Lens (zoom)
import System.IO (IOMode(..), withFile)
import Pipes
import qualified Pipes.Prelude as P
import qualified Pipes.ByteString as PB
import qualified Pipes.Parse as P
import qualified Pipes.Binary as P
intStream :: Monad m => Proxy x' x () Int m b
intStream = go (0 :: Int) where
go i = yield i >> go (i + 1)
decoder :: Monad m => Int -> P.Parser P.ByteString m [Int]
decoder n = zoom (P.decoded . P.splitAt n) P.drawAll
main :: IO ()
main = do
withFile "ints" WriteMode $ \h -> do
runEffect $ for intStream P.encode >-> P.take 10000 >-> PB.toHandle h
withFile "ints" ReadMode $ \h -> do
xs <- P.evalStateT (decoder 10000000) (PB.fromHandle h)
print xs
I got the decoder function from the documentation for Pipes.Binary. However it uses drawAll which according to the documentation drawAll is not idiomatic use of Pipes and is provided for testing purposes.
My question is how to modify decoder so that it doesn't use drawAll and thus does not load all the values of xs into memory. So instead of printing the list of xs I could P.map print over a stream of decoded ints being read from the file.
The docs say that decoded is a lens from a stream of bytes to a stream of decoded values. We can get the latter out of the former using view from lens:
decoder :: Monad m => Int -> Producer P.ByteString m a -> Producer Int m ()
decoder n p = void (view P.decoded p) >-> P.take n
main :: IO ()
main = do
withFile "ints" WriteMode $ \h -> do
runEffect $ for intStream P.encode >-> P.take 10000 >-> PB.toHandle h
withFile "ints" ReadMode $ \h -> do
runEffect $ decoder 10000 (PB.fromHandle h) >-> P.print
I don't have much experience with pipes, I just followed the types here. The program seems to function as intended though.

conduit: producing memory leak

Working on some observations on a previous question (haskell-data-hashset-from-unordered-container-performance-for-large-sets) I stumbled upon a strange memory leak
module Main where
import System.Environment (getArgs)
import Control.Monad.Trans.Resource (runResourceT)
import Data.Attoparsec.ByteString (sepBy, Parser)
import Data.Attoparsec.ByteString.Char8 (decimal, char)
import Data.Conduit
import qualified Data.Conduit.Attoparsec as CA
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
main :: IO ()
main = do (args:_) <- getArgs
writeFile "input.txt" $ unlines $ map show [1..4 :: Int]
case args of "list" -> m1
"fail" -> m2
"listlist" -> m3
"memoryleak" -> m4
--UPDATE
"bs-lines":_ -> m5
"bs":_ -> m6
_ -> putStr $ unlines ["Usage: conduit list"
," fail"
," listlist"
," memoryleak"
--UPDATE
," bs-lines"
," bs"
]
m1,m2,m3,m4 :: IO ()
m1 = do hs <- runResourceT
$ CB.sourceFile "input.txt"
$$ CB.lines
=$= CA.conduitParser (decimal :: Parser Int)
=$= CL.map snd
=$= CL.consume
print hs
m2 = do hs <- runResourceT
$ CB.sourceFile "input.txt"
$$ CA.conduitParser (decimal :: Parser Int)
=$= CL.map snd
=$= CL.consume
print hs
m3 = do hs <- runResourceT
$ CB.sourceFile "input.txt"
$$ CB.lines
=$= CA.conduitParser (decimal `sepBy` (char '\n') :: Parser [Int])
=$= CL.map snd
=$= CL.consume
print hs
m4 = do hs <- runResourceT
$ CB.sourceFile "input.txt"
$$ CA.conduitParser (decimal `sepBy` (char '\n') :: Parser [Int])
=$= CL.map snd
=$= CL.consume
print hs
-- UPDATE
m5 = do inpt <- BS.lines <$> BS.readFile "input.txt"
let Right hs = mapM (parseOnly (decimal :: Parser Int)) inpt
print hs
m6 = do inpt <- BS.readFile "input.txt"
let Right hs = (parseOnly (decimal `sepBy` (char '\n') :: Parser [Int])) inpt
print hs
Here is some example output:
$ > stack exec -- example list
[1234]
$ > stack exec -- example listlist
[[1234]]
$ > stack exec -- conduit fail
conduit: ParseError {errorContexts = [], errorMessage = "Failed reading: takeWhile1", errorPosition = 1:2}
$ > stack exec -- example memoryleak
(Ctrl+C)
-- UPDATE
$ > stack exec -- example bs-lines
[1,2,3,4]
$ > stack exec -- example bs
[1,2,3,4]
Now the questions I have is:
Why is m1 not producing [1,2,3,4]?
Why is m2 failing?
Why is m4 behaving totally different compared to all other versions and producing a space leak?
Why is m2 failing?
The input file as a character stream is:
1\n2\n3\n4\n
Since the decimal parser do not expect a newline character, after consuming the first number the remaining stream is:
\n2\n3\n4\n
As the input stream is not exhausted, conduitParser will run the parser on the stream again, this time it cannot even consume the first character so it failed.
Why is m4 behaving totally different compared to all other versions and producing a space leak?
decimal `sepBy` (char '\n') will only consume \n between two integers, after successfully parsed four numbers, the input stream has only one character in it:
\n
and decimal `sepBy` (char '\n') cannot consume it, even worse it will not fail: sepBy can consume nothing and return empty list. Therefore it parse nothing infinitely and never terminate.
Why is m1 not producing [1,2,3,4]?
I want to know it too! I guess it has something to do with fusing, maybe you should contact the author of conduit package, who just commented your question.
To answer the question about m1: when you use CB.lines, you're turning input that looks like:
["1\n2\n3\n4\n"]
into:
["1", "2", "3", "4"]
Then, attoparsec parses the "1", waits for more input, sees the "2", and so on.

Resources