Using Pipes to read and write binary data in Haskell

Using Pipes to read and write binary data in Haskell - haskell

I am trying to read and write very many ints in constant memory. I have figured out how to write the ints to memory but have not figured out how to read them back.
import Control.Lens (zoom)
import System.IO (IOMode(..), withFile)
import Pipes
import qualified Pipes.Prelude as P
import qualified Pipes.ByteString as PB
import qualified Pipes.Parse as P
import qualified Pipes.Binary as P
intStream :: Monad m => Proxy x' x () Int m b
intStream = go (0 :: Int) where
go i = yield i >> go (i + 1)
decoder :: Monad m => Int -> P.Parser P.ByteString m [Int]
decoder n = zoom (P.decoded . P.splitAt n) P.drawAll
main :: IO ()
main = do
withFile "ints" WriteMode $ \h -> do
runEffect $ for intStream P.encode >-> P.take 10000 >-> PB.toHandle h
withFile "ints" ReadMode $ \h -> do
xs <- P.evalStateT (decoder 10000000) (PB.fromHandle h)
print xs
I got the decoder function from the documentation for Pipes.Binary. However it uses drawAll which according to the documentation drawAll is not idiomatic use of Pipes and is provided for testing purposes.
My question is how to modify decoder so that it doesn't use drawAll and thus does not load all the values of xs into memory. So instead of printing the list of xs I could P.map print over a stream of decoded ints being read from the file.

The docs say that decoded is a lens from a stream of bytes to a stream of decoded values. We can get the latter out of the former using view from lens:
decoder :: Monad m => Int -> Producer P.ByteString m a -> Producer Int m ()
decoder n p = void (view P.decoded p) >-> P.take n
main :: IO ()
main = do
withFile "ints" WriteMode $ \h -> do
runEffect $ for intStream P.encode >-> P.take 10000 >-> PB.toHandle h
withFile "ints" ReadMode $ \h -> do
runEffect $ decoder 10000 (PB.fromHandle h) >-> P.print
I don't have much experience with pipes, I just followed the types here. The program seems to function as intended though.

Related

Pipes (Haskell lib) - piping pipes with different state monad

My goal is to have the last value produced equal to 80 (40 + 40) (see code below)...
import Pipes
import Pipes.Prelude
import Pipes.Lift
import Control.Monad.State.Strict
data Input = A Integer | B Integer | C Integer
main :: IO ()
main = runEffect $ each [A 10,B 2,C 3,A 40,A 40] >-> pipeline >-> print
pipeline :: Pipe Input Integer IO ()
pipeline = for cat $ \case
A x -> yield x >-> accumulate
B x -> yield x
C x -> yield x
accumulate :: Pipe Integer Integer IO ()
accumulate = evalStateP 0 accumulate'
accumulate' :: Pipe Integer Integer (StateT Integer IO) ()
accumulate' = go
where
go = do
x <- await
lift $ modify (+x)
r <- lift get
yield r
go
With this example Input As are not accumulated...yield x >-> accumulate on Input A does do what I'm expected, the stream is a new one each time...
Piping pipes with different state monad sequentially works well but here somehow I want to nest them in the case pattern (like a substream somehow)...

The problem is that you call evalStateP too early, discarding state you want to preserve across calls to accumulate. Try something like this:
pipeline :: Pipe Input Integer IO ()
pipeline = evalStateP 0 $ for cat $ \case
A x -> yield x >-> accumulate
B x -> yield x
C x -> yield x
accumulate :: Pipe Integer Integer (StateT Integer IO) ()
accumulate = for cat $ \x -> do
modify (+x)
r <- get
yield r
Note that Proxy has a MonadState instance, so you don't need to lift state operations manually if you use mtl.

What is the fastest way to parse line with lots of Ints?

I'm learning Haskell for two years now and I'm still confused, whats the best (fastest) way to read tons of numbers from a single input line.
For learning I registered into hackerearth.com trying to solve every challenge in Haskell. But now I'm stuck with a challenge because I run into timeout issues. My program is just too slow for beeing accepted by the site.
Using the profiler I found out it takes 80%+ of the time for parsing a line with lots of integers. The percentage gets even higher when the number of values in the line increases.
Now this is the way, I'm reading numbers from an input line:
import qualified Data.ByteString.Char8 as C8
main = do
scores <- fmap (map (fst . fromJust . C8.readInt) . C8.words) C8.getLine :: IO [Int]
Is there any way to get the data faster into the variable?
BTW: The biggest testcase consist of a line with 200.000 9-digits values. Parsing takes incredible long (> 60s).

It's always difficult to declare a particular approach "the fastest", since there's almost always some way to squeeze out more performance. However, an approach using Data.ByteString.Char8 and the general method you suggest should be among the fastest methods for reading numbers. If you encounter a case where performance is poor, the problem likely lies elsewhere.
To give some concrete results, I generated a 191Meg file of 20 million 9-digit numbers, space-separate on a single line. I then tried several general methods of reading a line of numbers and printing their sum (which, for the record, was 10999281565534666). The obvious approach using String:
reader :: IO [Int]
reader = map read . words <$> getLine
sum' xs = sum xs -- work around GHC ticket 10992
main = print =<< sum' <$> reader
took 52secs; a similar approach using Text:
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Data.Text.Read as T
readText = map parse . T.words <$> T.getLine
where parse s = let Right (n, _) = T.decimal s in n
ran in 2.4secs (but note that it would need to be modified to handle negative numbers!); and the same approach using Char8:
import qualified Data.ByteString.Char8 as C
readChar8 :: IO [Int]
readChar8 = map parse . C.words <$> C.getLine
where parse s = let Just (n, _) = C.readInt s in n
ran in 1.4secs. All examples were compiled with -O2 on GHC 8.0.2.
As a comparison benchmark, a scanf-based C implementation:
/* GCC 5.4.0 w/ -O3 */
#include <stdio.h>
int main()
{
long x, acc = 0;
while (scanf(" %ld", &x) == 1) {
acc += x;
}
printf("%ld\n", acc);
return 0;
}
ran in about 2.5secs, on par with the Text implementation.
You can squeeze a bit more performance out of the Char8 implementation. Using a hand-rolled parser:
readChar8' :: IO [Int]
readChar8' = parse <$> C.getLine
where parse = unfoldr go
go s = do (n, s1) <- C.readInt s
let s2 = C.dropWhile C.isSpace s1
return (n, s2)
runs in about 0.9secs -- I haven't tried to determine why there's a difference, but the compiler must be missing an opportunity to perform some optimization of the words-to-readInt pipeline.
Haskell Code for Reference
Make some numbers with Numbers.hs:
-- |Generate 20M 9-digit numbers:
-- ./Numbers 20000000 100000000 999999999 > data1.txt
import qualified Data.ByteString.Char8 as C
import Control.Monad
import System.Environment
import System.Random
main :: IO ()
main = do [n, a, b] <- map read <$> getArgs
nums <- replicateM n (randomRIO (a,b))
let _ = nums :: [Int]
C.putStrLn (C.unwords (map (C.pack . show) nums))
Find their sum with Sum.hs:
import Data.List
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Data.Text.Read as T
import qualified Data.Char8 as C
import qualified Data.ByteString.Char8 as C
import System.Environment
-- work around https://ghc.haskell.org/trac/ghc/ticket/10992
sum' xs = sum xs
readString :: IO [Int]
readString = map read . words <$> getLine
readText :: IO [Int]
readText = map parse . T.words <$> T.getLine
where parse s = let Right (n, _) = T.decimal s in n
readChar8 :: IO [Int]
readChar8 = map parse . C.words <$> C.getLine
where parse s = let Just (n, _) = C.readInt s in n
readHand :: IO [Int]
readHand = parse <$> C.getLine
where parse = unfoldr go
go s = do (n, s1) <- C.readInt s
let s2 = C.dropWhile C.isSpace s1
return (n, s2)
main = do [method] <- getArgs
let reader = case method of
"string" -> readString
"text" -> readText
"char8" -> readChar8
"hand" -> readHand
print =<< sum' <$> reader
where:
./Sum string <data1.txt # 54.3 secs
./Sum text <data1.txt # 2.29 secs
./Sum char8 <data1.txt # 1.34 secs
./Sum hand <data1.txt # 0.91 secs

Pipe that maintains state

I'm trying to calculate rolling hash values (buzzhash) for a big file using pipes.
Currently I have this. But don't know how to write a pipe that maintains a state.
import qualified Data.ByteString.Lazy as L
import Data.Word
import Data.Bits(xor,rotate)
import Data.Array
import Pipes
import Control.Monad.State.Strict
import Control.Monad(forever)
produceFromList (x:xs) = do
yield x
produceFromList xs
buzzHash = do
x <- await
h <- lift $ get -- pull out previous value
let h' = rotate h 1 `xor` (hashArrW8!x) -- calculate new value
lift $ put h' -- save new value
yield h'
stdoutLn :: Consumer Word64 IO ()
stdoutLn = do
a <- await
lift $ print a
main = do
bs <- L.unpack `fmap` L.getContents
runEffect $ produceFromList bs >-> buzzHash >-> stdoutLn
hashArrW8 :: Array Word8 Word64
How do I make buzzHash save previous value and use it for the calculation of next value? Initial state value should be 0.

You were almost there; you just need to run the state.
main = do
bs <- L.unpack `fmap` L.getContents
flip execStateT 0 $ runEffect $ produceList bs >-> buzzHash >-> hoist lift stdoutLn
I assume you don't want to recover the state, so I use execStateT rather than runStateT.
The only curiosity here is that stdoutLn was marked as Consumer Word64 IO () . So I use hoist lift to make it Consumer Word64 (StateT Word64 IO) () Everything in the series a >-> b >-> c must agree in the underlying monad and return type.
Here are a few further comments that might save you time. First produceFromList is each.
Moreover, you could have avoided the hoist lift by relabeling your stdoutLn:
stdoutLn :: MonadIO m => Consumer Word64 m ()
stdoutLn = do
a <- await
liftIO $ print a
But here there is some trouble: you are not repeating the action. This should pretty clearly be a loop:
stdoutLn :: MonadIO m => Consumer Word64 m ()
stdoutLn = do
a <- await
liftIO $ print a
stdoutLn
in fact this is already available as P.print, so we can write
import qualified Pipes.Prelude as P
main = do
bs <- L.unpack `fmap` L.getContents
flip execStateT 0 $ runEffect $ each bs >-> buzzHash >-> P.print
If I understand you, buzzHash is meant to be repeated indefinitely too:
buzzHash = do
x <- await
h <- lift $ get -- pull out previous value
let h' = rotate h 1 `xor` (hashArrW8!x) -- calculate new value
lift $ put h' -- save new value
yield h'
buzzHash
(this is forever buzzHash, where we use your buzzHash)
Finally, if you
import qualified Pipes.ByteString as PB
import Control.Lens (view) -- (or Lens.Micro.MTL or Lens.Simple)
we see we don't need the lazy bytestring IO, which doesn't stream properly anyway.
Pipes.ByteString already has the unpack we want, packaged as a lens, so that we use view PB.unpack where elsewhere we would use B.unpack. So in the end we can write
main = flip evalStateT 0 $ runEffect $ view PB.unpack PB.stdin >-> buzzHash >-> P.print
Once it is in this form we see we aren't using the underlying state of the pipeline except in buzzHash, so we can localize this
import Pipes.Lift (evalStateP)
main = runEffect $ view PB.unpack PB.stdin >-> evalStateP 0 buzzHash >-> P.print
or, if you like you can rewrite
buzzHash' :: Monad m => Word64 -> Pipe Word8 Word64 m r
buzzHash' n = evalStateP n $ forever $ do
x <- await
h <- lift $ get -- pull out previous value
let h' = rotate h 1 `xor` (hashArrW8!x) -- calculate new value
lift $ put h' -- save new value
yield h'
Then you would write
main = runEffect $ view PB.unpack PB.stdin >-> buzzHash' 0 >-> P.print

Forking the streaming flow in haskell-pipes

I'm having trouble directing flow though a pipeline with haskell-pipes. Basically, I analyze a bunch of files and then I have to either
print results to the terminal in a human-friendly way
encode results to JSON
The chosen path depends upon a command line option.
In the second case, I have to output an opening bracket, then every incoming value followed by a comma and then a closing bracket. Currently insertCommas never terminates, so the closing bracket is never outputted.
import Pipes
import Data.ByteString.Lazy as B
import Data.Aeson (encode)
insertCommas :: Consumer B.ByteString IO ()
insertCommas = do
first <- await
lift $ B.putStr first
for cat $ \obj -> lift $ do
putStr ","
B.putStr obj
jsonExporter :: Consumer (FilePath, AnalysisResult) IO ()
jsonExporter = do
lift $ putStr "["
P.map encode >-> insertCommas
lift $ putStr "]"
exportStream :: Config -> Consumer (FilePath, AnalysisResult) IO ()
exportStream conf =
case outputMode conf of
JSON -> jsonExporter
_ -> P.map (export conf) >-> P.stdoutLn
main :: IO ()
main = do
-- The first two lines are Docopt stuff, not relevant
args <- parseArgsOrExit patterns =<< getArgs
ins <- allFiles $ args `getAllArgs` argument "paths"
let conf = readConfig args
runEffect $ each ins
>-> P.mapM analyze
>-> P.map (filterResults conf)
>-> P.filter filterNulls
>-> exportStream conf

AFAIK a Consumer cannot detect the end of a stream. In order to do that you need to use a Pipes.Parser and invert the control.
Here is a Parser which inserts commas between String elements:
import Pipes
import qualified Pipes.Prelude as P
import Pipes.Parse (draw, evalStateT)
commify = do
lift $ putStrLn "["
m1 <- draw
case m1 of
Nothing -> lift $ putStrLn "]"
Just x1 -> do
lift $ putStrLn x1
let loop = do mx <- draw
case mx of
Nothing -> lift $ putStrLn "]"
Just x -> lift (putStr "," >> putStrLn x) >> loop
loop
test1 = evalStateT commify ( mapM_ yield (words "this is a test") )
test2 = evalStateT commify P.stdinLn
To handle the different output formats I would probably make both formats a Parser:
exportParser = do
mx <- draw
case mx of
Nothing -> return ()
Just x -> (lift $ putStrLn $ export x) >> exportParser
and then:
let parser = case outputMode of
JSON -> commify
_ -> exportParser
evalStateT parser (P.mapM analyze
>-> P.map (filterResults conf)
>-> P.filter filterNulls)
There is probably a slicker way to write exportParser in terms of foldAllM. You can also use the MaybeT transformer to more succinctly write the commify parser. I've written both out explicitly to make them easier to understand.

I think you should 'commify' with pipes-group. It has an intercalates, but not an intersperse, but it's not a big deal to write. You should stay away from the Consumer end, I think, for this sort of problem.
{-#LANGUAGE OverloadedStrings #-}
import Pipes
import qualified Pipes.Prelude as P
import qualified Data.ByteString.Lazy.Char8 as B
import Pipes.Group
import Lens.Simple -- or Control.Lens or Lens.Micro or anything with view/^.
import System.Environment
intersperse_ :: Monad m => a -> Producer a m r -> Producer a m r
intersperse_ a producer = intercalates (yield a) (producer ^. chunksOf 1)
main = do
args <- getArgs
let op prod = case args of
"json":_ -> yield "[" *> intersperse_ "," prod <* yield "]"
_ -> intersperse_ " " prod
runEffect $ op producer >-> P.mapM_ B.putStr
putStrLn ""
where
producer = mapM_ yield (B.words "this is a test")
which give me this
>>> :main json
[this,is,a,test]
>>> :main ---
this is a test

Skipping first line in pipes-attoparsec

My types:
data Test = Test {
a :: Int,
b :: Int
} deriving (Show)
My parser:
testParser :: Parser Test
testParser = do
a <- decimal
tab
b <- decimal
return $ Test a b
tab = char '\t'
Now in order to skip the first line, I do something like this:
import qualified System.IO as IO
parser :: Parser Test
parser = manyTill anyChar endOfLine *> testParser
main = IO.withFile testFile IO.ReadMode $ \testHandle -> runEffect $
for (parsed (parser <* endOfLine) (fromHandle testHandle)) (lift . print)
But the above parser function makes every alternate link skip (which is obvious). How to only skip the first line in such a way that it works with Pipes ecosystem (Producer should produce a single Test value.) This is one obvious solution which I don't want (the below code will only work if I modify testParser to read newlines) because it returns the entire [Test] instead of a single value:
tests :: Parser [Test]
tests = manyTill anyChar endOfLine *>
many1 testParser
Any ideas to tackle this problem ?

You can drop the first line efficiently in constant space like this:
import Lens.Family (over)
import Pipes.Group (drops)
import Pipes.ByteString (lines)
import Prelude hiding (lines)
dropLine :: Monad m => Producer ByteString m r -> Producer ByteString m r
dropLine = over lines (drops 1)
You can apply dropLine to your Producer before you parse the Producer, like this:
main = IO.withFile testFile IO.ReadMode $ \testHandle -> runEffect $
let p = dropLine (fromHandle testHandle)
for (parsed (parser <* endOfLine) p) (lift . print)

If the first line doesn't contain any valid Test, you can use Either () Test in order to handle it:
parserEither :: Parser (Either () Test)
parserEither = Right <$> testParser <* endOfLine
<|> Left <$> (manyTill anyChar endOfLine *> pure ())
After this you can use the functions provided by Pipes.Prelude to get rid of the first result (and additionally of all non-parseable lines):
producer p = parsed parserEither p
>-> P.drop 1
>-> P.filter (either (const False) (const True))
>-> P.map (\(Right x) -> x)
main = IO.withFile testFile IO.ReadMode $ \testHandle -> runEffect $
for (producer (fromHandle testHandle)) (lift . print)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Using Pipes to read and write binary data in Haskell - haskell

Related

Pipes (Haskell lib) - piping pipes with different state monad

What is the fastest way to parse line with lots of Ints?

Pipe that maintains state

Forking the streaming flow in haskell-pipes

Skipping first line in pipes-attoparsec

Categories

Resources