Lazy binary get

Lazy binary get - haskell

Why is Data.Binary.Get isn't lazy as it says? Or am I doing something wrong here?
import Data.ByteString.Lazy (pack)
import Data.Binary.Get (runGet, isEmpty, getWord8)
getWords = do
empty <- isEmpty
if empty
then return []
else do
w <- getWord8
ws <- getWords
return $ w:ws
main = print $ take 10 $ runGet getWords $ pack $ repeat 1
This main function just hangs instead of printing 10 words.

The documentation you linked provides several examples. The first one needs to read all the input before it can return and looks a lot like what you have written. The second one is a left-fold and processes the input in a streaming fashion. Here's your code rewritten in this style:
module Main where
import Data.Word (Word8)
import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get (runGetState, getWord8)
getWords :: BL.ByteString -> [Word8]
getWords input
| BL.null input = []
| otherwise =
let (w, rest, _) = runGetState getWord8 input 0
in w : getWords rest
main :: IO ()
main = print . take 10 . getWords . BL.pack . repeat $ 1
Testing:
*Main> :main
[1,1,1,1,1,1,1,1,1,1]

Related

Where to find chunks in haskell?

I'm trying to follow this tutorial: https://wiki.haskell.org/Tutorials/Programming_Haskell/String_IO.
In the last part 7 Extension: using SMP parallelism I copy the code but it fails to compile with this error message
/home/dhilst/parallelspell.hs:13:20: error:
Variable not in scope: chunk :: Int -> [String] -> t
I searched for chunks at Hoogle and got Data.Text.Internal.Lazy, but this seems to be an internal module. And I couldn't import it anyway.
Here is the code:
import Data.Set hiding (map)
import Data.Maybe
import Data.Char
import Text.Printf
import System.IO
import System.Environment
import Control.Concurrent
import Control.Monad
main = do
(f,g,n) <- readFiles
let dict = fromList (lines f)
work = chunk n (words g)
run n dict work
run n dict work = do
chan <- newChan
errs <- getChanContents chan
mapM_ (forkIO . thread chan dict) (zip [1..n] work)
wait n errs 0
wait n xs i = when (i < n) $ case xs of
Nothing : ys -> wait n ys $! i+1
Just s : ys -> putStrLn s >> wait n ys i
thread chan dict (me,xs) = do
mapM_ spellit xs
writeChan chan Nothing
where spellit w = when (spell dict w) $
writeChan chan . Just $ printf "Thread %d: %-25s" (me::Int) w
spell d w = w `notMember` d
readFiles = do
[s,n] <- getArgs
f <- readFile "/usr/share/dict/words"
g <- readFile s
return (f,g, read n)
And here is the compilation line:
ghc -O --make -threaded parallelspell.hs
--
Update: I write my own version of chunk based on this quest:How to partition a list in Haskell?
chunk :: Int -> [a] -> [[a]]
chunk _ [] = []
chunk n xs = (take n xs) : (chunk n (drop n xs))
Still, does this means that the tutorial that I'm following is very old and out of date!? Can anyone confirm if that function already existed some day or if I'm missing something?
Regards,

Looks like the tutorial just forgot to define chunk. I encourage you to update the wiki to include a suitable definition.

What is the fastest way to parse line with lots of Ints?

I'm learning Haskell for two years now and I'm still confused, whats the best (fastest) way to read tons of numbers from a single input line.
For learning I registered into hackerearth.com trying to solve every challenge in Haskell. But now I'm stuck with a challenge because I run into timeout issues. My program is just too slow for beeing accepted by the site.
Using the profiler I found out it takes 80%+ of the time for parsing a line with lots of integers. The percentage gets even higher when the number of values in the line increases.
Now this is the way, I'm reading numbers from an input line:
import qualified Data.ByteString.Char8 as C8
main = do
scores <- fmap (map (fst . fromJust . C8.readInt) . C8.words) C8.getLine :: IO [Int]
Is there any way to get the data faster into the variable?
BTW: The biggest testcase consist of a line with 200.000 9-digits values. Parsing takes incredible long (> 60s).

It's always difficult to declare a particular approach "the fastest", since there's almost always some way to squeeze out more performance. However, an approach using Data.ByteString.Char8 and the general method you suggest should be among the fastest methods for reading numbers. If you encounter a case where performance is poor, the problem likely lies elsewhere.
To give some concrete results, I generated a 191Meg file of 20 million 9-digit numbers, space-separate on a single line. I then tried several general methods of reading a line of numbers and printing their sum (which, for the record, was 10999281565534666). The obvious approach using String:
reader :: IO [Int]
reader = map read . words <$> getLine
sum' xs = sum xs -- work around GHC ticket 10992
main = print =<< sum' <$> reader
took 52secs; a similar approach using Text:
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Data.Text.Read as T
readText = map parse . T.words <$> T.getLine
where parse s = let Right (n, _) = T.decimal s in n
ran in 2.4secs (but note that it would need to be modified to handle negative numbers!); and the same approach using Char8:
import qualified Data.ByteString.Char8 as C
readChar8 :: IO [Int]
readChar8 = map parse . C.words <$> C.getLine
where parse s = let Just (n, _) = C.readInt s in n
ran in 1.4secs. All examples were compiled with -O2 on GHC 8.0.2.
As a comparison benchmark, a scanf-based C implementation:
/* GCC 5.4.0 w/ -O3 */
#include <stdio.h>
int main()
{
long x, acc = 0;
while (scanf(" %ld", &x) == 1) {
acc += x;
}
printf("%ld\n", acc);
return 0;
}
ran in about 2.5secs, on par with the Text implementation.
You can squeeze a bit more performance out of the Char8 implementation. Using a hand-rolled parser:
readChar8' :: IO [Int]
readChar8' = parse <$> C.getLine
where parse = unfoldr go
go s = do (n, s1) <- C.readInt s
let s2 = C.dropWhile C.isSpace s1
return (n, s2)
runs in about 0.9secs -- I haven't tried to determine why there's a difference, but the compiler must be missing an opportunity to perform some optimization of the words-to-readInt pipeline.
Haskell Code for Reference
Make some numbers with Numbers.hs:
-- |Generate 20M 9-digit numbers:
-- ./Numbers 20000000 100000000 999999999 > data1.txt
import qualified Data.ByteString.Char8 as C
import Control.Monad
import System.Environment
import System.Random
main :: IO ()
main = do [n, a, b] <- map read <$> getArgs
nums <- replicateM n (randomRIO (a,b))
let _ = nums :: [Int]
C.putStrLn (C.unwords (map (C.pack . show) nums))
Find their sum with Sum.hs:
import Data.List
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Data.Text.Read as T
import qualified Data.Char8 as C
import qualified Data.ByteString.Char8 as C
import System.Environment
-- work around https://ghc.haskell.org/trac/ghc/ticket/10992
sum' xs = sum xs
readString :: IO [Int]
readString = map read . words <$> getLine
readText :: IO [Int]
readText = map parse . T.words <$> T.getLine
where parse s = let Right (n, _) = T.decimal s in n
readChar8 :: IO [Int]
readChar8 = map parse . C.words <$> C.getLine
where parse s = let Just (n, _) = C.readInt s in n
readHand :: IO [Int]
readHand = parse <$> C.getLine
where parse = unfoldr go
go s = do (n, s1) <- C.readInt s
let s2 = C.dropWhile C.isSpace s1
return (n, s2)
main = do [method] <- getArgs
let reader = case method of
"string" -> readString
"text" -> readText
"char8" -> readChar8
"hand" -> readHand
print =<< sum' <$> reader
where:
./Sum string <data1.txt # 54.3 secs
./Sum text <data1.txt # 2.29 secs
./Sum char8 <data1.txt # 1.34 secs
./Sum hand <data1.txt # 0.91 secs

Forking the streaming flow in haskell-pipes

I'm having trouble directing flow though a pipeline with haskell-pipes. Basically, I analyze a bunch of files and then I have to either
print results to the terminal in a human-friendly way
encode results to JSON
The chosen path depends upon a command line option.
In the second case, I have to output an opening bracket, then every incoming value followed by a comma and then a closing bracket. Currently insertCommas never terminates, so the closing bracket is never outputted.
import Pipes
import Data.ByteString.Lazy as B
import Data.Aeson (encode)
insertCommas :: Consumer B.ByteString IO ()
insertCommas = do
first <- await
lift $ B.putStr first
for cat $ \obj -> lift $ do
putStr ","
B.putStr obj
jsonExporter :: Consumer (FilePath, AnalysisResult) IO ()
jsonExporter = do
lift $ putStr "["
P.map encode >-> insertCommas
lift $ putStr "]"
exportStream :: Config -> Consumer (FilePath, AnalysisResult) IO ()
exportStream conf =
case outputMode conf of
JSON -> jsonExporter
_ -> P.map (export conf) >-> P.stdoutLn
main :: IO ()
main = do
-- The first two lines are Docopt stuff, not relevant
args <- parseArgsOrExit patterns =<< getArgs
ins <- allFiles $ args `getAllArgs` argument "paths"
let conf = readConfig args
runEffect $ each ins
>-> P.mapM analyze
>-> P.map (filterResults conf)
>-> P.filter filterNulls
>-> exportStream conf

AFAIK a Consumer cannot detect the end of a stream. In order to do that you need to use a Pipes.Parser and invert the control.
Here is a Parser which inserts commas between String elements:
import Pipes
import qualified Pipes.Prelude as P
import Pipes.Parse (draw, evalStateT)
commify = do
lift $ putStrLn "["
m1 <- draw
case m1 of
Nothing -> lift $ putStrLn "]"
Just x1 -> do
lift $ putStrLn x1
let loop = do mx <- draw
case mx of
Nothing -> lift $ putStrLn "]"
Just x -> lift (putStr "," >> putStrLn x) >> loop
loop
test1 = evalStateT commify ( mapM_ yield (words "this is a test") )
test2 = evalStateT commify P.stdinLn
To handle the different output formats I would probably make both formats a Parser:
exportParser = do
mx <- draw
case mx of
Nothing -> return ()
Just x -> (lift $ putStrLn $ export x) >> exportParser
and then:
let parser = case outputMode of
JSON -> commify
_ -> exportParser
evalStateT parser (P.mapM analyze
>-> P.map (filterResults conf)
>-> P.filter filterNulls)
There is probably a slicker way to write exportParser in terms of foldAllM. You can also use the MaybeT transformer to more succinctly write the commify parser. I've written both out explicitly to make them easier to understand.

I think you should 'commify' with pipes-group. It has an intercalates, but not an intersperse, but it's not a big deal to write. You should stay away from the Consumer end, I think, for this sort of problem.
{-#LANGUAGE OverloadedStrings #-}
import Pipes
import qualified Pipes.Prelude as P
import qualified Data.ByteString.Lazy.Char8 as B
import Pipes.Group
import Lens.Simple -- or Control.Lens or Lens.Micro or anything with view/^.
import System.Environment
intersperse_ :: Monad m => a -> Producer a m r -> Producer a m r
intersperse_ a producer = intercalates (yield a) (producer ^. chunksOf 1)
main = do
args <- getArgs
let op prod = case args of
"json":_ -> yield "[" *> intersperse_ "," prod <* yield "]"
_ -> intersperse_ " " prod
runEffect $ op producer >-> P.mapM_ B.putStr
putStrLn ""
where
producer = mapM_ yield (B.words "this is a test")
which give me this
>>> :main json
[this,is,a,test]
>>> :main ---
this is a test

Read a list of integers lazily as a bytestring

I'm trying to find the sum of integers in a file. The code using the normal string is:
main = do
contents <- getContents
L.putStrLn (sumFile contents)
where sumFile = sum . map read. words
I tried to change it to use the Data.ByteString.Lazy module like this:
import Data.ByteString.Lazy as L
main = do
contents <- L.getContents
L.putStrLn (sumFile contents)
where sumFile = sum . L.map read. words
But this refused as words was returning a string. Then I tried using Data.ByteString.Char8 but it used a strict ByteString.
How can I make this function completely lazy?

I found a slightly length workaround to reading the file as a ByteString and then as a list of integers. Thanks to #melpomene
import Data.ByteString.Lazy.Char8 as L
main = do
contents <- L.getContents
print (sumFile contents)
where sumFile x = sum $ Prelude.map tups $ Prelude.map L.readInt (L.words x)
where read' = tups.(L.readInt)
tups :: (Num a) => (Maybe (a, b)) -> a
tups (Just (a,b)) = a
tups Nothing = 0

Converting a bytestring to a list of 7 bits bytes

I have to convert a ByteString into a list of 7 bits bytes. For example, bytes with a, b, c, d etc. bits:
abcdefgh ijklmnop qrstuvwx yz...
should be converted to:
abcdefg hijklmn opqrstu vwxyz...
I use the Binary-Bits package in order to do it. My convert8to7 function is recursive but the Binary-Bits does not provide any mean to check for the lack of bits whereas the Get monad does have isEmpty or remaining functions.
Here’s my code:
import Data.Word
import Data.Binary.Bits.Get
import Data.Binary.Get (runGet)
import Data.ByteString.Lazy.Char8
convert8to7 :: BitGet [Word8]
convert8to7 = do
bits <- getWord8 7
rest <- convert8to7
return (bits : rest)
main :: IO ()
main = do
let datas = pack "Hello world!"
print $ runGet (runBitGet convert8to7) datas
When I run this code, it logically says:
Data.Binary.Get.runGet at position 12: demandInput: not enough bytes
Can I do this conversion with Binary-Bits or should I look for an other package ?
Update
Here’s my code based on user5402 answer:
import Data.Word
import Data.Bits
import Data.Binary.Bits.Get
import Data.Binary.Get (runGet)
import qualified Data.ByteString.Lazy.Char8 as BS
convert87 :: Int -> BitGet [Word8]
convert87 n
| n == 0 = return []
| n < 7 = do bits <- getWord8 n
return [shiftL bits (7 - n)]
| otherwise = do bits <- getWord8 7
rest <- convert87 (n-7)
return $ bits : rest
to87 :: BS.ByteString -> [Word8]
to87 datas = runGet (runBitGet (convert87 len)) datas
where len = fromIntegral $ BS.length datas * 8
main :: IO ()
main = do
let datas = BS.pack "Hello world!"
print $ to87 datas

The problem is that you need to keep track of the number of bits to decode - the BitGet monad doesn't know when the end of input has been reached.
Try this:
import Data.Word
import Data.Binary.Bits.Get
import Data.Binary.Get (runGet)
import Data.ByteString.Lazy.Char8
import qualified Data.ByteString.Lazy.Char8 as BS
convert87 :: Int -> BitGet [Word8]
convert87 n
| n < 7 = do bits <- getWord8 n
return [bits]
| otherwise = do bits <- getWord8 7
rest <- convert87 (n-7)
return $ bits : rest
main :: IO ()
main = do
let datas = pack "Hello world!"
len = fromIntegral $ BS.length datas * 8
print $ runGet (runBitGet (convert87 len)) datas
Update: Here is the way to detect end of input in the Get monad (on top of which the BitGet monad is implemented). It relies on the Alternative class for Get. The function chunks7 breaks up a byte string into chunks of 7 with any remainder going into the last chunk.
As far as I can tell, BitGet does not implement the Alternative class - although I'm sure it could.
import Data.Word (Word8)
import Data.Binary.Get
import Data.ByteString.Lazy.Char8
import qualified Data.ByteString as BSW
import qualified Data.ByteString.Lazy as BSL
import Control.Applicative -- used for (<|>)
chunks7 :: Get [[Word8]]
chunks7 = do
b <- isEmpty
if b
then return []
else do chunk <- fmap BSW.unpack (getByteString 7)
<|> fmap BSL.unpack getRemainingLazyByteString
rest <- chunks7
return $ chunk : rest
main :: IO ()
main = do
let datas = pack "Hello world! This is a test"
print $ runGet chunks7 datas

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Lazy binary get - haskell

Related

Where to find chunks in haskell?

What is the fastest way to parse line with lots of Ints?

Forking the streaming flow in haskell-pipes

Read a list of integers lazily as a bytestring

Converting a bytestring to a list of 7 bits bytes

Categories

Resources