Converting a bytestring to a list of 7 bits bytes

Converting a bytestring to a list of 7 bits bytes - haskell

I have to convert a ByteString into a list of 7 bits bytes. For example, bytes with a, b, c, d etc. bits:
abcdefgh ijklmnop qrstuvwx yz...
should be converted to:
abcdefg hijklmn opqrstu vwxyz...
I use the Binary-Bits package in order to do it. My convert8to7 function is recursive but the Binary-Bits does not provide any mean to check for the lack of bits whereas the Get monad does have isEmpty or remaining functions.
Here’s my code:
import Data.Word
import Data.Binary.Bits.Get
import Data.Binary.Get (runGet)
import Data.ByteString.Lazy.Char8
convert8to7 :: BitGet [Word8]
convert8to7 = do
bits <- getWord8 7
rest <- convert8to7
return (bits : rest)
main :: IO ()
main = do
let datas = pack "Hello world!"
print $ runGet (runBitGet convert8to7) datas
When I run this code, it logically says:
Data.Binary.Get.runGet at position 12: demandInput: not enough bytes
Can I do this conversion with Binary-Bits or should I look for an other package ?
Update
Here’s my code based on user5402 answer:
import Data.Word
import Data.Bits
import Data.Binary.Bits.Get
import Data.Binary.Get (runGet)
import qualified Data.ByteString.Lazy.Char8 as BS
convert87 :: Int -> BitGet [Word8]
convert87 n
| n == 0 = return []
| n < 7 = do bits <- getWord8 n
return [shiftL bits (7 - n)]
| otherwise = do bits <- getWord8 7
rest <- convert87 (n-7)
return $ bits : rest
to87 :: BS.ByteString -> [Word8]
to87 datas = runGet (runBitGet (convert87 len)) datas
where len = fromIntegral $ BS.length datas * 8
main :: IO ()
main = do
let datas = BS.pack "Hello world!"
print $ to87 datas

The problem is that you need to keep track of the number of bits to decode - the BitGet monad doesn't know when the end of input has been reached.
Try this:
import Data.Word
import Data.Binary.Bits.Get
import Data.Binary.Get (runGet)
import Data.ByteString.Lazy.Char8
import qualified Data.ByteString.Lazy.Char8 as BS
convert87 :: Int -> BitGet [Word8]
convert87 n
| n < 7 = do bits <- getWord8 n
return [bits]
| otherwise = do bits <- getWord8 7
rest <- convert87 (n-7)
return $ bits : rest
main :: IO ()
main = do
let datas = pack "Hello world!"
len = fromIntegral $ BS.length datas * 8
print $ runGet (runBitGet (convert87 len)) datas
Update: Here is the way to detect end of input in the Get monad (on top of which the BitGet monad is implemented). It relies on the Alternative class for Get. The function chunks7 breaks up a byte string into chunks of 7 with any remainder going into the last chunk.
As far as I can tell, BitGet does not implement the Alternative class - although I'm sure it could.
import Data.Word (Word8)
import Data.Binary.Get
import Data.ByteString.Lazy.Char8
import qualified Data.ByteString as BSW
import qualified Data.ByteString.Lazy as BSL
import Control.Applicative -- used for (<|>)
chunks7 :: Get [[Word8]]
chunks7 = do
b <- isEmpty
if b
then return []
else do chunk <- fmap BSW.unpack (getByteString 7)
<|> fmap BSL.unpack getRemainingLazyByteString
rest <- chunks7
return $ chunk : rest
main :: IO ()
main = do
let datas = pack "Hello world! This is a test"
print $ runGet chunks7 datas

Related

Where to find chunks in haskell?

I'm trying to follow this tutorial: https://wiki.haskell.org/Tutorials/Programming_Haskell/String_IO.
In the last part 7 Extension: using SMP parallelism I copy the code but it fails to compile with this error message
/home/dhilst/parallelspell.hs:13:20: error:
Variable not in scope: chunk :: Int -> [String] -> t
I searched for chunks at Hoogle and got Data.Text.Internal.Lazy, but this seems to be an internal module. And I couldn't import it anyway.
Here is the code:
import Data.Set hiding (map)
import Data.Maybe
import Data.Char
import Text.Printf
import System.IO
import System.Environment
import Control.Concurrent
import Control.Monad
main = do
(f,g,n) <- readFiles
let dict = fromList (lines f)
work = chunk n (words g)
run n dict work
run n dict work = do
chan <- newChan
errs <- getChanContents chan
mapM_ (forkIO . thread chan dict) (zip [1..n] work)
wait n errs 0
wait n xs i = when (i < n) $ case xs of
Nothing : ys -> wait n ys $! i+1
Just s : ys -> putStrLn s >> wait n ys i
thread chan dict (me,xs) = do
mapM_ spellit xs
writeChan chan Nothing
where spellit w = when (spell dict w) $
writeChan chan . Just $ printf "Thread %d: %-25s" (me::Int) w
spell d w = w `notMember` d
readFiles = do
[s,n] <- getArgs
f <- readFile "/usr/share/dict/words"
g <- readFile s
return (f,g, read n)
And here is the compilation line:
ghc -O --make -threaded parallelspell.hs
--
Update: I write my own version of chunk based on this quest:How to partition a list in Haskell?
chunk :: Int -> [a] -> [[a]]
chunk _ [] = []
chunk n xs = (take n xs) : (chunk n (drop n xs))
Still, does this means that the tutorial that I'm following is very old and out of date!? Can anyone confirm if that function already existed some day or if I'm missing something?
Regards,

Looks like the tutorial just forgot to define chunk. I encourage you to update the wiki to include a suitable definition.

What is the fastest way to parse line with lots of Ints?

I'm learning Haskell for two years now and I'm still confused, whats the best (fastest) way to read tons of numbers from a single input line.
For learning I registered into hackerearth.com trying to solve every challenge in Haskell. But now I'm stuck with a challenge because I run into timeout issues. My program is just too slow for beeing accepted by the site.
Using the profiler I found out it takes 80%+ of the time for parsing a line with lots of integers. The percentage gets even higher when the number of values in the line increases.
Now this is the way, I'm reading numbers from an input line:
import qualified Data.ByteString.Char8 as C8
main = do
scores <- fmap (map (fst . fromJust . C8.readInt) . C8.words) C8.getLine :: IO [Int]
Is there any way to get the data faster into the variable?
BTW: The biggest testcase consist of a line with 200.000 9-digits values. Parsing takes incredible long (> 60s).

It's always difficult to declare a particular approach "the fastest", since there's almost always some way to squeeze out more performance. However, an approach using Data.ByteString.Char8 and the general method you suggest should be among the fastest methods for reading numbers. If you encounter a case where performance is poor, the problem likely lies elsewhere.
To give some concrete results, I generated a 191Meg file of 20 million 9-digit numbers, space-separate on a single line. I then tried several general methods of reading a line of numbers and printing their sum (which, for the record, was 10999281565534666). The obvious approach using String:
reader :: IO [Int]
reader = map read . words <$> getLine
sum' xs = sum xs -- work around GHC ticket 10992
main = print =<< sum' <$> reader
took 52secs; a similar approach using Text:
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Data.Text.Read as T
readText = map parse . T.words <$> T.getLine
where parse s = let Right (n, _) = T.decimal s in n
ran in 2.4secs (but note that it would need to be modified to handle negative numbers!); and the same approach using Char8:
import qualified Data.ByteString.Char8 as C
readChar8 :: IO [Int]
readChar8 = map parse . C.words <$> C.getLine
where parse s = let Just (n, _) = C.readInt s in n
ran in 1.4secs. All examples were compiled with -O2 on GHC 8.0.2.
As a comparison benchmark, a scanf-based C implementation:
/* GCC 5.4.0 w/ -O3 */
#include <stdio.h>
int main()
{
long x, acc = 0;
while (scanf(" %ld", &x) == 1) {
acc += x;
}
printf("%ld\n", acc);
return 0;
}
ran in about 2.5secs, on par with the Text implementation.
You can squeeze a bit more performance out of the Char8 implementation. Using a hand-rolled parser:
readChar8' :: IO [Int]
readChar8' = parse <$> C.getLine
where parse = unfoldr go
go s = do (n, s1) <- C.readInt s
let s2 = C.dropWhile C.isSpace s1
return (n, s2)
runs in about 0.9secs -- I haven't tried to determine why there's a difference, but the compiler must be missing an opportunity to perform some optimization of the words-to-readInt pipeline.
Haskell Code for Reference
Make some numbers with Numbers.hs:
-- |Generate 20M 9-digit numbers:
-- ./Numbers 20000000 100000000 999999999 > data1.txt
import qualified Data.ByteString.Char8 as C
import Control.Monad
import System.Environment
import System.Random
main :: IO ()
main = do [n, a, b] <- map read <$> getArgs
nums <- replicateM n (randomRIO (a,b))
let _ = nums :: [Int]
C.putStrLn (C.unwords (map (C.pack . show) nums))
Find their sum with Sum.hs:
import Data.List
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Data.Text.Read as T
import qualified Data.Char8 as C
import qualified Data.ByteString.Char8 as C
import System.Environment
-- work around https://ghc.haskell.org/trac/ghc/ticket/10992
sum' xs = sum xs
readString :: IO [Int]
readString = map read . words <$> getLine
readText :: IO [Int]
readText = map parse . T.words <$> T.getLine
where parse s = let Right (n, _) = T.decimal s in n
readChar8 :: IO [Int]
readChar8 = map parse . C.words <$> C.getLine
where parse s = let Just (n, _) = C.readInt s in n
readHand :: IO [Int]
readHand = parse <$> C.getLine
where parse = unfoldr go
go s = do (n, s1) <- C.readInt s
let s2 = C.dropWhile C.isSpace s1
return (n, s2)
main = do [method] <- getArgs
let reader = case method of
"string" -> readString
"text" -> readText
"char8" -> readChar8
"hand" -> readHand
print =<< sum' <$> reader
where:
./Sum string <data1.txt # 54.3 secs
./Sum text <data1.txt # 2.29 secs
./Sum char8 <data1.txt # 1.34 secs
./Sum hand <data1.txt # 0.91 secs

Why does decodeFile throw "not enough bytes" error?

Why does decodeFile from Data.Binary throw the error *** Exception: demandInput: not enough bytes, when using decode and readFile from Data.ByteString.Lazy works as expected?
A minimal example
module Testing where
import Data.Binary
import qualified Data.Map.Strict as Map
import qualified Data.ByteString.Lazy as B
x :: Map.Map Char Int
x = Map.fromList $ zip ['a'..'d'] [1,2,3,4]
main :: IO (Map.Map Char Int)
main = do
encodeFile "testing" (encode x)
r <- decodeFile "testing"
print $ r == x
return r
main2 :: IO (Map.Map Char Int)
main2 = do
B.writeFile "testing2" (encode x)
r <- fmap decode $ B.readFile "testing2"
print $ r == x
return r
I found this potentially relevant question: Using Data.Binary.decodeFile, encountered error "demandInput: not enough bytes" but it does not really answer my question, which is why using decodeFile doesn't work but readFile does.
The output I get:
[1 of 1] Compiling Testing ( testing.hs, interpreted )
Ok, modules loaded: Testing.
*Testing> main
*** Exception: demandInput: not enough bytes
*Testing> main2
True
fromList [('a',1),('b',2),('c',3),('d',4)]

You have
encodeFile "testing" (encode x)
which means you're encoding x twice. Changing that to
encodeFile "testing" x
works as expected.

How to improve BitGet performance

I'm now developing binary parsing program using Haskell.
I currently found out that strict/lazy both BitGet seems to be very slow and
surprisingly allocate a lot of memory.
I tested below code (built with -O2), such as parsing entire bits in the input file, and
figure out the profiling result.
For this example, I used the 1,819,173 bytes binary file.
Strict version:
import Prelude as P
import System.Environment (getArgs)
import Data.ByteString as B
import Data.Binary.Strict.BitGet
coreFunc :: Int -> BitGet Int
coreFunc len = f len 0
where
f 0 r = return r
f l _ = do
b <- getBit
f (l - 1) $ if b then 1 else 0
mainFunc :: B.ByteString -> IO ()
mainFunc bs =
case runBitGet bs (coreFunc ((B.length bs) * 8)) of
Left emsg -> error emsg
Right r -> print $ show r
main :: IO ()
main = do
args <- getArgs
case args of
[] -> return ()
(x:_) -> (do
bs <- B.readFile x
mainFunc bs
return ()
)
-- profiling result --
total time = 1.74 secs (1741 ticks # 1000 us, 1 processor)
total alloc = 7,948,043,192 bytes (excludes profiling overheads)
Lazy version:
import Prelude as P
import System.Environment (getArgs)
import Data.ByteString.Lazy as B
import Data.Binary.Bits.Get
import Data.Binary.Get
import Data.Int (Int64)
coreFunc :: Int64 -> BitGet Int
coreFunc len = f len 0
where
f 0 r = return r
f l _ = do
b <- getBool
f (l - 1) $ if b then 1 else 0
mainFunc :: B.ByteString -> IO ()
mainFunc bs = do
let r = runGet (runBitGet (coreFunc ((B.length bs) * 8))) bs
print $ show r
main :: IO ()
main = do
args <- getArgs
case args of
[] -> return ()
(x:_) -> (do
bs <- B.readFile x
mainFunc bs
return ()
)
-- profiling result --
total time = 2.21 secs (2207 ticks # 1000 us, 1 processor)
total alloc = 6,405,531,680 bytes (excludes profiling overheads)
I want to ask that:
How can I improve this performance?
Can I profile inside of the BitGet library behavior?
Are there the other way to parse binary bits?

It seems like your coreFunc is supposed to skip forward some (len - 1) number of bits, then read a single bit as an 0 or 1 and return it in the BitGet monad. If that's the intent, something like this will be much more efficient.
I'm using the binary-bits package:
import Control.Applicative
import Data.Binary.Get
coreFunc :: Int -> Get Int
coreFunc len =
fromEnum <$> runBitGet (block (skip (len - 1) *> bool)
skip :: Int -> BitGet ()
skip n = byteString bytes *> word64be bits *> pure ()
where (bytes, bits) = quotRem n 8 -- sizeOf Word8
Unfortunately the package does not have a skip function to let us skip n bits, which the binary package it's based off includes, so I've had to write my own. It's possible a more efficient version could be written with access to the Block internals, but the library might already be optimizing it well enough that theres no benefit.
I'd like to benchmark this version against yours to get an accurate comparison, can you provide the binary file you use for testing?

Lazy binary get

Why is Data.Binary.Get isn't lazy as it says? Or am I doing something wrong here?
import Data.ByteString.Lazy (pack)
import Data.Binary.Get (runGet, isEmpty, getWord8)
getWords = do
empty <- isEmpty
if empty
then return []
else do
w <- getWord8
ws <- getWords
return $ w:ws
main = print $ take 10 $ runGet getWords $ pack $ repeat 1
This main function just hangs instead of printing 10 words.

The documentation you linked provides several examples. The first one needs to read all the input before it can return and looks a lot like what you have written. The second one is a left-fold and processes the input in a streaming fashion. Here's your code rewritten in this style:
module Main where
import Data.Word (Word8)
import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get (runGetState, getWord8)
getWords :: BL.ByteString -> [Word8]
getWords input
| BL.null input = []
| otherwise =
let (w, rest, _) = runGetState getWord8 input 0
in w : getWords rest
main :: IO ()
main = print . take 10 . getWords . BL.pack . repeat $ 1
Testing:
*Main> :main
[1,1,1,1,1,1,1,1,1,1]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Converting a bytestring to a list of 7 bits bytes - haskell

Related

Where to find chunks in haskell?

What is the fastest way to parse line with lots of Ints?

Why does decodeFile throw "not enough bytes" error?

How to improve BitGet performance

Lazy binary get

Categories

Resources