Why does decodeFile throw "not enough bytes" error? - haskell

Why does decodeFile from Data.Binary throw the error *** Exception: demandInput: not enough bytes, when using decode and readFile from Data.ByteString.Lazy works as expected?
A minimal example
module Testing where
import Data.Binary
import qualified Data.Map.Strict as Map
import qualified Data.ByteString.Lazy as B
x :: Map.Map Char Int
x = Map.fromList $ zip ['a'..'d'] [1,2,3,4]
main :: IO (Map.Map Char Int)
main = do
encodeFile "testing" (encode x)
r <- decodeFile "testing"
print $ r == x
return r
main2 :: IO (Map.Map Char Int)
main2 = do
B.writeFile "testing2" (encode x)
r <- fmap decode $ B.readFile "testing2"
print $ r == x
return r
I found this potentially relevant question: Using Data.Binary.decodeFile, encountered error "demandInput: not enough bytes" but it does not really answer my question, which is why using decodeFile doesn't work but readFile does.
The output I get:
[1 of 1] Compiling Testing ( testing.hs, interpreted )
Ok, modules loaded: Testing.
*Testing> main
*** Exception: demandInput: not enough bytes
*Testing> main2
True
fromList [('a',1),('b',2),('c',3),('d',4)]

You have
encodeFile "testing" (encode x)
which means you're encoding x twice. Changing that to
encodeFile "testing" x
works as expected.

Related

Where to find chunks in haskell?

I'm trying to follow this tutorial: https://wiki.haskell.org/Tutorials/Programming_Haskell/String_IO.
In the last part 7 Extension: using SMP parallelism I copy the code but it fails to compile with this error message
/home/dhilst/parallelspell.hs:13:20: error:
Variable not in scope: chunk :: Int -> [String] -> t
I searched for chunks at Hoogle and got Data.Text.Internal.Lazy, but this seems to be an internal module. And I couldn't import it anyway.
Here is the code:
import Data.Set hiding (map)
import Data.Maybe
import Data.Char
import Text.Printf
import System.IO
import System.Environment
import Control.Concurrent
import Control.Monad
main = do
(f,g,n) <- readFiles
let dict = fromList (lines f)
work = chunk n (words g)
run n dict work
run n dict work = do
chan <- newChan
errs <- getChanContents chan
mapM_ (forkIO . thread chan dict) (zip [1..n] work)
wait n errs 0
wait n xs i = when (i < n) $ case xs of
Nothing : ys -> wait n ys $! i+1
Just s : ys -> putStrLn s >> wait n ys i
thread chan dict (me,xs) = do
mapM_ spellit xs
writeChan chan Nothing
where spellit w = when (spell dict w) $
writeChan chan . Just $ printf "Thread %d: %-25s" (me::Int) w
spell d w = w `notMember` d
readFiles = do
[s,n] <- getArgs
f <- readFile "/usr/share/dict/words"
g <- readFile s
return (f,g, read n)
And here is the compilation line:
ghc -O --make -threaded parallelspell.hs
--
Update: I write my own version of chunk based on this quest:How to partition a list in Haskell?
chunk :: Int -> [a] -> [[a]]
chunk _ [] = []
chunk n xs = (take n xs) : (chunk n (drop n xs))
Still, does this means that the tutorial that I'm following is very old and out of date!? Can anyone confirm if that function already existed some day or if I'm missing something?
Regards,
Looks like the tutorial just forgot to define chunk. I encourage you to update the wiki to include a suitable definition.

What is the fastest way to parse line with lots of Ints?

I'm learning Haskell for two years now and I'm still confused, whats the best (fastest) way to read tons of numbers from a single input line.
For learning I registered into hackerearth.com trying to solve every challenge in Haskell. But now I'm stuck with a challenge because I run into timeout issues. My program is just too slow for beeing accepted by the site.
Using the profiler I found out it takes 80%+ of the time for parsing a line with lots of integers. The percentage gets even higher when the number of values in the line increases.
Now this is the way, I'm reading numbers from an input line:
import qualified Data.ByteString.Char8 as C8
main = do
scores <- fmap (map (fst . fromJust . C8.readInt) . C8.words) C8.getLine :: IO [Int]
Is there any way to get the data faster into the variable?
BTW: The biggest testcase consist of a line with 200.000 9-digits values. Parsing takes incredible long (> 60s).
It's always difficult to declare a particular approach "the fastest", since there's almost always some way to squeeze out more performance. However, an approach using Data.ByteString.Char8 and the general method you suggest should be among the fastest methods for reading numbers. If you encounter a case where performance is poor, the problem likely lies elsewhere.
To give some concrete results, I generated a 191Meg file of 20 million 9-digit numbers, space-separate on a single line. I then tried several general methods of reading a line of numbers and printing their sum (which, for the record, was 10999281565534666). The obvious approach using String:
reader :: IO [Int]
reader = map read . words <$> getLine
sum' xs = sum xs -- work around GHC ticket 10992
main = print =<< sum' <$> reader
took 52secs; a similar approach using Text:
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Data.Text.Read as T
readText = map parse . T.words <$> T.getLine
where parse s = let Right (n, _) = T.decimal s in n
ran in 2.4secs (but note that it would need to be modified to handle negative numbers!); and the same approach using Char8:
import qualified Data.ByteString.Char8 as C
readChar8 :: IO [Int]
readChar8 = map parse . C.words <$> C.getLine
where parse s = let Just (n, _) = C.readInt s in n
ran in 1.4secs. All examples were compiled with -O2 on GHC 8.0.2.
As a comparison benchmark, a scanf-based C implementation:
/* GCC 5.4.0 w/ -O3 */
#include <stdio.h>
int main()
{
long x, acc = 0;
while (scanf(" %ld", &x) == 1) {
acc += x;
}
printf("%ld\n", acc);
return 0;
}
ran in about 2.5secs, on par with the Text implementation.
You can squeeze a bit more performance out of the Char8 implementation. Using a hand-rolled parser:
readChar8' :: IO [Int]
readChar8' = parse <$> C.getLine
where parse = unfoldr go
go s = do (n, s1) <- C.readInt s
let s2 = C.dropWhile C.isSpace s1
return (n, s2)
runs in about 0.9secs -- I haven't tried to determine why there's a difference, but the compiler must be missing an opportunity to perform some optimization of the words-to-readInt pipeline.
Haskell Code for Reference
Make some numbers with Numbers.hs:
-- |Generate 20M 9-digit numbers:
-- ./Numbers 20000000 100000000 999999999 > data1.txt
import qualified Data.ByteString.Char8 as C
import Control.Monad
import System.Environment
import System.Random
main :: IO ()
main = do [n, a, b] <- map read <$> getArgs
nums <- replicateM n (randomRIO (a,b))
let _ = nums :: [Int]
C.putStrLn (C.unwords (map (C.pack . show) nums))
Find their sum with Sum.hs:
import Data.List
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Data.Text.Read as T
import qualified Data.Char8 as C
import qualified Data.ByteString.Char8 as C
import System.Environment
-- work around https://ghc.haskell.org/trac/ghc/ticket/10992
sum' xs = sum xs
readString :: IO [Int]
readString = map read . words <$> getLine
readText :: IO [Int]
readText = map parse . T.words <$> T.getLine
where parse s = let Right (n, _) = T.decimal s in n
readChar8 :: IO [Int]
readChar8 = map parse . C.words <$> C.getLine
where parse s = let Just (n, _) = C.readInt s in n
readHand :: IO [Int]
readHand = parse <$> C.getLine
where parse = unfoldr go
go s = do (n, s1) <- C.readInt s
let s2 = C.dropWhile C.isSpace s1
return (n, s2)
main = do [method] <- getArgs
let reader = case method of
"string" -> readString
"text" -> readText
"char8" -> readChar8
"hand" -> readHand
print =<< sum' <$> reader
where:
./Sum string <data1.txt # 54.3 secs
./Sum text <data1.txt # 2.29 secs
./Sum char8 <data1.txt # 1.34 secs
./Sum hand <data1.txt # 0.91 secs

Converting a bytestring to a list of 7 bits bytes

I have to convert a ByteString into a list of 7 bits bytes. For example, bytes with a, b, c, d etc. bits:
abcdefgh ijklmnop qrstuvwx yz...
should be converted to:
abcdefg hijklmn opqrstu vwxyz...
I use the Binary-Bits package in order to do it. My convert8to7 function is recursive but the Binary-Bits does not provide any mean to check for the lack of bits whereas the Get monad does have isEmpty or remaining functions.
Here’s my code:
import Data.Word
import Data.Binary.Bits.Get
import Data.Binary.Get (runGet)
import Data.ByteString.Lazy.Char8
convert8to7 :: BitGet [Word8]
convert8to7 = do
bits <- getWord8 7
rest <- convert8to7
return (bits : rest)
main :: IO ()
main = do
let datas = pack "Hello world!"
print $ runGet (runBitGet convert8to7) datas
When I run this code, it logically says:
Data.Binary.Get.runGet at position 12: demandInput: not enough bytes
Can I do this conversion with Binary-Bits or should I look for an other package ?
Update
Here’s my code based on user5402 answer:
import Data.Word
import Data.Bits
import Data.Binary.Bits.Get
import Data.Binary.Get (runGet)
import qualified Data.ByteString.Lazy.Char8 as BS
convert87 :: Int -> BitGet [Word8]
convert87 n
| n == 0 = return []
| n < 7 = do bits <- getWord8 n
return [shiftL bits (7 - n)]
| otherwise = do bits <- getWord8 7
rest <- convert87 (n-7)
return $ bits : rest
to87 :: BS.ByteString -> [Word8]
to87 datas = runGet (runBitGet (convert87 len)) datas
where len = fromIntegral $ BS.length datas * 8
main :: IO ()
main = do
let datas = BS.pack "Hello world!"
print $ to87 datas
The problem is that you need to keep track of the number of bits to decode - the BitGet monad doesn't know when the end of input has been reached.
Try this:
import Data.Word
import Data.Binary.Bits.Get
import Data.Binary.Get (runGet)
import Data.ByteString.Lazy.Char8
import qualified Data.ByteString.Lazy.Char8 as BS
convert87 :: Int -> BitGet [Word8]
convert87 n
| n < 7 = do bits <- getWord8 n
return [bits]
| otherwise = do bits <- getWord8 7
rest <- convert87 (n-7)
return $ bits : rest
main :: IO ()
main = do
let datas = pack "Hello world!"
len = fromIntegral $ BS.length datas * 8
print $ runGet (runBitGet (convert87 len)) datas
Update: Here is the way to detect end of input in the Get monad (on top of which the BitGet monad is implemented). It relies on the Alternative class for Get. The function chunks7 breaks up a byte string into chunks of 7 with any remainder going into the last chunk.
As far as I can tell, BitGet does not implement the Alternative class - although I'm sure it could.
import Data.Word (Word8)
import Data.Binary.Get
import Data.ByteString.Lazy.Char8
import qualified Data.ByteString as BSW
import qualified Data.ByteString.Lazy as BSL
import Control.Applicative -- used for (<|>)
chunks7 :: Get [[Word8]]
chunks7 = do
b <- isEmpty
if b
then return []
else do chunk <- fmap BSW.unpack (getByteString 7)
<|> fmap BSL.unpack getRemainingLazyByteString
rest <- chunks7
return $ chunk : rest
main :: IO ()
main = do
let datas = pack "Hello world! This is a test"
print $ runGet chunks7 datas

Using the Reader monad with QuickCheck / monadicIO

I'd like to pass an integer as a CLI argument to a Haskell program that makes use of QuickCheck / monadicIO. That integer is going to be used inside the assert to make the tests customizable.
The problem is that once I parse the integer value in main, I don't know how to pass it inside of the monadicIO call without using something as ugly as an IORef. I would think that an elegant solution might be the Reader monad, but I couldn't find a solution to make it work, seen as quickCheck is rigid in its arguments.
Any ideas?
Later Edit 1: As requested, I'm attaching the actual code I'm trying this on, and failing. The commented-out lines represent my failed attempt. Background: the test suite is intended to exercise a very simple remote endpoint that computes the SHA512 of the randomized input generated by QuickCheck. The remote endpoint is Python/Flask based.
Later Edit 2 in response to #user2407038: I could make propHasExpectedLengthCeiling take an additional argument of type Int, but quickCheck would generate random values for it, and that's not what I want happening. My goal is to use the maxSegmentLengthCeiling that I'm taking in from the command-line arguments and use it in let testPassed = actualMaxSegmentLength <= maxSegmentLengthCeiling inside of the monadicIO block. Right now maxSegmentLengthCeiling is specified as a top-level value, which means I have to recompile the code every time I change the value. I don't yet have any code that involves IORef because that's a last resort and the essence of my question is how to avoid going the IORef route.
import qualified Data.ByteString.Lazy.Char8 as LC
import Control.Applicative ( (<$>) )
import Data.Function ( on )
import Data.List ( groupBy )
import Data.Char ( isDigit )
--import Safe ( headMay
-- , readMay
-- )
--import System.Environment ( getArgs )
import Network.HTTP.Conduit ( simpleHttp )
import Test.QuickCheck ( Arbitrary
, Property
, arbitrary
, choose
, frequency
, quickCheckWith
, stdArgs
, vectorOf
)
import Test.QuickCheck.Test ( Args
, maxSuccess
)
import Test.QuickCheck.Monadic ( assert
, monadicIO
, run
)
newtype CustomInput = MkCustomInput String deriving Show
instance Arbitrary CustomInput where
arbitrary =
let
genCustomInput = vectorOf 20
$ frequency [ (26, choose ('0','9'))
, (10, choose ('a','z'))
]
in
MkCustomInput <$> genCustomInput
maxSegmentLengthCeiling :: Int
maxSegmentLengthCeiling = 22
urlPrefix :: String
urlPrefix = "http://192.168.2.3:5000/sha512sum/"
propHasExpectedLengthCeiling :: CustomInput -> Property
propHasExpectedLengthCeiling (MkCustomInput input) = monadicIO $ do
testPassed <- run $ do
response <- simpleHttp $ urlPrefix ++ input
let stringResponse = LC.unpack response
let brokenDownStringResponse = groupBy ( (==) `on` isDigit ) stringResponse
let actualMaxSegmentLength = maximum $ map length brokenDownStringResponse
let testPassed = actualMaxSegmentLength <= maxSegmentLengthCeiling
putStrLn ""
putStrLn ""
putStrLn $ "Input: " ++ input
putStrLn $ "Control sum: " ++ stringResponse
putStrLn $ "Breakdown: " ++ show brokenDownStringResponse
putStrLn $ "Max. length: " ++ show actualMaxSegmentLength
putStrLn $ "Ceiling: " ++ show maxSegmentLengthCeiling
putStrLn $ "Test result: " ++ if testPassed then "Pass" else "Fail"
putStrLn ""
putStrLn ""
return testPassed
assert $ testPassed
customArgs :: Args
customArgs = stdArgs { maxSuccess = 1000000 }
--readMayAsInt :: String -> Maybe Int
--readMayAsInt = readMay
main :: IO ()
main =
--main = do
-- cliArgs <- getArgs
-- let ceilingInputMay = headMay cliArgs >>= readMayAsInt
-- maxSegmentLengthCeiling <- case ceilingInputMay of
-- (Just lengthCeiling) -> return lengthCeiling
-- Nothing -> error "No valid number given"
quickCheckWith
customArgs
propHasExpectedLengthCeiling
Make maxSegmentLengthCeiling a parameter to propHasExpectedLengthCeiling :
propHasExpectedLengthCeiling :: Int -> CustomInput -> Property
and invoke it as
main = do
[n] <- getArgs
quickCheckWith customArgs (propHasExpectedLengthCeiling (read n))

Lazy binary get

Why is Data.Binary.Get isn't lazy as it says? Or am I doing something wrong here?
import Data.ByteString.Lazy (pack)
import Data.Binary.Get (runGet, isEmpty, getWord8)
getWords = do
empty <- isEmpty
if empty
then return []
else do
w <- getWord8
ws <- getWords
return $ w:ws
main = print $ take 10 $ runGet getWords $ pack $ repeat 1
This main function just hangs instead of printing 10 words.
The documentation you linked provides several examples. The first one needs to read all the input before it can return and looks a lot like what you have written. The second one is a left-fold and processes the input in a streaming fashion. Here's your code rewritten in this style:
module Main where
import Data.Word (Word8)
import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get (runGetState, getWord8)
getWords :: BL.ByteString -> [Word8]
getWords input
| BL.null input = []
| otherwise =
let (w, rest, _) = runGetState getWord8 input 0
in w : getWords rest
main :: IO ()
main = print . take 10 . getWords . BL.pack . repeat $ 1
Testing:
*Main> :main
[1,1,1,1,1,1,1,1,1,1]

Resources