Reading in a binary file in haskell

Reading in a binary file in haskell - haskell

How could I write a function with a definition something like...
readBinaryFile :: Filename -> IO Data.ByteString
I've got the functional parts of Haskell down, but the type system and monads still make my head hurt. Can someone write and explain how that function works to me?

import Data.ByteString.Lazy
readFile fp
easy as pie man. Knock off the lazy if you don't want the string to be lazy.
import Data.ByteString.Lazy as BS
import Data.Word
import Data.Bits
fileToWordList :: String -> IO [Word8]
fileToWordList fp = do
contents <- BS.readFile fp
return $ unpack contents

readBinaryFile :: Filename -> IO Data.ByteString
This is simply the Data.ByteString.readFile function, which you should never have to write, since it is in the bytestring package.

Related

Efficient Conversion of Bytestring to [Word16]

I am attempting to do a plain conversion from a bytestring to a list of Word16s. The implementation below using Data.Binary.Get works, though it is a performance bottleneck in the code. This is understandable as IO is always going to be slow, but I was wondering if there isn't a more efficient way of doing this.
getImageData' = do
e <- isEmpty
if e then return []
else do
w <- getWord16be
ws <- getImageData'
return $ w : ws

I suspect the big problem you're encountering with Data.Binary.Get is that the decoders are inherently much too strict for your purpose. This also appears to be the case for serialise, and probably also the other serialization libraries. I think the fundamental trouble is that while you know that the operation will succeed as long as the ByteString has an even number of bytes, the library does not know that. So it has to read in the whole ByteString before it can conclude "Ah yes, all is well" and construct the list you've requested. Indeed, the way you're building the result, it's going to build a slew of closures (proportional in number to the length) before actually doing anything useful.
How can you fix this? Just use the bytestring library directly. The easiest thing is to use unpack, but I think you'll probably get slightly better performance like this:
{-# language BangPatterns #-}
module Wordy where
import qualified Data.ByteString as BS
import Data.ByteString (ByteString)
import Data.Word (Word16)
import Data.Maybe (fromMaybe)
import Data.Bits (unsafeShiftL)
toDBs :: ByteString -> Maybe [Word16]
toDBs bs0
| odd (BS.length bs0) = Nothing
| otherwise = Just (go bs0)
where
go bs = fromMaybe [] $ do
(b1, bs') <- BS.uncons bs
(b2, bs'') <- BS.uncons bs'
let !res = (fromIntegral b1 `unsafeShiftL` 8) + fromIntegral b2
Just (res : go bs'')

Efficient chunking of conduit for strict bytestring

This is a followup to this earlier question. I have a conduit source (from Network.HTTP.Conduit) which is strict ByteString. I will like to recombine them into larger chunks (to send over network to another client, after another encoding and conversion to lazy bytestring). I wrote chunksOfAtLeast conduit, derived from the answer in above question which seems to work pretty well. I am wondering if there is any further scope for improving it performance-wise.
import Data.Conduit as C
import Control.Monad.IO.Class
import Control.Monad
import Data.Conduit.Combinators as CC
import Data.Conduit.List as CL
import Data.ByteString.Lazy as LBS hiding (putStrLn)
import Data.ByteString as BS hiding (putStrLn)
chunksOfAtLeast :: Monad m => Int -> Conduit BS.ByteString m BS.ByteString
chunksOfAtLeast chunkSize =
loop
where
loop = do
bs <- takeE chunkSize =$= ((BS.concat . ($ [])) <$> CL.fold (\front next -> front . (next:)) id)
unless (BS.null bs) $ do
yield bs
loop
main = do
yieldMany ["hello", "there", "world!"] $$ chunksOfAtLeast 8 =$ CL.mapM_ Prelude.print

Getting optimal performance is always a case of trying something and benchmarking it, so I can't tell you with certainty that I'm offering you something more efficient. That said, combining smaller chunks of data into larger chunks is a primary goal of builders, so leveraging them may be more efficient. Here's an example:
{-# LANGUAGE OverloadedStrings #-}
import Conduit
import Data.ByteString (ByteString)
import Data.ByteString.Builder (byteString)
import Data.Conduit.ByteString.Builder
bufferChunks :: Conduit ByteString IO ByteString
bufferChunks = mapC byteString =$= builderToByteString
main :: IO ()
main = yieldMany ["hello", "there", "world!"] $$ bufferChunks =$ mapM_C print

String -> ByteString and reverse

In my Haskell Program I need to work with Strings and ByteStrings:
import Data.ByteString.Lazy as BS (ByteString)
import Data.ByteString.Char8 as C8 (pack)
import Data.Char (chr)
stringToBS :: String -> ByteString
stringToBS str = C8.pack str
bsToString :: BS.ByteString -> String
bsToString bs = map (chr . fromEnum) . BS.unpack $ bs
bsToString works fine, but stringToBS results with following error at compiling:
Couldn't match expected type ‘ByteString’
with actual type ‘Data.ByteString.Internal.ByteString’
NB: ‘ByteString’ is defined in ‘Data.ByteString.Lazy.Internal’
‘Data.ByteString.Internal.ByteString’
is defined in ‘Data.ByteString.Internal’
In the expression: pack str
In an equation for ‘stringToBS’: stringToBS str = pack str
But I need to let it be ByteString from Data.ByteString.Lazy as BS (ByteString) for further working functions in my code.
Any idea how to solve my problem?

You are working with both strict ByteStrings and lazy ByteStrings which are two different types.
This import:
import Data.ByteString.Lazy as BS (ByteString)
makes ByteString refer the lazy ByteStrings, so the type signature of your stringToBS doesn't match it's definition:
stringToBS :: String -> ByteString -- refers to lazy ByteStrings
stringToBS str = C8.pack str -- refers to strict ByteStrings
I think it would be a better idea to use import qualified like this:
import qualified Data.ByteString.Lazy as LBS
import qualified Data.ByteString.Char8 as BS
and use BS.ByteString and LBS.ByteString to refer to strict / lazy ByteStrings.

You can convert between lazy and non-lazy versions using fromStrict, and toStrict (both functions are in the lazy bytestring module).

Haskell Conduit: having a Sink return a value based on the values from upstream

I've been trying to use the Conduit library to do some simple I/O involving files, but I'm having a hard time.
I have a text file containing nothing but a few digits such as 1234. I have a function that reads the file using readFile (no conduits), and returns Maybe Int (Nothing is returned when the file actually doesn't exist). I'm trying to write a version of this function that uses conduits, and I just can't figure it out.
Here is what I have:
import Control.Monad.Trans.Resource
import Data.Conduit
import Data.Functor
import System.Directory
import qualified Data.ByteString.Char8 as B
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.Text as CT
import qualified Data.Text as T
myFile :: FilePath
myFile = "numberFile"
withoutConduit :: IO (Maybe Int)
withoutConduit = do
doesExist <- doesFileExist myFile
if doesExist
then Just . read <$> readFile myFile
else return Nothing
withConduit :: IO (Maybe Int)
withConduit = do
doesExist <- doesFileExist myFile
if doesExist
then runResourceT $ source $$ conduit =$ sink
else return Nothing
where
source :: Source (ResourceT IO) B.ByteString
source = CB.sourceFile myFile
conduit :: Conduit B.ByteString (ResourceT IO) T.Text
conduit = CT.decodeUtf8
sink :: Sink T.Text (ResourceT IO) (Maybe Int)
sink = awaitForever $ \txt -> let num = read . T.unpack $ txt :: Int
in -- I don't know what to do here...
Could someone please help me complete the sink function?
Thanks!

This isn't really a good example for where conduit actually provides a lot of value, at least not the way you're looking at it right now. Specifically, you're trying to use the read function, which requires that the entire value be in memory. Additionally, your current error handling behavior is a bit loose. Essentially, you're just going to get an read: no parse error if there's anything unexpected in the content.
However, there is a way we can play with this in conduit and be meaningful: by parsing the ByteString byte-by-byte ourselves and avoiding the read function. Fortunately, this pattern falls into a standard left fold, which the conduit-combinators package provides a perfect function for (element-wise left fold in a conduit, aka foldlCE):
{-# LANGUAGE OverloadedStrings #-}
import Conduit
import Data.Word8
import qualified Data.ByteString as S
sinkInt :: Monad m => Consumer S.ByteString m Int
sinkInt =
foldlCE go 0
where
go total w
| _0 <= w && w <= _9 =
total * 10 + (fromIntegral $ w - _0)
| otherwise = error $ "Invalid byte: " ++ show w
main :: IO ()
main = do
x <- yieldMany ["1234", "5678"] $$ sinkInt
print x
There are plenty of caveats that go along with this: it will simply throw an exception if there are unexpected bytes, and it doesn't handle integer overflow at all (though fixing that is just a matter of replacing Int with Integer). It's important to note that, since the in-memory string representation of a valid 32- or 64-bit int is always going to be tiny, conduit is overkill for this problem, though I hope that this code gives some guidance on how to generally write conduit code.

Idiomatic io-streams directory traversal

I was discussing some code on Reddit, and it made me curious about how this would be implemented in io-streams. Consider the following code which traverses a directory structure and prints out all of the filenames:
import Control.Exception (bracket)
import qualified Data.Foldable as F
import Data.Streaming.Filesystem (closeDirStream, openDirStream,
readDirStream)
import System.Environment (getArgs)
import System.FilePath ((</>))
printFiles :: FilePath -> IO ()
printFiles dir = bracket
(openDirStream dir)
closeDirStream
loop
where
loop ds = do
mfp <- readDirStream ds
F.forM_ mfp $ \fp' -> do
let fp = dir </> fp'
ftype <- getFileType fp
case ftype of
FTFile -> putStrLn fp
FTFileSym -> putStrLn fp
FTDirectory -> printFiles fp
_ -> return ()
loop ds
main :: IO ()
main = getArgs >>= mapM_ printFiles
Instead of simply printing the files, suppose we wanted to create some kind of streaming filepath representation. I know how this would work in enumerator, conduit, and pipes. However, since the intermediate steps require acquisition of a scarce resource (the DirStream), I'm not sure what the implementation would be for io-streams. Can someone provide an example of how that would be done?
For comparison, here's the conduit implementation, which is made possible via bracketP and MonadResource. And here's how the conduit code would be used to implemented the same file printing program as above:
import Control.Monad.IO.Class (liftIO)
import Control.Monad.Trans.Resource (runResourceT)
import Data.Conduit (($$))
import Data.Conduit.Filesystem (sourceDirectoryDeep)
import qualified Data.Conduit.List as CL
import System.Environment (getArgs)
main :: IO ()
main =
getArgs >>= runResourceT . mapM_ eachRoot
where
-- False means don't traverse dir symlinks
eachRoot root = sourceDirectoryDeep False root
$$ CL.mapM_ (liftIO . putStrLn)

Typical style would be to do something like this:
traverseDirectory :: RawFilePath -> (InputStream RawFilePath -> IO a) -> IO a
i.e. a standard "with-" function, with the obvious implementation.
Edit: added a working example implementation: https://gist.github.com/gregorycollins/00c51e7e33cf1f9c8cc0
It's not exactly complicated but it's also not as trivial as I had first suggested.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Reading in a binary file in haskell - haskell

How could I write a function with a definition something like... readBinaryFile :: Filename -> IO Data.ByteString I've got the functional parts of Haskell down, but the type system and monads still make my head hurt. Can someone write and explain how that function works to me?

import Data.ByteString.Lazy readFile fp easy as pie man. Knock off the lazy if you don't want the string to be lazy. import Data.ByteString.Lazy as BS import Data.Word import Data.Bits fileToWordList :: String -> IO [Word8] fileToWordList fp = do contents <- BS.readFile fp return $ unpack contents

readBinaryFile :: Filename -> IO Data.ByteString This is simply the Data.ByteString.readFile function, which you should never have to write, since it is in the bytestring package.

Related

Efficient Conversion of Bytestring to [Word16]

Efficient chunking of conduit for strict bytestring

String -> ByteString and reverse

Haskell Conduit: having a Sink return a value based on the values from upstream

Idiomatic io-streams directory traversal

Categories

Resources