I/O Monad and ByteString to Char conversion?

I/O Monad and ByteString to Char conversion? - haskell

I'm testing some HTTP requests in haskell and have the below methods:
import qualified Data.ByteString.Lazy as LAZ
import Language.Haskell.TH.Ppr
import System.IO
import Data.Word (Word8)
request :: IO LAZ.ByteString
request = do
response <- simpleHttp "https://www.url.com"
return (response)
exampleFunctionOne:: IO LAZ.ByteString -> IO LAZ.ByteString
exampleFunctionOne bytes = do
html <- bytes
let bytesToChars = bytesToString $ LAZ.unpack html
let x = exampleFunctionTwo bytesToChars
bytes
exampleFunctionTwo :: [Char] -> [Char]
exampleFunctionTwo chars = --Do stuff...
main = do
exampleFunctionOe $ request
My questions are:
Is there a more straight forward way to convert the ByteString to [Char]? Currently I've having to convert to perform (ByteString -> Word8) and then (Word8 -> Char)
Am I correct in saying the 'return ()' statement in my request function is simply re-applying the monad context (in this case IO) to the value I've extracted (response <- simpleHttp)? Or does it have an additional purpose?

To answer your first question, note that there's a different "unpack" in Data.ByteString.Lazy.Char8 with the signature you want:
unpack :: ByteString -> String
It's not unusual for people to import both modules:
import qualified Data.ByteString.Lazy as B
import qualified Data.ByteString.Lazy.Char8 as C
and mix and match functions from each.
To answer your second question, yes that's more or less it. For example:
redund = do x <- getLine
y <- return x
z <- return y
u <- return z
return u
is all equivalent to redund = getLine with a bunch of re-wrapping and extracting of pure values into an out of an IO monad.

Related

Efficient streaming and manipulation of a byte stream in Haskell

While writing a deserialiser for a large (<bloblength><blob>)* encoded binary file I got stuck with the various Haskell produce-transform-consume libraries. So far I'm aware of four streaming libraries:
Data.Conduit: Widely used, has very careful resource management
Pipes: Similar to conduit (Haskell Cast #6 nicely reveals the differences between conduit and pipes)
Data.Binary.Get: Offers useful functions such as getWord32be, but the streaming example is awkward
System.IO.Streams: Seems to be the easiest one to use
Here's a stripped down example of where things go wrong when I try to do Word32 streaming with conduit. A slightly more realistic example would first read a Word32 that determines the blob length and then yield a lazy ByteString of that length (which is then deserialised further).
But here I just try to extract Word32's in streaming fashion from a binary file:
module Main where
-- build-depends: bytestring, conduit, conduit-extra, resourcet, binary
import Control.Monad.Trans.Resource (MonadResource, runResourceT)
import qualified Data.Binary.Get as G
import qualified Data.ByteString as BS
import qualified Data.ByteString.Char8 as C
import qualified Data.ByteString.Lazy as BL
import Data.Conduit
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
import Data.Word (Word32)
import System.Environment (getArgs)
-- gets a Word32 from a ByteString.
getWord32 :: C.ByteString -> Word32
getWord32 bs = do
G.runGet G.getWord32be $ BL.fromStrict bs
-- should read BytesString and return Word32
transform :: (Monad m, MonadResource m) => Conduit BS.ByteString m Word32
transform = do
mbs <- await
case mbs of
Just bs -> do
case C.null bs of
False -> do
yield $ getWord32 bs
leftover $ BS.drop 4 bs
transform
True -> return ()
Nothing -> return ()
main :: IO ()
main = do
filename <- fmap (!!0) getArgs -- should check length getArgs
result <- runResourceT $ (CB.sourceFile filename) $$ transform =$ CL.consume
print $ length result -- is always 8188 for files larger than 32752 bytes
The output of the program is just the number of Word32's that were read. It turns out the stream terminates after reading the first chunk (about 32KiB). For some reason mbs is never Nothing, so I must check null bs which stops the stream when the chunk is consumed. Clearly, my conduit transform is faulty. I see two routes to a solution:
The await doesn't want to go to the second chunk of the ByteStream, so is there another function that pulls the next chunk? In examples I've seen (e.g. Conduit 101) this is not how it's done
This is just the wrong way to set up transform.
How is this done properly? Is this the right way to go? (Performance does matter.)
Update: Here's a BAD way to do it using Systems.IO.Streams:
module Main where
import Data.Word (Word32)
import System.Environment (getArgs)
import System.IO (IOMode (ReadMode), openFile)
import qualified System.IO.Streams as S
import System.IO.Streams.Binary (binaryInputStream)
import System.IO.Streams.List (outputToList)
main :: IO ()
main = do
filename : _ <- getArgs
h <- openFile filename ReadMode
s <- S.handleToInputStream h
i <- binaryInputStream s :: IO (S.InputStream Word32)
r <- outputToList $ S.connect i
print $ last r
'Bad' means: Very demanding in time and space, does not handle Decode exception.

Your immediate problem is caused by how you are using leftover. That function is used to "Provide a single piece of leftover input to be consumed by the next component in the current monadic binding", and so when you give it bs before looping with transform you are effectively throwing away the rest of the bytestring (i.e. what is after bs).
A correct solution based on your code would use the incremental input interface of Data.Binary.Get to replace your yield/leftover combination with something that consumes each chunk fully. A more pragmatic approach, though, is using the binary-conduit package, which provides that in the shape of conduitGet (its source gives a good idea of what a "manual" implementation would look like):
import Data.Conduit.Serialization.Binary
-- etc.
transform :: (Monad m, MonadResource m) => Conduit BS.ByteString m Word32
transform = conduitGet G.getWord32be
One caveat is that this will throw a parse error if the total number of bytes is not a multiple of 4 (i.e. the last Word32 is incomplete). In the unlikely case of that not being what you want, a lazy way out would be simply using \bs -> C.take (4 * truncate (C.length bs / 4)) bs on the input bytestring.

With pipes (and pipes-group and pipes-bytestring) the demo problem reduces to combinators. First we resolve the incoming undifferentiated byte stream into little 4 byte chunks:
chunksOfStrict :: (Monad m) => Int -> Producer ByteString m r -> Producer ByteString m r
chunksOfStrict n = folds mappend mempty id . view (Bytes.chunksOf n)
then we map these to Word32s and (here) count them.
main :: IO ()
main = do
filename:_ <- getArgs
IO.withFile filename IO.ReadMode $ \h -> do
n <- P.length $ chunksOfStrict 4 (Bytes.fromHandle h) >-> P.map getWord32
print n
This will fail if we have less than 4 bytes or otherwise fail to parse but we can as well map with
getMaybeWord32 :: ByteString -> Maybe Word32
getMaybeWord32 bs = case G.runGetOrFail G.getWord32be $ BL.fromStrict bs of
Left r -> Nothing
Right (_, off, w32) -> Just w32
The following program will then print the parses for the valid 4 byte sequences
main :: IO ()
main = do
filename:_ <- getArgs
IO.withFile filename IO.ReadMode $ \h -> do
runEffect $ chunksOfStrict 4 (Bytes.fromHandle h)
>-> P.map getMaybeWord32
>-> P.concat -- here `concat` eliminates maybes
>-> P.print
There are other ways of dealing with failed parses, of course.
Here, though, is something closer to the program you asked for. It takes a four byte segment from a byte stream (Producer ByteString m r) and reads it as a Word32 if it is long enough; it then takes that many of the incoming bytes and accumulates them into a lazy bytestring, yielding it. It just repeats this until it runs out of bytes. In main below, I print each yielded lazy bytestring that is produced:
module Main (main) where
import Pipes
import qualified Pipes.Prelude as P
import Pipes.Group (folds)
import qualified Pipes.ByteString as Bytes ( splitAt, fromHandle, chunksOf )
import Control.Lens ( view ) -- or Lens.Simple (view) -- or Lens.Micro ((.^))
import qualified System.IO as IO ( IOMode(ReadMode), withFile )
import qualified Data.Binary.Get as G ( runGet, getWord32be )
import Data.ByteString ( ByteString )
import qualified Data.ByteString.Lazy.Char8 as BL
import System.Environment ( getArgs )
splitLazy :: (Monad m, Integral n) =>
n -> Producer ByteString m r -> m (BL.ByteString, Producer ByteString m r)
splitLazy n bs = do
(bss, rest) <- P.toListM' $ view (Bytes.splitAt n) bs
return (BL.fromChunks bss, rest)
measureChunks :: Monad m => Producer ByteString m r -> Producer BL.ByteString m r
measureChunks bs = do
(lbs, rest) <- lift $ splitLazy 4 bs
if BL.length lbs /= 4
then rest >-> P.drain -- in fact it will be empty
else do
let w32 = G.runGet G.getWord32be lbs
(lbs', rest') <- lift $ splitLazy w32 bs
yield lbs
measureChunks rest
main :: IO ()
main = do
filename:_ <- getArgs
IO.withFile filename IO.ReadMode $ \h -> do
runEffect $ measureChunks (Bytes.fromHandle h) >-> P.print
This is again crude in that it uses runGet not runGetOrFail, but this is easily repaired. The pipes standard procedure would be to stop the stream transformation on a failed parse and return the unparsed bytestream.
If you were anticipating that the Word32s were for large numbers, so that you did not want to accumulate the corresponding stream of bytes as a lazy bytestring, but say write them to different files without accumulating, we could change the program pretty easily to do that. This would require a sophisticated use of conduit but is the preferred approach with pipes and streaming.

Here's a relatively straightforward solution that I want to throw into the ring. It's a repeated use of splitAt wrapped into a State monad that gives an interface identical to (a subset of) Data.Binary.Get. The resulting [ByteString] is obtained in main with a whileJust over getBlob.
module Main (main) where
import Control.Monad.Loops
import Control.Monad.State
import qualified Data.Binary.Get as G (getWord32be, runGet)
import qualified Data.ByteString.Lazy as BL
import Data.Int (Int64)
import Data.Word (Word32)
import System.Environment (getArgs)
-- this is going to mimic the Data.Binary.Get.Get Monad
type Get = State BL.ByteString
getWord32be :: Get (Maybe Word32)
getWord32be = state $ \bs -> do
let (w, rest) = BL.splitAt 4 bs
case BL.length w of
4 -> (Just w', rest) where
w' = G.runGet G.getWord32be w
_ -> (Nothing, BL.empty)
getLazyByteString :: Int64 -> Get BL.ByteString
getLazyByteString n = state $ \bs -> BL.splitAt n bs
getBlob :: Get (Maybe BL.ByteString)
getBlob = do
ml <- getWord32be
case ml of
Nothing -> return Nothing
Just l -> do
blob <- getLazyByteString (fromIntegral l :: Int64)
return $ Just blob
runGet :: Get a -> BL.ByteString -> a
runGet g bs = fst $ runState g bs
main :: IO ()
main = do
fname <- head <$> getArgs
bs <- BL.readFile fname
let ls = runGet loop bs where
loop = whileJust getBlob return
print $ length ls
There's no error handling in getBlob, but it's easy to extend. Time and space complexity is quite good, as long as the resulting list is used carefully. (The python script that creates some random data for consumption by the above is here).

Monad escaping inside a StateT context

I am trying to get back a value that is in a json feed (via Aeson) directly inside a StateT stacked on IO:
{-# LANGUAGE DeriveGeneric #-}
module MyFeed where
import Data.Aeson
import Network.URI (parseURI, URI(..))
import Data.Maybe (fromJust)
import Data.Text (Text, unpack)
import Control.Monad.State
import Network.HTTP
import GHC.Generics
import Control.Applicative
import Network.HTTP.Conduit (simpleHttp)
import qualified Data.ByteString.Lazy as B
type Feed a = StateT MyIndex IO a
data MyIndex = MyIndex {
index :: Int
}
data FooBar = Foo | Bar
data MyFeed = MyFeed {
idx :: !Text,
key :: !Text
} deriving (Show,Generic)
instance FromJSON MyFeed
instance ToJSON MyFeed
getJSON :: String -> IO B.ByteString
getJSON url = simpleHttp url
getFeed :: String -> IO (Maybe MyFeed)
getFeed url = (decode <$> getJSON url) :: IO (Maybe MyFeed)
getIndex :: FooBar -> Feed MyIndex
getIndex fb = do
cursor <- get
let newCursor = case fb of
Foo -> do myFeed <- liftIO $ getFeed "http://echo.jsontest.com/key/value/idx/1"
let i = read $ unpack $ idx $ fromJust myFeed
return $ cursor { index = i }
Bar -> return cursor
put newCursor
return newCursor
In the Foo case I fetch the feed as expected but when the required value is returned I get:
src/MyFeed.hs:47:10:
Couldn't match expected type ‘MyIndex’
with actual type ‘m0 MyIndex’
Relevant bindings include
newCursor :: m0 MyIndex (bound at src/MyFeed.hs:40:7)
In the first argument of ‘return’, namely ‘newCursor’
In a stmt of a 'do' block: return newCursor
The Actual Type looks still in a Monad context (do {...}). Is there a way to take it out or I am using a wrong approach?

The error was due to the fact I used:
let newCursor = case fb of
instead of
newCursor <- case fb of
For this reason the final value never get "unwrapped" from its monad context.

Haskell Conduit: having a Sink return a value based on the values from upstream

I've been trying to use the Conduit library to do some simple I/O involving files, but I'm having a hard time.
I have a text file containing nothing but a few digits such as 1234. I have a function that reads the file using readFile (no conduits), and returns Maybe Int (Nothing is returned when the file actually doesn't exist). I'm trying to write a version of this function that uses conduits, and I just can't figure it out.
Here is what I have:
import Control.Monad.Trans.Resource
import Data.Conduit
import Data.Functor
import System.Directory
import qualified Data.ByteString.Char8 as B
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.Text as CT
import qualified Data.Text as T
myFile :: FilePath
myFile = "numberFile"
withoutConduit :: IO (Maybe Int)
withoutConduit = do
doesExist <- doesFileExist myFile
if doesExist
then Just . read <$> readFile myFile
else return Nothing
withConduit :: IO (Maybe Int)
withConduit = do
doesExist <- doesFileExist myFile
if doesExist
then runResourceT $ source $$ conduit =$ sink
else return Nothing
where
source :: Source (ResourceT IO) B.ByteString
source = CB.sourceFile myFile
conduit :: Conduit B.ByteString (ResourceT IO) T.Text
conduit = CT.decodeUtf8
sink :: Sink T.Text (ResourceT IO) (Maybe Int)
sink = awaitForever $ \txt -> let num = read . T.unpack $ txt :: Int
in -- I don't know what to do here...
Could someone please help me complete the sink function?
Thanks!

This isn't really a good example for where conduit actually provides a lot of value, at least not the way you're looking at it right now. Specifically, you're trying to use the read function, which requires that the entire value be in memory. Additionally, your current error handling behavior is a bit loose. Essentially, you're just going to get an read: no parse error if there's anything unexpected in the content.
However, there is a way we can play with this in conduit and be meaningful: by parsing the ByteString byte-by-byte ourselves and avoiding the read function. Fortunately, this pattern falls into a standard left fold, which the conduit-combinators package provides a perfect function for (element-wise left fold in a conduit, aka foldlCE):
{-# LANGUAGE OverloadedStrings #-}
import Conduit
import Data.Word8
import qualified Data.ByteString as S
sinkInt :: Monad m => Consumer S.ByteString m Int
sinkInt =
foldlCE go 0
where
go total w
| _0 <= w && w <= _9 =
total * 10 + (fromIntegral $ w - _0)
| otherwise = error $ "Invalid byte: " ++ show w
main :: IO ()
main = do
x <- yieldMany ["1234", "5678"] $$ sinkInt
print x
There are plenty of caveats that go along with this: it will simply throw an exception if there are unexpected bytes, and it doesn't handle integer overflow at all (though fixing that is just a matter of replacing Int with Integer). It's important to note that, since the in-memory string representation of a valid 32- or 64-bit int is always going to be tiny, conduit is overkill for this problem, though I hope that this code gives some guidance on how to generally write conduit code.

haskell [char] to bytestring

main :: IO ()
main = do
let a = ("teeeeeeeeeeeeest","teeeeeeeeeeeest")
b <- app a
print b
app expects (bytestring,bytestring) not ([char],[char])
how can I convert it?

You can convert Strings to ByteStrings with Data.ByteString.Char8.pack (or the lazy ByteString version thereof) if your String contains only ASCII values or you are interested only in the last eight bits of each Char,
import qualified Data.ByteString.Char8 as C
main :: IO ()
main = do
let a = ("teeeeeeeeeeeeest","teeeeeeeeeeeest")
b <- app $ (\(x,y) -> (C.pack x, C.pack y)) a
print b
If your String contains non-ASCII Chars and you are interested in more than only the last eight bits, you will need some other encoding, like Data.ByteString.UTF8.fromString.

You could try:
import qualified Data.ByteString.Char8 as B --to prevent name clash with Prelude
B.pack "Hello, world"
A lot of useful functions can be found here:
http://www.haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString-Char8.html
you could also use Data.ByteString.Lazy.Char8
for lazy bytestrings
http://hackage.haskell.org/packages/archive/bytestring/latest/doc/html/Data-ByteString-Lazy-Char8.html#v:pack

Why Haskell alloc huge amounts of memory when working with strings?

I've written a program in Haskell which had to load and parse big text file in UTF8. The file represents a dictionary with key:value pairs on each line. In my program I want to have a Data.Map container for fast dictionary search. My file is about 40MB, but after loading it to my program 1.5 GB of RAM is used, and never freed. What did I do wrong? Is the memory usage expected?
Here is a code sample from my program:
module Main where
import Engine
import Codec.Archive.Zip
import Data.IORef
import System.IO
import System.Directory
import qualified System.IO.UTF8 as UTF8
import qualified Data.ByteString.Lazy as B
import qualified Data.ByteString.UTF8 as BsUtf
import qualified Data.Map as Map
import Graphics.UI.Gtk
import Graphics.UI.Gtk.Glade
maybeRead :: Read a => BsUtf.ByteString -> Maybe a
maybeRead s = case reads $ BsUtf.toString s of
[(x, "")] -> Just x
_ -> Nothing
parseToEntries :: [BsUtf.ByteString] -> [(BsUtf.ByteString, Int)]
parseToEntries [] = []
parseToEntries (x:xs) = let (key, svalue) = BsUtf.break (==':') x
value = maybeRead svalue
in case value of
Just x -> [(key, x)] ++ parseToEntries xs
Nothing -> parseToEntries xs
createDict :: BsUtf.ByteString -> IO (Map.Map BsUtf.ByteString Int)
createDict str = do
let entries = parseToEntries $ BsUtf.lines str
dict = Map.fromList entries
return (dict)
main :: IO ()
main = do
currFileName <- newIORef ""
dictZipFile <- B.readFile "data.db"
extractFilesFromArchive [] $ toArchive dictZipFile
dictFile <- UTF8.readFile "dict.txt"
dict <- createDict $ BsUtf.fromString dictFile
...
searchAccent :: Map.Map BsUtf.ByteString Int -> String -> Int
searchAccent dict word = let sword = BsUtf.fromString $ map toLower word
entry = Map.lookup sword dict
in case entry of
Nothing -> -1
Just match -> 0

Quick answer.
Main problem is that System.IO.UTF8.readFile reads file into String.
Supposed bottleneck is here:
dictFile <- UTF8.readFile "dict.txt"
dict <- createDict $ BsUtf.fromString dictFile
When dealing with UTF-8 text it is better to use Data.Text instead of ByteString.
Try something like this:
import qualified Data.Text.Lazy as LT
import qualified Data.Text.Lazy.Encoding as LT
...
dictFile <- B.readFile "dict.txt"
dict <- createDict $ LT.decodeUtf8 dictFile
Another bottleneck is parsing numbers: you are converting ByteString to String and then read it.
It's better to use Data.Text.Lazy.Read:
import qualified Data.Text.Lazy.Read as LT
maybeRead :: LT.Text -> Maybe Int
maybeRead s = case LT.decimal s of
Left _ -> Nothing
Right i -> Just i

The Haskell String type is an indirect (because of laziness) linked list of characters; it is extremely wasteful space-wise. You may wish to try Data.Text (from http://hackage.haskell.org/package/text) instead, for large amounts of text.
(edit now that source is up I see the strings are lazy ByteString instead of String, so this is not relevant. Profiling is the next step.)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

I/O Monad and ByteString to Char conversion? - haskell

Related

Efficient streaming and manipulation of a byte stream in Haskell

Monad escaping inside a StateT context

Haskell Conduit: having a Sink return a value based on the values from upstream

haskell [char] to bytestring

Why Haskell alloc huge amounts of memory when working with strings?

Categories

Resources