Constructing RequestBodyStream from Lazy ByteString when length is known

Constructing RequestBodyStream from Lazy ByteString when length is known - haskell

I am trying to adapt this AWS S3 upload code to handle Lazy ByteString where length is already known (so that it is not forced to be read in its entirety in memory - it comes over the network where length is sent beforehand). It seems I have to define a GivesPopper function over Lazy ByteString to convert it to RequestBodyStream. Because of the convoluted way GivesPopper is defined, I am not sure how to write it for Lazy ByteString. Will appreciate pointers on how to write it. Here is how it is written for reading from the file:
let file ="test"
-- streams large file content, without buffering more than 10k in memory
let streamer sink = withFile file ReadMode $ \h -> sink $ S.hGet h 10240
streamer in the code above is of type GivesPopper () if I understand it correctly.
Given a Lazy ByteString with known length len, what would be a good way to write GivesPopper function over it? We can read one chunk at a time.

Is this what you're looking for?
import qualified Data.ByteString as S
import qualified Data.ByteString.Lazy as L
import System.IO
file = "test"
-- original streamer for feeding a sink from a file
streamer :: (IO S.ByteString -> IO r) -> IO r
streamer sink = withFile file ReadMode $ \h -> sink $ S.hGet h 10240
-- feed a lazy ByteString to sink
lstreamer :: L.ByteString -> (IO S.ByteString -> IO r) -> IO r
lstreamer lbs sink = sink (return (L.toStrict lbs))
lstreamer type checks but probably doesn't do exactly what you want it to do. It simply returns the same data every time the sink calls it. On the other hand S.hGet h ... will eventually return the empty string.
Here is a solution which uses an IORef to keep track of if we should start returning the empty string:
import Data.IORef
mklstream :: L.ByteString -> (IO S.ByteString -> IO r) -> IO r
mklstream lbs sink = do
ref <- newIORef False
let fetch :: IO S.ByteString
fetch = do sent <- readIORef ref
writeIORef ref True
if sent
then return S.empty
else return (L.toStrict lbs)
sink fetch
Here fetch is the action which gets the next chunk. The first time you call it you will get the original lazy Bytestring (strict-ified). Subsequent calls will always return the empty string.
Update
Here's how to give out a small amount at a time:
mklstream :: L.ByteString -> (IO S.ByteString -> IO r) -> IO r
mklstream lbs sink = do
ref <- newIORef (L.toChunks lbs)
let fetch :: IO S.ByteString
fetch = do chunks <- readIORef ref
case chunks of
[] -> return S.empty
(c:cs) -> do writeIORef ref cs
return c
sink fetch

Related

How to track progress through a streaming ByteString?

I'm using the streaming-utils streaming-utils to stream a HTTP response body. I want to track the progress similar to how bytestring-progress allows with lazy ByteStrings. I suspect something like toChunks would be necessary, then reducing some cumulative bytes read and returning the original stream unmodified. But I cannot figure it out, and the streaming documentation is very unhelpful, mostly full of grandiose comparisons to alternative libraries.
Here's some code with my best effort so far. It doesn't include the counting yet, and just tries to print the size of chunks as they stream past (and doesn't compile).
download :: ByteString -> FilePath -> IO ()
download i file = do
req <- parseRequest . C.unpack $ i
m <- newHttpClientManager
runResourceT $ do
resp <- http req m
lift . traceIO $ "downloading " <> file
let body = SBS.fromChunks $ mapsM step $ SBS.toChunks $ responseBody resp
SBS.writeFile file body
step bs = do
traceIO $ "got " <> show (C.length bs) <> " bytes"
return bs

What we want is to traverse the Stream (Of ByteString) IO () in two ways:
One that accumulates the incoming lengths of the ByteStrings and prints updates to console.
One that writes the stream to a file.
We can do that with the help of the copy function, which has type:
copy :: Monad m => Stream (Of a) m r -> Stream (Of a) (Stream (Of a) m) r
copy takes a stream and duplicates it into two different monadic layers, where each element of the original stream is emitted by both layers of the new dissociated stream.
(Notice that we are changing the base monad, not the functor. What changing the functor to another Stream does is to delimit groups in a single stream, and we aren't interested in that here.)
The following function takes a stream, copies it, accumulates the length of incoming strings with S.scan, prints them, and returns another stream that you can still work with, for example writing it to a file:
{-# LANGUAGE OverloadedStrings #-}
import Streaming
import qualified Streaming.Prelude as S
import qualified Data.ByteString as B
track :: Stream (Of B.ByteString) IO r -> Stream (Of B.ByteString) IO r
track stream =
S.mapM_ (liftIO . print) -- brings us back to the base monad, here another stream
. S.scan (\s b -> s + B.length b) (0::Int) id
$ S.copy stream
This will print the ByteStrings along with the accumulated lengths:
main :: IO ()
main = S.mapM_ B.putStr . track $ S.each ["aa","bb","c"]

Efficient streaming and manipulation of a byte stream in Haskell

While writing a deserialiser for a large (<bloblength><blob>)* encoded binary file I got stuck with the various Haskell produce-transform-consume libraries. So far I'm aware of four streaming libraries:
Data.Conduit: Widely used, has very careful resource management
Pipes: Similar to conduit (Haskell Cast #6 nicely reveals the differences between conduit and pipes)
Data.Binary.Get: Offers useful functions such as getWord32be, but the streaming example is awkward
System.IO.Streams: Seems to be the easiest one to use
Here's a stripped down example of where things go wrong when I try to do Word32 streaming with conduit. A slightly more realistic example would first read a Word32 that determines the blob length and then yield a lazy ByteString of that length (which is then deserialised further).
But here I just try to extract Word32's in streaming fashion from a binary file:
module Main where
-- build-depends: bytestring, conduit, conduit-extra, resourcet, binary
import Control.Monad.Trans.Resource (MonadResource, runResourceT)
import qualified Data.Binary.Get as G
import qualified Data.ByteString as BS
import qualified Data.ByteString.Char8 as C
import qualified Data.ByteString.Lazy as BL
import Data.Conduit
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
import Data.Word (Word32)
import System.Environment (getArgs)
-- gets a Word32 from a ByteString.
getWord32 :: C.ByteString -> Word32
getWord32 bs = do
G.runGet G.getWord32be $ BL.fromStrict bs
-- should read BytesString and return Word32
transform :: (Monad m, MonadResource m) => Conduit BS.ByteString m Word32
transform = do
mbs <- await
case mbs of
Just bs -> do
case C.null bs of
False -> do
yield $ getWord32 bs
leftover $ BS.drop 4 bs
transform
True -> return ()
Nothing -> return ()
main :: IO ()
main = do
filename <- fmap (!!0) getArgs -- should check length getArgs
result <- runResourceT $ (CB.sourceFile filename) $$ transform =$ CL.consume
print $ length result -- is always 8188 for files larger than 32752 bytes
The output of the program is just the number of Word32's that were read. It turns out the stream terminates after reading the first chunk (about 32KiB). For some reason mbs is never Nothing, so I must check null bs which stops the stream when the chunk is consumed. Clearly, my conduit transform is faulty. I see two routes to a solution:
The await doesn't want to go to the second chunk of the ByteStream, so is there another function that pulls the next chunk? In examples I've seen (e.g. Conduit 101) this is not how it's done
This is just the wrong way to set up transform.
How is this done properly? Is this the right way to go? (Performance does matter.)
Update: Here's a BAD way to do it using Systems.IO.Streams:
module Main where
import Data.Word (Word32)
import System.Environment (getArgs)
import System.IO (IOMode (ReadMode), openFile)
import qualified System.IO.Streams as S
import System.IO.Streams.Binary (binaryInputStream)
import System.IO.Streams.List (outputToList)
main :: IO ()
main = do
filename : _ <- getArgs
h <- openFile filename ReadMode
s <- S.handleToInputStream h
i <- binaryInputStream s :: IO (S.InputStream Word32)
r <- outputToList $ S.connect i
print $ last r
'Bad' means: Very demanding in time and space, does not handle Decode exception.

Your immediate problem is caused by how you are using leftover. That function is used to "Provide a single piece of leftover input to be consumed by the next component in the current monadic binding", and so when you give it bs before looping with transform you are effectively throwing away the rest of the bytestring (i.e. what is after bs).
A correct solution based on your code would use the incremental input interface of Data.Binary.Get to replace your yield/leftover combination with something that consumes each chunk fully. A more pragmatic approach, though, is using the binary-conduit package, which provides that in the shape of conduitGet (its source gives a good idea of what a "manual" implementation would look like):
import Data.Conduit.Serialization.Binary
-- etc.
transform :: (Monad m, MonadResource m) => Conduit BS.ByteString m Word32
transform = conduitGet G.getWord32be
One caveat is that this will throw a parse error if the total number of bytes is not a multiple of 4 (i.e. the last Word32 is incomplete). In the unlikely case of that not being what you want, a lazy way out would be simply using \bs -> C.take (4 * truncate (C.length bs / 4)) bs on the input bytestring.

With pipes (and pipes-group and pipes-bytestring) the demo problem reduces to combinators. First we resolve the incoming undifferentiated byte stream into little 4 byte chunks:
chunksOfStrict :: (Monad m) => Int -> Producer ByteString m r -> Producer ByteString m r
chunksOfStrict n = folds mappend mempty id . view (Bytes.chunksOf n)
then we map these to Word32s and (here) count them.
main :: IO ()
main = do
filename:_ <- getArgs
IO.withFile filename IO.ReadMode $ \h -> do
n <- P.length $ chunksOfStrict 4 (Bytes.fromHandle h) >-> P.map getWord32
print n
This will fail if we have less than 4 bytes or otherwise fail to parse but we can as well map with
getMaybeWord32 :: ByteString -> Maybe Word32
getMaybeWord32 bs = case G.runGetOrFail G.getWord32be $ BL.fromStrict bs of
Left r -> Nothing
Right (_, off, w32) -> Just w32
The following program will then print the parses for the valid 4 byte sequences
main :: IO ()
main = do
filename:_ <- getArgs
IO.withFile filename IO.ReadMode $ \h -> do
runEffect $ chunksOfStrict 4 (Bytes.fromHandle h)
>-> P.map getMaybeWord32
>-> P.concat -- here `concat` eliminates maybes
>-> P.print
There are other ways of dealing with failed parses, of course.
Here, though, is something closer to the program you asked for. It takes a four byte segment from a byte stream (Producer ByteString m r) and reads it as a Word32 if it is long enough; it then takes that many of the incoming bytes and accumulates them into a lazy bytestring, yielding it. It just repeats this until it runs out of bytes. In main below, I print each yielded lazy bytestring that is produced:
module Main (main) where
import Pipes
import qualified Pipes.Prelude as P
import Pipes.Group (folds)
import qualified Pipes.ByteString as Bytes ( splitAt, fromHandle, chunksOf )
import Control.Lens ( view ) -- or Lens.Simple (view) -- or Lens.Micro ((.^))
import qualified System.IO as IO ( IOMode(ReadMode), withFile )
import qualified Data.Binary.Get as G ( runGet, getWord32be )
import Data.ByteString ( ByteString )
import qualified Data.ByteString.Lazy.Char8 as BL
import System.Environment ( getArgs )
splitLazy :: (Monad m, Integral n) =>
n -> Producer ByteString m r -> m (BL.ByteString, Producer ByteString m r)
splitLazy n bs = do
(bss, rest) <- P.toListM' $ view (Bytes.splitAt n) bs
return (BL.fromChunks bss, rest)
measureChunks :: Monad m => Producer ByteString m r -> Producer BL.ByteString m r
measureChunks bs = do
(lbs, rest) <- lift $ splitLazy 4 bs
if BL.length lbs /= 4
then rest >-> P.drain -- in fact it will be empty
else do
let w32 = G.runGet G.getWord32be lbs
(lbs', rest') <- lift $ splitLazy w32 bs
yield lbs
measureChunks rest
main :: IO ()
main = do
filename:_ <- getArgs
IO.withFile filename IO.ReadMode $ \h -> do
runEffect $ measureChunks (Bytes.fromHandle h) >-> P.print
This is again crude in that it uses runGet not runGetOrFail, but this is easily repaired. The pipes standard procedure would be to stop the stream transformation on a failed parse and return the unparsed bytestream.
If you were anticipating that the Word32s were for large numbers, so that you did not want to accumulate the corresponding stream of bytes as a lazy bytestring, but say write them to different files without accumulating, we could change the program pretty easily to do that. This would require a sophisticated use of conduit but is the preferred approach with pipes and streaming.

Here's a relatively straightforward solution that I want to throw into the ring. It's a repeated use of splitAt wrapped into a State monad that gives an interface identical to (a subset of) Data.Binary.Get. The resulting [ByteString] is obtained in main with a whileJust over getBlob.
module Main (main) where
import Control.Monad.Loops
import Control.Monad.State
import qualified Data.Binary.Get as G (getWord32be, runGet)
import qualified Data.ByteString.Lazy as BL
import Data.Int (Int64)
import Data.Word (Word32)
import System.Environment (getArgs)
-- this is going to mimic the Data.Binary.Get.Get Monad
type Get = State BL.ByteString
getWord32be :: Get (Maybe Word32)
getWord32be = state $ \bs -> do
let (w, rest) = BL.splitAt 4 bs
case BL.length w of
4 -> (Just w', rest) where
w' = G.runGet G.getWord32be w
_ -> (Nothing, BL.empty)
getLazyByteString :: Int64 -> Get BL.ByteString
getLazyByteString n = state $ \bs -> BL.splitAt n bs
getBlob :: Get (Maybe BL.ByteString)
getBlob = do
ml <- getWord32be
case ml of
Nothing -> return Nothing
Just l -> do
blob <- getLazyByteString (fromIntegral l :: Int64)
return $ Just blob
runGet :: Get a -> BL.ByteString -> a
runGet g bs = fst $ runState g bs
main :: IO ()
main = do
fname <- head <$> getArgs
bs <- BL.readFile fname
let ls = runGet loop bs where
loop = whileJust getBlob return
print $ length ls
There's no error handling in getBlob, but it's easy to extend. Time and space complexity is quite good, as long as the resulting list is used carefully. (The python script that creates some random data for consumption by the above is here).

How to generalize reads from url and file in Haskell

I develop an application that borrows data from the Internet by chunks with the given offset. For testing purposes I have a dump file that contains lines where each line corresponds to the separate chunk. I want to generalize read operations from url and dump file. Currently, I have the following functions:
getChunk :: DataSourceMode -> Config -> Int -> Int -> IO FetchResult
getChunk DSNormal config ownerId' offset' = do ...
getChunk DSFromFile config ownerId' offset' = do ...
The problem with the current implementation is that it reads dump file on each getChunk call and it's, obviously, ineffective. The first idea is to save the lines from the dump file into list, but then it wouldn't be easy to generalize it with readings from url. I suppose, conduits or pipes could be used to construct source of chunks, but I'm not familiar with these libraries; should I use one of them, or, maybe, there's a better solution?

I ended up with conduits. Used generalized function processFeed as a sink and then pushed into it data from postUrlSource or Data.Conduit.Binary.sourceFile, depending on mode.
import Data.Conduit.Binary as CB(sourceFile, conduitFile, lines)
processFeed :: MonadIO m => Config -> OwnerId -> (OwnerId -> [Post] -> IO ()) -> Sink BS.ByteString m FetchResult
processFeed config ownerId' processFn = do ...
postUrlSource :: MonadIO m => Config -> OwnerId -> Source (StateT FetchState (m)) BS.ByteString
postUrlSource config ownerId' = do ...
...
_ <- case (dsMode config) of
DSFromFile -> do
runResourceT $ CB.sourceFile dumpFile $= CB.lines $$ (processFeed config publicId' saveResult)
DSNormal -> do
let postsFromUrlConduit = (postUrlSource config publicId') $$ (processFeed config publicId' saveResult)
fetchedPosts <- runStateT postsFromUrlConduit (FetchState 0 "")
return $ fst fetchedPosts
...
StateT is used for the case when we fetch data from the url, so, each chunk is fetched with a new offset.
For reading from file it's IO monad, it just read lines sequentially from the dump.

Lazy computation inside IO monad

I'm trying to generate a infinite lazy stream of values from IO wrapped by WriterT. I'm using conduits to consume this stream and write it to a file. I'm well aware of the strictness of IO in its bind operator, so how could I produce this stream lazily having IO there?
If it is impossible, should I try to change to lazy ST?
import Data.Conduit
import Control.Monad.Writer
import Data.DList as DL
type Stream = WriterT (DL.DList String) IO ()
generator :: Stream
generator = do
tell $ DL.singleton "something"
generator
runStream :: Stream -> IO ()
runStream s = runResourceT $ stream s
where stream s = sourceStream s $$ sinkStream -- sinkStream just writes to a file
sourceStream s = do w <- liftIO $ execWriterT s
CL.sourceList (DL.toList w)

Seeing that nobody gave a full answer, I'll convert my comment into one. One of the biggest advantages of conduit is that it saves us from using lazy IO! Using a complicated WriterT goes against the idea. Instead, we should make generator a Source and then just plug it with a file Sink:
import Control.Monad
import Data.Conduit
import Data.Conduit.Binary
import qualified Data.ByteString.Char8 as BS
generator :: Monad m => Source m String
generator = replicateM_ 3 (yield "something\n")
-- or `forever (...)` if you want an infinite loop
-- Reads Strings, converts them to ByteStrings and writes
-- to a file.
sinkStream :: MonadResource m => FilePath -> Sink String m ()
sinkStream file = mapInput BS.pack (const Nothing) (sinkFile file)
main :: IO ()
main = runResourceT (generator $$ sinkStream "/tmp/output.txt")

Sequential Binary Data Decoding Using Conduits

The goal is to have a conduit with the following type signature
protobufConduit :: MonadResource m => (ByteString -> a) -> Conduit ByteString m a
The conduit should repeatedly parse protocol buffers (using the ByteString -> a function) received via TCP/IP (using the network-conduit package).
The wire message format is
{length (32 bits big endian)}{protobuf 1}{length}{protobuf 2}...
(The curly braces are not party of the protocol, only used here to separate the entities).
The first idea was to use sequenceSink to repeatedly apply a Sink that is able to parse one ProtoBuf:
[...]
import qualified Data.Binary as B
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.Util as CU
protobufConduit :: MonadResource m => (ByteString -> a) -> Conduit ByteString m a
protobufConduit protobufDecode =
CU.sequenceSink () $ \() ->
do lenBytes <- CB.take 4 -- read protobuf length
let len :: Word32
len = B.decode lengthBytes -- decode ProtoBuf length
intLen = fromIntegral len
protobufBytes <- CB.take intLen -- read the ProtoBuf bytes
return $ CU.Emit () [ protobufDecode protobufBytes ] -- emit decoded ProtoBuf
It doens't work (only works for the first protocol buffer) because there seems to be a number of "leftover" bytes already read from the source but not consumed via CB.take that get discarded.
And I found no way of pushing "the rest back into the source".
Did I get the concept entirely wrong?
PS: Even if I use protocol buffers here, the problem is not related to protocol buffers. To debug the problem I always use {length}{UTF8 encoded string}{length}{UTF8 encoded string}... and a conduit similar to the above one (utf8StringConduit :: MonadResource m => Conduit ByteString m Text).
Update:
I just tried to replace the state (no state () in the sample above) by the remaining bytes and replaced the CB.take calls by calls to a function that first consumes the already read bytes (from the state) and calls await only as needed (when the state is not large enough). Unfortunately, that doesn't work either because as soon as the Source has no bytes left, sequenceSink does not execute the code but the state still contains the remaining bytes :-(.
If you should be interested in the code (which isn't optimized or very good but should be enough to test):
utf8StringConduit :: forall m. MonadResource m => Conduit ByteString m Text
utf8StringConduit =
CU.sequenceSink [] $ \st ->
do (lengthBytes, st') <- takeWithState BS.empty st 4
let len :: Word32
len = B.decode $ BSL.fromChunks [lengthBytes]
intLength = fromIntegral len
(textBytes, st'') <- takeWithState BS.empty st' intLength
return $ CU.Emit st'' [ TE.decodeUtf8 $ textBytes ]
takeWithState :: Monad m
=> ByteString
-> [ByteString]
-> Int
-> Pipe l ByteString o u m (ByteString, [ByteString])
takeWithState acc state 0 = return (acc, state)
takeWithState acc state neededLen =
let stateLenSum = foldl' (+) 0 $ map BS.length state
in if stateLenSum >= neededLen
then do let (firstChunk:state') = state
(neededChunk, pushBack) = BS.splitAt neededLen firstChunk
acc' = acc `BS.append` neededChunk
neededLen' = neededLen - BS.length neededChunk
state'' = if BS.null pushBack
then state'
else pushBack:state'
takeWithState acc' state'' neededLen'
else do aM <- await
case aM of
Just a -> takeWithState acc (state ++ [a]) neededLen
Nothing -> error "to be fixed later"

For protocol buffer parsing and serializing we use messageWithLengthPutM and messageWithLengthGetM (see below) but I assume it uses a varint encoding for the length, which is not what you need. I'd probably try to adapt our implementation below by replacing the messageWithLength Get/Put with something like
myMessageWithLengthGetM =
do size <- getWord32be
getMessageWithSize size
but I have no idea how to implement the getMessageWithSize using the available functions from the protocol buffer package. On the other hand you could just getByteString and then "reparse" the bytestring.
Regarding conduits: Have you tried implementing the conduit without Data.Conduit.Util? Something like
protobufConduit protobufDecode = loop
where
loop =
do len <- liftM convertLen (CB.take 4)
bs <- CB.take len
yield (protobufDecode bs)
loop
Here's the code we use:
pbufSerialize :: (ReflectDescriptor w, Wire w) => Conduit w IO ByteString
pbufSerialize = awaitForever f
where f pb = M.mapM_ yield $ BSL.toChunks $ runPut (messageWithLengthPutM pb)
pbufParse :: (ReflectDescriptor w, Wire w, Show w) => Conduit ByteString IO w
pbufParse = new
where
new = read (runGet messageWithLengthGetM . BSL.fromChunks . (:[]))
read parse =
do mbs <- await
case mbs of
Just bs -> checkResult (parse bs)
Nothing -> return ()
checkResult result =
case result of
Failed _ errmsg -> fail errmsg
Partial cont -> read (cont . Just . BSL.fromChunks . (:[]))
Finished rest _ msg ->
do yield msg
checkResult (runGet messageWithLengthGetM rest)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Constructing RequestBodyStream from Lazy ByteString when length is known - haskell

Related

How to track progress through a streaming ByteString?

Efficient streaming and manipulation of a byte stream in Haskell

How to generalize reads from url and file in Haskell

Lazy computation inside IO monad

Sequential Binary Data Decoding Using Conduits

Categories

Resources