How to track progress through a streaming ByteString?

How to track progress through a streaming ByteString? - haskell

I'm using the streaming-utils streaming-utils to stream a HTTP response body. I want to track the progress similar to how bytestring-progress allows with lazy ByteStrings. I suspect something like toChunks would be necessary, then reducing some cumulative bytes read and returning the original stream unmodified. But I cannot figure it out, and the streaming documentation is very unhelpful, mostly full of grandiose comparisons to alternative libraries.
Here's some code with my best effort so far. It doesn't include the counting yet, and just tries to print the size of chunks as they stream past (and doesn't compile).
download :: ByteString -> FilePath -> IO ()
download i file = do
req <- parseRequest . C.unpack $ i
m <- newHttpClientManager
runResourceT $ do
resp <- http req m
lift . traceIO $ "downloading " <> file
let body = SBS.fromChunks $ mapsM step $ SBS.toChunks $ responseBody resp
SBS.writeFile file body
step bs = do
traceIO $ "got " <> show (C.length bs) <> " bytes"
return bs

What we want is to traverse the Stream (Of ByteString) IO () in two ways:
One that accumulates the incoming lengths of the ByteStrings and prints updates to console.
One that writes the stream to a file.
We can do that with the help of the copy function, which has type:
copy :: Monad m => Stream (Of a) m r -> Stream (Of a) (Stream (Of a) m) r
copy takes a stream and duplicates it into two different monadic layers, where each element of the original stream is emitted by both layers of the new dissociated stream.
(Notice that we are changing the base monad, not the functor. What changing the functor to another Stream does is to delimit groups in a single stream, and we aren't interested in that here.)
The following function takes a stream, copies it, accumulates the length of incoming strings with S.scan, prints them, and returns another stream that you can still work with, for example writing it to a file:
{-# LANGUAGE OverloadedStrings #-}
import Streaming
import qualified Streaming.Prelude as S
import qualified Data.ByteString as B
track :: Stream (Of B.ByteString) IO r -> Stream (Of B.ByteString) IO r
track stream =
S.mapM_ (liftIO . print) -- brings us back to the base monad, here another stream
. S.scan (\s b -> s + B.length b) (0::Int) id
$ S.copy stream
This will print the ByteStrings along with the accumulated lengths:
main :: IO ()
main = S.mapM_ B.putStr . track $ S.each ["aa","bb","c"]

Related

Where is the memory leak in using StateT s IO a?

Intention: Small application to learn Haskell: Downloads a wikipedia-article, then downloads all articles linked from it, then downloads all articles linked from them, and so on... until a specified recursion depth is reached. The result is saved to a file.
Approach: Use a StateT to keep track of the download queue, to download an article and to update the queue. I build a list IO [WArticle] recursively and then print it.
Problem: While profiling I find that total memory in use is proportional to number of articles downloaded.
Analysis: By literature I'm lead to believe this is a laziness and/or strictness issue. BangPatterns reduced the memory consumed but didn't solve proportionality. Furthermore, I know all articles are downloaded before the file output is started.
Possible solutions:
1) The function getNextNode :: StateT CrawlState IO WArticle (below) already has IO. One solution would be to just do the file writing in it and only return the state. It would mean the file is written to in very small chunks though. Doesn't feel very Haskell..
2) Have the function buildHelper :: CrawlState -> IO [WArticle] (below) return [IO WArticle]. Though I wouldn't know how to rewrite that code and have been advised against it in the comments.
Are any of these proposed solutions better than I think they are or are there better alternatives?
import GetArticle (WArticle, getArticle, wa_links, wiki2File) -- my own
type URL = Text
data CrawlState =
CrawlState ![URL] ![(URL, Int)]
-- [Completed] [(Queue, depth)]
-- Called by user
buildDB :: URL -> Int -> IO [WArticle]
buildDB startURL recursionDepth = buildHelper cs
where cs = CrawlState [] [(startURL, recursionDepth)]
-- Builds list recursively
buildHelper :: CrawlState -> IO [WArticle]
buildHelper !cs#(CrawlState _ queue) = {-# SCC "buildHelper" #-}
if null queue
then return []
else do
(!article, !cs') <- runStateT getNextNode cs
rest <- buildHelper cs'
return (article:rest)
-- State manipulation
getNextNode :: StateT CrawlState IO WArticle
getNextNode = {-# SCC "getNextNode" #-} do
CrawlState !parsed !queue#( (url, depth):queueTail ) <- get
article <- liftIO $ getArticle url
put $ CrawlState (url:parsed) (queueTail++ ( if depth > 1
then let !newUrls = wa_links article \\ parsed
!newUrls' = newUrls \\ map fst queue
in zip newUrls' (repeat (depth-1))
else []))
return article
startUrl = pack "https://en.wikipedia.org/wiki/Haskell_(programming_language)"
recursionDepth = 3
main :: IO ()
main = {-# SCC "DbMain" #-}
buildDB startUrl recursionDepth
>>= return . wiki2File
>>= writeFile "savedArticles.txt"
Full code at https://gitlab.com/mattias.br/sillyWikipediaSpider. Current version limited to only download the first eight links from each page to save time. Without changing it download 55 pages at ~600 MB heap usage.
Thanks for any help!

2) Is [IO WArticle] want I want in this case?
Not quite. The problem is that some of the IO WArticle actions depend on the result of a previous action: the links to future pages reside in previously obtained pages. [IO Warticle] can't provide that: it is pure in the sense that you can always find an action in the list without executing the previous actions.
What we need is a kind of "effectful list" that lets us extract articles one by one, progressively performing the neccessary effects, but not forcing us to completely generate the list in one go.
There are several libraries that provide these kinds of "effectful lists": streaming, pipes, conduit. They define monad transformers that extend a base monad with the ability to yield intermediate values before returning a final result. Usually the final result is of a type different from the values that are yielded; it might be simply unit ().
Note: The Functor, Applicative and Monad instances for these libraries differ from the corresponding instances for pure lists. The Functor instances map over the resulting final value, not over the intermediate values which are yielded. To map over the yielded values, they provide separate functions. And The Monad instances sequence effectful lists, instead of trying all combinations. To try all combinations, they provide separate functions.
Using the streaming library, we could modify buildHelper to something like this:
import Streaming
import qualified Streaming.Prelude as S
buildHelper :: CrawlState -> Stream (Of WArticle) IO ()
buildHelper !cs#(CrawlState _ queue) =
if null queue
then return []
else do (article, cs') <- liftIO (runStateT getNextNode cs)
S.yield article
buildHelper cs'
And then we could use functions like mapM_ (from Streaming.Prelude, not the one from Control.Monad!) to process the articles one by one, as they are generated.

Adding a further explaination and code building upon the answer of danidiaz. Here's the final code:
import Streaming
import qualified Streaming.Prelude as S
import System.IO (IOMode (WriteMode), hClose, openFile)
buildHelper :: CrawlState -> Stream (Of WArticle) IO ()
buildHelper cs#( CrawlState _ queue ) =
if null queue
then return ()
else do
(article, cs') <- liftIO (runStateT getNextNode cs)
S.yield article
buildHelper cs'
main :: IO ()
main = do outFileHandle <- openFile filename WriteMode
S.toHandle outFileHandle . S.show . buildHelper $
CrawlState [] [(startUrl, recursionDepth)]
hClose outFileHandle
outFileHandle is a usual file output handle.
S.toHandle takes a stream of String and writes them to the specified handle.
S.show maps show :: WArticle -> String over the stream.
An elegant solution that creates a lazy stream even though it is produced by a series of IO actions (namely downloading websites) and writes it to a file as results become available. On my machine it still uses a lot of memory (relative to the task) during execution but never exceeds 450 MB.

Reduce memory usage of a Haskell program

I have a following program in Haskell:
processDate :: String -> IO ()
processDate date = do
...
let newFlattenedPropertiesWithPrice = filter (notYetInserted date existingProperties) flattenedPropertiesWithPrice
geocodedProperties <- propertiesWithGeocoding newFlattenedPropertiesWithPrice
propertiesWithGeocoding :: [ParsedProperty] -> IO [(ParsedProperty, Maybe LatLng)]
propertiesWithGeocoding properties = do
let addresses = fmap location properties
let batchAddresses = chunksOf 100 addresses
batchGeocodedLocations <- mapM geocodeAddresses batchAddresses
let geocodedLocations = fromJust $ concat <$> sequence batchGeocodedLocations
return (zip properties geocodedLocations)
geocodeAddresses :: [String] -> IO (Maybe [Maybe LatLng])
geocodeAddresses addresses = do
mapQuestKey <- getEnv "MAP_QUEST_KEY"
geocodeResponse <- openURL $ mapQuestUrl mapQuestKey addresses
return $ geocodeResponseToResults geocodeResponse
geocodeResponseToResults :: String -> Maybe [Maybe LatLng]
geocodeResponseToResults inputResponse =
latLangs
where
decodedResponse :: Maybe GeocodingResponse
decodedResponse = decodeGeocodingResponse inputResponse
latLangs = fmap (fmap geocodingResultToLatLng . results) decodedResponse
decodeGeocodingResponse :: String -> Maybe GeocodingResponse
decodeGeocodingResponse inputResponse = Data.Aeson.decode (fromString inputResponse) :: Maybe GeocodingResponse
It reads a list of properties (homes and apartments) from html files, parses them, geocodes the addresses and saves the results into sqlite db.
Everything works fine except for a very high memory usage (around 800M).
By commenting code out I have pinpointed the problem to be the geocoding step.
I send 100 addresses at a time to MapQuest api (https://developer.mapquest.com/documentation/geocoding-api/batch/get/).
The response for 100 addresses is quite massive so it might be one of the culprits, but 800M? I feel like it holds to all of the results until the end which drives the memory usage so high.
After commenting out the geocoding part of the program memory usage is around 30M which is fine.
You can get the full version which reproduces the issue here: https://github.com/Leonti/haskell-memory-so
I'm quite a newbie in Haskell, so not sure how I can optimize it.
Any ideas?
Cheers!

It might be worth recording that this turned out to be a simple streaming problem arising from use of mapM and sequence, which with replicateM and traverse and other things that make you "extract a list from IO" always raise accumulation worries. So a little detour by a streaming library was needed. So in the repo it was necessary just to replace
processDate :: String -> IO ()
processDate date = do
allFiles <- listFiles date
allProperties <- mapM fileToProperties allFiles
let flattenedPropertiesWithPrice = filter hasPrice $ concat allProperties
geocodedProperties <- propertiesWithGeocoding flattenedPropertiesWithPrice
print geocodedProperties
propertiesWithGeocoding :: [ParsedProperty] -> IO [(ParsedProperty, Maybe LatLng)]
propertiesWithGeocoding properties = do
let batchProperties = chunksOf 100 properties
batchGeocodedLocations <- mapM geocodeAddresses batchProperties
let geocodedLocations = fromJust $ concat <$> sequence batchGeocodedLocations
return geocodedLocations
with something like this
import Streaming
import qualified Streaming.Prelude as S
processDate :: String -> IO ()
processDate date = do
allFiles <- listFiles date -- we accept an unstreamed list
S.print $ propertiesWithGeocoding -- this was the main pain point see below
$ S.filter hasPrice
$ S.concat
$ S.mapM fileToProperties -- this mapM doesn't accumulate
$ S.each allFiles -- the list is converted to a stream
propertiesWithGeocoding
:: Stream (Of ParsedProperty) IO r
-> Stream (Of (ParsedProperty, Maybe LatLng)) IO r
propertiesWithGeocoding properties =
S.concat $ S.concat
$ S.mapM geocodeAddresses -- this mapM doesn't accumulate results from mapquest
$ S.mapped S.toList -- convert segments to haskell lists
$ chunksOf 100 properties -- this is the streaming `chunksOf`
-- concat here flattens a stream of lists of as into a stream of as
-- and a stream of maybe as into a stream of as
Then the memory use looks like so, each peak corresponding to a trip to Mapquest promply followed by a little processing and a print, whereupon ghc forgets all about it and moves on:
Of course this could be done with pipes or conduit. But here we just need a little bit of simple mapM / sequence/ traverse / replicateM avoidance and streaming is perhaps simplest for this sort of quick local refactoring. Note that this list is quite short so the thought 'but short lists are cool with mapM/traverse/etc !" can be quite spectacularly false. Why not just get rid of them? Whenever you are about to write list mapM f it is a good idea to consider S.mapM f . S.each (or conduit or pipes equivalent) . You will now have a stream and can recover a list with S.toList or an equivalent, but it is likely that, as in this case, you will find you don't need a reified accumulated list but can e.g. use some streaming process like printing to file or stdout or writing things to a database, after making whatever list like manipulations are needed (here we use eg. streaming filter and also concat to flatten streamed lists and as a sort of catMaybe).

Constructing RequestBodyStream from Lazy ByteString when length is known

I am trying to adapt this AWS S3 upload code to handle Lazy ByteString where length is already known (so that it is not forced to be read in its entirety in memory - it comes over the network where length is sent beforehand). It seems I have to define a GivesPopper function over Lazy ByteString to convert it to RequestBodyStream. Because of the convoluted way GivesPopper is defined, I am not sure how to write it for Lazy ByteString. Will appreciate pointers on how to write it. Here is how it is written for reading from the file:
let file ="test"
-- streams large file content, without buffering more than 10k in memory
let streamer sink = withFile file ReadMode $ \h -> sink $ S.hGet h 10240
streamer in the code above is of type GivesPopper () if I understand it correctly.
Given a Lazy ByteString with known length len, what would be a good way to write GivesPopper function over it? We can read one chunk at a time.

Is this what you're looking for?
import qualified Data.ByteString as S
import qualified Data.ByteString.Lazy as L
import System.IO
file = "test"
-- original streamer for feeding a sink from a file
streamer :: (IO S.ByteString -> IO r) -> IO r
streamer sink = withFile file ReadMode $ \h -> sink $ S.hGet h 10240
-- feed a lazy ByteString to sink
lstreamer :: L.ByteString -> (IO S.ByteString -> IO r) -> IO r
lstreamer lbs sink = sink (return (L.toStrict lbs))
lstreamer type checks but probably doesn't do exactly what you want it to do. It simply returns the same data every time the sink calls it. On the other hand S.hGet h ... will eventually return the empty string.
Here is a solution which uses an IORef to keep track of if we should start returning the empty string:
import Data.IORef
mklstream :: L.ByteString -> (IO S.ByteString -> IO r) -> IO r
mklstream lbs sink = do
ref <- newIORef False
let fetch :: IO S.ByteString
fetch = do sent <- readIORef ref
writeIORef ref True
if sent
then return S.empty
else return (L.toStrict lbs)
sink fetch
Here fetch is the action which gets the next chunk. The first time you call it you will get the original lazy Bytestring (strict-ified). Subsequent calls will always return the empty string.
Update
Here's how to give out a small amount at a time:
mklstream :: L.ByteString -> (IO S.ByteString -> IO r) -> IO r
mklstream lbs sink = do
ref <- newIORef (L.toChunks lbs)
let fetch :: IO S.ByteString
fetch = do chunks <- readIORef ref
case chunks of
[] -> return S.empty
(c:cs) -> do writeIORef ref cs
return c
sink fetch

how to add a new source inside a conduit haskell

I have a problem with the following code using network-conduit:
import Data.Conduit.List as CL
import Data.Conduit.Text as CT
import qualified Data.ByteString.Char8 as S8
import qualified Data.Text as TT
mySource :: ResourceT m => Integer -> Source m Int
mySource i = {- function -} undefined
myApp :: Application
myApp src snk =
src $= CT.decode CT.ascii
$= CL.map decimal
$= CL.map {-problem here-}
$$ src
in problem place I want to write something like
\t -> case t of
Left err = S8.pack $ "Error:" ++ e
Right (i,xs) = (>>>=) mySource
{- or better:
do
(>>>=) mySource
(<<<=) T.pack xs
-}
where the (>>>=) function pushes mySource output to the next level and
(<<<=) is sending function back to previous level

The network chops up the byte stream into arbitrary ByteString chunks. With the code above, those ByteString chunks will be mapped to chunks of Text, and each chunk of Text will be parsed as a decimal. However, a string of decimal digits representing a single decimal may be split across two (or more) Text chunks. Also, as you realize, using decimal gives you back the remainder of the Text chunk that didn't get parsed as part of the decimal, which you are trying to shove back into the input stream.
Both of these problems can be solved by using Data.Conduit.Attoparsec. conduitParserEither with Data.Attoparsec.Text.decimal. Note that it is not sufficient to just parse decimal; you will also need to handle some kind of separator between decimals.
It is also not possible to splice a Source from CL.map, since CL.map's type signature is
map :: Monad m => (a -> b) -> Conduit a m b
The function you pass to map gets an opportunity to transform each input a into a single output b, not a stream of b's. To do that, you can use awaitForever, but you'll need to transform your Source into a general Producer with toProducer in order for the types to match.
However, in your code, you are trying to send parse errors downstream as ByteString's, but the output of mySource as Int's, which is a type error. You must provide a stream of ByteString in both cases; the successful parse case can return a Conduit made by fusing other Conduit's as long as it ends up with an output of ByteString:
...
$= (let f (Left err) = yield $ S8.pack $ "Error: " ++ show err
f (Right (_, i)) = toProducer (mySource i) $= someOtherConduit
in awaitForever f)
where someOtherConduit sinks the Int's from mySource, and sources ByteString's.
someOtherConduit :: Monad m => Conduit Int m ByteString
Finally, I believe you meant to connect the snk at the end of the pipe instead of the src.

Attoparsec Iteratee

I wanted, just to learn a bit about Iteratees, reimplement a simple parser I made, using Data.Iteratee and Data.Attoparsec.Iteratee. I'm pretty much stumped though. Below I have a simple example that is able to parse one line from a file. My parser reads one line at a time, so I need a way of feeding lines to the iteratee until it's done. I've read all I've found googling this, but a lot of the material on iteratee/enumerators is pretty advanced. This is the part of the code that matters:
-- There are more imports above.
import Data.Attoparsec.Iteratee
import Data.Iteratee (joinI, run)
import Data.Iteratee.IO (defaultBufSize, enumFile)
line :: Parser ByteString -- left the implementation out (it doesn't check for
new line)
iter = parserToIteratee line
main = do
p <- liftM head getArgs
i <- enumFile defaultBufSize p $ iter
i' <- run i
print i'
This example will parse and print one line from a file with multiple lines. The original script mapped the parser over a list of ByteStrings. So I would like to do the same thing here. I found enumLinesin Iteratee, but I can't for the life of me figure out how to use it. Maybe I misunderstand its purpose?

Since your parser works on a line at a time, you don't even need to use attoparsec-iteratee. I would write this as:
import Data.Iteratee as I
import Data.Iteratee.Char
import Data.Attoparsec as A
parser :: Parser ParseOutput
type POut = Either String ParseOutput
processLines :: Iteratee ByteString IO [POut]
processLines = joinI $ (enumLinesBS ><> I.mapStream (A.parseOnly parser)) stream2list
The key to understanding this is the "enumeratee", which is just the iteratee term for a stream converter. It takes a stream processor (iteratee) of one stream type and converts it to work with another stream. Both enumLinesBS and mapStream are enumeratees.
To map your parser over multiple lines, mapStream is sufficient:
i1 :: Iteratee [ByteString] IO (Iteratee [POut] IO [POut]
i1 = mapStream (A.parseOnly parser) stream2list
The nested iteratees just mean that this converts a stream of [ByteString] to a stream of [POut], and when the final iteratee (stream2list) is run it returns that stream as [POut]. So now you just need the iteratee equivalent of lines to create that stream of [ByteString], which is what enumLinesBS does:
i2 :: Iteratee ByteString IO (Iteratee [ByteString] IO (Iteratee [POut] m [POut])))
i2 = enumLinesBS $ mapStream f stream2list
But this function is pretty unwieldy to use because of all the nesting. What we really want is a way to pipe output directly between stream converters, and at the end simplify everything to a single iteratee. To do this we use joinI, (><>), and (><>):
e1 :: Iteratee [POut] IO a -> Iteratee ByteString IO (Iteratee [POut] IO a)
e1 = enumLinesBS ><> mapStream (A.parseOnly parser)
i' :: Iteratee ByteString IO [POut]
i' = joinI $ e1 stream2list
which is equivalent to how I wrote it above, with e1 inlined.
There's still important element remaining though. This function simply returns the parse results in a list. Typically you would want to do something else, such as combine the results with a fold.
edit: Data.Iteratee.ListLike.mapM_ is often useful to create consumers. At that point each element of the stream is a parse result, so if you want to print them you can use
consumeParse :: Iteratee [POut] IO ()
consumeParse = I.mapM_ (either (\e -> return ()) print)
processLines2 :: Iteratee ByteString IO ()
processLines2 = joinI $ (enumLinesBS ><> I.mapStream (A.parseOnly parser)) consumeParse
This will print just the successful parses. You could easily report errors to STDERR, or handle them in other ways, as well.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to track progress through a streaming ByteString? - haskell

Related

Where is the memory leak in using StateT s IO a?

Reduce memory usage of a Haskell program

Constructing RequestBodyStream from Lazy ByteString when length is known

how to add a new source inside a conduit haskell

Attoparsec Iteratee

Categories

Resources