How to generalize reads from url and file in Haskell - haskell

I develop an application that borrows data from the Internet by chunks with the given offset. For testing purposes I have a dump file that contains lines where each line corresponds to the separate chunk. I want to generalize read operations from url and dump file. Currently, I have the following functions:
getChunk :: DataSourceMode -> Config -> Int -> Int -> IO FetchResult
getChunk DSNormal config ownerId' offset' = do ...
getChunk DSFromFile config ownerId' offset' = do ...
The problem with the current implementation is that it reads dump file on each getChunk call and it's, obviously, ineffective. The first idea is to save the lines from the dump file into list, but then it wouldn't be easy to generalize it with readings from url. I suppose, conduits or pipes could be used to construct source of chunks, but I'm not familiar with these libraries; should I use one of them, or, maybe, there's a better solution?

I ended up with conduits. Used generalized function processFeed as a sink and then pushed into it data from postUrlSource or Data.Conduit.Binary.sourceFile, depending on mode.
import Data.Conduit.Binary as CB(sourceFile, conduitFile, lines)
processFeed :: MonadIO m => Config -> OwnerId -> (OwnerId -> [Post] -> IO ()) -> Sink BS.ByteString m FetchResult
processFeed config ownerId' processFn = do ...
postUrlSource :: MonadIO m => Config -> OwnerId -> Source (StateT FetchState (m)) BS.ByteString
postUrlSource config ownerId' = do ...
...
_ <- case (dsMode config) of
DSFromFile -> do
runResourceT $ CB.sourceFile dumpFile $= CB.lines $$ (processFeed config publicId' saveResult)
DSNormal -> do
let postsFromUrlConduit = (postUrlSource config publicId') $$ (processFeed config publicId' saveResult)
fetchedPosts <- runStateT postsFromUrlConduit (FetchState 0 "")
return $ fst fetchedPosts
...
StateT is used for the case when we fetch data from the url, so, each chunk is fetched with a new offset.
For reading from file it's IO monad, it just read lines sequentially from the dump.

Related

How to Pipe Typed Process to wai-conduit's responseSource?

I want to have warp run a process, then respond with that process' output. The output is assumed to be larger than the server's RAM; loading the entire output then responding is not an option. I'd thought that I could accomplish this using something like
withProcessWait_ (setStdout createSource "cat largefile") (pure . responseSource ok200 [] . getStdout)
but responseSource uses ConduitT i (Flush Builder) IO () and createSource uses ConduitT i ByteString m (). I could not figure how to convert a ByteString conduit to a Flush Builder conduit.
So I devised a solution that seems to work, but it's regrettably less simply defined:
responseProcess :: Status -> ResponseHeaders -> ProcessConfig in out err -> Response
responseProcess s hs cfg = responseStream s hs $ \send flush ->
withProcessWait_ (setStdout createPipe cfg) $ \p#(getStdout -> h) ->
let loop = do
bs <- hGetSome h defaultChunkSize
unless (BS.null bs) (send (byteString bs) *> flush *> loop)
in loop *> hClose h
. Is this necessary, even if I may try prettying-it-up by wrapping in mkStreamSpec or something? Or is there a simpler method I'm missing?
edit: comments on the solution:
intersperseC lets me use Chunk and Flush together. That solves the Flush Builder/ByteString conversion problem. I haven't tested it, but it looks right and I trust it's been used.
However, I found that
withProcessWait_ (setStdout createSource "cat largefile") $ \p ->
responseSource ok200 [] (getStdout p .| mapC (Chunk . byteString) .| intersperseC Flush)
closes the process handle too early. Thus I need to manage the pipe myself: using createPipe instead of createSource. But this means that I need to call hClose at the end, which means that I need a response handler that returns IO (); the only one that does (excepting responseRaw) is responseStream, which uses StreamingBody as an alternative to Conduit. Thus I conclude that my original solution is needed and that Conduit cannot be used for streaming processes. Feel free to correct this if it's incorrect.
responseSource has type
responseSource :: Status -> ResponseHeaders -> Source IO (Flush
Builder) -> Response
and the definition of Flush is
data Flush a = Chunk a | Flush
That is, a value of type Flush Builder is either a Builder or a command that instructs warp to flush the output stream.
Builder is from the binary package. It's basically a representation of a chunk of bytes, optimized for efficient concatenation. And it can be constructed from a ByteString, using the fromByteString function.
Knowing that, and using mapC from conduit, we can define this adapter:
adapter :: :: Monad m => ConduitT ByteString (Flush Builder) m ()
adapter = mapC (Chunk . fromByteString)
There's a problem though, the adapter never flushes. But we can intersperse flusing commands by means of intersperseC:
adapter :: :: Monad m => ConduitT ByteString (Flush Builder) m ()
adapter = mapC (Chunk . fromByteString) .| intersperseC Flush
And what if we don't want to flush after every chunk? Perhaps we could use chunksOfCE to group the byte chunks before converting them into Flush values.

How to track progress through a streaming ByteString?

I'm using the streaming-utils streaming-utils to stream a HTTP response body. I want to track the progress similar to how bytestring-progress allows with lazy ByteStrings. I suspect something like toChunks would be necessary, then reducing some cumulative bytes read and returning the original stream unmodified. But I cannot figure it out, and the streaming documentation is very unhelpful, mostly full of grandiose comparisons to alternative libraries.
Here's some code with my best effort so far. It doesn't include the counting yet, and just tries to print the size of chunks as they stream past (and doesn't compile).
download :: ByteString -> FilePath -> IO ()
download i file = do
req <- parseRequest . C.unpack $ i
m <- newHttpClientManager
runResourceT $ do
resp <- http req m
lift . traceIO $ "downloading " <> file
let body = SBS.fromChunks $ mapsM step $ SBS.toChunks $ responseBody resp
SBS.writeFile file body
step bs = do
traceIO $ "got " <> show (C.length bs) <> " bytes"
return bs
What we want is to traverse the Stream (Of ByteString) IO () in two ways:
One that accumulates the incoming lengths of the ByteStrings and prints updates to console.
One that writes the stream to a file.
We can do that with the help of the copy function, which has type:
copy :: Monad m => Stream (Of a) m r -> Stream (Of a) (Stream (Of a) m) r
copy takes a stream and duplicates it into two different monadic layers, where each element of the original stream is emitted by both layers of the new dissociated stream.
(Notice that we are changing the base monad, not the functor. What changing the functor to another Stream does is to delimit groups in a single stream, and we aren't interested in that here.)
The following function takes a stream, copies it, accumulates the length of incoming strings with S.scan, prints them, and returns another stream that you can still work with, for example writing it to a file:
{-# LANGUAGE OverloadedStrings #-}
import Streaming
import qualified Streaming.Prelude as S
import qualified Data.ByteString as B
track :: Stream (Of B.ByteString) IO r -> Stream (Of B.ByteString) IO r
track stream =
S.mapM_ (liftIO . print) -- brings us back to the base monad, here another stream
. S.scan (\s b -> s + B.length b) (0::Int) id
$ S.copy stream
This will print the ByteStrings along with the accumulated lengths:
main :: IO ()
main = S.mapM_ B.putStr . track $ S.each ["aa","bb","c"]

Constructing RequestBodyStream from Lazy ByteString when length is known

I am trying to adapt this AWS S3 upload code to handle Lazy ByteString where length is already known (so that it is not forced to be read in its entirety in memory - it comes over the network where length is sent beforehand). It seems I have to define a GivesPopper function over Lazy ByteString to convert it to RequestBodyStream. Because of the convoluted way GivesPopper is defined, I am not sure how to write it for Lazy ByteString. Will appreciate pointers on how to write it. Here is how it is written for reading from the file:
let file ="test"
-- streams large file content, without buffering more than 10k in memory
let streamer sink = withFile file ReadMode $ \h -> sink $ S.hGet h 10240
streamer in the code above is of type GivesPopper () if I understand it correctly.
Given a Lazy ByteString with known length len, what would be a good way to write GivesPopper function over it? We can read one chunk at a time.
Is this what you're looking for?
import qualified Data.ByteString as S
import qualified Data.ByteString.Lazy as L
import System.IO
file = "test"
-- original streamer for feeding a sink from a file
streamer :: (IO S.ByteString -> IO r) -> IO r
streamer sink = withFile file ReadMode $ \h -> sink $ S.hGet h 10240
-- feed a lazy ByteString to sink
lstreamer :: L.ByteString -> (IO S.ByteString -> IO r) -> IO r
lstreamer lbs sink = sink (return (L.toStrict lbs))
lstreamer type checks but probably doesn't do exactly what you want it to do. It simply returns the same data every time the sink calls it. On the other hand S.hGet h ... will eventually return the empty string.
Here is a solution which uses an IORef to keep track of if we should start returning the empty string:
import Data.IORef
mklstream :: L.ByteString -> (IO S.ByteString -> IO r) -> IO r
mklstream lbs sink = do
ref <- newIORef False
let fetch :: IO S.ByteString
fetch = do sent <- readIORef ref
writeIORef ref True
if sent
then return S.empty
else return (L.toStrict lbs)
sink fetch
Here fetch is the action which gets the next chunk. The first time you call it you will get the original lazy Bytestring (strict-ified). Subsequent calls will always return the empty string.
Update
Here's how to give out a small amount at a time:
mklstream :: L.ByteString -> (IO S.ByteString -> IO r) -> IO r
mklstream lbs sink = do
ref <- newIORef (L.toChunks lbs)
let fetch :: IO S.ByteString
fetch = do chunks <- readIORef ref
case chunks of
[] -> return S.empty
(c:cs) -> do writeIORef ref cs
return c
sink fetch

Conduit - Dispatch into multiple output files

I'm trying to dispatch the items from a conduit into many output files, the problem is very similar to Conduit - Multiple output file within the pipeline, with a few differences:
In the previous solution, every sink has a filter that decides if the element belongs to that sink or not. In my case every element coming from the upstream goes exactly to one file, and in the case where there is a big number of files it would be better to make only one operation to decide to which file is it going.
The files are created on demand. A "selector" function decides which sink the next element is going to, and if it doesn't exist yet it creates it using a "create new sink" function.
For example if the Source yields: 8 4 7 1 5
And the sink selector is a module 3, then the sequence of actions would be:
Create file 2
Add 8 to file 2
Create file 1
Add 4 to file 1
Add 7 to file 1
Add 1 to file 1
Add 5 to file 2
I'm thinking of a type for this dispatcher like this:
dispatcherSink_ :: (Monad m) =>
(a -> k) -> -- sink selector
(k -> Sink a m ()) -> -- new sink
Sink a m ()
I've tried to write the function using evalStateC with an internal StateT holding a Map of Sinks, but I'm not able to tie up the types. I'm not sure if you can even use the same sink twice.
Is what I'm trying to do even possible?
I'm still a newbie in Haskell, so any help will be appreciated.
Edited
I though I could create a map of ResumableSinks, there is a library in Hackage for that, but it depends on an old and very specific version of Conduit, so cabal couldn't install it.
In the end I didn't find a way to write the function with the previous type, able to work with any sink, so I came up with a function that works with files directly:
import System.IO (hClose,openFile,IOMode(WriteMode))
import Conduit
import Data.IOData
import qualified Data.Foldable as F
import qualified Data.Map.Strict as M
import Control.Monad.State.Strict
import Data.ByteString.Char8 (pack)
fileDispatcherSink ::
(MonadIO m, IOData c,Ord k) =>
(a -> k) ->
(a -> c) ->
(k -> FilePath) ->
Sink a m ()
fileDispatcherSink selector toChunked path =
evalStateC M.empty $ dispatcher
where
dispatcher = do
next <- await
m <- get
case next of
Nothing -> liftIO $ F.traverse_ hClose m
Just a -> do
let k = selector a
h <- case M.lookup k m of
Nothing -> do
nh <- liftIO $ openFile (path k) WriteMode
put $ M.insert k nh m
return nh
Just h -> return h
yield (toChunked a) $$ sinkHandle h
dispatcher
testSource :: (Monad m) => Source m Int
testSource = yieldMany [8, 4, 7, 1, 5]
main :: IO ()
main = testSource
$$ fileDispatcherSink (`mod` 3) (pack . show) ((++ ".txt") . show)
Is there a way to write the _dispatcherSink__ function?
There is a conceptual problem with implementing
dispatcherSink_ :: (Monad m) =>
(a -> k) -> -- sink selector
(k -> Sink a m ()) -> -- new sink
Sink a m ()
. In conduit, data is pulled from upstream to downstream, instead of being pushed. So a Sink decides if it requests a next input value from its upstream conduit or not. So you can't really have a map of Sinks, read an input value and then feed it to one of the Sinks. The Sink you select might not decide to read the input value, it might decide to finish, and then what will you do with the input value? You can create a new sink for that key, but it can also decide not to accept the input.
So instead of a Sink you'll most likely need some different concept, something to which you can push a value and also what you can finalize. An idea (untested):
data PushSink m i = PushSink { psPush :: i -> m (PushSink m i)
, psFinalize :: m () }
An implementation for writing files would open a file, keep the handle, and psPush would just write a chunk into the file, returning the same object, while psFinalize would close the file.
And then you can implement a variant like this
dispatcherSink_ :: (Monad m) =>
(a -> k) -> -- sink selector
(k -> m (PushSink a m)) -> -- new sink
Sink a m ()
which pushes values to PushSinks and finalizes them all when there is no input.

Haskell: Hiding failures in lazy IO

This is a noob question.
I'd like to write a function which provides a lazy stream of images, presumably something like:
imageStream :: [IO Image]
Unfortunately, the function which reads images can fail, so it looks like:
readImage :: IO (Maybe Image)
So, the function I can write looks like:
maybeImageStream :: [IO (Maybe Image)]
How do I implement a function such as the following, while still keeping lazy IO?
flattenImageStream :: [IO (Maybe Image)] -> [IO Image]
Semantically, when you ask flattenImageStream for the next image, it should iterate through the list and attempt to read each image. It does this until it finds an image that loads, and returns it.
EDIT: There seems to be some disagreement in the answers.
Some have suggested solutions that use sequence, but I'm pretty sure I tested that and found it destroys laziness.
(I'll test it again to be sure when I get back to my computer.)
Someone also suggested using unsafeInterleaveIO.
From the documentation for that function, it seems it would work, but obviously I want to respect the type system as much as possible.
You can use ListT from pipes, which provides a safer alternative to lazy IO that does the right thing in this case.
The way you model your lazy stream of potentially failing images is:
imageStream :: ListT IO (Maybe Image)
Assuming that you had some image loading function of type:
loadImage :: FileName -> IO (Maybe Image)
.. then the way you build such a stream would be something like:
imageStream = do
fileName <- Select $ each ["file1.jpg", "file2.jpg", "file3.jpg"]
lift $ loadImage fileName
If you use the dirstream library, then you can even lazily stream over the directory contents, too.
The function that filters out only the successful results would have this type:
flattenImageStream :: (Monad m) => ListT m (Maybe a) -> ListT m a
flattenImageStream stream = do
ma <- stream
case ma of
Just a -> return a
Nothing -> mzero
Notice that this function works for any base monad, m. There is nothing IO-specific about it. It also preserves laziness!
Applying flattenImage to imageStream, gives us something of type:
finalStream :: List IO Image
finalStream = flattenImage imageStream
Now let's say that you have some function that consumes these images, of type:
useImage :: Image -> IO ()
If you want to process the final ListT using the useImage function, you just write:
main = runEffect $
for (every finalStream) $ \image -> do
lift $ useImage image
That will then lazily consume the image stream.
Of course, you could also play code golf and combine all of that into the following much shorter version:
main = runEffect $ for (every image) (lift . useImage)
where
image = do
fileName <- Select $ each ["file1.jpg", "file2.jpg", "file3.jpg"]
maybeImage <- lift $ loadImage fileName
case maybeImage of
Just img -> return img
Nothing -> mzero
I'm also thinking of adding a fail definition for ListT so that you could just write:
main = runEffect $ for (every image) (lift . useImage)
where
image = do
fileName <- Select $ each ["file1.jpg", "file2.jpg", "file3.jpg"]
Just img <- lift $ loadImage fileName
return img
as suggested u can turn [m a] into m [a] using sequence
so you get:
imageStream :: IO [Image]
then you can use cayMaybes from Data.Maybe to keep just the Just values:
catMaybes `liftM` imageStream
Implementing this as requested seems like it would require knowing outside of the IO monad whether a value inside IO was Nothing, and as IO is designed to prevent its values from "leaking out" into the outside purely functional world (unsafePerformIO notwithstanding), this would be impossible. Instead, I recommend producing an IO [Image]: use sequence to convert the [IO (Maybe Image)] to IO [Maybe Image], and then use Data.Maybe.catMaybes within the IO monad (e.g., with fmap or liftM) to convert to IO [Image], e.g.:
flattenImageStream = fmap catMaybes $ sequence maybeImageStream
I don't think any of these other answers are doing exactly what you want. Because i'm pretty sure catMaybes will just skip over the image and not try to reload it. If you want to just keep trying to reload an image try this.
flattenImageStream :: [IO (Maybe Image)] -> IO [Image]
flattenImageStream xs = mapM untilSuc xs
untilSuc :: IO (Maybe a) -> IO a
untilSuc f = do
res <- f
case res of
Nothing -> untilSuc f
Just i -> return i
But what you are doing is kind of strange. What if you have the wrong file path? What if the image simply can't be loaded? You'll just try to load an image forever. You should probably have a number of times to try and load the image before it gives up.

Resources