how to add a new source inside a conduit haskell

how to add a new source inside a conduit haskell - haskell

I have a problem with the following code using network-conduit:
import Data.Conduit.List as CL
import Data.Conduit.Text as CT
import qualified Data.ByteString.Char8 as S8
import qualified Data.Text as TT
mySource :: ResourceT m => Integer -> Source m Int
mySource i = {- function -} undefined
myApp :: Application
myApp src snk =
src $= CT.decode CT.ascii
$= CL.map decimal
$= CL.map {-problem here-}
$$ src
in problem place I want to write something like
\t -> case t of
Left err = S8.pack $ "Error:" ++ e
Right (i,xs) = (>>>=) mySource
{- or better:
do
(>>>=) mySource
(<<<=) T.pack xs
-}
where the (>>>=) function pushes mySource output to the next level and
(<<<=) is sending function back to previous level

The network chops up the byte stream into arbitrary ByteString chunks. With the code above, those ByteString chunks will be mapped to chunks of Text, and each chunk of Text will be parsed as a decimal. However, a string of decimal digits representing a single decimal may be split across two (or more) Text chunks. Also, as you realize, using decimal gives you back the remainder of the Text chunk that didn't get parsed as part of the decimal, which you are trying to shove back into the input stream.
Both of these problems can be solved by using Data.Conduit.Attoparsec. conduitParserEither with Data.Attoparsec.Text.decimal. Note that it is not sufficient to just parse decimal; you will also need to handle some kind of separator between decimals.
It is also not possible to splice a Source from CL.map, since CL.map's type signature is
map :: Monad m => (a -> b) -> Conduit a m b
The function you pass to map gets an opportunity to transform each input a into a single output b, not a stream of b's. To do that, you can use awaitForever, but you'll need to transform your Source into a general Producer with toProducer in order for the types to match.
However, in your code, you are trying to send parse errors downstream as ByteString's, but the output of mySource as Int's, which is a type error. You must provide a stream of ByteString in both cases; the successful parse case can return a Conduit made by fusing other Conduit's as long as it ends up with an output of ByteString:
...
$= (let f (Left err) = yield $ S8.pack $ "Error: " ++ show err
f (Right (_, i)) = toProducer (mySource i) $= someOtherConduit
in awaitForever f)
where someOtherConduit sinks the Int's from mySource, and sources ByteString's.
someOtherConduit :: Monad m => Conduit Int m ByteString
Finally, I believe you meant to connect the snk at the end of the pipe instead of the src.

Related

How to track progress through a streaming ByteString?

I'm using the streaming-utils streaming-utils to stream a HTTP response body. I want to track the progress similar to how bytestring-progress allows with lazy ByteStrings. I suspect something like toChunks would be necessary, then reducing some cumulative bytes read and returning the original stream unmodified. But I cannot figure it out, and the streaming documentation is very unhelpful, mostly full of grandiose comparisons to alternative libraries.
Here's some code with my best effort so far. It doesn't include the counting yet, and just tries to print the size of chunks as they stream past (and doesn't compile).
download :: ByteString -> FilePath -> IO ()
download i file = do
req <- parseRequest . C.unpack $ i
m <- newHttpClientManager
runResourceT $ do
resp <- http req m
lift . traceIO $ "downloading " <> file
let body = SBS.fromChunks $ mapsM step $ SBS.toChunks $ responseBody resp
SBS.writeFile file body
step bs = do
traceIO $ "got " <> show (C.length bs) <> " bytes"
return bs

What we want is to traverse the Stream (Of ByteString) IO () in two ways:
One that accumulates the incoming lengths of the ByteStrings and prints updates to console.
One that writes the stream to a file.
We can do that with the help of the copy function, which has type:
copy :: Monad m => Stream (Of a) m r -> Stream (Of a) (Stream (Of a) m) r
copy takes a stream and duplicates it into two different monadic layers, where each element of the original stream is emitted by both layers of the new dissociated stream.
(Notice that we are changing the base monad, not the functor. What changing the functor to another Stream does is to delimit groups in a single stream, and we aren't interested in that here.)
The following function takes a stream, copies it, accumulates the length of incoming strings with S.scan, prints them, and returns another stream that you can still work with, for example writing it to a file:
{-# LANGUAGE OverloadedStrings #-}
import Streaming
import qualified Streaming.Prelude as S
import qualified Data.ByteString as B
track :: Stream (Of B.ByteString) IO r -> Stream (Of B.ByteString) IO r
track stream =
S.mapM_ (liftIO . print) -- brings us back to the base monad, here another stream
. S.scan (\s b -> s + B.length b) (0::Int) id
$ S.copy stream
This will print the ByteStrings along with the accumulated lengths:
main :: IO ()
main = S.mapM_ B.putStr . track $ S.each ["aa","bb","c"]

Haskell Conduit: having a Sink return a value based on the values from upstream

I've been trying to use the Conduit library to do some simple I/O involving files, but I'm having a hard time.
I have a text file containing nothing but a few digits such as 1234. I have a function that reads the file using readFile (no conduits), and returns Maybe Int (Nothing is returned when the file actually doesn't exist). I'm trying to write a version of this function that uses conduits, and I just can't figure it out.
Here is what I have:
import Control.Monad.Trans.Resource
import Data.Conduit
import Data.Functor
import System.Directory
import qualified Data.ByteString.Char8 as B
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.Text as CT
import qualified Data.Text as T
myFile :: FilePath
myFile = "numberFile"
withoutConduit :: IO (Maybe Int)
withoutConduit = do
doesExist <- doesFileExist myFile
if doesExist
then Just . read <$> readFile myFile
else return Nothing
withConduit :: IO (Maybe Int)
withConduit = do
doesExist <- doesFileExist myFile
if doesExist
then runResourceT $ source $$ conduit =$ sink
else return Nothing
where
source :: Source (ResourceT IO) B.ByteString
source = CB.sourceFile myFile
conduit :: Conduit B.ByteString (ResourceT IO) T.Text
conduit = CT.decodeUtf8
sink :: Sink T.Text (ResourceT IO) (Maybe Int)
sink = awaitForever $ \txt -> let num = read . T.unpack $ txt :: Int
in -- I don't know what to do here...
Could someone please help me complete the sink function?
Thanks!

This isn't really a good example for where conduit actually provides a lot of value, at least not the way you're looking at it right now. Specifically, you're trying to use the read function, which requires that the entire value be in memory. Additionally, your current error handling behavior is a bit loose. Essentially, you're just going to get an read: no parse error if there's anything unexpected in the content.
However, there is a way we can play with this in conduit and be meaningful: by parsing the ByteString byte-by-byte ourselves and avoiding the read function. Fortunately, this pattern falls into a standard left fold, which the conduit-combinators package provides a perfect function for (element-wise left fold in a conduit, aka foldlCE):
{-# LANGUAGE OverloadedStrings #-}
import Conduit
import Data.Word8
import qualified Data.ByteString as S
sinkInt :: Monad m => Consumer S.ByteString m Int
sinkInt =
foldlCE go 0
where
go total w
| _0 <= w && w <= _9 =
total * 10 + (fromIntegral $ w - _0)
| otherwise = error $ "Invalid byte: " ++ show w
main :: IO ()
main = do
x <- yieldMany ["1234", "5678"] $$ sinkInt
print x
There are plenty of caveats that go along with this: it will simply throw an exception if there are unexpected bytes, and it doesn't handle integer overflow at all (though fixing that is just a matter of replacing Int with Integer). It's important to note that, since the in-memory string representation of a valid 32- or 64-bit int is always going to be tiny, conduit is overkill for this problem, though I hope that this code gives some guidance on how to generally write conduit code.

How do I avoid memory problems when writing to file using the Writer monad?

I am building some moderately large DIMACS files, however with the method used below the memory usage is rather large compared to the size of the files generated, and on some of the larger files I need to generate I run in to out of memory problems.
import Control.Monad.State.Strict
import Control.Monad.Writer.Strict
import qualified Data.ByteString.Lazy.Char8 as B
import Control.Monad
import qualified Text.Show.ByteString as BS
import Data.List
main = printDIMACS "test.cnf" test
test = do
xs <- freshs 100000
forM_ (zip xs (tail xs))
(\(x,y) -> addAll [[negate x, negate y],[x,y]])
type Var = Int
type Clause = [Var]
data DIMACSS = DS{
nextFresh :: Int,
numClauses :: Int
} deriving (Show)
type DIMACSM a = StateT DIMACSS (Writer B.ByteString) a
freshs :: Int -> DIMACSM [Var]
freshs i = do
next <- gets nextFresh
let toRet = [next..next+i-1]
modify (\s -> s{nextFresh = next+i})
return toRet
fresh :: DIMACSM Int
fresh = do
i <- gets nextFresh
modify (\s -> s{nextFresh = i+1})
return i
addAll :: [Clause] -> DIMACSM ()
addAll c = do
tell
(B.concat .
intersperse (B.pack " 0\n") .
map (B.unwords . map BS.show) $ c)
tell (B.pack " 0\n")
modify (\s -> s{numClauses = numClauses s + length c})
add h = addAll [h]
printDIMACS :: FilePath -> DIMACSM a -> IO ()
printDIMACS file f = do
writeFile file ""
appendFile file (concat ["p cnf ", show i, " ", show j, "\n"])
B.appendFile file b
where
(s,b) = runWriter (execStateT f (DS 1 0))
i = nextFresh s - 1
j = numClauses s
I would like to keep the monadic building of clauses since it is very handy, but I need to overcome the memory problem. How do I optimize the above program so that it doesn't use too much memory?

If you want good memory behavior, you need to make sure that you write out the clauses as you generate them, instead of collecting them in memory and dumping them as such, either using lazyness or a more explicit approach such as conduits, enumerators, pipes or the like.
The main obstacle to that approach is that the DIMACS format expects the number of clauses and variables in the header. This prevents the naive implementation from being sufficiently lazy. There are two possibilities:
The pragmatic one is to write the clauses first to a temporary location. After that the numbers are known, so you write them to the real file and append the contents of the temporary file.
The prettier approach is possible if the generation of clauses has no side effects (besides the effects offered by your DIMACSM monad) and is sufficiently fast: Run it twice, first throwing away the clauses and just calculating the numbers, print the header line, run the generator again; now printing the clauses.
(This is from my experience with implementing SAT-Britney, where I took the second approach, because it fitted better with other requirements in that context.)
Also, in your code, addAll is not lazy enough: The list c needs to be retained even after writing (in the MonadWriter sense) the clauses. This is another space leak. I suggest you implement add as the primitive operation and then addAll = mapM_ add.

As explained in Joachim Breitner's answer the problem was that DIMACSM was not lazy enough, both because the strict versions of the monads was used and because the number of variables and clauses are needed before the ByteString can be written to the file. The solution is to use the lazy versions of the Monads and execute them twice. It turns out that it is also necessary to have WriterT be the outer monad:
import Control.Monad.State
import Control.Monad.Writer
...
type DIMACSM a = WriterT B.ByteString (State DIMACSS) a
...
printDIMACS :: FilePath -> DIMACSM a -> IO ()
printDIMACS file f = do
writeFile file ""
appendFile file (concat ["p cnf ", show i, " ", show j, "\n"])
B.appendFile file b
where
s = execState (execWriterT f) (DS 1 0)
b = evalState (execWriterT f) (DS 1 0)
i = nextFresh s - 1
j = numClauses s

Lazy output from monadic action

I have the next monad transformer:
newtype Pdf' m a = Pdf' {
unPdf' :: StateT St (Iteratee ByteString m) a
}
type Pdf m = ErrorT String (Pdf' m)
Basically, it uses underlying Iteratee that reads and processes pdf document (requires random-access source, so that it will not keep the document in memory all the time).
I need to implement a function that will save pdf document, and I want it to be lazy, it should be possible to save document in constant memory.
I can produce lazy ByteString:
import Data.ByteString.Lazy (ByteString)
import qualified Data.ByteString.Lazy as BS
save :: Monad m => Pdf m ByteString
save = do
-- actually it is a loop
str1 <- serializeTheFirstObject
storeOffsetForTheFirstObject (BS.length str1)
str2 <- serializeTheSecondObject
storeOffsetForTheSecondObject (BS.length str2)
...
strn <- serializeTheNthObject
storeOffsetForTheNthObject (BS.length strn)
table <- dumpRefTable
return mconcat [str1, str2, ..., strn] `mappend` table
But actual output can depend on previous output. (Details: pdf document contains so called "reference table" with absolute offset in bytes of every object inside the document. It definitely depends on length of ByteString pdf object is serialized to.)
How to ensure that save function will not force entire ByteString before returning it to caller?
Is it better to take callback as an argument and call it every time I have something to output?
import Data.ByteString (ByteString)
save :: Monad m => (ByteString -> Pdf m ()) -> Pdf m ()
Is there better solution?

To build this in one pass you will need to store (perhaps in the state) where your indirect objects have been written. So the save needs to keep track of the absolute byte position as it works -- I have not considered whether your Pdf monad is suitable for this task. When you get to the end you can used the addresses stored in the state to create the xref section.
I do not think a two-pass algorithm will help.
Edit June 6th: Perhaps I understand your desire better now. For very fast generation of documents, e.g. HTML, there are several libraries on hackage with "blaze" in the name. The technique is to avoid using 'mconcat' on the ByteString and use in on an intermediate 'builder' type. The core library for this seems to be 'blaze-builder', which is used in 'blaze-html' and 'blaze-textual'.

The solution I found so far is Coroutine
Example:
proc :: Int -> Coroutine (Yield String) IO ()
proc 0 = return ()
proc i = do
suspend $ Yield "Hello World\n" (proc $ i - 1)
main :: IO ()
main = do
go (proc 10)
where
go cr = do
r <- resume cr
case r of
Right () -> return ()
Left (Yield str cont) -> do
putStr str
go cont
It does the same work as callback, but caller has full control on output generation.

Attoparsec Iteratee

I wanted, just to learn a bit about Iteratees, reimplement a simple parser I made, using Data.Iteratee and Data.Attoparsec.Iteratee. I'm pretty much stumped though. Below I have a simple example that is able to parse one line from a file. My parser reads one line at a time, so I need a way of feeding lines to the iteratee until it's done. I've read all I've found googling this, but a lot of the material on iteratee/enumerators is pretty advanced. This is the part of the code that matters:
-- There are more imports above.
import Data.Attoparsec.Iteratee
import Data.Iteratee (joinI, run)
import Data.Iteratee.IO (defaultBufSize, enumFile)
line :: Parser ByteString -- left the implementation out (it doesn't check for
new line)
iter = parserToIteratee line
main = do
p <- liftM head getArgs
i <- enumFile defaultBufSize p $ iter
i' <- run i
print i'
This example will parse and print one line from a file with multiple lines. The original script mapped the parser over a list of ByteStrings. So I would like to do the same thing here. I found enumLinesin Iteratee, but I can't for the life of me figure out how to use it. Maybe I misunderstand its purpose?

Since your parser works on a line at a time, you don't even need to use attoparsec-iteratee. I would write this as:
import Data.Iteratee as I
import Data.Iteratee.Char
import Data.Attoparsec as A
parser :: Parser ParseOutput
type POut = Either String ParseOutput
processLines :: Iteratee ByteString IO [POut]
processLines = joinI $ (enumLinesBS ><> I.mapStream (A.parseOnly parser)) stream2list
The key to understanding this is the "enumeratee", which is just the iteratee term for a stream converter. It takes a stream processor (iteratee) of one stream type and converts it to work with another stream. Both enumLinesBS and mapStream are enumeratees.
To map your parser over multiple lines, mapStream is sufficient:
i1 :: Iteratee [ByteString] IO (Iteratee [POut] IO [POut]
i1 = mapStream (A.parseOnly parser) stream2list
The nested iteratees just mean that this converts a stream of [ByteString] to a stream of [POut], and when the final iteratee (stream2list) is run it returns that stream as [POut]. So now you just need the iteratee equivalent of lines to create that stream of [ByteString], which is what enumLinesBS does:
i2 :: Iteratee ByteString IO (Iteratee [ByteString] IO (Iteratee [POut] m [POut])))
i2 = enumLinesBS $ mapStream f stream2list
But this function is pretty unwieldy to use because of all the nesting. What we really want is a way to pipe output directly between stream converters, and at the end simplify everything to a single iteratee. To do this we use joinI, (><>), and (><>):
e1 :: Iteratee [POut] IO a -> Iteratee ByteString IO (Iteratee [POut] IO a)
e1 = enumLinesBS ><> mapStream (A.parseOnly parser)
i' :: Iteratee ByteString IO [POut]
i' = joinI $ e1 stream2list
which is equivalent to how I wrote it above, with e1 inlined.
There's still important element remaining though. This function simply returns the parse results in a list. Typically you would want to do something else, such as combine the results with a fold.
edit: Data.Iteratee.ListLike.mapM_ is often useful to create consumers. At that point each element of the stream is a parse result, so if you want to print them you can use
consumeParse :: Iteratee [POut] IO ()
consumeParse = I.mapM_ (either (\e -> return ()) print)
processLines2 :: Iteratee ByteString IO ()
processLines2 = joinI $ (enumLinesBS ><> I.mapStream (A.parseOnly parser)) consumeParse
This will print just the successful parses. You could easily report errors to STDERR, or handle them in other ways, as well.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to add a new source inside a conduit haskell - haskell

Related

How to track progress through a streaming ByteString?

Haskell Conduit: having a Sink return a value based on the values from upstream

How do I avoid memory problems when writing to file using the Writer monad?

Lazy output from monadic action

Attoparsec Iteratee

Categories

Resources