I am a Haskell beginner and still learning about monad transformers.
I am trying to use the streaming-bytestring library to read a binary file, process chunks of bytes, and print the result as each chunk is processed. I believe this is the popular streaming library that provides an alternative to lazy bytestrings. It appears the authors copy-pasted the lazy bytestring documentation and added some arbitrary examples.
The examples mention runResourceT without going into any discussion of what it is or how to use it. It appears that should use runResourceT on any streaming-bytestring function that performs an action. That's fine, but what if I'm reading an infinite stream that processes chunks and prints them? Should I call runResourceT every time I want to process the chunk?
My code is something like this:
import qualified Data.ByteString.Streaming as BSS
import System.TimeIt
main = timeIt $ processByteChunks $ BSS.drop 100 $ BSS.readFile "filename"
and I'm unsure of how to organize processByteChunks as a recursive function that iterates through the binary file.
If I call runResourceT only once, it would read the infinite file BEFORE printing, right? That seems bad.
main = timeIt $ runResourceT $ processByteChunks $ BSS.drop 100 $ BSS.readFile "filename"
The ResourceT monad just cleans up resources in a timely fashion when you're finished with them. In this case, it will ensure the file handle opened by BSS.readFile is closed when the stream is consumed. (Unless the stream truly is infinite, in which case I guess it won't.)
In your application, you only want to call it once, since you don't want the file closed until you've read all the chunks. Don't worry -- it has nothing to do with the timing of output or anything like that.
Here's an example with a recursive processByteChunks that should work. It will read lazily and generate output as chunks are lazily read:
import Control.Monad.IO.Class
import Control.Monad.Trans.Resource
import qualified Data.ByteString.Streaming as BSS
import qualified Data.ByteString as BS
import System.TimeIt
main :: IO ()
main = timeIt $ runResourceT $
processByteChunks $ BSS.drop 100 $ BSS.readFile "filename"
processByteChunks :: MonadIO m => BSS.ByteString m () -> m ()
processByteChunks = go 0 0
where go len nulls stream = do
m <- BSS.unconsChunk stream
case m of
Just (bs, stream') -> do
let len' = len + BS.length bs
nulls' = nulls + BS.length (BS.filter (==0) bs)
liftIO $ print $ "cumulative length=" ++ show len'
++ ", nulls=" ++ show nulls'
go len' nulls' stream'
Nothing -> return ()
Related
In essence I wish to know how to approach implementing tail -F Linux command functionality in Haskell. My goal is to follow a log file, such as a web server log file, and compute various real time statistics by parsing the input as it comes in. Ideally with no interruptions if the log file is rotated with logrotate or similar service.
I'm somewhat at loss on how to even approach the problem and what should I take into consideration in terms of performance in presence of lazy I/O. Would any of the streaming libraries be relevant here?
This is a partial answer, as it doesn't handle file truncation by logrotate. It avoids lazy I/O and uses the bytestring, streaming, streaming-bytestring and hinotify packages.
Some preliminary imports:
{-# language OverloadedStrings #-}
module Main where
import qualified Data.ByteString
import Data.ByteString.Lazy.Internal (defaultChunkSize)
import qualified Data.ByteString.Streaming as B
import Streaming
import qualified Streaming.Prelude as S
import Control.Concurrent.QSem
import System.INotify
import System.IO (withFile,IOMode(ReadMode))
import System.Environment (getArgs)
Here's the "tailing" function:
tailing :: FilePath -> (B.ByteString IO () -> IO r) -> IO r
tailing filepath continuation = withINotify $ \i -> do
sem <- newQSem 1
addWatch i [Modify] filepath (\_ -> signalQSem sem)
withFile filepath ReadMode (\h -> continuation (handleToStream sem h))
where
handleToStream sem h = B.concat . Streaming.repeats $ do
lift (waitQSem sem)
readWithoutClosing h
-- Can't use B.fromHandle here because annoyingly it closes handle on EOF
-- instead of just returning, and this causes problems on new appends.
readWithoutClosing h = do
c <- lift (Data.ByteString.hGetSome h defaultChunkSize)
if Data.ByteString.null c
then return ()
else do B.chunk c
readWithoutClosing h
It takes a file path an a callback that consumes a streaming bytestring.
The idea is that, each time before reading from the handle until EOF, we decrement a semaphore, which is only increased by the callback that is invoked when the file is modified.
We can test the function like this:
main :: IO ()
main = do
filepath : _ <- getArgs
tailing filepath B.stdout
I wrote the code below to simulate upload to S3 from Lazy ByteString (which will be received over the network socket. Here, we simulate by reading from a file of size ~100MB). The problem with the code below is that it seems to be forcing the read of entire file into memory instead of chunking it (cbytes) - will appreciate pointers on why chunking is not working:
import Control.Lens
import Network.AWS
import Network.AWS.S3
import Network.AWS.Data.Body
import System.IO
import Data.Conduit (($$+-))
import Data.Conduit.Binary (sinkLbs,sourceLbs)
import qualified Data.Conduit.List as CL (mapM_)
import Network.HTTP.Conduit (responseBody,RequestBody(..),newManager,tlsManagerSettings)
import qualified Data.ByteString.Lazy as LBS
example :: IO PutObjectResponse
example = do
-- To specify configuration preferences, newEnv is used to create a new Env. The Region denotes the AWS region requests will be performed against,
-- and Credentials is used to specify the desired mechanism for supplying or retrieving AuthN/AuthZ information.
-- In this case, Discover will cause the library to try a number of options such as default environment variables, or an instance's IAM Profile:
e <- newEnv NorthVirginia Discover
-- A new Logger to replace the default noop logger is created, with the logger set to print debug information and errors to stdout:
l <- newLogger Debug stdout
-- The payload for the S3 object is retrieved from a file that simulates lazy bytestring received over network
inb <- LBS.readFile "out"
lenb <- System.IO.withFile "out" ReadMode hFileSize -- evaluates to 104857600 (100MB)
let cbytes = toBody $ ChunkedBody (1024*128) (fromIntegral lenb) (sourceLbs inb)
-- We now run the AWS computation with the overriden logger, performing the PutObject request:
runResourceT . runAWS (e & envLogger .~ l) $
send ((putObject "yourtestenv-change-it-please" "testbucket/test" cbytes) & poContentType .~ Just "text; charset=UTF-8")
main = example >> return ()
Running the executable with RTS -s option shows that entire thing is read into memory (~113MB maximum residency - I did see ~87MB once). On the other hand, if I use chunkedFile, it is chunked correctly (~10MB maximum residency).
It's clear this bit
inb <- LBS.readFile "out"
lenb <- System.IO.withFile "out" ReadMode hFileSize -- evaluates to 104857600 (100MB)
let cbytes = toBody $ ChunkedBody (1024*128) (fromIntegral lenb) (sourceLbs inb)
should be rewritten as
lenb <- System.IO.withFile "out" ReadMode hFileSize -- evaluates to 104857600 (100MB)
let cbytes = toBody $ ChunkedBody (1024*128) (fromIntegral lenb) (C.sourceFile "out")
As you wrote it, the purpose of conduits is defeated. The entire file would need to be accumulated by LBS.readFile, but then broken apart chunk by chunk when fed to sourceLBS. (If lazy IO is working right, this might not happen.) sourceFile reads the file incrementally, chunk by chunk. It may be that, e.g. toBody accumulates the whole file, in which case the point of conduits is defeated at a different point. Glancing at the source for send and so on I can't see anything that would do this, though.
I am not sure but I think the culprit is LBS.readFile its documentation says:
readFile :: FilePath -> IO ByteString
Read an entire file lazily into a ByteString.
The Handle will be held open until EOF is encountered.
chunkedFile works in the way of conduit - alternatively you could use
sourceFile :: MonadResource m => FilePath -> Producer m ByteString
from (conduit-extras/Data.Conduit.Binary) instead of LBS.readFile, but I am not an expert.
How does one make their own streaming code? I was generating about 1,000,000,000 random pairs of war decks, and I wanted them to be lazy streamed into a foldl', but I got a space leak! Here is the relevant section of code:
main = do
games <- replicateM 1000000000 $ deal <$> sDeck --Would be a trillion, but Int only goes so high
let res = experiment Ace games --experiment is a foldl'
print res --res is tiny
When I run it with -O2, it first starts freezing up my computer, and then the program dies and the computer comes back to life (and Google Chrome then has the resources it needs to yell at me for using up all its resources.)
Note: I tried unsafeInterleaveIO, and it didn't work.
Full code is at: http://lpaste.net/109977
replicateM doesn't do lazy streaming. If you need to stream results from monadic actions, you should use a library such as conduit or pipes.
Your example code could be written to support streaming with conduits like this:
import Data.Conduit
import qualified Data.Conduit.Combinators as C
main = do
let games = C.replicateM 1000000 $ deal <$> sDeck
res <- games $$ C.foldl step Ace
-- where step is the function you want to fold with
print res
The Data.Conduit.Combinators module is from the conduit-combinators package.
As a quick-and-dirty solution you could implement a streaming version of replicateM using lazy IO.
import System.IO.Unsafe
lazyReplicateIO :: Integer -> IO a -> IO [a] --Using Integer so I can make a trillion copies
lazyReplicateIO 0 _ = return []
lazyReplicateIO n act = do
a <- act
rest <- unsafeInterleaveIO $ lazyReplicateIO (n-1) act
return $ a : rest
But I recommend using a proper streaming library.
The equivalent pipes solution is:
import Pipes
import qualified Pipes.Prelude as Pipes
-- Assuming the following types
action :: IO A
acc :: S
step :: S -> A -> S
done :: S -> B
main = do
b <- Pipes.fold step acc done (Pipes.replicateM 1000000 action)
print (b :: B)
Supposing I have a module like this:
module Explosion where
import Pipes.Parse (foldAll, Parser, Producer)
import Pipes.ByteString (ByteString, fromLazy)
import Pipes.Aeson (DecodingError)
import Pipes.Aeson.Unchecked (decoded)
import Data.List (intercalate)
import Data.ByteString.Lazy.Char8 (pack)
import Lens.Family (view)
import Lens.Family.State.Strict (zoom)
produceString :: Producer ByteString IO ()
produceString = fromLazy $ pack $ intercalate " " $ map show [1..1000000]
produceInts ::
Producer Int IO (Either (DecodingError, Producer ByteString IO ()) ())
produceInts = view decoded produceString
produceInts' :: Producer Int IO ()
produceInts' = produceInts >> return ()
parseBiggest :: Parser ByteString IO Int
parseBiggest = zoom decoded (foldAll max 0 id)
The 'produceString' function is a bytestring producer, and I am concerned with folding a parse over it to produce some kind of result.
The following two programs show different ways of tackling the problem of finding the maximum value in the bytestring by parsing it as a series of JSON ints.
Program 1:
module Main where
import Explosion (produceInts')
import Pipes.Prelude (fold)
main :: IO ()
main = do
biggest <- fold max 0 id produceInts'
print $ show biggest
Program 2:
module Main where
import Explosion (parseBiggest, produceString)
import Pipes.Parse (evalStateT)
main :: IO ()
main = do
biggest <- evalStateT parseBiggest produceString
print $ show biggest
Unfortunately, both programs eat about 200MB of memory total when I profile them, a problem I'd hoped the use of streaming parsers would solve. The first program spends most of its time and memory (> 70%) in (^.) from Lens.Family, while the second spends it in fmap, called by zoom from Lens.Family.State.Strict. The usage graphs are below. Both programs spend about 70% of their time doing garbage collection.
Am I doing something wrong? Is the Prelude function max not strict enough? I can't tell if the library functions are bad, or if I'm using the library wrong! (It's probably the latter.)
For completeness, here's a git repo that you can clone and run cabal install in if you'd like to see what I'm talking about first-hand, and here's the memory usage of the two programs:
Wrapping a strict bytestring in a single yield doesn't make it lazy. You have to yield smaller chunks to get any streaming behavior.
Edit: I found the error. pipes-aeson internally uses a consecutively function defined like this:
consecutively parser = step where
step p0 = do
(mr, p1) <- lift $
S.runStateT atEndOfBytes (p0 >-> PB.dropWhile B.isSpaceWord8)
case mr of
Just r -> return (Right r)
Nothing -> do
(ea, p2) <- lift (S.runStateT parser p1)
case ea of
Left e -> return (Left (e, p2))
Right a -> yield a >> step p2
The problematic line is the one with PB.dropWhile. This adds a quadratic blow up proportional to the number of parsed elements.
What happens is that the pipe that is threaded through this computation accumulates a new cat pipe downstream of it after each parse. So after N parses you get N cat pipes, which adds O(N) overhead to each parsed element.
I've created a Github issue to fix this. pipes-aeson is maintained by Renzo and he has fixed this issue before.
Edit: I've submitted a pull request to fix a second problem (you needed to use the intercalate for lazy bytestrings). Now the program runs in 5 KB constant space for both versions:
Earlier today I wrote a small test app for iteratees that composed an iteratee for writing progress with an iteratee for actually copying data. I wound up with values like these:
-- NOTE: this snippet is with iteratees-0.8.5.0
-- side effect: display progress on stdout
displayProgress :: Iteratee ByteString IO ()
-- side effect: copy the bytestrings of Iteratee to Handle
fileSink :: Handle -> Iteratee ByteString IO ()
writeAndDisplayProgress :: Handle -> Iteratee ByteString IO ()
writeAndDisplayProgress handle = sequence_ [fileSink handle, displayProgress]
In looking at the enumerator library, I don't see an analog of sequence_ or enumWith. All I want to do is compose two iteratees so they act as one. I could discard the result (it's going to be () anyway) or keep it, I don't care. (&&&) from Control.Arrow is what I want, only for iteratees rather than arrows.
I tried these two options:
-- NOTE: this snippet is with enumerator-0.4.10
run_ $ enumFile source $$ sequence_ [iterHandle handle, displayProgress]
run_ $ enumFile source $$ sequence_ [displayProgress, iterHandle handle]
The first one copies the file, but doesn't show progress; the second one shows progress, but doesn't copy the file, so obviously the effect of the built-in sequence_ on enumerator's iteratees is to run the first iteratee until it terminates and then run the other, which is not what I want. I want to be running the iteratees in parallel rather than serially. I feel like I'm missing something obvious, but in reading the wc example for the enumerator library, I see this curious comment:
-- Exactly matching wc's output is too annoying, so this example
-- will just print one line per file, and support counting at most
-- one statistic per run
I wonder if this remark indicates that combining or composing iteratees within the enumerations framework isn't possible out of the box. What's the generally-accepted right way to do this?
Edit:
It seems as though there is no built-in way to do this. There's discussion on the Haskell mailing list about adding combinators like enumSequence and manyToOne but so far, there doesn't seem to be anything actually in the enumerator package that furnishes this capability.
It seems to me like rather than trying to have two Iteratees consume the sequence in parallel, it would be better to feed the stream through an identity Enumeratee that simply counts the bytes passing it.
Here's a simple example that copies a file and prints the number of bytes copied after each chunk.
import System.Environment
import System.IO
import Data.Enumerator
import Data.Enumerator.Binary (enumFile, iterHandle)
import Data.Enumerator.List (mapAccumM)
import qualified Data.ByteString as B
printBytes :: Enumeratee B.ByteString B.ByteString IO ()
printBytes = flip mapAccumM 0 $ \total bytes -> do
let total' = total + B.length bytes
print total'
return (total', bytes)
copyFile s t = withBinaryFile t WriteMode $ \h -> do
run_ $ (enumFile s $= printBytes) $$ iterHandle h
main = do
[source, target] <- getArgs
copyFile source target