How make chunking work with amazonka, conduit and lazy bytestring - haskell

I wrote the code below to simulate upload to S3 from Lazy ByteString (which will be received over the network socket. Here, we simulate by reading from a file of size ~100MB). The problem with the code below is that it seems to be forcing the read of entire file into memory instead of chunking it (cbytes) - will appreciate pointers on why chunking is not working:
import Control.Lens
import Network.AWS
import Network.AWS.S3
import Network.AWS.Data.Body
import System.IO
import Data.Conduit (($$+-))
import Data.Conduit.Binary (sinkLbs,sourceLbs)
import qualified Data.Conduit.List as CL (mapM_)
import Network.HTTP.Conduit (responseBody,RequestBody(..),newManager,tlsManagerSettings)
import qualified Data.ByteString.Lazy as LBS
example :: IO PutObjectResponse
example = do
-- To specify configuration preferences, newEnv is used to create a new Env. The Region denotes the AWS region requests will be performed against,
-- and Credentials is used to specify the desired mechanism for supplying or retrieving AuthN/AuthZ information.
-- In this case, Discover will cause the library to try a number of options such as default environment variables, or an instance's IAM Profile:
e <- newEnv NorthVirginia Discover
-- A new Logger to replace the default noop logger is created, with the logger set to print debug information and errors to stdout:
l <- newLogger Debug stdout
-- The payload for the S3 object is retrieved from a file that simulates lazy bytestring received over network
inb <- LBS.readFile "out"
lenb <- System.IO.withFile "out" ReadMode hFileSize -- evaluates to 104857600 (100MB)
let cbytes = toBody $ ChunkedBody (1024*128) (fromIntegral lenb) (sourceLbs inb)
-- We now run the AWS computation with the overriden logger, performing the PutObject request:
runResourceT . runAWS (e & envLogger .~ l) $
send ((putObject "yourtestenv-change-it-please" "testbucket/test" cbytes) & poContentType .~ Just "text; charset=UTF-8")
main = example >> return ()
Running the executable with RTS -s option shows that entire thing is read into memory (~113MB maximum residency - I did see ~87MB once). On the other hand, if I use chunkedFile, it is chunked correctly (~10MB maximum residency).

It's clear this bit
inb <- LBS.readFile "out"
lenb <- System.IO.withFile "out" ReadMode hFileSize -- evaluates to 104857600 (100MB)
let cbytes = toBody $ ChunkedBody (1024*128) (fromIntegral lenb) (sourceLbs inb)
should be rewritten as
lenb <- System.IO.withFile "out" ReadMode hFileSize -- evaluates to 104857600 (100MB)
let cbytes = toBody $ ChunkedBody (1024*128) (fromIntegral lenb) (C.sourceFile "out")
As you wrote it, the purpose of conduits is defeated. The entire file would need to be accumulated by LBS.readFile, but then broken apart chunk by chunk when fed to sourceLBS. (If lazy IO is working right, this might not happen.) sourceFile reads the file incrementally, chunk by chunk. It may be that, e.g. toBody accumulates the whole file, in which case the point of conduits is defeated at a different point. Glancing at the source for send and so on I can't see anything that would do this, though.

I am not sure but I think the culprit is LBS.readFile its documentation says:
readFile :: FilePath -> IO ByteString
Read an entire file lazily into a ByteString.
The Handle will be held open until EOF is encountered.
chunkedFile works in the way of conduit - alternatively you could use
sourceFile :: MonadResource m => FilePath -> Producer m ByteString
from (conduit-extras/Data.Conduit.Binary) instead of LBS.readFile, but I am not an expert.

Related

When to call runResourceT on streaming-bytestring?

I am a Haskell beginner and still learning about monad transformers.
I am trying to use the streaming-bytestring library to read a binary file, process chunks of bytes, and print the result as each chunk is processed. I believe this is the popular streaming library that provides an alternative to lazy bytestrings. It appears the authors copy-pasted the lazy bytestring documentation and added some arbitrary examples.
The examples mention runResourceT without going into any discussion of what it is or how to use it. It appears that should use runResourceT on any streaming-bytestring function that performs an action. That's fine, but what if I'm reading an infinite stream that processes chunks and prints them? Should I call runResourceT every time I want to process the chunk?
My code is something like this:
import qualified Data.ByteString.Streaming as BSS
import System.TimeIt
main = timeIt $ processByteChunks $ BSS.drop 100 $ BSS.readFile "filename"
and I'm unsure of how to organize processByteChunks as a recursive function that iterates through the binary file.
If I call runResourceT only once, it would read the infinite file BEFORE printing, right? That seems bad.
main = timeIt $ runResourceT $ processByteChunks $ BSS.drop 100 $ BSS.readFile "filename"
The ResourceT monad just cleans up resources in a timely fashion when you're finished with them. In this case, it will ensure the file handle opened by BSS.readFile is closed when the stream is consumed. (Unless the stream truly is infinite, in which case I guess it won't.)
In your application, you only want to call it once, since you don't want the file closed until you've read all the chunks. Don't worry -- it has nothing to do with the timing of output or anything like that.
Here's an example with a recursive processByteChunks that should work. It will read lazily and generate output as chunks are lazily read:
import Control.Monad.IO.Class
import Control.Monad.Trans.Resource
import qualified Data.ByteString.Streaming as BSS
import qualified Data.ByteString as BS
import System.TimeIt
main :: IO ()
main = timeIt $ runResourceT $
processByteChunks $ BSS.drop 100 $ BSS.readFile "filename"
processByteChunks :: MonadIO m => BSS.ByteString m () -> m ()
processByteChunks = go 0 0
where go len nulls stream = do
m <- BSS.unconsChunk stream
case m of
Just (bs, stream') -> do
let len' = len + BS.length bs
nulls' = nulls + BS.length (BS.filter (==0) bs)
liftIO $ print $ "cumulative length=" ++ show len'
++ ", nulls=" ++ show nulls'
go len' nulls' stream'
Nothing -> return ()

How to (efficiently) follow / tail a file with Haskell, including detecting file rotation? (tail -F)

In essence I wish to know how to approach implementing tail -F Linux command functionality in Haskell. My goal is to follow a log file, such as a web server log file, and compute various real time statistics by parsing the input as it comes in. Ideally with no interruptions if the log file is rotated with logrotate or similar service.
I'm somewhat at loss on how to even approach the problem and what should I take into consideration in terms of performance in presence of lazy I/O. Would any of the streaming libraries be relevant here?
This is a partial answer, as it doesn't handle file truncation by logrotate. It avoids lazy I/O and uses the bytestring, streaming, streaming-bytestring and hinotify packages.
Some preliminary imports:
{-# language OverloadedStrings #-}
module Main where
import qualified Data.ByteString
import Data.ByteString.Lazy.Internal (defaultChunkSize)
import qualified Data.ByteString.Streaming as B
import Streaming
import qualified Streaming.Prelude as S
import Control.Concurrent.QSem
import System.INotify
import System.IO (withFile,IOMode(ReadMode))
import System.Environment (getArgs)
Here's the "tailing" function:
tailing :: FilePath -> (B.ByteString IO () -> IO r) -> IO r
tailing filepath continuation = withINotify $ \i -> do
sem <- newQSem 1
addWatch i [Modify] filepath (\_ -> signalQSem sem)
withFile filepath ReadMode (\h -> continuation (handleToStream sem h))
where
handleToStream sem h = B.concat . Streaming.repeats $ do
lift (waitQSem sem)
readWithoutClosing h
-- Can't use B.fromHandle here because annoyingly it closes handle on EOF
-- instead of just returning, and this causes problems on new appends.
readWithoutClosing h = do
c <- lift (Data.ByteString.hGetSome h defaultChunkSize)
if Data.ByteString.null c
then return ()
else do B.chunk c
readWithoutClosing h
It takes a file path an a callback that consumes a streaming bytestring.
The idea is that, each time before reading from the handle until EOF, we decrement a semaphore, which is only increased by the callback that is invoked when the file is modified.
We can test the function like this:
main :: IO ()
main = do
filepath : _ <- getArgs
tailing filepath B.stdout

Haskell read/write binary files complete working example

I wish if someone gives a complete working code that allows to do the following in Haskell:
Read a very large sequence (more than 1 billion elements) of 32-bit
int values from a binary file into an appropriate container (e.g.
certainly not a list, for performance issues) and doubling each number
if it's less than 1000 (decimal) and then write the resulting 32-bit
int values to another binary file. I may not want to read the entire
contents of the binary file in the memory at once. I want to read one
chunk after the previous.
I am confused because I could find very little documentation about this. Data.Binary, ByteString, Word8 and what not, it just adds to the confusion. There is pretty straight-forward solution to such problems in C/C++. Take an array (e.g. of unsigned int) of desired size, and use the read/write library calls and be done with it. In Haskell it didn't seem so easy, at least to me.
I'd appreciate if your solution uses the best possible standard packages that are available with mainstream Haskell (> GHC 7.10) and not some obscure/obsolete ones.
I read from these pages
https://wiki.haskell.org/Binary_IO
https://wiki.haskell.org/Dealing_with_binary_data
If you're doing binary I/O, you almost certainly want ByteString for the actual input/output part. Have a look at the hGet and hPut functions it provides. (Or, if you only need strictly linear access, you can try using lazy I/O, but it's easy to get that wrong.)
Of course, a byte string is just an array of bytes; your next problem is interpreting those bytes as character / integers / doubles / whatever else they're supposed to be. There are a couple of packages for that, but Data.Binary seems to be the most mainstream one.
The documentation for binary seems to want to steer you towards using the Binary class, where you write code to serialise and deserialise whole objects. But you can use the functions in Data.Binary.Get and Data.Binary.Put to deal with individual items. There you will find functions such as getWord32be (get Word32 big-endian) and so forth.
I don't have time to write a working code example right now, but basically look at the functions I mention above and ignore everything else, and you should get some idea.
Now with working code:
module Main where
import Data.Word
import qualified Data.ByteString.Lazy as BIN
import Data.Binary.Get
import Data.Binary.Put
import Control.Monad
import System.IO
main = do
h_in <- openFile "Foo.bin" ReadMode
h_out <- openFile "Bar.bin" WriteMode
replicateM 1000 (process_chunk h_in h_out)
hClose h_in
hClose h_out
chunk_size = 1000
int_size = 4
process_chunk h_in h_out = do
bin1 <- BIN.hGet h_in chunk_size
let ints1 = runGet (replicateM (chunk_size `div` int_size) getWord32le) bin1
let ints2 = map (\ x -> if x < 1000 then 2*x else x) ints1
let bin2 = runPut (mapM_ putWord32le ints2)
BIN.hPut h_out bin2
This, I believe, does what you asked for. It reads 1000 chunks of chunk_size bytes, converts each one into a list of Word32 (so it only ever has chunk_size / 4 integers in memory at once), does the calculation you specified, and writes the result back out again.
Obviously if you did this "for real" you'd want EOF checking and such.
Best way to work with binary I/O in Haskell is by using bytestrings. Lazy bytestrings provide buffered I/O, so you don't even need to care about it.
Code below assumes that chunk size is a multiple of 32-bit (which it is).
module Main where
import Data.Word
import Control.Monad
import Data.Binary.Get
import Data.Binary.Put
import qualified Data.ByteString.Lazy as BS
import qualified Data.ByteString as BStrict
-- Convert one bytestring chunk to the list of integers
-- and append the result of conversion of the later chunks.
-- It actually appends only closure which will evaluate next
-- block of numbers on demand.
toNumbers :: BStrict.ByteString -> [Word32] -> [Word32]
toNumbers chunk rest = chunkNumbers ++ rest
where
getNumberList = replicateM (BStrict.length chunk `div` 4) getWord32le
chunkNumbers = runGet getNumberList (BS.fromStrict chunk)
main :: IO()
main = do
-- every operation below is done lazily, consuming input as necessary
input <- BS.readFile "in.dat"
let inNumbers = BS.foldrChunks toNumbers [] input
let outNumbers = map (\x -> if x < 1000 then 2*x else x) inNumbers
let output = runPut (mapM_ putWord32le outNumbers)
-- There lazy bytestring output is evaluated and saved chunk
-- by chunk, pulling data from input file, decoding, processing
-- and encoding it back one chunk at a time
BS.writeFile "out.dat" output
Here is a loop to process one line at a time from stdin:
import System.IO
loop = do b <- hIsEOF stdin
if b then return ()
else do str <- hGetLine stdin
let str' = ...process str...
hPutStrLn stdout str'
Now just replace hGetLine with something that reads 4 bytes, etc.
Here is the I/O section for Data.ByteString:
https://hackage.haskell.org/package/bytestring-0.10.6.0/docs/Data-ByteString.html#g:29

Space explosion when folding over Producers/Parsers in Haskell

Supposing I have a module like this:
module Explosion where
import Pipes.Parse (foldAll, Parser, Producer)
import Pipes.ByteString (ByteString, fromLazy)
import Pipes.Aeson (DecodingError)
import Pipes.Aeson.Unchecked (decoded)
import Data.List (intercalate)
import Data.ByteString.Lazy.Char8 (pack)
import Lens.Family (view)
import Lens.Family.State.Strict (zoom)
produceString :: Producer ByteString IO ()
produceString = fromLazy $ pack $ intercalate " " $ map show [1..1000000]
produceInts ::
Producer Int IO (Either (DecodingError, Producer ByteString IO ()) ())
produceInts = view decoded produceString
produceInts' :: Producer Int IO ()
produceInts' = produceInts >> return ()
parseBiggest :: Parser ByteString IO Int
parseBiggest = zoom decoded (foldAll max 0 id)
The 'produceString' function is a bytestring producer, and I am concerned with folding a parse over it to produce some kind of result.
The following two programs show different ways of tackling the problem of finding the maximum value in the bytestring by parsing it as a series of JSON ints.
Program 1:
module Main where
import Explosion (produceInts')
import Pipes.Prelude (fold)
main :: IO ()
main = do
biggest <- fold max 0 id produceInts'
print $ show biggest
Program 2:
module Main where
import Explosion (parseBiggest, produceString)
import Pipes.Parse (evalStateT)
main :: IO ()
main = do
biggest <- evalStateT parseBiggest produceString
print $ show biggest
Unfortunately, both programs eat about 200MB of memory total when I profile them, a problem I'd hoped the use of streaming parsers would solve. The first program spends most of its time and memory (> 70%) in (^.) from Lens.Family, while the second spends it in fmap, called by zoom from Lens.Family.State.Strict. The usage graphs are below. Both programs spend about 70% of their time doing garbage collection.
Am I doing something wrong? Is the Prelude function max not strict enough? I can't tell if the library functions are bad, or if I'm using the library wrong! (It's probably the latter.)
For completeness, here's a git repo that you can clone and run cabal install in if you'd like to see what I'm talking about first-hand, and here's the memory usage of the two programs:
Wrapping a strict bytestring in a single yield doesn't make it lazy. You have to yield smaller chunks to get any streaming behavior.
Edit: I found the error. pipes-aeson internally uses a consecutively function defined like this:
consecutively parser = step where
step p0 = do
(mr, p1) <- lift $
S.runStateT atEndOfBytes (p0 >-> PB.dropWhile B.isSpaceWord8)
case mr of
Just r -> return (Right r)
Nothing -> do
(ea, p2) <- lift (S.runStateT parser p1)
case ea of
Left e -> return (Left (e, p2))
Right a -> yield a >> step p2
The problematic line is the one with PB.dropWhile. This adds a quadratic blow up proportional to the number of parsed elements.
What happens is that the pipe that is threaded through this computation accumulates a new cat pipe downstream of it after each parse. So after N parses you get N cat pipes, which adds O(N) overhead to each parsed element.
I've created a Github issue to fix this. pipes-aeson is maintained by Renzo and he has fixed this issue before.
Edit: I've submitted a pull request to fix a second problem (you needed to use the intercalate for lazy bytestrings). Now the program runs in 5 KB constant space for both versions:

Is there something better than unsafePerformIO for this....?

I've so far avoided ever needing unsafePerformIO, but this might have to change today.... I would like to see if the community agrees, or if someone has a better solution.
I have a library which needs to use some config data stored in a bunch of files. This data is guaranteed static (during the run), but needs to be in files that can (on very rare occasions) be edited by an end user who can not compile Haskell programs. (The details are uninportant, but think of "/etc/mime.types" as a pretty good approximation. It is a large almost static data file used throughout many programs).
If this weren't a library I would just use the IO monad.... But because it is a library which is called throughout my code, it literally forces a bubbling up of the IO monad through pretty much everything I have written in multiple modules! Although I need to do a one time read of the data files, this low level call is effetively pure, so this is a pretty unacceptable outcome.
FYI, I plan to also wrap the call in unsafeInterleaveIO, so that only files that are needed will be loaded. My code will look something like this....
dataDir="<path to files>"
datafiles::[FilePath]
datafiles =
unsafePerformIO $
unsafeInterleaveIO $
map (dataDir </>)
<$> filter (not . ("." `isPrefixOf`))
<$> getDirectoryContents dataDir
fileData::[String]
fileData = unsafePerformIO $ unsafeInterleaveIO $ sequence $ readFile <$> datafiles
Given that the data read is referentially transparent, I am pretty sure that unsafePerformIO is safe (this has been discussed in many place, such as "Use of unsafePerformIO appropriate?"). Still, though, if there is a better way, I would love to hear about it.
UPDATE-
In response to Anupam's comment....
There are two reasons why I can't break up the lib into IO and non IO parts.
First, the amount of data is large, and I don't want to read it all into memory at once. Remember that IO is always read strictly.... This is the reason that I need to put in the unsafeInterleaveIO call, to make it lazy. IMHO, once you use unsafeInterleaveIO, you might as well use unsafePerformIO, as the risk is already there.
Second, breaking out the IO specific parts just substitutes the bubbling up of the IO monad with the bubbling up of the IO read code, as well as the passing around of the data (I might actually choose to pass around the data using the state monad anyway, so it really isn't an improvement to substitute the IO monad for the state monad everywhere). This wouldn't be so bad if the low level function itself wasn't effectively pure (ie- think of my /etc/mime.types example above, and imagine a Haskell extensionToMimeType function, which is basically pure, but needs to get the database data from the file.... Suddenly everything from low to high in the stack needs to call or pass through a readMimeData::IO String. Why should each main even need to care about the library choice of a submodule many levels deep?).
I agree with Anupam Jain, you would be better off reading these data files at a somewhat higher level, in IO, and then passing the data in them through the rest of your program purely.
You could, for example, put the functions that need the results of fileData into Reader [String], so that they can just ask for the results as needed (or some Reader Config, where Config holds these strings and whatever else you need).
A sketch of what I'm suggesting follows:
type AppResult = String
fileData :: IO [String]
fileData = undefined -- read the files
myApp :: String -> Reader [String] AppResult
myApp s = do
files <- ask
return undefined -- do whatever with s and config
main = do
config <- fileData
return $ runReader (myApp "test") config
I gather that you don't want to read all the data at once, because that would be costly. And maybe you don't really know up-front what files you will need to load, so loading all of them at the start would be wasteful.
Here's an attempt at a solution. It requires you to work inside a free monad and relegate the side-effecting operations to an interpreter. Some preliminary imports:
{-# LANGUAGE OverloadedStrings #-}
module Main where
import qualified Data.ByteString as B
import Data.Monoid
import Data.List
import Data.Functor.Compose
import Control.Applicative
import Control.Monad
import Control.Monad.Free
import System.IO
We define a functor for the free monad. It will offer a value p do the interpreter and continue the computation after receiving a value b:
type LazyLoad p b = Compose ((,) p) ((->) b)
A convenience function to request the loading of a file:
lazyLoad :: FilePath -> Free (LazyLoad FilePath B.ByteString) B.ByteString
lazyLoad path = liftF $ Compose (path,id)
A dummy interpreter function that reads "file contents" from stdin:
interpret :: Free (LazyLoad FilePath B.ByteString) a -> IO a
interpret = iterM $ \(Compose (path,next)) -> do
putStrLn $ "Enter the contents for file " <> path <> ":"
B.hGetLine stdin >>= next
Some silly example functions:
someComp :: B.ByteString -> B.ByteString
someComp b = "[" <> b <> "]"
takesAwhile :: Int
takesAwhile = foldl' (+) 0 $ take 400000000 $ intersperse (negate 1) $ repeat 1
An example program:
main :: IO ()
main = do
r <- interpret $ do
r1 <- someComp <$> lazyLoad "file1"
r2 <- return takesAwhile
if (r2 == 1)
then return r1
else someComp <$> lazyLoad "file2"
putStrLn . show $ r
When executed, this program will request a line, spend some time computing takesAwhile and only then request another line.
If want to allow different kinds of "requests", this solution could be extended with something like Data types à la carte so that each function only needs to know about about the precise effects it requires.
If you are content with allowing only one type of request, you could also use Clients and Servers from Pipes.Core instead of the free monad.

Resources