I'm trying to work over big files using Haskell. I'd like to browse an input file byte after byte, and to generate an output byte after byte. Of course I need the IO to be buffered with blocks of reasonable size (a few KB). I can't do it, and I need your help please.
import System
import qualified Data.ByteString.Lazy as BL
import Data.Word
import Data.List
main :: IO ()
main =
do
args <- System.getArgs
let filename = head args
byteString <- BL.readFile filename
let wordsList = BL.unpack byteString
let foldFun acc word = doSomeStuff word : acc
let wordsListCopy = foldl' foldFun [] wordsList
let byteStringCopy = BL.pack (reverse wordsListCopy)
BL.writeFile (filename ++ ".cpy") byteStringCopy
where
doSomeStuff = id
I name this file TestCopy.hs, then do the following:
$ ls -l *MB
-rwxrwxrwx 1 root root 10000000 2011-03-24 13:11 10MB
-rwxrwxrwx 1 root root 5000000 2011-03-24 13:31 5MB
$ ghc --make -O TestCopy.hs
[1 of 1] Compiling Main ( TestCopy.hs, TestCopy.o )
Linking TestCopy ...
$ time ./TestCopy 5MB
real 0m5.631s
user 0m1.972s
sys 0m2.488s
$ diff 5MB 5MB.cpy
$ time ./TestCopy 10MB
real 3m6.671s
user 0m3.404s
sys 1m21.649s
$ diff 10MB 10MB.cpy
$ time ./TestCopy 10MB +RTS -K500M -RTS
real 2m50.261s
user 0m3.808s
sys 1m13.849s
$ diff 10MB 10MB.cpy
$
My problem: There is a huge difference between a 5MB and a 10 MB file. I'd like the performances to be linear in the size of the input file. Please what am i doing wrong, and how can I achieve this? I don't mind using lazy bytestrings or anything else as long as it works, but it has to be a standard ghc library.
Precision: It's for a university project. And I'm not trying to copy files. The doSomeStuff function shall perform compression/decompression actions that I have to customize.
For chunked input processing I would use the enumerator package.
import Data.Enumerator
import Data.Enumerator.Binary (enumFile)
We use bytestrings
import Data.ByteString as BS
and IO
import Control.Monad.Trans (liftIO)
import Control.Monad (mapM_)
import System (getArgs)
Your main function could look like following:
main =
do (filepath:_) <- getArgs
let destination
run_ $ enumFile filepath $$ writeFile (filepath ++ ".cpy")
enumFile reads 4096 bytes per chunk and passes these to writeFile, which writes it down.
enumWrite is defined as:
enumWrite :: FilePath -> Iteratee BS.ByteString IO ()
enumWrite filepath =
do liftIO (BS.writeFile filepath BS.empty) -- ensure the destination is empty
continue step
where
step (Chunks xs) =
do liftIO (mapM_ (BS.appendFile filepath) xs)
continue step
step EOF = yield () EOF
As you can see, the step function takes chunks of bytestrings and appends them
to the destination file. These chunks have the type Stream BS.Bytestring, where
Stream is defined as:
data Stream a = Chunks [a] | EOF
On an EOF step terminates, yielding ().
To have a much more elaborate read on this I personally recommend Michael
Snoymans tutorial
The numbers
$ time ./TestCopy 5MB
./TestCopy 5MB 2,91s user 0,32s system 96% cpu 3,356 total
$ time ./TestCopy2 5MB
./TestCopy2 5MB 0,04s user 0,03s system 93% cpu 0,075 total
That's quite an improvement. Now in order to implement your fold you probably want to write an Enumeratee, which is used to transform a input stream. Fortunately there is already a map function defined in the enumerator package, which can be modified for your need, i.e. it can be modified to carry over state.
On the construction of the intermediate result
You construct wordsList in reverse order and reverse it afterwards. I think difference lists do a better job, because appends take only O(1) time due to the fact that appending is only a function composition. I'm not sure whether they takes more space though. Here's a rough sketch of difference lists:
type DList a = [a] -> [a]
emptyList :: DList a
emptyList = id
snoc :: DList a -> a -> DList a
snoc dlist a = dlist . (a:)
toList :: DList a -> [a]
toList dlist = dlist []
This answer is probably not needed anymore, but I added it for completeness.
I take it this is a follow on to Reading large file in haskell? from yesterday.
Try compiling with "-rtsopts -O2" instead of just "-O".
You claim "I'd like to browse an input file byte after byte, and to generate an output byte after byte." but your code reads the entire input before trying to create any output. This is just not very representative of the goal.
With my system I see "ghc -rtsopts --make -O2 b.hs" giving
(! 741)-> time ./b five
real 0m2.789s user 0m2.235s sys 0m0.518s
(! 742)-> time ./b ten
real 0m5.830s user 0m4.623s sys 0m1.027s
Which now looks linear to me.
Related
I am a Haskell beginner and still learning about monad transformers.
I am trying to use the streaming-bytestring library to read a binary file, process chunks of bytes, and print the result as each chunk is processed. I believe this is the popular streaming library that provides an alternative to lazy bytestrings. It appears the authors copy-pasted the lazy bytestring documentation and added some arbitrary examples.
The examples mention runResourceT without going into any discussion of what it is or how to use it. It appears that should use runResourceT on any streaming-bytestring function that performs an action. That's fine, but what if I'm reading an infinite stream that processes chunks and prints them? Should I call runResourceT every time I want to process the chunk?
My code is something like this:
import qualified Data.ByteString.Streaming as BSS
import System.TimeIt
main = timeIt $ processByteChunks $ BSS.drop 100 $ BSS.readFile "filename"
and I'm unsure of how to organize processByteChunks as a recursive function that iterates through the binary file.
If I call runResourceT only once, it would read the infinite file BEFORE printing, right? That seems bad.
main = timeIt $ runResourceT $ processByteChunks $ BSS.drop 100 $ BSS.readFile "filename"
The ResourceT monad just cleans up resources in a timely fashion when you're finished with them. In this case, it will ensure the file handle opened by BSS.readFile is closed when the stream is consumed. (Unless the stream truly is infinite, in which case I guess it won't.)
In your application, you only want to call it once, since you don't want the file closed until you've read all the chunks. Don't worry -- it has nothing to do with the timing of output or anything like that.
Here's an example with a recursive processByteChunks that should work. It will read lazily and generate output as chunks are lazily read:
import Control.Monad.IO.Class
import Control.Monad.Trans.Resource
import qualified Data.ByteString.Streaming as BSS
import qualified Data.ByteString as BS
import System.TimeIt
main :: IO ()
main = timeIt $ runResourceT $
processByteChunks $ BSS.drop 100 $ BSS.readFile "filename"
processByteChunks :: MonadIO m => BSS.ByteString m () -> m ()
processByteChunks = go 0 0
where go len nulls stream = do
m <- BSS.unconsChunk stream
case m of
Just (bs, stream') -> do
let len' = len + BS.length bs
nulls' = nulls + BS.length (BS.filter (==0) bs)
liftIO $ print $ "cumulative length=" ++ show len'
++ ", nulls=" ++ show nulls'
go len' nulls' stream'
Nothing -> return ()
I wrote the code below to simulate upload to S3 from Lazy ByteString (which will be received over the network socket. Here, we simulate by reading from a file of size ~100MB). The problem with the code below is that it seems to be forcing the read of entire file into memory instead of chunking it (cbytes) - will appreciate pointers on why chunking is not working:
import Control.Lens
import Network.AWS
import Network.AWS.S3
import Network.AWS.Data.Body
import System.IO
import Data.Conduit (($$+-))
import Data.Conduit.Binary (sinkLbs,sourceLbs)
import qualified Data.Conduit.List as CL (mapM_)
import Network.HTTP.Conduit (responseBody,RequestBody(..),newManager,tlsManagerSettings)
import qualified Data.ByteString.Lazy as LBS
example :: IO PutObjectResponse
example = do
-- To specify configuration preferences, newEnv is used to create a new Env. The Region denotes the AWS region requests will be performed against,
-- and Credentials is used to specify the desired mechanism for supplying or retrieving AuthN/AuthZ information.
-- In this case, Discover will cause the library to try a number of options such as default environment variables, or an instance's IAM Profile:
e <- newEnv NorthVirginia Discover
-- A new Logger to replace the default noop logger is created, with the logger set to print debug information and errors to stdout:
l <- newLogger Debug stdout
-- The payload for the S3 object is retrieved from a file that simulates lazy bytestring received over network
inb <- LBS.readFile "out"
lenb <- System.IO.withFile "out" ReadMode hFileSize -- evaluates to 104857600 (100MB)
let cbytes = toBody $ ChunkedBody (1024*128) (fromIntegral lenb) (sourceLbs inb)
-- We now run the AWS computation with the overriden logger, performing the PutObject request:
runResourceT . runAWS (e & envLogger .~ l) $
send ((putObject "yourtestenv-change-it-please" "testbucket/test" cbytes) & poContentType .~ Just "text; charset=UTF-8")
main = example >> return ()
Running the executable with RTS -s option shows that entire thing is read into memory (~113MB maximum residency - I did see ~87MB once). On the other hand, if I use chunkedFile, it is chunked correctly (~10MB maximum residency).
It's clear this bit
inb <- LBS.readFile "out"
lenb <- System.IO.withFile "out" ReadMode hFileSize -- evaluates to 104857600 (100MB)
let cbytes = toBody $ ChunkedBody (1024*128) (fromIntegral lenb) (sourceLbs inb)
should be rewritten as
lenb <- System.IO.withFile "out" ReadMode hFileSize -- evaluates to 104857600 (100MB)
let cbytes = toBody $ ChunkedBody (1024*128) (fromIntegral lenb) (C.sourceFile "out")
As you wrote it, the purpose of conduits is defeated. The entire file would need to be accumulated by LBS.readFile, but then broken apart chunk by chunk when fed to sourceLBS. (If lazy IO is working right, this might not happen.) sourceFile reads the file incrementally, chunk by chunk. It may be that, e.g. toBody accumulates the whole file, in which case the point of conduits is defeated at a different point. Glancing at the source for send and so on I can't see anything that would do this, though.
I am not sure but I think the culprit is LBS.readFile its documentation says:
readFile :: FilePath -> IO ByteString
Read an entire file lazily into a ByteString.
The Handle will be held open until EOF is encountered.
chunkedFile works in the way of conduit - alternatively you could use
sourceFile :: MonadResource m => FilePath -> Producer m ByteString
from (conduit-extras/Data.Conduit.Binary) instead of LBS.readFile, but I am not an expert.
I wish if someone gives a complete working code that allows to do the following in Haskell:
Read a very large sequence (more than 1 billion elements) of 32-bit
int values from a binary file into an appropriate container (e.g.
certainly not a list, for performance issues) and doubling each number
if it's less than 1000 (decimal) and then write the resulting 32-bit
int values to another binary file. I may not want to read the entire
contents of the binary file in the memory at once. I want to read one
chunk after the previous.
I am confused because I could find very little documentation about this. Data.Binary, ByteString, Word8 and what not, it just adds to the confusion. There is pretty straight-forward solution to such problems in C/C++. Take an array (e.g. of unsigned int) of desired size, and use the read/write library calls and be done with it. In Haskell it didn't seem so easy, at least to me.
I'd appreciate if your solution uses the best possible standard packages that are available with mainstream Haskell (> GHC 7.10) and not some obscure/obsolete ones.
I read from these pages
https://wiki.haskell.org/Binary_IO
https://wiki.haskell.org/Dealing_with_binary_data
If you're doing binary I/O, you almost certainly want ByteString for the actual input/output part. Have a look at the hGet and hPut functions it provides. (Or, if you only need strictly linear access, you can try using lazy I/O, but it's easy to get that wrong.)
Of course, a byte string is just an array of bytes; your next problem is interpreting those bytes as character / integers / doubles / whatever else they're supposed to be. There are a couple of packages for that, but Data.Binary seems to be the most mainstream one.
The documentation for binary seems to want to steer you towards using the Binary class, where you write code to serialise and deserialise whole objects. But you can use the functions in Data.Binary.Get and Data.Binary.Put to deal with individual items. There you will find functions such as getWord32be (get Word32 big-endian) and so forth.
I don't have time to write a working code example right now, but basically look at the functions I mention above and ignore everything else, and you should get some idea.
Now with working code:
module Main where
import Data.Word
import qualified Data.ByteString.Lazy as BIN
import Data.Binary.Get
import Data.Binary.Put
import Control.Monad
import System.IO
main = do
h_in <- openFile "Foo.bin" ReadMode
h_out <- openFile "Bar.bin" WriteMode
replicateM 1000 (process_chunk h_in h_out)
hClose h_in
hClose h_out
chunk_size = 1000
int_size = 4
process_chunk h_in h_out = do
bin1 <- BIN.hGet h_in chunk_size
let ints1 = runGet (replicateM (chunk_size `div` int_size) getWord32le) bin1
let ints2 = map (\ x -> if x < 1000 then 2*x else x) ints1
let bin2 = runPut (mapM_ putWord32le ints2)
BIN.hPut h_out bin2
This, I believe, does what you asked for. It reads 1000 chunks of chunk_size bytes, converts each one into a list of Word32 (so it only ever has chunk_size / 4 integers in memory at once), does the calculation you specified, and writes the result back out again.
Obviously if you did this "for real" you'd want EOF checking and such.
Best way to work with binary I/O in Haskell is by using bytestrings. Lazy bytestrings provide buffered I/O, so you don't even need to care about it.
Code below assumes that chunk size is a multiple of 32-bit (which it is).
module Main where
import Data.Word
import Control.Monad
import Data.Binary.Get
import Data.Binary.Put
import qualified Data.ByteString.Lazy as BS
import qualified Data.ByteString as BStrict
-- Convert one bytestring chunk to the list of integers
-- and append the result of conversion of the later chunks.
-- It actually appends only closure which will evaluate next
-- block of numbers on demand.
toNumbers :: BStrict.ByteString -> [Word32] -> [Word32]
toNumbers chunk rest = chunkNumbers ++ rest
where
getNumberList = replicateM (BStrict.length chunk `div` 4) getWord32le
chunkNumbers = runGet getNumberList (BS.fromStrict chunk)
main :: IO()
main = do
-- every operation below is done lazily, consuming input as necessary
input <- BS.readFile "in.dat"
let inNumbers = BS.foldrChunks toNumbers [] input
let outNumbers = map (\x -> if x < 1000 then 2*x else x) inNumbers
let output = runPut (mapM_ putWord32le outNumbers)
-- There lazy bytestring output is evaluated and saved chunk
-- by chunk, pulling data from input file, decoding, processing
-- and encoding it back one chunk at a time
BS.writeFile "out.dat" output
Here is a loop to process one line at a time from stdin:
import System.IO
loop = do b <- hIsEOF stdin
if b then return ()
else do str <- hGetLine stdin
let str' = ...process str...
hPutStrLn stdout str'
Now just replace hGetLine with something that reads 4 bytes, etc.
Here is the I/O section for Data.ByteString:
https://hackage.haskell.org/package/bytestring-0.10.6.0/docs/Data-ByteString.html#g:29
Supposing I have a module like this:
module Explosion where
import Pipes.Parse (foldAll, Parser, Producer)
import Pipes.ByteString (ByteString, fromLazy)
import Pipes.Aeson (DecodingError)
import Pipes.Aeson.Unchecked (decoded)
import Data.List (intercalate)
import Data.ByteString.Lazy.Char8 (pack)
import Lens.Family (view)
import Lens.Family.State.Strict (zoom)
produceString :: Producer ByteString IO ()
produceString = fromLazy $ pack $ intercalate " " $ map show [1..1000000]
produceInts ::
Producer Int IO (Either (DecodingError, Producer ByteString IO ()) ())
produceInts = view decoded produceString
produceInts' :: Producer Int IO ()
produceInts' = produceInts >> return ()
parseBiggest :: Parser ByteString IO Int
parseBiggest = zoom decoded (foldAll max 0 id)
The 'produceString' function is a bytestring producer, and I am concerned with folding a parse over it to produce some kind of result.
The following two programs show different ways of tackling the problem of finding the maximum value in the bytestring by parsing it as a series of JSON ints.
Program 1:
module Main where
import Explosion (produceInts')
import Pipes.Prelude (fold)
main :: IO ()
main = do
biggest <- fold max 0 id produceInts'
print $ show biggest
Program 2:
module Main where
import Explosion (parseBiggest, produceString)
import Pipes.Parse (evalStateT)
main :: IO ()
main = do
biggest <- evalStateT parseBiggest produceString
print $ show biggest
Unfortunately, both programs eat about 200MB of memory total when I profile them, a problem I'd hoped the use of streaming parsers would solve. The first program spends most of its time and memory (> 70%) in (^.) from Lens.Family, while the second spends it in fmap, called by zoom from Lens.Family.State.Strict. The usage graphs are below. Both programs spend about 70% of their time doing garbage collection.
Am I doing something wrong? Is the Prelude function max not strict enough? I can't tell if the library functions are bad, or if I'm using the library wrong! (It's probably the latter.)
For completeness, here's a git repo that you can clone and run cabal install in if you'd like to see what I'm talking about first-hand, and here's the memory usage of the two programs:
Wrapping a strict bytestring in a single yield doesn't make it lazy. You have to yield smaller chunks to get any streaming behavior.
Edit: I found the error. pipes-aeson internally uses a consecutively function defined like this:
consecutively parser = step where
step p0 = do
(mr, p1) <- lift $
S.runStateT atEndOfBytes (p0 >-> PB.dropWhile B.isSpaceWord8)
case mr of
Just r -> return (Right r)
Nothing -> do
(ea, p2) <- lift (S.runStateT parser p1)
case ea of
Left e -> return (Left (e, p2))
Right a -> yield a >> step p2
The problematic line is the one with PB.dropWhile. This adds a quadratic blow up proportional to the number of parsed elements.
What happens is that the pipe that is threaded through this computation accumulates a new cat pipe downstream of it after each parse. So after N parses you get N cat pipes, which adds O(N) overhead to each parsed element.
I've created a Github issue to fix this. pipes-aeson is maintained by Renzo and he has fixed this issue before.
Edit: I've submitted a pull request to fix a second problem (you needed to use the intercalate for lazy bytestrings). Now the program runs in 5 KB constant space for both versions:
How can I read multiple files as a single ByteString lazily with constant memory?
readFiles :: [FilePath] -> IO ByteString
I currently have the following implementation but from what I have seen from profiling as well as my understanding I will end with n-1 of the files in memory.
readFiles = foldl1 joinIOStrings . map ByteString.readFile
where joinIOStrings ml mr = do
l <- ml
r <- mr
return $ l `ByteString.append` r
I understand that the flaw here is that I am applying the IO actions then rewrapping them so what I think I need is a way to replace the foldl1 joinIOStrings without applying them.
How can I read multiple files as a single ByteString lazily with constant memory?
If you want constant memory usage, you need Data.ByteString.Lazy. A strict ByteString cannot be read lazily, and would require O(sum of filesizes) memory.
For a not too large number of files, simply reading them all (D.B.L.readFile reads lazily) and concatenating the results is good,
import qualified Data.ByteString.Lazy as L
readFiles :: [FilePath] -> IO L.ByteString
readFiles = fmap L.concat . mapM L.readFile
The mapM L.readFile will open the files, but only read the contents of each file when it is demanded.
If the number of files is large, so that the limit of open file handles allowed by the OS for a single process could be exhausted, you need something more complicated. You can cook up your own lazy version of mapM,
import System.IO.Unsafe (unsafeInterleaveIO)
mapM_lazy :: [IO a] -> IO [a]
mapM_lazy [] = return []
mapM_lazy (x:xs) = do
r <- x
rs <- unsafeInterleaveIO (mapM_lazy xs)
return (r:rs)
so that each file will only be opened when its contents are needed, when previously read files can already be closed. There's a slight possibility that that still runs into resource limits, since the time of closing the handles is not guaranteed.
Or you can use your favourite iteratee, enumerator, conduit or whatever package that solves the problem in a systematic way. Each of them has its own advantages and disadvantages with respect to the others and, if coded correctly, eliminates the possibility of accidentally hitting the resource limit.
I assume that you are using lazy byte strings (from Data.ByteString.Lazy). There are probably other ways to do this, but one option is to simply use concat :: [ByteString] -> ByteString:
import Control.Monad
import Data.ByteString.Lazy (ByteString)
import qualified Data.ByteString.Lazy as ByteString
readFiles :: [FilePath] -> IO ByteString
readFiles = fmap ByteString.concat . mapM ByteString.readFile
(Note: I don't have time to test the code, but reading the documentation says that this should work)