I use haskell for line-based data processing, i.e. tasks where you can apply sed, awk and similar tools. As a trivial example, let's prepend 000 to every line from standard input.
I have three alternative ways to do the task:
Lazy IO with lazy ByteStrings
Line-based conduit.
Chunk-based conduit with pure strict ByteString processing inside.
example.hs:
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE FlexibleContexts #-}
import ClassyPrelude.Conduit
import qualified Data.ByteString.Char8 as B8
import qualified Data.ByteString.Lazy.Char8 as BL8
import qualified Data.Conduit.Binary as CB
main = do
[arg] <- getArgs
case arg of
"lazy" -> BL8.getContents >>= BL8.putStr . BL8.unlines . map ("000" ++) . BL8.lines
"lines" -> runConduitRes $ stdinC .| CB.lines .|
mapC ("000" ++) .| mapC (`snoc` 10) .| stdoutC
"chunks" -> runConduitRes $ stdinC .| lineChunksC .|
mapC (B8.unlines . (map ("000" ++)) . B8.lines) .| stdoutC
lineChunksC :: Monad m => Conduit ByteString m ByteString
lineChunksC = await >>= maybe (return ()) go
where
go acc = if
| Just (_, 10) <- unsnoc acc -> yield acc >> lineChunksC
| otherwise -> await >>= maybe (yield acc) (go' . breakAfterEOL)
where
go' (this, next) = let acc' = acc ++ this in if null next then go acc' else yield acc' >> go next
breakAfterEOL :: ByteString -> (ByteString, ByteString)
breakAfterEOL = uncurry (\x -> maybe (x, "") (first (snoc x)) . uncons) . break (== 10)
$ stack ghc --package={classy-prelude-conduit,conduit-extra} -- -O2 example.hs -o example
$ for cmd in lazy lines chunks; do echo $cmd; time -p seq 10000000 | ./example $cmd > /dev/null; echo; done
lazy
real 2.99
user 3.06
sys 0.07
lines
real 3.30
user 3.36
sys 0.06
chunks
real 1.83
user 1.95
sys 0.06
(The results are consistent across multiple runs, and also hold for lines with several numbers).
So chunks is 1.6x faster than lines which is a bit faster than lazy. This means that conduits can be faster than plain bytestrings, but the overhead of conduit pipes is too heavy when you split chunks into short lines.
What I don't like about chunks approach is that it mixes both conduit and pure worlds, and it makes it harder to use it for more complex tasks.
The question is, did I miss a simple and elegant solution which would allow me to write efficient code in same fashion as with lines approach?
EDIT1: Per #Michael's suggestion I joined two mapC into one mapC (("000" ++). (snoc10)) in lines solution, to make number of pipes (.|) same between lines and chunks. This made it perform a bit better (down from 3.3s to 2.8s), but still significantly slower than chunks.
Also I tried older Conduit.Binary.lines which Michael suggested in comments, and it also improves performance a bit, by ~0.1s.
EDIT2: Fixed lineChunksC so it works with very small chunks, e.g.
> runConduitPure $ yield ("\nr\n\n"::ByteString) .| concatC .| mapC singleton .| lineChunksC .| sinkList
["\n","r\n","\n"]
My guess is that, for "lines", the mapC ("000" ++) .| mapC (`snoc` 10) part is doing a lot of work.
Concatenating several strict ByteStrings into another strict ByteString is expensive. Concatenating them into a lazy ByteString tends to be more efficient.
To avoid this cost, you can yield each part individually downstream as a strict ByteString (but be aware that then we aren't talking about "lines" anymore).
Alternatively, yield each transformed line as a lazy ByteString downstream.
The question is, did I miss a simple and elegant solution which would
allow me to write efficient code in same fashion as with lines
approach?
Some streaming libraries have an interesting feature: you can delimit lines in the stream, and manipulate them, without the need to materialize entire lines in memory at any point.
Here I'm using the streaming and streaming-bytestring packages, because I'm more familiar with them.
In module Data.ByteString.Streaming.Char8 of streaming-bytestring, we have the lines function:
lines :: Monad m => ByteString m r -> Stream (ByteString m) m r
lines turns a ByteString into a connected stream of ByteStrings at
divide at newline characters. The resulting strings do not contain
newlines. This is the genuinely streaming lines which only breaks
chunks, and thus never increases the use of memory.
The gist of it is that ByteString m r is already a streaming type! So this version of lines transforms a stream into a "stream of streams". And we can only reach the "next stream" (next line) by exhausting the "current stream" (current line).
Your "lines" example could be written as:
{-# language OverloadedStrings #-}
module Main where
import Control.Applicative ((*>))
import Streaming
import qualified Streaming.Prelude as S
import qualified Data.ByteString.Streaming.Char8 as Q
main :: IO ()
main = Q.stdout
. Q.unlines
. S.maps (\line -> "000" *> line)
. Q.lines
$ Q.stdin
Related
I'm trying to write code in source -> transform -> sink style, for example:
let (|>) = flip ($)
repeat 1 |> take 5 |> sum |> print
But would like to do that using IO. I have this impression that my source can be an infinite list of IO actions, and each one gets evaluated once it is needed downstream. Something like this:
-- prints the number of lines entered before "quit" is entered
[getLine..] >>= takeWhile (/= "quit") >>= length >>= print
I think this is possible with the streaming libraries, but can it be done along the lines of what I'm proposing?
Using the repeatM, takeWhile and length_ functions from the streaming library:
import Streaming
import qualified Streaming.Prelude as S
count :: IO ()
count = do r <- S.length_ . S.takeWhile (/= "quit") . S.repeatM $ getLine
print r
This seems to be in that spirit:
let (|>) = flip ($)
let (.>) = flip (.)
getContents >>= lines .> takeWhile (/= "quit") .> length .> print
The issue here is that Monad is not the right abstraction for this, and attempting to do something like this results in a situation where referential transparency is broken.
Firstly, we can do a lazy IO read like so:
module Main where
import System.IO.Unsafe (unsafePerformIO)
import Control.Monad(forM_)
lazyIOSequence :: [IO a] -> IO [a]
lazyIOSequence = pure . go where
go :: [IO a] -> [a]
go (l:ls) = (unsafePerformIO l):(go ls)
main :: IO ()
main = do
l <- lazyIOSequence (repeat getLine)
forM_ l putStrLn
This when run will perform cat. It will read lines and output them. Everything works fine.
But consider changing the main function to this:
main :: IO ()
main = do
l <- lazyIOSequence (map (putStrLn . show) [1..])
putStrLn "Hello World"
This outputs Hello World only, as we didn't need to evaluate any of l. But now consider replacing the last line like the following:
main :: IO ()
main = do
x <- lazyIOSequence (map (putStrLn . show) [1..])
seq (head x) putStrLn "Hello World"
Same program, but the output is now:
1
Hello World
This is bad, we've changed the results of a program just by evaluating a value. This is not supposed to happen in Haskell, when you evaluate something it should just evaluate it, not change the outside world.
So if you restrict your IO actions to something like reading from a file nothing else is reading from, then you might be able to sensibly lazily evaluate things, because when you read from it in relation to all the other IO actions your program is taking doesn't matter. But you don't want to allow this for IO in general, because skipping actions or performing them in a different order can matter (and above, certainly does). Even in the reading a file lazily case, if something else in your program writes to the file, then whether you evaluate that list before or after the write action will affect the output of your program, which again, breaks referential transparency (because evaluation order shouldn't matter).
So for a restricted subset of IO actions, you can sensibly define Functor, Applicative and Monad on a stream type to work in a lazy way, but doing so in the IO Monad in general is a minefield and often just plain incorrect. Instead you want a specialised streaming type, and indeed Conduit defines Functor, Applicative and Monad on a lot of it's types so you can still use all your favourite functions.
I am a Haskell beginner and still learning about monad transformers.
I am trying to use the streaming-bytestring library to read a binary file, process chunks of bytes, and print the result as each chunk is processed. I believe this is the popular streaming library that provides an alternative to lazy bytestrings. It appears the authors copy-pasted the lazy bytestring documentation and added some arbitrary examples.
The examples mention runResourceT without going into any discussion of what it is or how to use it. It appears that should use runResourceT on any streaming-bytestring function that performs an action. That's fine, but what if I'm reading an infinite stream that processes chunks and prints them? Should I call runResourceT every time I want to process the chunk?
My code is something like this:
import qualified Data.ByteString.Streaming as BSS
import System.TimeIt
main = timeIt $ processByteChunks $ BSS.drop 100 $ BSS.readFile "filename"
and I'm unsure of how to organize processByteChunks as a recursive function that iterates through the binary file.
If I call runResourceT only once, it would read the infinite file BEFORE printing, right? That seems bad.
main = timeIt $ runResourceT $ processByteChunks $ BSS.drop 100 $ BSS.readFile "filename"
The ResourceT monad just cleans up resources in a timely fashion when you're finished with them. In this case, it will ensure the file handle opened by BSS.readFile is closed when the stream is consumed. (Unless the stream truly is infinite, in which case I guess it won't.)
In your application, you only want to call it once, since you don't want the file closed until you've read all the chunks. Don't worry -- it has nothing to do with the timing of output or anything like that.
Here's an example with a recursive processByteChunks that should work. It will read lazily and generate output as chunks are lazily read:
import Control.Monad.IO.Class
import Control.Monad.Trans.Resource
import qualified Data.ByteString.Streaming as BSS
import qualified Data.ByteString as BS
import System.TimeIt
main :: IO ()
main = timeIt $ runResourceT $
processByteChunks $ BSS.drop 100 $ BSS.readFile "filename"
processByteChunks :: MonadIO m => BSS.ByteString m () -> m ()
processByteChunks = go 0 0
where go len nulls stream = do
m <- BSS.unconsChunk stream
case m of
Just (bs, stream') -> do
let len' = len + BS.length bs
nulls' = nulls + BS.length (BS.filter (==0) bs)
liftIO $ print $ "cumulative length=" ++ show len'
++ ", nulls=" ++ show nulls'
go len' nulls' stream'
Nothing -> return ()
I am trying to write a solution for one of the Hackerrank problems. The challenge is to count elements in a list, the elements vary from 0 to 99, so it is possible to count them in linear time. Here is what I got:
{-# LANGUAGE BangPatterns #-}
{-# OPTIONS_GHC -O3 #-}
module Main where
import Data.STRef
import Data.Foldable
import Control.Monad
import Control.Monad.ST
main = do
line1 <- getLine
line2 <- getLine
let
!ns = map read $ words line2 :: [Int]
res = runST $ do
refs <- forM [0..99] $ \i ->
newSTRef (0 :: Int)
traverse_ (\x -> modifySTRef' (refs !! x) (+1) ) ns
mapM (\ref -> readSTRef ref) refs
putStrLn . unwords . map show $ res
This code works but not fast enough to pass the last test case. Can someone recommend an improvement to it? (link to the problem)
This can be done as a one-liner using accumArray from Data.Array. Something like accumArray (+) 0 (0,99) . zip values $ repeat 1 where values is the input.
It appears to still not be fast enough, which is somewhat vexing. accumArray is more or less as efficient as possible for what it does. Testing on my system reveals the time for processing 1,000,000 input values to be about 1 second, even without compiling it, and that time is dominated by generating the random inputs. That's a far cry from the 5 seconds on the test site.. I have to wonder how overloaded that system is.
One problem you have is that you're looking up your STRefs in a list which means that you'll have to traverse O(n) steps for every lookup and modification. This can be alleviated by using something like Data.Map.Map which has O(log(n)) lookup and modification time.
You could also use a mutable Array or Vector for O(1) lookup/modification time in the ST monad. This is probably the fastest method.
I'm trying to make a conduit that's sort of a cross between takeWhile and isolate. That is, it will consume from the input and yield to the output until either the predicate no longer holds or it has reached the byte limit. I know the type signature will be
isolateWhile :: (Monad m) => Int -> (Word8 -> Bool) -> Conduit ByteString m ByteString
As an example of its use:
{-# LANGUAGE OverloadedStrings #-}
import Data.Conduit
import qualified Data.Conduit.List as CL
import qualified Data.Conduit.Binary as CB
import Control.Monad.Trans.Class
charToWord = fromIntegral . fromEnum
example :: Int -> Char -> IO ()
example limit upTo = do
untaken <- CB.sourceLbs "Hello, world!" $= conduit $$ CB.sinkLbs
putStrLn $ "Left " ++ show untaken
where
conduit = do
taken <- toConsumer $ isolateWhile limit (/= charToWord upTo) =$ CB.sinkLbs
lift $ putStrLn $ "Took " ++ show taken
CL.map id -- pass the rest through untouched
I expect that
ghci> example 5 'l'
Took "He"
Left "llo, world!"
ghci> example 5 'w'
Took "Hello"
Left ", world!"
However, the simplest possible definition of isolateWhile:
isolateWhile limit pred = CB.isolate limit =$= CB.takeWhile pred
yields
ghci> example 5 'l'
Took "He"
Left ", world!"
ghci> example 5 'w'
Took "Hello"
Left ", world!"
In other words, isolate will eat up the entire Hello, leaving He to takeWhile and discarding the llo. This data loss is undesirable for my application. However, it is notable that the second case yields the expected result.
If I swap the operands of =$= like so:
isolateWhile limit pred = CB.takeWhile pred =$= CB.isolate limit
Then
ghci> example 5 'l'
Took "He"
Left ", world!"
ghci> example 5 'w'
Took "Hello"
Left ""
Now I've fixed the first test, but broken the second one! This time, takeWhile will take whatever it needs and isolate will take a subset of that; but whatever takeWhile uses that isolate doesn't will be discarded, and this is undesirable.
Lastly, I tried:
isolateWhile limit pred = do
untaken <- CB.isolate limit =$= (CB.takeWhile pred >> CL.consume)
mapM_ leftover $ reverse untaken
This actually works! Whatever isolate accepts and takeWhile doesn't is consumed by the CL.consume and placed back into the stream with leftover. Unfortunately, this seems like a horrible kludge, and undesirably (although not unusably so) it will buffer up to limit bytes in memory only to put it back with leftover. That seems like a waste.
The only solution I can think of is to write it in terms of the primitives await, yield and leftover as takeWhile and isolate are themselves written. While this would solve all the problems without wasting much, it seems like there must be a better way.
Am I missing something, or is there really no better way to write this?
There's a known limitation in the current version of conduit: fusion always discards downstream leftovers, which is exactly what you're running into here. There are some discussions right now about an architecture to resolve this, but for the moment, writing your function in terms of the primitives is likely your best option.
Greetings,
I'm trying to understand why I'm seeing the entire file loaded into memory with the following program, yet if you comment out the line below "(***)" then the program runs in constant (about 1.5M) space.
EDIT: The file is about 660MB, the field in column 26 is a date string like '2009-10-01', and there are one million lines. The process uses about 810MB by the time it hits the 'getLine'
Am I right in thinking it's related to the splitting of the string using 'split', and that somehow the underlying ByteString that has been read from the file can't be garbage-collected because it's still referenced? But if so, then I thought BS.copy would work around that. Any ideas how to force the computation - I can't seem to get 'seq' into the right place to have an effect.
(NB the source file is tab-separated lines)
Thanks in advance,
Kevin
module Main where
import System.IO
import qualified Data.ByteString.Lazy.Char8 as BS
import Control.Monad
type Record = BS.ByteString
importRecords :: String -> IO [Record]
importRecords filename = do
liftM (map importRecord.BS.lines) (BS.readFile filename)
importRecord :: BS.ByteString -> Record
importRecord txt = r
where
r = getField 26
getField f = BS.copy $ ((BS.split '\t' txt) !! f)
loopInput :: [Record] -> IO ()
loopInput jrs = do
putStrLn $ "Done" ++ (show $ last jrs)
hFlush stdout
x <- getLine
return ()
-- (***)
loopInput jrs
main = do
jrs <- importRecords "c:\\downloads\\lcg1m.txt"
loopInput jrs
Your call to last forces the list, jrs. To figure that out it must run through the entire file building up thunks for each entry in jrs. Because you aren't evaluating each element in jrs (except the last one) these thunks hang out with references to the bytestring, so that must stay in memory.
The solution is to force the evaluation of those thunks. Because we're talking about space the first thing I did was actually to store your info in a smaller format:
type Year = Word16
type Month = Word8
type Day = Word8
data Record = Rec {-# UNPACK #-} !Year {-# UNPACK #-} !Month {-# UNPACK #-} !Day
deriving (Eq, Ord, Show, Read)
This reduces that ugly 10 byte Bytestring (+ overhead of ~16 bytes of structure information) to around 8 bytes.
importRecord now has to call toRecord r to get the right type:
toRecord :: BS.ByteString -> Record
toRecord bs =
case BS.splitWith (== '-') bs of
(y:m:d:[]) -> Rec (rup y) (rup m) (rup d)
_ -> Rec 0 0 0
rup :: (Read a) => BS.ByteString -> a
rup = read . BS.unpack
We'll need to evalute data when we convert from ByteString to Record, so lets use the parallel package and define an NFData instance from DeepSeq.
instance NFData Record where
rnf (Rec y m d) = y `seq` m `seq` d `seq` ()
Now we're ready to go, I modified main to use evalList, thus forcing the whole list before your function that wants the last one:
main = do
jrs <- importRecords "./tabLines"
let jrs' = using jrs (evalList rdeepseq)
loopInput jrs'
And we can see the heap profile looks beautiful (and top agrees, the program uses very little memory).
Sorry about that other misleading wrong answer - I was hooked on the fact that incremental processing fixes it and didn't really realize the thunks really were hanging around, not sure why my brain glided over that. Though I do stand by the gist, you should incrementally process this information making all of this answer moot.
FYI the huge bytestring didn't show up in those previous heap profiles I posted because foreign allocations (which includes ByteString) aren't tracked by the heap profiler.
There seem to be two questions here:
why does the memory usage depend on the presence or absence of the line (***);
why is the memory usage with (***) present about 800MB, rather than, say, 40MB.
I don't really know what to say about the first one that TomMD didn't already say; inside the loopInput loop, jrs can never be freed, because it's needed as an argument to the recursive call of loopInput. (You know that return () doesn't do anything when (***) is present, right?)
As for the second question, I think you are right that the input ByteString isn't being garbage collected. The reason is that you never evaluate the elements of your list jrs besides the last one, so they still contain references to the original ByteString (even though they are of the form BS.copy ...). I would think that replacing show $ last jrs with show jrs would reduce your memory usage; does it? Alternatively, you could try a stricter map, like
map' f [] = []
map' f (x:xs) = ((:) $! (f $! x)) (map' f xs)
Replace the map in importRecords with map' and see whether that reduces your memory usage.