heap memory buildup with xml-conduit parseBytes

heap memory buildup with xml-conduit parseBytes - haskell

I'm parsing some rather large XML files with xml-conduit's streaming interface https://hackage.haskell.org/package/xml-conduit-1.8.0/docs/Text-XML-Stream-Parse.html#v:parseBytes but I'm seeing this memory buildup (here on a small test file):
where the top users are:
The actual data shouldn't take up that much heap – if I serialise and re-read, the resident memory use is kilobytes vs the megabytes here.
The minimal example I've managed to reproduce this with:
{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.Monad
import Control.Monad.IO.Class
import Data.Conduit
import Data.Conduit.Binary (sourceFile)
import qualified Data.Conduit.List as CL
import Data.Text (Text)
import Text.XML.Stream.Parse
type Y = [(Text, Text)]
main :: IO ()
main = do
res1 <- runConduitRes $
sourceFile "test.xml"
.| Text.XML.Stream.Parse.parseBytes def
.| parseMain
.| CL.foldM get []
print res1
get :: (MonadIO m, Show a) => [a] -> [a] -> m [a]
get acc !vals = do
liftIO $! print vals -- this oughta force it?
return $! take 1 vals ++ acc
parseMain = void $ tagIgnoreAttrs "Period" parseDetails
parseDetails = many parseParam >>= yield
parseParam = tag' "param" parseParamAttrs $ \idAttr -> do
value <- content
return (idAttr, value)
parseParamAttrs = do
idAttr <- requireAttr "id"
attr "name"
return idAttr

If I change get to just return ["hi"] or something, I don't get the buildup. So it seems the returned texts keep some reference to the larger text they were in (e.g. zero-copy slicing, cf. comment at https://hackage.haskell.org/package/text-0.11.2.0/docs/Data-Text.html#g:18 ), so the rest of the text can't be garbage collected even though we're using only little parts.
Our fix is to use Data.Text.copy on any attributes we want to yield:
someattr <- requireAttr "n"
yield (T.copy someattr)
which lets us parse with nearly constant memory use.
(And we might consider using https://markkarpov.com/post/short-bs-and-text.html#shorttext if we want to save even more memory.)

Related

In GHC- Why is the lazy version of this small program so much faster than the loop based variant?

These two programs do the same thing, but one runs 10x faster.
This takes approx. 10 seconds on my machine:
import Control.Monad
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as BL
theValueOne=B.singleton 1
main = replicateM_ 100000000 $ B.putStr theValueOne
The second version uses output-lazy IO. It is done in about 1 second (as fast as c):
import qualified Data.ByteString.Lazy as BL
main = BL.putStr $ BL.pack $ replicate 100000000 1
Question: Why is the non-lazy version so slow? More importantly, how can I make it fast? (I've tried recursion, forM, modifying the output buffer using hSetBuffering... Nothing has made a difference)
Note- This is more than just an academic question. The non-lazy version is an extremely simplified version of an executable my company uses in production, which is also slow in the same way. It would be nearly impossible to re-architect the larger program around the analogous lazy solution.

Updated: Added possible source of problem and a solution.
I don't think it has anything to do with lazy I/O. If you rewrite the strict I/O version to write two bytes at once:
theValueOne = B.singleton 1
main = replicateM_ 50000000 $ B.putStr (theValueOne <> theValueOne)
that halves the time. Write ten bytes at once:
theValueOne = B.singleton 1
main = replicateM_ 10000000 $ B.putStr (foldMap id (replicate 10 theValueOne))
and it's already faster than the lazy I/O version.
The issue is that there's a bit of overhead in a B.hPutStr call, much more than the overhead of a C fwrite call, and it's just not a particularly efficient way to write a single byte.
A good chunk of the overhead comes from the fact that Haskell I/O buffers have immutable metadata. Even though the buffer content itself is mutable, the pointers to valid data within the buffer are immutable, and so writing a single byte requires a heap allocation of a new GHC.IO.Buffer.Buffer structure which GHC can't optimize away
One solution is to use a hand-crafted buffering structure with a mutable pointer. The following works, and it's about twice as fast as the lazy I/O version in the original question.
{-# LANGUAGE RecordWildCards #-}
{-# OPTIONS_GHC -Wall #-}
import Control.Monad
import Data.IORef
import Data.Word
import Foreign.ForeignPtr
import Foreign.Ptr
import Foreign.Storable
import System.IO
data WriteBuffer = WriteBuffer
{ handle :: !Handle
, capacity :: !Int
, used :: !(IORef Int)
, content :: !(ForeignPtr Word8)
}
newBuffer :: Handle -> IO WriteBuffer
newBuffer h = do
hSetBinaryMode h True
hSetBuffering h NoBuffering
WriteBuffer h cap <$> newIORef 0 <*> mallocForeignPtrBytes cap
where cap = 4096
flushBuffer :: WriteBuffer -> IO ()
flushBuffer WriteBuffer{..} = do
n <- readIORef used
withForeignPtr content $ \p -> hPutBuf handle p n
writeIORef used 0
writeByte :: Word8 -> WriteBuffer -> IO ()
writeByte w buf#(WriteBuffer{..}) = do
n <- readIORef used
withForeignPtr content $ \p -> poke (plusPtr p n) w
let n' = n + 1
writeIORef used n'
when (n' == capacity) $
flushBuffer buf
main :: IO ()
main = do
b <- newBuffer stdout
replicateM_ 100000000 (writeByte 1 b)
flushBuffer b
Someone ironically, converting this to a version using an immutable counter and passing the WriteBuffer as state through foldM doubles the speed again, so it's about 4 times as fast as the lazy I/O version in the original question:
{-# LANGUAGE RecordWildCards #-}
{-# OPTIONS_GHC -Wall #-}
import Control.Monad
import Data.Word
import Foreign.ForeignPtr
import Foreign.Ptr
import Foreign.Storable
import System.IO
data WriteBuffer = WriteBuffer
{ handle :: !Handle
, capacity :: !Int
, used :: !Int
, content :: !(ForeignPtr Word8)
}
newBuffer :: Handle -> IO WriteBuffer
newBuffer h = do
hSetBinaryMode h True
hSetBuffering h NoBuffering
WriteBuffer h cap 0 <$> mallocForeignPtrBytes cap
where cap = 4096
flushBuffer :: WriteBuffer -> IO WriteBuffer
flushBuffer buf#WriteBuffer{..} = do
withForeignPtr content $ \p -> hPutBuf handle p used
return $ buf { used = 0 }
writeByte :: Word8 -> WriteBuffer -> IO WriteBuffer
writeByte w buf#(WriteBuffer{..}) = do
withForeignPtr content $ \p -> poke (plusPtr p used) w
let used' = used + 1
buf' = buf { used = used' }
if (used' == capacity)
then flushBuffer buf'
else return buf'
main :: IO ()
main = do
b <- newBuffer stdout
b' <- foldM (\s _ -> writeByte 1 s) b [(1::Int)..100000000]
void (flushBuffer b')
The reason this one is so fast seems to be that GHC is able to optimize away the WriteBuffer constructor entirely from the fold and just pass around unboxed pointers and integers in the loop. My guess is that if I modified the mutable version above to avoid boxing and unboxing the integer in the used IORef, it would be similarly fast.

Efficient streaming and manipulation of a byte stream in Haskell

While writing a deserialiser for a large (<bloblength><blob>)* encoded binary file I got stuck with the various Haskell produce-transform-consume libraries. So far I'm aware of four streaming libraries:
Data.Conduit: Widely used, has very careful resource management
Pipes: Similar to conduit (Haskell Cast #6 nicely reveals the differences between conduit and pipes)
Data.Binary.Get: Offers useful functions such as getWord32be, but the streaming example is awkward
System.IO.Streams: Seems to be the easiest one to use
Here's a stripped down example of where things go wrong when I try to do Word32 streaming with conduit. A slightly more realistic example would first read a Word32 that determines the blob length and then yield a lazy ByteString of that length (which is then deserialised further).
But here I just try to extract Word32's in streaming fashion from a binary file:
module Main where
-- build-depends: bytestring, conduit, conduit-extra, resourcet, binary
import Control.Monad.Trans.Resource (MonadResource, runResourceT)
import qualified Data.Binary.Get as G
import qualified Data.ByteString as BS
import qualified Data.ByteString.Char8 as C
import qualified Data.ByteString.Lazy as BL
import Data.Conduit
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
import Data.Word (Word32)
import System.Environment (getArgs)
-- gets a Word32 from a ByteString.
getWord32 :: C.ByteString -> Word32
getWord32 bs = do
G.runGet G.getWord32be $ BL.fromStrict bs
-- should read BytesString and return Word32
transform :: (Monad m, MonadResource m) => Conduit BS.ByteString m Word32
transform = do
mbs <- await
case mbs of
Just bs -> do
case C.null bs of
False -> do
yield $ getWord32 bs
leftover $ BS.drop 4 bs
transform
True -> return ()
Nothing -> return ()
main :: IO ()
main = do
filename <- fmap (!!0) getArgs -- should check length getArgs
result <- runResourceT $ (CB.sourceFile filename) $$ transform =$ CL.consume
print $ length result -- is always 8188 for files larger than 32752 bytes
The output of the program is just the number of Word32's that were read. It turns out the stream terminates after reading the first chunk (about 32KiB). For some reason mbs is never Nothing, so I must check null bs which stops the stream when the chunk is consumed. Clearly, my conduit transform is faulty. I see two routes to a solution:
The await doesn't want to go to the second chunk of the ByteStream, so is there another function that pulls the next chunk? In examples I've seen (e.g. Conduit 101) this is not how it's done
This is just the wrong way to set up transform.
How is this done properly? Is this the right way to go? (Performance does matter.)
Update: Here's a BAD way to do it using Systems.IO.Streams:
module Main where
import Data.Word (Word32)
import System.Environment (getArgs)
import System.IO (IOMode (ReadMode), openFile)
import qualified System.IO.Streams as S
import System.IO.Streams.Binary (binaryInputStream)
import System.IO.Streams.List (outputToList)
main :: IO ()
main = do
filename : _ <- getArgs
h <- openFile filename ReadMode
s <- S.handleToInputStream h
i <- binaryInputStream s :: IO (S.InputStream Word32)
r <- outputToList $ S.connect i
print $ last r
'Bad' means: Very demanding in time and space, does not handle Decode exception.

Your immediate problem is caused by how you are using leftover. That function is used to "Provide a single piece of leftover input to be consumed by the next component in the current monadic binding", and so when you give it bs before looping with transform you are effectively throwing away the rest of the bytestring (i.e. what is after bs).
A correct solution based on your code would use the incremental input interface of Data.Binary.Get to replace your yield/leftover combination with something that consumes each chunk fully. A more pragmatic approach, though, is using the binary-conduit package, which provides that in the shape of conduitGet (its source gives a good idea of what a "manual" implementation would look like):
import Data.Conduit.Serialization.Binary
-- etc.
transform :: (Monad m, MonadResource m) => Conduit BS.ByteString m Word32
transform = conduitGet G.getWord32be
One caveat is that this will throw a parse error if the total number of bytes is not a multiple of 4 (i.e. the last Word32 is incomplete). In the unlikely case of that not being what you want, a lazy way out would be simply using \bs -> C.take (4 * truncate (C.length bs / 4)) bs on the input bytestring.

With pipes (and pipes-group and pipes-bytestring) the demo problem reduces to combinators. First we resolve the incoming undifferentiated byte stream into little 4 byte chunks:
chunksOfStrict :: (Monad m) => Int -> Producer ByteString m r -> Producer ByteString m r
chunksOfStrict n = folds mappend mempty id . view (Bytes.chunksOf n)
then we map these to Word32s and (here) count them.
main :: IO ()
main = do
filename:_ <- getArgs
IO.withFile filename IO.ReadMode $ \h -> do
n <- P.length $ chunksOfStrict 4 (Bytes.fromHandle h) >-> P.map getWord32
print n
This will fail if we have less than 4 bytes or otherwise fail to parse but we can as well map with
getMaybeWord32 :: ByteString -> Maybe Word32
getMaybeWord32 bs = case G.runGetOrFail G.getWord32be $ BL.fromStrict bs of
Left r -> Nothing
Right (_, off, w32) -> Just w32
The following program will then print the parses for the valid 4 byte sequences
main :: IO ()
main = do
filename:_ <- getArgs
IO.withFile filename IO.ReadMode $ \h -> do
runEffect $ chunksOfStrict 4 (Bytes.fromHandle h)
>-> P.map getMaybeWord32
>-> P.concat -- here `concat` eliminates maybes
>-> P.print
There are other ways of dealing with failed parses, of course.
Here, though, is something closer to the program you asked for. It takes a four byte segment from a byte stream (Producer ByteString m r) and reads it as a Word32 if it is long enough; it then takes that many of the incoming bytes and accumulates them into a lazy bytestring, yielding it. It just repeats this until it runs out of bytes. In main below, I print each yielded lazy bytestring that is produced:
module Main (main) where
import Pipes
import qualified Pipes.Prelude as P
import Pipes.Group (folds)
import qualified Pipes.ByteString as Bytes ( splitAt, fromHandle, chunksOf )
import Control.Lens ( view ) -- or Lens.Simple (view) -- or Lens.Micro ((.^))
import qualified System.IO as IO ( IOMode(ReadMode), withFile )
import qualified Data.Binary.Get as G ( runGet, getWord32be )
import Data.ByteString ( ByteString )
import qualified Data.ByteString.Lazy.Char8 as BL
import System.Environment ( getArgs )
splitLazy :: (Monad m, Integral n) =>
n -> Producer ByteString m r -> m (BL.ByteString, Producer ByteString m r)
splitLazy n bs = do
(bss, rest) <- P.toListM' $ view (Bytes.splitAt n) bs
return (BL.fromChunks bss, rest)
measureChunks :: Monad m => Producer ByteString m r -> Producer BL.ByteString m r
measureChunks bs = do
(lbs, rest) <- lift $ splitLazy 4 bs
if BL.length lbs /= 4
then rest >-> P.drain -- in fact it will be empty
else do
let w32 = G.runGet G.getWord32be lbs
(lbs', rest') <- lift $ splitLazy w32 bs
yield lbs
measureChunks rest
main :: IO ()
main = do
filename:_ <- getArgs
IO.withFile filename IO.ReadMode $ \h -> do
runEffect $ measureChunks (Bytes.fromHandle h) >-> P.print
This is again crude in that it uses runGet not runGetOrFail, but this is easily repaired. The pipes standard procedure would be to stop the stream transformation on a failed parse and return the unparsed bytestream.
If you were anticipating that the Word32s were for large numbers, so that you did not want to accumulate the corresponding stream of bytes as a lazy bytestring, but say write them to different files without accumulating, we could change the program pretty easily to do that. This would require a sophisticated use of conduit but is the preferred approach with pipes and streaming.

Here's a relatively straightforward solution that I want to throw into the ring. It's a repeated use of splitAt wrapped into a State monad that gives an interface identical to (a subset of) Data.Binary.Get. The resulting [ByteString] is obtained in main with a whileJust over getBlob.
module Main (main) where
import Control.Monad.Loops
import Control.Monad.State
import qualified Data.Binary.Get as G (getWord32be, runGet)
import qualified Data.ByteString.Lazy as BL
import Data.Int (Int64)
import Data.Word (Word32)
import System.Environment (getArgs)
-- this is going to mimic the Data.Binary.Get.Get Monad
type Get = State BL.ByteString
getWord32be :: Get (Maybe Word32)
getWord32be = state $ \bs -> do
let (w, rest) = BL.splitAt 4 bs
case BL.length w of
4 -> (Just w', rest) where
w' = G.runGet G.getWord32be w
_ -> (Nothing, BL.empty)
getLazyByteString :: Int64 -> Get BL.ByteString
getLazyByteString n = state $ \bs -> BL.splitAt n bs
getBlob :: Get (Maybe BL.ByteString)
getBlob = do
ml <- getWord32be
case ml of
Nothing -> return Nothing
Just l -> do
blob <- getLazyByteString (fromIntegral l :: Int64)
return $ Just blob
runGet :: Get a -> BL.ByteString -> a
runGet g bs = fst $ runState g bs
main :: IO ()
main = do
fname <- head <$> getArgs
bs <- BL.readFile fname
let ls = runGet loop bs where
loop = whileJust getBlob return
print $ length ls
There's no error handling in getBlob, but it's easy to extend. Time and space complexity is quite good, as long as the resulting list is used carefully. (The python script that creates some random data for consumption by the above is here).

Haskell Conduit: having a Sink return a value based on the values from upstream

I've been trying to use the Conduit library to do some simple I/O involving files, but I'm having a hard time.
I have a text file containing nothing but a few digits such as 1234. I have a function that reads the file using readFile (no conduits), and returns Maybe Int (Nothing is returned when the file actually doesn't exist). I'm trying to write a version of this function that uses conduits, and I just can't figure it out.
Here is what I have:
import Control.Monad.Trans.Resource
import Data.Conduit
import Data.Functor
import System.Directory
import qualified Data.ByteString.Char8 as B
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.Text as CT
import qualified Data.Text as T
myFile :: FilePath
myFile = "numberFile"
withoutConduit :: IO (Maybe Int)
withoutConduit = do
doesExist <- doesFileExist myFile
if doesExist
then Just . read <$> readFile myFile
else return Nothing
withConduit :: IO (Maybe Int)
withConduit = do
doesExist <- doesFileExist myFile
if doesExist
then runResourceT $ source $$ conduit =$ sink
else return Nothing
where
source :: Source (ResourceT IO) B.ByteString
source = CB.sourceFile myFile
conduit :: Conduit B.ByteString (ResourceT IO) T.Text
conduit = CT.decodeUtf8
sink :: Sink T.Text (ResourceT IO) (Maybe Int)
sink = awaitForever $ \txt -> let num = read . T.unpack $ txt :: Int
in -- I don't know what to do here...
Could someone please help me complete the sink function?
Thanks!

This isn't really a good example for where conduit actually provides a lot of value, at least not the way you're looking at it right now. Specifically, you're trying to use the read function, which requires that the entire value be in memory. Additionally, your current error handling behavior is a bit loose. Essentially, you're just going to get an read: no parse error if there's anything unexpected in the content.
However, there is a way we can play with this in conduit and be meaningful: by parsing the ByteString byte-by-byte ourselves and avoiding the read function. Fortunately, this pattern falls into a standard left fold, which the conduit-combinators package provides a perfect function for (element-wise left fold in a conduit, aka foldlCE):
{-# LANGUAGE OverloadedStrings #-}
import Conduit
import Data.Word8
import qualified Data.ByteString as S
sinkInt :: Monad m => Consumer S.ByteString m Int
sinkInt =
foldlCE go 0
where
go total w
| _0 <= w && w <= _9 =
total * 10 + (fromIntegral $ w - _0)
| otherwise = error $ "Invalid byte: " ++ show w
main :: IO ()
main = do
x <- yieldMany ["1234", "5678"] $$ sinkInt
print x
There are plenty of caveats that go along with this: it will simply throw an exception if there are unexpected bytes, and it doesn't handle integer overflow at all (though fixing that is just a matter of replacing Int with Integer). It's important to note that, since the in-memory string representation of a valid 32- or 64-bit int is always going to be tiny, conduit is overkill for this problem, though I hope that this code gives some guidance on how to generally write conduit code.

Preventing "getCurrentDirectory: resource exhausted (Too many open files)" error

I am trying to run a Parsec parser over a whole bunch of small files, and getting an error saying I have too many open files. I understand that I need to use strict IO, but I'm not sure how to do that. This is the problematic code:
files = getDirectoryContents historyFolder
hands :: IO [Either ParseError [Hand]]
hands = join $ sequence <$> parseFromFile (many hand) <<$>> files
Note: my <<$>> function is this:
(<<$>>) :: (Functor f1, Functor f2) => (a -> b) -> f1 (f2 a) -> f1 (f2 b)
a <<$>> b = (a <$>) <$> b

I don't know what your parseFromFile function looks like right now (probably a good idea to include that in the question), but I'm guessing you're using Prelude.readFile, which as #Markus1189 points out includes lazy I/O. To get to strict I/O, you just need a strict readFile, such as Data.Text.IO.readFile.
A streaming data library like pipes or conduit would allow you to avoid reading the entire file into memory at once, though- to my knowledge- parsec doesn't provide a streaming interface to allow this to happen. attoparsec, on the other hand, does include such a streaming interface, and both pipes and conduit have attoparsec adapter libraries (e.g., Data.Conduit.Attoparsec).
tl;dr: You probably just need the following helper function:
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
readFileStrict :: FilePath -> IO String
readFileStrict = fmap T.unpack . TIO.readFile

You can use the BangPatterns language extension to enforce strictness of your IO operations, in this case parseFromFile. For example the function hands can be changed in:
hands :: [String] → IO [Either ParseError [Hand]]
hands [] = return []
hands (f:fs) = do
!res ← parseFromFile hand f
others ← hands fs
return (res:others)
This version of hands waits for the results of each call of parseFromFile before moving to the next file in the list. Once you have this, the problem should disappear. A full working toy example is:
{-# LANGUAGE BangPatterns #-}
import Control.Monad
import Control.Applicative hiding (many)
import Data.Char (isDigit)
import System.Directory (getDirectoryContents)
import System.FilePath ((</>))
import Text.ParserCombinators.Parsec
data Hand = Hand Int deriving Show
hand :: GenParser Char st [Hand]
hand = do
string "I'm file "
num ← many digit
newline
eof
return [Hand $ read num]
files :: IO [String]
files = map ("manyfiles" </>)
∘ filter (all isDigit) <$> getDirectoryContents "manyfiles"
hands :: [String] → IO [Either ParseError [Hand]]
hands [] = return []
hands (f:fs) = do
!res ← parseFromFile hand f
others ← hands fs
return (res:others)
main :: IO 𐌏
main = do
results ← files >≥ hands
print results

Writing "wc -l" using Iteratee library - how to filter for newline?

I am trying to come up with equivalent of "wc -l" using Haskell Iteratee library. Below is the code for "wc" (which just counts the words - similar to the code in iteratee example on hackage), and runs very fast:
{-# LANGUAGE BangPatterns #-}
import Data.Iteratee as I
import Data.ListLike as LL
import Data.Iteratee.IO
import Data.ByteString
length1 :: (Monad m, Num a, LL.ListLike s el) => Iteratee s m a
length1 = liftI (step 0)
where
step !i (Chunk xs) = liftI (step $ i + fromIntegral (LL.length xs))
step !i stream = idone i stream
{-# INLINE length1 #-}
main = do
i' <- enumFile 1024 "/usr/share/dict/words" (length1 :: (Monad m) => Iteratee ByteString m Int)
result <- run i'
print result
{- Time measured on a linux x86 box:
$ time ./test ## above haskell compiled code
4950996
real 0m0.013s
user 0m0.004s
sys 0m0.007s
$ time wc -c /usr/share/dict/words
4950996 /usr/share/dict/words
real 0m0.003s
user 0m0.000s
sys 0m0.002s
-}
Now, how do you extend it to count the number of lines that too runs fast? I did a version using Prelude.filter to filter only "\n" to length but it is slower than linux "wc -l" because of too much memory, and gc (lazy evaluation, I guess). So, I wrote another version using Data.ListLike.filter but it won't compile because it doesn't type check - help here would be appreciated:
{-# LANGUAGE BangPatterns #-}
import Data.Iteratee as I
import Data.ListLike as LL
import Data.Iteratee.IO
import Data.ByteString
import Data.Char
import Data.ByteString.Char8 (pack)
numlines :: (Monad m, Num a, LL.ListLike s el) => Iteratee s m a
numlines = liftI $ step 0
where
step !i (Chunk xs) = liftI (step $i + fromIntegral (LL.length $ LL.filter (\x -> x == Data.ByteString.Char8.pack "\n") xs))
step !i stream = idone i stream
{-# INLINE numlines #-}
main = do
i' <- enumFile 1024 "/usr/share/dict/words" (numlines :: (Monad m) => Iteratee ByteString m Int)
result <- run i'
print result

So I did some experimenting and I got a wc -l that is only about twice as slow as "wc -l" This is better performance than even the wc -c version shown above.
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString.Lazy.Char8 as BSL
import qualified Data.ByteString.Char8 as BS
import qualified Data.Enumerator as E
import qualified Data.Enumerator.Binary as EB
import Control.Monad.IO.Class (liftIO)
import Data.Int
numlines :: Int64 -> E.Iteratee BS.ByteString IO ()
numlines n = do
chunk <- EB.take 1024
case chunk of
"" -> do liftIO $ print n
return ()
a -> do let ct = BSL.count '\n' a
numlines (n+ct)
main = do
let i = EB.enumFile "/usr/share/dict/words" E.$$ numlines 0
E.run_ i
Running it vs. native:
Eriks-MacBook-Air:skunk erikhinton$ time wc -l "/usr/share/dict/words"
235886 /usr/share/dict/words
real 0m0.009s
user 0m0.006s
sys 0m0.002s
Eriks-MacBook-Air:skunk erikhinton$ time ./wcl
235886
real 0m0.019s
user 0m0.013s
sys 0m0.005s
[EDIT]
Here's an even faster, smaller footprint and far more concise/expressive way of doing it. These enumerators are starting to get fun.
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString.Lazy.Char8 as BSL
import qualified Data.ByteString.Char8 as BS
import qualified Data.Enumerator as E
import qualified Data.Enumerator.Binary as EB
import qualified Data.Enumerator.List as EL
import Control.Monad.IO.Class (liftIO)
import Data.Int
numlines :: E.Iteratee BS.ByteString IO ()
numlines = do
num <- EL.fold (\n b -> (BS.count '\n' b) + n ) 0
liftIO . print $ num
main = do
let i = EB.enumFile "/usr/share/dict/words" E.$$ numlines
E.run_ i
And the timing
Eriks-MacBook-Air:skunk erikhinton$ time ./wcl2
235886
real 0m0.015s
user 0m0.010s
sys 0m0.004s

There are a lot of good answers already; I have very little to offer performance-wise but a few style points.
First, I would write it this way:
import Prelude as P
import Data.Iteratee
import qualified Data.Iteratee as I
import qualified Data.Iteratee.IO as I
import qualified Data.ByteString as B
import Data.Char
import System.Environment
-- numLines has a concrete stream type so it's not necessary to provide an
-- annotation later. It could have a more general type.
numLines :: Monad m => I.Iteratee B.ByteString m Int
numLines = I.foldl' step 0
where
--step :: Int -> Word8 -> Int
step acc el = if el == (fromIntegral $ ord '\n') then acc + 1 else acc
main = do
f:_ <- getArgs
words <- run =<< I.enumFile 65536 f numLines
print words
The biggest difference is that this uses Data.Iteratee.ListLike.foldl'. Note that only the individual stream elements matter to the step function, not the stream type. It's exactly the same function as you would use with e.g. Data.ByteString.Lazy.foldl'.
Using foldl' also means that you don't need to manually write iteratees with liftI. I would discourage users from doing so unless absolutely necessary. The result is usually longer and harder to maintain with little to no benefit.
Finally, I've increased the buffer size significantly. On my system this is marginally faster than enumerators default of 4096, which is again marginally faster (with iteratee) than your choice of 1024. YMMV with this setting of course.

If you're reading ByteString chunks, you can use the count function from Data.ByteString, the relevant step would then be
step !i (Chunk xs) = liftI (step $ i + count 10 xs)
(perhaps with a fromIntegral). Data.ByteString.count is pretty fast, that shouldn't be too much slower than wc -l.

I figured out how to fix the type error. The key to fixing type error is understanding the relationship between Data.ListLike.filter and ByteString input that is being passed to that filter. Here is the type of Data.ListLike.filter:
Data.ListLike.filter
:: Data.ListLike.Base.ListLike full item =>
(item -> Bool) -> full -> full
full refers to the stream in the context of an enumerator/iteratee, if I understand it correctly. item refers to the element of the stream.
Now, if we want to filter on newline in the input file, we have to know the type of input file stream, and the type of elements in that stream. In this case, input file is being read as ByteString stream. ByteString is documented as a space-efficient representation of a Word8 vector. So, item type here is Word8.
So, when we write the filter, in the step function, we have to make sure that Bool operation is defined for Word8 since that is the type of the item being passed to the filter (as explained above). We are filtering for newline. So, the bool function like the one below which builds a Word8 representation of newline, and check for equality against x of type Word8, should work:
\x -> x == Data.ByteString.Internal.c2w '\n'
There is still one more missing piece - for some reasons, the compiler (v7.0.3 Mac) is unable to deduce the type of el in numfile type signature (if anyone has ideas on why it is so, please do discuss). So, telling it explicitly that it is Word8 solves the compilation issue:
numlines :: (Monad m, Num a, LL.ListLike s Word8) => Iteratee s m a
Full code below - it compiles, and runs quite fast.
{-# LANGUAGE BangPatterns,FlexibleContexts #-}
import Data.Iteratee as I
import Data.ListLike as LL
import Data.Iteratee.IO
import Data.ByteString
import GHC.Word (Word8)
import Data.ByteString.Internal (c2w)
numlines :: (Monad m, Num a, LL.ListLike s Word8) => Iteratee s m a
numlines = liftI $ step 0
where
step !i (Chunk xs) = let newline = c2w '\n' in liftI (step $i + fromIntegral (LL.length $ LL.filter (\x -> x == newline) xs))
step !i stream = idone i stream
{-# INLINE numlines #-}
main = do
i' <- enumFile 1024 "/usr/share/dict/words" (numlines :: (Monad m) => Iteratee ByteString m Int)
result <- run i'
print result
{- Time to run on mac OSX:
$ time ./test ## above compiled program: ghc --make -O2 test.hs
235886
real 0m0.011s
user 0m0.007s
sys 0m0.004s
$ time wc -l /usr/share/dict/words
235886 /usr/share/dict/words
real 0m0.005s
user 0m0.002s
sys 0m0.002s
-}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

heap memory buildup with xml-conduit parseBytes - haskell

Related

In GHC- Why is the lazy version of this small program so much faster than the loop based variant?

Efficient streaming and manipulation of a byte stream in Haskell

Haskell Conduit: having a Sink return a value based on the values from upstream

Preventing "getCurrentDirectory: resource exhausted (Too many open files)" error

Writing "wc -l" using Iteratee library - how to filter for newline?

Categories

Resources