parsec running out of memory

parsec running out of memory - haskell

I wrote a parser for a large csv file which works on a smaller subset but runs out of memory for ~1.5m lines (the actual file).
After initially parsing all elements into a list(using manyTill), i instead used the parser state to store them in a single binary search tree - this worked for the large file.
i have since split the "element type" in three separate types and want to store them in their own tree, resulting in three trees of different type.
This version, though, only works for the small test file while running out of memory for the larger one.
import qualified Data.Tree.AVL as AVL
import qualified Text.ParserCombinators.Parsec as Parsec
----
data ENW = ENW (AVL.AVL Extent) (AVL.AVL Node) (AVL.AVL Way)
---- used to be Element = Extent | Node | Way in a (Tree Element) - this worked
csvParser :: Parsec String ENW ENW
csvParser = do (Parsec.manyTill (parseL) Parsec.eof) >> Parsec.getState
where parseL = parseLine >> ((Parsec.newline >> return ()) <|> Parsec.eof)
parseLine :: Parsec String ENW ()
parseLine = parseNode <|> parseWay <|> parseExtents
parseNode :: Parsec String ENW ()
parseNode = Parsec.string "node" *> (flip addNode <$> (Node <$> identifier <*> float <*> float)) >>= Parsec.updateState
where identifier = Parsec.tab *> (read <$> Parsec.many1 Parsec.digit)
float = Parsec.tab *> (read <$> parseFloat)
addNode :: ENW -> Node -> ENW
addNode (ENW e n w) node = (ENW e (AVL.push (sndCC node) node n) w)
parseWay and parseExtent follow the same pattern and the whole thing is started with
Parsec.runParser csvParser (ENW AVL.empty AVL.empty AVL.empty) "" input
i dont understand how using three smaller trees instead of a single large one can cause memory issues.

Do you have a good reason to not use Cassava? It can be used to stream CSV data and is probably more robust than an ad hoc CSV parser. My own experience with it has shown it has excellent performance and can be easily extended to parse your own types.
Edit: It also looks like you're working with tab separated value data, not comma separated data, but Cassava lets you specify what delimiter to split columns by.It also appears that the data you have is potentially different on each line so you may need to use Cassava's 'raw' format which returns a Vector ByteString for each line, which you can then parse based on the first element.
I've never seen anyone use the AVL tree package before, is there a good reason you aren't using more standard structures? That package is quite old (Last updated in 2008) and more recent packages are likely to perform better.

Related

Parsec: grabbing raw source after parsing

I have a strange whim. Suppose I have something like this:
data Statement = StatementType Stuff Source
Now I want to parse such a statement, parse all the stuff, and after that I want to put all characters that I've processed (for this particular statement) into resulting data structure. For some reason.
Is it possible, and if yes, how to accomplish that?

In general this is not possible. parsec does not expect a lot from its stream type, in particular there is no way to efficently split a stream.
But for a concrete stream type (e.g. String, or [a], or ByteString) a hack like this would work:
parseWithSource :: Parsec [c] u a -> Parsec [c] u ([c], a)
parseWithSource p = do
input <- getInput
a <- p
input' <- getInput
return (take (length input - length input') input, a)
This solution relies on function getInput that returns current input. So we can get the input twice: before and after parsing, this gives us exact number of consumed elements, and knowing that we can take these elements from the original input.
Here you can see it in action:
*Main Text.Parsec> parseTest (between (char 'x') (char 'x') (parseWithSource ((read :: String -> Int) `fmap` many1 digit))) "x1234x"
("1234",1234)
But you should also look into attoparsec, as it properly supports this functionality with the match function.

Can I create a function in Haskell that will encapsulate reading data from file and returning me a simple list of data?

Consider the code below taken from a working example I've built to help me learn Haskell. This code parses a CSV file containing stock quotes downloaded from Yahoo into a nice simple list of bars with which I can then work.
My question: how can I write a function that will take a file name as its parameter and return an OHLCBarList so that the first four lines inside main can be properly encapsulated?
In other words, how can I implement (without getting all sorts of errors about IO stuff) the function whose type would be
getBarsFromFile :: Filename -> OHLCBarList
so that the grunt work that was being done in the first four lines of main can be properly encapsulated?
I've tried to do this myself but with my limited Haskell knowledge, I'm failing miserably.
import qualified Data.ByteString as BS
type Filename = String
getContentsOfFile :: Filename -> IO BS.ByteString
barParser :: Parser Bar
barParser = do
time <- timeParser
char ','
open <- double
char ','
high <- double
char ','
low <- double
char ','
close <- double
char ','
volume <- decimal
char ','
return $ Bar Bar1Day time open high low close volume
type OHLCBar = (UTCTime, Double, Double, Double, Double)
type OHLCBarList = [OHLCBar]
barsToBarList :: [Either String Bar] -> OHLCBarList
main :: IO ()
main = do
contents :: C.ByteString <- getContentsOfFile "PriceData/Daily/yhoo1.csv" --PriceData/Daily/Yhoo.csv"
let lineList :: [C.ByteString] = C.lines contents -- Break the contents into a list of lines
let bars :: [Either String Bar] = map (parseOnly barParser) lineList -- Using the attoparsec
let ohlcBarList :: OHLCBarList = barsToBarList bars -- Now I have a nice simple list of tuples with which to work
--- Now I can do simple operations like
print $ ohlcBarList !! 0

If you really want your function to have type Filename -> OHLCBarList, it can't be done.* Reading the contents of a file is an IO operation, and Haskell's IO monad is specifically designed so that values in the IO monad can never leave. If this restriction were broken, it would (in general) mess with a lot of things. Instead of doing this, you have two options: make the type of getBarsFromFile be Filename -> IO OHLCBarList — thus essentially copying the first four lines of main — or write a function with type C.ByteString -> OHLCBarList that the output of getContentsOfFile can be piped through to encapsulate lines 2 through 4 of main.
* Technically, it can be done, but you really, really, really shouldn't even try, especially if you're new to Haskell.

Others have explained that the correct type of your function has to be Filename -> IO OHLCBarList, I'd like to try and give you some insight as to why the compiler imposes this draconian measure on you.
Imperative programming is all about managing state: "do certain operations to certain bits of memory in sequence". When they grow large, procedural programs become brittle; we need a way of limiting the scope of state changes. OO programs encapsulate state in classes but the paradigm is not fundamentally different: you can call the same method twice and get different results. The output of the method depends on the (hidden) state of the object.
Functional programming goes all the way and bans mutable state entirely. A Haskell function, when called with certain inputs, will always produce the same output. Simple examples of
pure functions are mathematical operators like + and *, or most of the list-processing functions like map. Pure functions are all about the inputs and outputs, not managing internal state.
This allows the compiler to be very smart in optimising your program (for example, it can safely collapse duplicated code for you), and helps the programmer not to make mistakes: you can't put the system in an invalid state if there is none! We like pure functions.
The exception to the rule is IO. Code that performs IO is impure by definition: you could call getLine a hundred times and never get the same result, because it depends on what the user typed. Haskell handles this using the type system: all impure functions are marred with the IO type. IO can be thought of as a dependency on the state of the real world, sort of like World -> (NewWorld, a)
To summarise: pure functions are good because they are easy to reason about; this is why Haskell makes functions pure by default. Any impure code has to be labelled as such with an IO type signature; this tells the compiler and the reader to be careful with this function. So your function which reads from a file (a fundamentally impure action) but returns a pure value can't exist.
Addendum in response to your comment
You can still write pure functions to operate on data that was obtained impurely. Consider the following straw-man:
main :: IO ()
main = do
putStrLn "Enter the numbers you want me to process, separated by spaces"
line <- getLine
let numberStrings = words line
let numbers = map read numberStrings
putStrLn $ "The result of the calculation is " ++ (show $ foldr1 (*) numbers + 10)
Lots of code inside IO here. Let's extract some functions:
main :: IO ()
main = do
putStrLn "Enter the numbers you want me to process, separated by spaces"
result <- fmap processLine getLine -- fmap :: (a -> b) -> IO a -> IO b
-- runs an impure result through a pure function
-- without leaving IO
putStrLn $ "The result of the calculation is " ++ result
processLine :: String -> String -- look ma, no IO!
processLine = show . calculate . readNumbers
readNumbers :: String -> [Int]
readNumbers = map read . words
calculate :: [Int] -> Int
calculate numbers = product numbers + 10
product :: [Int] -> Int
product = foldr1 (*)
I've pulled logic out of main into pure functions which are easier to read, easier for the compiler to optimise, and more reusable (and so more testable). The program as a whole still lives inside IO because the data is obtained impurely (see the last part of this answer for a more thorough treatment of this argument). Impure data can be piped through pure functions using fmap and other combinators; you should try to put as little logic in main as possible.
Your code does seem to be most of the way there; as others have suggested you could extract lines 2-4 of your main into another function.

In other words, how can I implement (without getting all sorts of errors about IO stuff) the function whose type would be
getBarsFromFile :: Filename -> OHLCBarList
so that the grunt work that was being done in the first four lines of main can be properly encapsulated?
You cannot do this without getting all sorts of errors about IO stuff because this type for getBarsFromFile misses an IO. Probably that's what the errors about IO stuff are trying to tell you. Did you try understanding and fixing the errors?
In your situation, I would start by abstracting over the second to fourth line of your main in a function:
parseBars :: ByteString -> OHLCBarList
And then I would combine this function with getContentsOfFile to get:
getBarsFromFile :: FilePath -> IO OHLCBarList
This I would call in main.

How to translate this python to Haskell?

I'm learning Haskell and as an exercise I'm trying to convert write the read_from function following code to Haskell. Taken from Peter Norvig's Scheme interpreter.
Is there a straightforward way do this?
def read(s):
"Read a Scheme expression from a string."
return read_from(tokenize(s))
parse = read
def tokenize(s):
"Convert a string into a list of tokens."
return s.replace('(',' ( ').replace(')',' ) ').split()
def read_from(tokens):
"Read an expression from a sequence of tokens."
if len(tokens) == 0:
raise SyntaxError('unexpected EOF while reading')
token = tokens.pop(0)
if '(' == token:
L = []
while tokens[0] != ')':
L.append(read_from(tokens))
tokens.pop(0) # pop off ')'
return L
elif ')' == token:
raise SyntaxError('unexpected )')
else:
return atom(token)
def atom(token):
"Numbers become numbers; every other token is a symbol."
try: return int(token)
except ValueError:
try: return float(token)
except ValueError:
return Symbol(token)

There is a straightforward way to "transliterate" Python into Haskell. This can be done by clever usage of monad transformers, which sounds scary, but it's really not. You see, due to purity, in Haskell when you want to use effects such as mutable state (e.g. the append and pop operations are performing mutation) or exceptions, you have to make it a little more explicit. Let's start at the top.
parse :: String -> SchemeExpr
parse s = readFrom (tokenize s)
The Python docstring said "Read a Scheme expression from a string", so I just took the liberty of encoding this as the type signature (String -> SchemeExpr). That docstring becomes obsolete because the type conveys the same information. Now... what is a SchemeExpr? According to your code, a scheme expression can be an int, float, symbol, or list of scheme expressions. Let's create a data type that represents these options.
data SchemeExpr
= SInt Int
| SFloat Float
| SSymbol String
| SList [SchemeExpr]
deriving (Eq, Show)
In order to tell Haskell that the Int we are dealing with should be treated as a SchemeExpr, we need to tag it with SInt. Likewise with the other possibilities. Let's move on to tokenize.
tokenize :: String -> [Token]
Again, the docstring turns into a type signature: turn a String into a list of Tokens. Well, what's a Token? If you look at the code, you'll notice that the left and right paren characters are apparently special tokens, which signal particular behaviors. Anything else is... unspecial. While we could create a data type to more clearly distinguish parens from other tokens, let's just use Strings, to stick a little closer to the original Python code.
type Token = String
Now let's try writing tokenize. First, let's write a quick little operator for making function chaining look a bit more like Python. In Haskell, you can define your own operators.
(|>) :: a -> (a -> b) -> b
x |> f = f x
tokenize s = s |> replace "(" " ( "
|> replace ")" " ) "
|> words
words is Haskell's version of split. However, Haskell has no pre-cooked version of replace that I know of. Here's one that should do the trick:
-- add imports to top of file
import Data.List.Split (splitOn)
import Data.List (intercalate)
replace :: String -> String -> String -> String
replace old new s = s |> splitOn old
|> intercalate new
If you read the docs for splitOn and intercalate, this simple algorithm should make perfect sense. Haskellers would typically write this as replace old new = intercalate new . splitOn old, but I used |> here for easier Python audience understanding.
Note that replace takes three arguments, but above I only invoked it with two. In Haskell you can partially apply any function, which is pretty neat. |> works sort of like the unix pipe, if you couldn't tell, except with more type safety.
Still with me? Let's skip over to atom. That nested logic is a bit ugly, so let's try a slightly different approach to clean it up. We'll use the Either type for a much nicer presentation.
atom :: Token -> SchemeExpr
atom s = Left s |> tryReadInto SInt
|> tryReadInto SFloat
|> orElse (SSymbol s)
Haskell doesn't have the automagical coersion functions int and float, so instead we will build tryReadInto. Here's how it works: we're going to thread Either values around. An Either value is either a Left or a Right. Conventionally, Left is used to signal error or failure, while Right signals success or completion. In Haskell, to simulate the Python-esque function call chaining, you just place the "self" argument as the last one.
tryReadInto :: Read a => (a -> b) -> Either String b -> Either String b
tryReadInto f (Right x) = Right x
tryReadInto f (Left s) = case readMay s of
Just x -> Right (f x)
Nothing -> Left s
orElse :: a -> Either err a -> a
orElse a (Left _) = a
orElse _ (Right a) = a
tryReadInto relies on type inference in order to determine which type it is trying to parse the string into. If the parse fails, it simply reproduces the same string in the Left position. If it succeeds, then it performs whatever function is desired and places the result in the Right position. orElse allows us to eliminate the Either by supplying a value in case the former computations failed. Can you see how Either acts as a replacement for exceptions here? Since the ValueExceptions in the Python code are always caught inside the function itself, we know that atom will never raise an exception. Similarly, in the Haskell code, even though we used Either on the inside of the function, the interface that we expose is pure: Token -> SchemeExpr, no outwardly-visible side effects.
OK, let's move on to read_from. First, ask yourself the question: what side effects does this function have? It mutates its argument tokens via pop, and it has internal mutation on the list named L. It also raises the SyntaxError exception. At this point, most Haskellers will be throwing up their hands saying "oh noes! side effects! gross!" But the truth is that Haskellers use side effects all the time as well. We just call them "monads" in order to scare people away and avoid success at all costs. Mutation can be accomplished with the State monad, and exceptions with the Either monad (surprise!). We will want to use both at the same time, so we'll in fact use "monad transformers", which I'll explain in a bit. It's not that scary, once you learn to see past the cruft.
First, a few utilities. These are just some simple plumbing operations. raise will let us "raise exceptions" as in Python, and whileM will let us write a while loop as in Python. For the latter, we simply have to make it explicit in what order the effects should happen: first perform the effect to compute the condition, then if it's True, perform the effects of the body and loop again.
import Control.Monad.Trans.State
import Control.Monad.Trans.Class (lift)
raise = lift . Left
whileM :: Monad m => m Bool -> m () -> m ()
whileM mb m = do
b <- mb
if b
then m >> whileM mb m
else return ()
We again want to expose a pure interface. However, there is a chance that there will be a SyntaxError, so we will indicate in the type signature that the result will be either a SchemeExpr or a SyntaxError. This is reminiscent of how in Java you can annotate which exceptions a method will raise. Note that the type signature of parse has to change as well, since it might raise the SyntaxError.
data SyntaxError = SyntaxError String
deriving (Show)
parse :: String -> Either SyntaxError SchemeExpr
readFrom :: [Token] -> Either SyntaxError SchemeExpr
readFrom = evalStateT readFrom'
We are going to perform a stateful computation on the token list that is passed in. Unlike the Python, however, we are not going to be rude to the caller and mutate the very list passed to us. Instead, we will establish our own state space and initialize it to the token list we are given. We will use do notation, which provides syntactic sugar to make it look like we're programming imperatively. The StateT monad transformer gives us the get, put, and modify state operations.
readFrom' :: StateT [Token] (Either SyntaxError) SchemeExpr
readFrom' = do
tokens <- get
case tokens of
[] -> raise (SyntaxError "unexpected EOF while reading")
(token:tokens') -> do
put tokens' -- here we overwrite the state with the "rest" of the tokens
case token of
"(" -> (SList . reverse) `fmap` execStateT readWithList []
")" -> raise (SyntaxError "unexpected close paren")
_ -> return (atom token)
I've broken out the readWithList portion into a separate chunk of code,
because I want you to see the type signature. This portion of code introduces
a new scope, so we simply layer another StateT on top of the monad stack
that we had before. Now, the get, put, and modify operations refer
to the thing called L in the Python code. If we want to perform these operations
on the tokens, then we can simply preface the operation with lift in order
to strip away one layer of the monad stack.
readWithList :: StateT [SchemeExpr] (StateT [Token] (Either SyntaxError)) ()
readWithList = do
whileM ((\toks -> toks !! 0 /= ")") `fmap` lift get) $ do
innerExpr <- lift readFrom'
modify (innerExpr:)
lift $ modify (drop 1) -- this seems to be missing from the Python
In Haskell, appending to the end of a list is inefficient, so I instead prepended, and then reversed the list afterwards. If you are interested in performance, then there are better list-like data structures you can use.
Here is the complete file: http://hpaste.org/77852
So if you're new to Haskell, then this probably looks terrifying. My advice is to just give it some time. The Monad abstraction is not nearly as scary as people make it out to be. You just have to learn that what most languages have baked in (mutation, exceptions, etc), Haskell instead provides via libraries. In Haskell, you must explicitly specify which effects you want, and controlling those effects is a little less convenient. In exchange, however, Haskell provides more safety so you don't accidentally mix up the wrong effects, and more power, because you are in complete control of how to combine and refactor effects.

In Haskell, you wouldn't use an algorithm that mutates the data it operates on. So no, there is no straightforward way to do that. However, the code can be rewritten using recursion to avoid updating variables. Solution below uses the MissingH package because Haskell annoyingly doesn't have a replace function that works on strings.
import Data.String.Utils (replace)
import Data.Tree
import System.Environment (getArgs)
data Atom = Sym String | NInt Int | NDouble Double | Para deriving (Eq, Show)
type ParserStack = (Tree Atom, Tree Atom)
tokenize = words . replace "(" " ( " . replace ")" " ) "
atom :: String -> Atom
atom tok =
case reads tok :: [(Int, String)] of
[(int, _)] -> NInt int
_ -> case reads tok :: [(Double, String)] of
[(dbl, _)] -> NDouble dbl
_ -> Sym tok
empty = Node $ Sym "dummy"
para = Node Para
parseToken (Node _ stack, Node _ out) "(" =
(empty $ stack ++ [empty out], empty [])
parseToken (Node _ stack, Node _ out) ")" =
(empty $ init stack, empty $ (subForest (last stack)) ++ [para out])
parseToken (stack, Node _ out) tok =
(stack, empty $ out ++ [Node (atom tok) []])
main = do
(file:_) <- getArgs
contents <- readFile file
let tokens = tokenize contents
parseStack = foldl parseToken (empty [], empty []) tokens
schemeTree = head $ subForest $ snd parseStack
putStrLn $ drawTree $ fmap show schemeTree
foldl is the haskeller's basic structured recursion tool and it serves the same purpose as your while loop and recursive call to read_from. I think the code can be improved a lot, but I'm not so used to Haskell. Below is an almost straight transliteration of the above to Python:
from pprint import pprint
from sys import argv
def atom(tok):
try:
return 'int', int(tok)
except ValueError:
try:
return 'float', float(tok)
except ValueError:
return 'sym', tok
def tokenize(s):
return s.replace('(',' ( ').replace(')',' ) ').split()
def handle_tok((stack, out), tok):
if tok == '(':
return stack + [out], []
if tok == ')':
return stack[:-1], stack[-1] + [out]
return stack, out + [atom(tok)]
if __name__ == '__main__':
tokens = tokenize(open(argv[1]).read())
tree = reduce(handle_tok, tokens, ([], []))[1][0]
pprint(tree)

haskell write large string

Hello Stackoverflow Community.
I'm relativly new to Haskell and i have noticed writing large strings to a file with
writeFile or hPutStr is extremly slow.
For a 1.5 Mb String my Programm (compiled with ghc) takes about 2 seconds while the
"same" code in c++ only takes about 0.1 seconds.
The string is generated from a list with about 10000 elements and then dumped with writeFile. I have also tried to traverse the the list with mapM_ and hPutStr with the same result.
Is there a faster way to write a large string?
Update
As #applicative pointed out the following code finishes with a 2MB file in no time
main = readFile "input.txt" >>= writeFile "ouput.txt"
So my problem seems to be somewhere else. Here are my two implementations for
Writing the list (WordIndex and CoordList are typealiases for a Map and a List)
with hPutStrLn
-- Print to File
indexToFile :: String -> WordIndex -> IO ()
indexToFile filename index =
let
indexList = map (\(k, v) -> entryToString k v) (Map.toList index)
in do
output <- openFile filename WriteMode
mapM_ (\v -> hPutStrLn output v) indexList
hClose output
-- Convert Listelement to String
entryToString :: String -> CoordList -> String
entryToString key value = (embedString 25 key) ++ (coordListToString value) ++ "\n"
with writeFile
-- Print to File
indexToFile :: String -> WordIndex -> IO ()
indexToFile filename index = writeFile filename (indexToString "" index)
-- Index to String
indexToString :: String -> WordIndex -> String
indexToString lead index = Map.foldrWithKey (\k v r -> lead ++ (entryToString k v) ++ r) "" index
Maybe you guys can help me a little in finding a speed up here.
Thanks in advance

This is well-known problem. The default Haskell String type is simple [Char] and is slow by definition and is dead slow if it is constructed lazily (usual situation). However, as list, it allows simple and clean processing using list combinators and is useful when performance is not an issue. If it is, one should use ByteString or Text packages. ByteString is better as it is shipped with ghc, but does not provide unicode support. ByteString-based utf8 packages are available on hackage.

Yes. You could, for instance, use the Text type from the module Data.Text or Data.Text.Lazy, which internally represent text in a more efficient way (namely UTF-16) than lists of Chars do.
When writing binary data (which may or may not contain text encoded in some form) you can use ByteStrings or their lazy equivalents.
When modifying Text or ByteStrings, some operations to modify them are faster on the lazy versions. If you only want to read from such a string after creating it the non-lazy versions can generally be recommended.

Haskell iteratee: simple worked example of stripping trailing whitespace

I'm trying to understand how to use the iteratee library with Haskell. All of the articles I've seen so far seem to focus on building an intuition for how iteratees could be built, which is helpful, but now that I want to get down and actually use them, I feel a bit at sea. Looking at the source code for iteratees has been of limited value for me.
Let's say I have this function which trims trailing whitespace from a line:
import Data.ByteString.Char8
rstrip :: ByteString -> ByteString
rstrip = fst . spanEnd isSpace
What I'd like to do is: make this into an iteratee, read a file and write it out somewhere else with the trailing whitespace stripped from each line. How would I go about structuring that with iteratees? I see there's an enumLinesBS function in Data.Iteratee.Char which I could plumb into this, but I don't know if I should use mapChunks or convStream or how to repackage the function above into an iteratee.

If you just want code, it's this:
procFile' iFile oFile = fileDriver (joinI $
enumLinesBS ><>
mapChunks (map rstrip) $
I.mapM_ (B.appendFile oFile))
iFile
Commentary:
This is a three-stage process: first you transform the raw stream into a stream of lines, then you apply your function to convert that stream of lines, and finally you consume the stream. Since rstrip is in the middle stage, it will be creating a stream transformer (Enumeratee).
You can use either mapChunks or convStream, but mapChunks is simpler. The difference is that mapChunks doesn't allow for you to cross chunk boundaries, whereas convStream is more general. I prefer convStream because it doesn't expose any of the underlying implementation, but if mapChunks is sufficient the resulting code is usually shorter.
rstripE :: Monad m => Enumeratee [ByteString] [ByteString] m a
rstripE = mapChunks (map rstrip)
Note the extra map in rstripE. The outer stream (which is the input to rstrip) has type [ByteString], so we need to map rstrip onto it.
For comparison, this is what it would look like if implemented with convStream:
rstripE' :: Enumeratee [ByteString] [ByteString] m a
rstripE' = convStream $ do
mLine <- I.peek
maybe (return B.empty) (\line -> I.drop 1 >> return (rstrip line)) mLine
This is longer, and it's less efficient because it will only apply the rstrip function to one line at a time, even though more lines may be available. It's possible to work on all of the currently available chunk, which is closer to the mapChunks version:
rstripE'2 :: Enumeratee [ByteString] [ByteString] m a
rstripE'2 = convStream (liftM (map rstrip) getChunk)
Anyway, with the stripping enumeratee available, it's easily composed with the enumLinesBS enumeratee:
enumStripLines :: Monad m => Enumeratee ByteString [ByteString] m a
enumStripLines = enumLinesBS ><> rstripE
The composition operator ><> follows the same order as the arrow operator >>>. enumLinesBS splits the stream into lines, then rstripE strips them. Now you just need to add a consumer (which is a normal iteratee), and you're done:
writer :: FilePath -> Iteratee [ByteString] IO ()
writer fp = I.mapM_ (B.appendFile fp)
processFile iFile oFile =
enumFile defaultBufSize iFile (joinI $ enumStripLines $ writer oFile) >>= run
The fileDriver functions are shortcuts for simply enumerating over a file and running the resulting iteratee (unfortunately the argument order is switched from enumFile):
procFile2 iFile oFile = fileDriver (joinI $ enumStripLines $ writer oFile) iFile
Addendum: here's a situation where you would need the extra power of convStream. Suppose you want to concatenate every 2 lines into one. You can't use mapChunks. Consider when the chunk is a singleton element, [bytestring]. mapChunks doesn't provide any way to access the next chunk, so there's nothing else to concatenate with this. With convStream however, it's simple:
concatPairs = convStream $ do
line1 <- I.head
line2 <- I.head
return $ line1 `B.append` line2
this looks even nicer in applicative style,
convStream $ B.append <$> I.head <*> I.head
You can think of convStream as continually consuming a portion of the stream with the provided iteratee, then sending the transformed version to the inner consumer. Sometimes even this isn't general enough, since the same iteratee is called at each step. In that case, you can use unfoldConvStream to pass state between successive iterations.
convStream and unfoldConvStream also allow for monadic actions, since the stream processing iteratee is a monad transformer.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

parsec running out of memory - haskell

Related

Parsec: grabbing raw source after parsing

Can I create a function in Haskell that will encapsulate reading data from file and returning me a simple list of data?

How to translate this python to Haskell?

haskell write large string

Haskell iteratee: simple worked example of stripping trailing whitespace

Categories

Resources