Haskell enumerator: analog to iteratees `enumWith` operator? - haskell

Earlier today I wrote a small test app for iteratees that composed an iteratee for writing progress with an iteratee for actually copying data. I wound up with values like these:
-- NOTE: this snippet is with iteratees-0.8.5.0
-- side effect: display progress on stdout
displayProgress :: Iteratee ByteString IO ()
-- side effect: copy the bytestrings of Iteratee to Handle
fileSink :: Handle -> Iteratee ByteString IO ()
writeAndDisplayProgress :: Handle -> Iteratee ByteString IO ()
writeAndDisplayProgress handle = sequence_ [fileSink handle, displayProgress]
In looking at the enumerator library, I don't see an analog of sequence_ or enumWith. All I want to do is compose two iteratees so they act as one. I could discard the result (it's going to be () anyway) or keep it, I don't care. (&&&) from Control.Arrow is what I want, only for iteratees rather than arrows.
I tried these two options:
-- NOTE: this snippet is with enumerator-0.4.10
run_ $ enumFile source $$ sequence_ [iterHandle handle, displayProgress]
run_ $ enumFile source $$ sequence_ [displayProgress, iterHandle handle]
The first one copies the file, but doesn't show progress; the second one shows progress, but doesn't copy the file, so obviously the effect of the built-in sequence_ on enumerator's iteratees is to run the first iteratee until it terminates and then run the other, which is not what I want. I want to be running the iteratees in parallel rather than serially. I feel like I'm missing something obvious, but in reading the wc example for the enumerator library, I see this curious comment:
-- Exactly matching wc's output is too annoying, so this example
-- will just print one line per file, and support counting at most
-- one statistic per run
I wonder if this remark indicates that combining or composing iteratees within the enumerations framework isn't possible out of the box. What's the generally-accepted right way to do this?
Edit:
It seems as though there is no built-in way to do this. There's discussion on the Haskell mailing list about adding combinators like enumSequence and manyToOne but so far, there doesn't seem to be anything actually in the enumerator package that furnishes this capability.

It seems to me like rather than trying to have two Iteratees consume the sequence in parallel, it would be better to feed the stream through an identity Enumeratee that simply counts the bytes passing it.
Here's a simple example that copies a file and prints the number of bytes copied after each chunk.
import System.Environment
import System.IO
import Data.Enumerator
import Data.Enumerator.Binary (enumFile, iterHandle)
import Data.Enumerator.List (mapAccumM)
import qualified Data.ByteString as B
printBytes :: Enumeratee B.ByteString B.ByteString IO ()
printBytes = flip mapAccumM 0 $ \total bytes -> do
let total' = total + B.length bytes
print total'
return (total', bytes)
copyFile s t = withBinaryFile t WriteMode $ \h -> do
run_ $ (enumFile s $= printBytes) $$ iterHandle h
main = do
[source, target] <- getArgs
copyFile source target

Related

Where is the memory leak in using StateT s IO a?

Intention: Small application to learn Haskell: Downloads a wikipedia-article, then downloads all articles linked from it, then downloads all articles linked from them, and so on... until a specified recursion depth is reached. The result is saved to a file.
Approach: Use a StateT to keep track of the download queue, to download an article and to update the queue. I build a list IO [WArticle] recursively and then print it.
Problem: While profiling I find that total memory in use is proportional to number of articles downloaded.
Analysis: By literature I'm lead to believe this is a laziness and/or strictness issue. BangPatterns reduced the memory consumed but didn't solve proportionality. Furthermore, I know all articles are downloaded before the file output is started.
Possible solutions:
1) The function getNextNode :: StateT CrawlState IO WArticle (below) already has IO. One solution would be to just do the file writing in it and only return the state. It would mean the file is written to in very small chunks though. Doesn't feel very Haskell..
2) Have the function buildHelper :: CrawlState -> IO [WArticle] (below) return [IO WArticle]. Though I wouldn't know how to rewrite that code and have been advised against it in the comments.
Are any of these proposed solutions better than I think they are or are there better alternatives?
import GetArticle (WArticle, getArticle, wa_links, wiki2File) -- my own
type URL = Text
data CrawlState =
CrawlState ![URL] ![(URL, Int)]
-- [Completed] [(Queue, depth)]
-- Called by user
buildDB :: URL -> Int -> IO [WArticle]
buildDB startURL recursionDepth = buildHelper cs
where cs = CrawlState [] [(startURL, recursionDepth)]
-- Builds list recursively
buildHelper :: CrawlState -> IO [WArticle]
buildHelper !cs#(CrawlState _ queue) = {-# SCC "buildHelper" #-}
if null queue
then return []
else do
(!article, !cs') <- runStateT getNextNode cs
rest <- buildHelper cs'
return (article:rest)
-- State manipulation
getNextNode :: StateT CrawlState IO WArticle
getNextNode = {-# SCC "getNextNode" #-} do
CrawlState !parsed !queue#( (url, depth):queueTail ) <- get
article <- liftIO $ getArticle url
put $ CrawlState (url:parsed) (queueTail++ ( if depth > 1
then let !newUrls = wa_links article \\ parsed
!newUrls' = newUrls \\ map fst queue
in zip newUrls' (repeat (depth-1))
else []))
return article
startUrl = pack "https://en.wikipedia.org/wiki/Haskell_(programming_language)"
recursionDepth = 3
main :: IO ()
main = {-# SCC "DbMain" #-}
buildDB startUrl recursionDepth
>>= return . wiki2File
>>= writeFile "savedArticles.txt"
Full code at https://gitlab.com/mattias.br/sillyWikipediaSpider. Current version limited to only download the first eight links from each page to save time. Without changing it download 55 pages at ~600 MB heap usage.
Thanks for any help!
2) Is [IO WArticle] want I want in this case?
Not quite. The problem is that some of the IO WArticle actions depend on the result of a previous action: the links to future pages reside in previously obtained pages. [IO Warticle] can't provide that: it is pure in the sense that you can always find an action in the list without executing the previous actions.
What we need is a kind of "effectful list" that lets us extract articles one by one, progressively performing the neccessary effects, but not forcing us to completely generate the list in one go.
There are several libraries that provide these kinds of "effectful lists": streaming, pipes, conduit. They define monad transformers that extend a base monad with the ability to yield intermediate values before returning a final result. Usually the final result is of a type different from the values that are yielded; it might be simply unit ().
Note: The Functor, Applicative and Monad instances for these libraries differ from the corresponding instances for pure lists. The Functor instances map over the resulting final value, not over the intermediate values which are yielded. To map over the yielded values, they provide separate functions. And The Monad instances sequence effectful lists, instead of trying all combinations. To try all combinations, they provide separate functions.
Using the streaming library, we could modify buildHelper to something like this:
import Streaming
import qualified Streaming.Prelude as S
buildHelper :: CrawlState -> Stream (Of WArticle) IO ()
buildHelper !cs#(CrawlState _ queue) =
if null queue
then return []
else do (article, cs') <- liftIO (runStateT getNextNode cs)
S.yield article
buildHelper cs'
And then we could use functions like mapM_ (from Streaming.Prelude, not the one from Control.Monad!) to process the articles one by one, as they are generated.
Adding a further explaination and code building upon the answer of danidiaz. Here's the final code:
import Streaming
import qualified Streaming.Prelude as S
import System.IO (IOMode (WriteMode), hClose, openFile)
buildHelper :: CrawlState -> Stream (Of WArticle) IO ()
buildHelper cs#( CrawlState _ queue ) =
if null queue
then return ()
else do
(article, cs') <- liftIO (runStateT getNextNode cs)
S.yield article
buildHelper cs'
main :: IO ()
main = do outFileHandle <- openFile filename WriteMode
S.toHandle outFileHandle . S.show . buildHelper $
CrawlState [] [(startUrl, recursionDepth)]
hClose outFileHandle
outFileHandle is a usual file output handle.
S.toHandle takes a stream of String and writes them to the specified handle.
S.show maps show :: WArticle -> String over the stream.
An elegant solution that creates a lazy stream even though it is produced by a series of IO actions (namely downloading websites) and writes it to a file as results become available. On my machine it still uses a lot of memory (relative to the task) during execution but never exceeds 450 MB.

Space explosion when folding over Producers/Parsers in Haskell

Supposing I have a module like this:
module Explosion where
import Pipes.Parse (foldAll, Parser, Producer)
import Pipes.ByteString (ByteString, fromLazy)
import Pipes.Aeson (DecodingError)
import Pipes.Aeson.Unchecked (decoded)
import Data.List (intercalate)
import Data.ByteString.Lazy.Char8 (pack)
import Lens.Family (view)
import Lens.Family.State.Strict (zoom)
produceString :: Producer ByteString IO ()
produceString = fromLazy $ pack $ intercalate " " $ map show [1..1000000]
produceInts ::
Producer Int IO (Either (DecodingError, Producer ByteString IO ()) ())
produceInts = view decoded produceString
produceInts' :: Producer Int IO ()
produceInts' = produceInts >> return ()
parseBiggest :: Parser ByteString IO Int
parseBiggest = zoom decoded (foldAll max 0 id)
The 'produceString' function is a bytestring producer, and I am concerned with folding a parse over it to produce some kind of result.
The following two programs show different ways of tackling the problem of finding the maximum value in the bytestring by parsing it as a series of JSON ints.
Program 1:
module Main where
import Explosion (produceInts')
import Pipes.Prelude (fold)
main :: IO ()
main = do
biggest <- fold max 0 id produceInts'
print $ show biggest
Program 2:
module Main where
import Explosion (parseBiggest, produceString)
import Pipes.Parse (evalStateT)
main :: IO ()
main = do
biggest <- evalStateT parseBiggest produceString
print $ show biggest
Unfortunately, both programs eat about 200MB of memory total when I profile them, a problem I'd hoped the use of streaming parsers would solve. The first program spends most of its time and memory (> 70%) in (^.) from Lens.Family, while the second spends it in fmap, called by zoom from Lens.Family.State.Strict. The usage graphs are below. Both programs spend about 70% of their time doing garbage collection.
Am I doing something wrong? Is the Prelude function max not strict enough? I can't tell if the library functions are bad, or if I'm using the library wrong! (It's probably the latter.)
For completeness, here's a git repo that you can clone and run cabal install in if you'd like to see what I'm talking about first-hand, and here's the memory usage of the two programs:
Wrapping a strict bytestring in a single yield doesn't make it lazy. You have to yield smaller chunks to get any streaming behavior.
Edit: I found the error. pipes-aeson internally uses a consecutively function defined like this:
consecutively parser = step where
step p0 = do
(mr, p1) <- lift $
S.runStateT atEndOfBytes (p0 >-> PB.dropWhile B.isSpaceWord8)
case mr of
Just r -> return (Right r)
Nothing -> do
(ea, p2) <- lift (S.runStateT parser p1)
case ea of
Left e -> return (Left (e, p2))
Right a -> yield a >> step p2
The problematic line is the one with PB.dropWhile. This adds a quadratic blow up proportional to the number of parsed elements.
What happens is that the pipe that is threaded through this computation accumulates a new cat pipe downstream of it after each parse. So after N parses you get N cat pipes, which adds O(N) overhead to each parsed element.
I've created a Github issue to fix this. pipes-aeson is maintained by Renzo and he has fixed this issue before.
Edit: I've submitted a pull request to fix a second problem (you needed to use the intercalate for lazy bytestrings). Now the program runs in 5 KB constant space for both versions:

Is there something better than unsafePerformIO for this....?

I've so far avoided ever needing unsafePerformIO, but this might have to change today.... I would like to see if the community agrees, or if someone has a better solution.
I have a library which needs to use some config data stored in a bunch of files. This data is guaranteed static (during the run), but needs to be in files that can (on very rare occasions) be edited by an end user who can not compile Haskell programs. (The details are uninportant, but think of "/etc/mime.types" as a pretty good approximation. It is a large almost static data file used throughout many programs).
If this weren't a library I would just use the IO monad.... But because it is a library which is called throughout my code, it literally forces a bubbling up of the IO monad through pretty much everything I have written in multiple modules! Although I need to do a one time read of the data files, this low level call is effetively pure, so this is a pretty unacceptable outcome.
FYI, I plan to also wrap the call in unsafeInterleaveIO, so that only files that are needed will be loaded. My code will look something like this....
dataDir="<path to files>"
datafiles::[FilePath]
datafiles =
unsafePerformIO $
unsafeInterleaveIO $
map (dataDir </>)
<$> filter (not . ("." `isPrefixOf`))
<$> getDirectoryContents dataDir
fileData::[String]
fileData = unsafePerformIO $ unsafeInterleaveIO $ sequence $ readFile <$> datafiles
Given that the data read is referentially transparent, I am pretty sure that unsafePerformIO is safe (this has been discussed in many place, such as "Use of unsafePerformIO appropriate?"). Still, though, if there is a better way, I would love to hear about it.
UPDATE-
In response to Anupam's comment....
There are two reasons why I can't break up the lib into IO and non IO parts.
First, the amount of data is large, and I don't want to read it all into memory at once. Remember that IO is always read strictly.... This is the reason that I need to put in the unsafeInterleaveIO call, to make it lazy. IMHO, once you use unsafeInterleaveIO, you might as well use unsafePerformIO, as the risk is already there.
Second, breaking out the IO specific parts just substitutes the bubbling up of the IO monad with the bubbling up of the IO read code, as well as the passing around of the data (I might actually choose to pass around the data using the state monad anyway, so it really isn't an improvement to substitute the IO monad for the state monad everywhere). This wouldn't be so bad if the low level function itself wasn't effectively pure (ie- think of my /etc/mime.types example above, and imagine a Haskell extensionToMimeType function, which is basically pure, but needs to get the database data from the file.... Suddenly everything from low to high in the stack needs to call or pass through a readMimeData::IO String. Why should each main even need to care about the library choice of a submodule many levels deep?).
I agree with Anupam Jain, you would be better off reading these data files at a somewhat higher level, in IO, and then passing the data in them through the rest of your program purely.
You could, for example, put the functions that need the results of fileData into Reader [String], so that they can just ask for the results as needed (or some Reader Config, where Config holds these strings and whatever else you need).
A sketch of what I'm suggesting follows:
type AppResult = String
fileData :: IO [String]
fileData = undefined -- read the files
myApp :: String -> Reader [String] AppResult
myApp s = do
files <- ask
return undefined -- do whatever with s and config
main = do
config <- fileData
return $ runReader (myApp "test") config
I gather that you don't want to read all the data at once, because that would be costly. And maybe you don't really know up-front what files you will need to load, so loading all of them at the start would be wasteful.
Here's an attempt at a solution. It requires you to work inside a free monad and relegate the side-effecting operations to an interpreter. Some preliminary imports:
{-# LANGUAGE OverloadedStrings #-}
module Main where
import qualified Data.ByteString as B
import Data.Monoid
import Data.List
import Data.Functor.Compose
import Control.Applicative
import Control.Monad
import Control.Monad.Free
import System.IO
We define a functor for the free monad. It will offer a value p do the interpreter and continue the computation after receiving a value b:
type LazyLoad p b = Compose ((,) p) ((->) b)
A convenience function to request the loading of a file:
lazyLoad :: FilePath -> Free (LazyLoad FilePath B.ByteString) B.ByteString
lazyLoad path = liftF $ Compose (path,id)
A dummy interpreter function that reads "file contents" from stdin:
interpret :: Free (LazyLoad FilePath B.ByteString) a -> IO a
interpret = iterM $ \(Compose (path,next)) -> do
putStrLn $ "Enter the contents for file " <> path <> ":"
B.hGetLine stdin >>= next
Some silly example functions:
someComp :: B.ByteString -> B.ByteString
someComp b = "[" <> b <> "]"
takesAwhile :: Int
takesAwhile = foldl' (+) 0 $ take 400000000 $ intersperse (negate 1) $ repeat 1
An example program:
main :: IO ()
main = do
r <- interpret $ do
r1 <- someComp <$> lazyLoad "file1"
r2 <- return takesAwhile
if (r2 == 1)
then return r1
else someComp <$> lazyLoad "file2"
putStrLn . show $ r
When executed, this program will request a line, spend some time computing takesAwhile and only then request another line.
If want to allow different kinds of "requests", this solution could be extended with something like Data types à la carte so that each function only needs to know about about the precise effects it requires.
If you are content with allowing only one type of request, you could also use Clients and Servers from Pipes.Core instead of the free monad.

Create lazy IO list from a non-IO list

I have a lazy list of filenames created by find. I'd like to be able to load the metadata of these files lazily too. That means, that if i take 10 elements from metadata, it should only search the metadata of these ten files. The fact is find perfectly gives you 10 files if you ask for them without hanging your disk, whereas my script searches the metadata of all files.
main = do
files <- find always always /
metadata <- loadMetaList files
loadMetaList :: [String] -> IO [Metadata]
loadMetaList file:files = do
first <- loadMeta file
rest <- loadMetaList files
return (first:rest)
loadMeta :: String -> IO Metadata
As you can see, loadMetaList is not lazy. For it to be lazy, it should use tail recursion. Something like return (first:loadMetaList rest).
How do I make loadMetaList lazy?
The (>>=) of the IO monad is such that in
loadMetaList :: [String] -> IO [Metadata]
loadMetaList file:files = do
first <- loadMeta file
rest <- loadMetaList files
return (first:rest)
the action loadMetaList files has to be run before return (first:rest) can be executed.
You can avoid that by deferring the execution of loadMetaList files,
import System.IO.Unsafe
loadMetaList :: [String] -> IO [Metadata]
loadMetaList file:files = do
first <- loadMeta file
rest <- unsafeInterleaveIO $ loadMetaList files
return (first:rest)
with unsafeInterleaveIO (which find also uses). That way, the loadMetaList files is not executed until its result is needed, and if you require only the metadata of 10 files, only that will be loaded.
It's not quite as unsafe as its cousin unsafePerformIO, but should be handled with care too.
Here's how you do it the pipes way. I don't really know how you implement loadMeta and find, so I just made something up:
import Pipes
find :: Producer FilePath IO ()
find = each ["heavy.mp3", "metal.mp3"]
type MetaData = String
loadMeta :: String -> IO MetaData
loadMeta file = return $ "This song is " ++ takeWhile (/= '.') file
loadMetaList :: Pipe FilePath MetaData IO r
loadMetaList = mapM loadMeta
To run it, we just compose processing stages like a pipeline and run the pipeline using runEffect:
>>> runEffect $ find >-> loadMetaList >-> stdoutLn
This song is heavy
This song is metal
There are a couple of key things to point out:
You can make find a Producer so that it only searches the directory tree lazily, too. I know you don't need this feature because your file set is small now, but it's very easy to include later when your directory gets larger.
It's lazy, but without unsafeInterleaveIO. It generates each output immediately and doesn't wait to first collect the whole list of results.
For example, it will work even if we use an infinite list of files:
>>> import qualified Pipes.Prelude as Pipes
>>> runEffect $ each (cycle ["heavy.mp3", "metal.mp3"]) >-> loadMetaList >-> Pipes.stdoutLn
This song is heavy
This song is metal
This song is heavy
This song is metal
This song is heavy
This song is metal
...
It will only compute as much as necessary. If we specify we only want three results, it will do the minimum amount of loading necessary to return two results, even if we provide an infinite list of files.
For example, we can cap the number of results using take:
>>> runEffect $ each (cycle ["heavy.mp3", "metal.mp3"]) >-> loadMetaList >-> Pipes.take 3 >-> Pipes.stdoutLn
This song is heavy
This song is metal
This song is heavy
So you asked what is wrong with unsafeInterleaveIO. The main limitation of unsafeInterleaveIO is that you cannot guarantee when the IO actions actually occur, which leads to the following common pitfalls:
Handles accidentally being closed before the file is read
IO actions occurring late or never
Pure code having side effects and throwing IOExceptions
The biggest advantages of Haskell's IO system over other languages is that Haskell completely decouples the evaluation model from the order of side effects. When you use lazy IO, you lose that decoupling and then the order of side effects becomes tightly integrated with Haskell's evaluation model, which is a huge step backwards.
This is why it is generally not wise to use lazy IO, especially now that there are easy and elegant alternatives.
If you want to learn more about how to use pipes to implement lazy IO safely, then you can read the extensive pipes tutorial.

Stdin as IO Handle

this may be a stupid question but i couldn't find answer anywhere. I'm a Haskell newbie and i'm having trouble with I/O.
I have this structure:
data SrcFile = SrcFile (IO Handle) String
srcFileHandle :: SrcFile -> IO Handle
srcFileHandle (SrcFile handle _) = handle
srcFileLine :: SrcFile -> String
srcFileLine (SrcFile _ string) = string
Now the problem is that i have no idea how to assign stdin/stderr/stdout into it, because the stdin etc are Handlers, no IO Handlers. And if i make the structure have Handle attributes insted of IO Handle, then i won't be able to add any other file handles into it.
Judging from your definition of SrcFile, it seems as though you may be trying to write a C program in Haskell. Language shapes the way we think, and the good news is Haskell is a much more powerful language!
The excellent book Real World Haskell has a section on lazy I/O. Consider an excerpt:
One novel way to approach I/O is the hGetContents function. hGetContents has the type Handle -> IO String. The String it returns represents all of the data in the file given by the Handle.
In a strictly-evaluated language, using such a function is often a bad idea. It may be fine to read the entire contents of a 2KB file, but if you try to read the entire contents of a 500GB file, you are likely to crash due to lack of RAM to store all that data. In these languages, you would traditionally use mechanisms such as loops to process the file's entire data.
Here's the radical part.
But hGetContents is different. The String it returns is evaluated lazily. At the moment you call hGetContents, nothing is actually read. Data is only read from the Handle as the elements (characters) of the list are processed. As elements of the String are no longer used, Haskell's garbage collector automatically frees that memory. All of this happens completely transparently to you. And since you have what looks like—and, really, is—a pure String, you can pass it to pure (non-IO) code.
Further down is a section on readFile and writeFile that shows you how to forget about handles entirely.
For example, say you want to grab all the import lines from a source file:
module Main where
import Control.Monad (liftM, mapM_)
import Data.List (isPrefixOf)
import System.Environment (getArgs, getProgName)
import System.IO (hPutStrLn, stderr)
main :: IO ()
main = getArgs >>= go
where go [path] = collectImports `liftM` readFile path >>= mapM_ putStrLn
go _ = getProgName >>=
hPutStrLn stderr . ("Usage: " ++) . (++ " source-file")
collectImports :: String -> [String]
collectImports = filter ("import" `isPrefixOf`)
. takeWhile (\l -> null l
|| "module" `isPrefixOf` l
|| "import" `isPrefixOf` l)
. lines
Even though the definition of main uses readFile, the program reads only as much of the named source-file as necessary, not the whole thing! There's nothing magic going on: note that collectImports uses takeWhile to examine only those lines it needs to rather than, say, filter that would have to read all lines.
When fed its own source, the program outputs
import Control.Monad (liftM, mapM_)
import Data.List (isPrefixOf)
import System.Environment (getArgs, getProgName)
import System.IO (hPutStrLn, stderr)
So embrace laziness. Laziness is your friend! Enjoy the rest of the wonderful journey with Haskell.
I'm not sure what you're really attempting to do, but you can convert a Handle to IO Handle by using return function. So,
stdin :: Handle
return stdin :: IO Handle
In fact, return is a polymorphic function. It's type is a -> m a where m can be IO, Maybe, [] and others. Don't confuse it with return in C - it's a normal function, not a keyword that is used to exit prematurely.
In your code, you can use record syntax. The following is equivalent and automatically declares srcFileHandle and srcFileLine as functions:
data SrcFile = SrcFile { srcFileHandle :: IO Handle,
srcFileLine :: String }
I don't quite get what you're trying to achieve.
An IO a means: An interaction with the outside world that, when run, will yield an a.
It therefore doesn't make sense to store an IO Handle in a data structure. You just store the handle and you can do IO with the handle, but for storing/loading it, you have no IO interaction involved.
Hence your structure is:
data SrcFile = SrcFile Handle String
If you want to change/add/manipulate the contents, you can use an IORef which you can use like a pointer from IO code.

Resources