Reading in arbitrary amount of binary messages - haskell

I am parsing binary data out of files using Binary.Get and have something like the following:
data FileMessageHeaders = FileMessageHeaders [FileMessageHeader]
data FileMessageHeader = FileMessageHeader ...
instance Binary FileMessageHeaders where
put = undefined
get = do
messages <- untilM get isEmpty
return (FileMessageHeaders messages)
instance Binary FileMessageHeader where
put = undefined
get = ..
The problem I am having is that the untilM from monad-loops on hackage uses sequence so I believe that this is what is causing a massive delay in returning the head of the FileMessageHeader list as the whole file must be read (is this correct?). I am having trouble coming up with a way to rewrite this and avoid sequencing all of the FileMessageHeaders in the file. Any suggestions?
Thanks!

As FUZxxl notes, the problem is untilM; the Get monad is strict and requires that the entire untilM action completes before it returns. IO has nothing to do with it.
The easiest thing to do is probably switch to attoparsec and use that for parsing instead of binary. Attoparsec supports streaming parses and would likely be much easier to use for this case.
If you can't switch to attoparsec, you'll need to use some of the lower-level functions of binary rather than just using the Binary instance. Something like the following (completely untested).
getHeaders :: ByteString -> [FileMessageHeader]
getHeaders b = go b 0
where
go bs n
| B.null bs = []
| otherwise = let (header, bs', n') = runGetState get bs n
in header : go bs' n'
Unfortunately this means you won't be able to use the Binary instance or the get function, you'll have to use getHeaders. It will stream though.

The problem here is, that an IO action has to finish before the control flow can continue. Thus, the program has to read in all the messages, before they get evaluated. You could try to define an own combinator sequenceI, that uses the function unsafeInterleaveIO from System.IO.Unsafe. This function allows you, well, to interleave actions. It is used, for instance by getContents. I would define sequenceI like this:
sequenceI (x:xs) = do v <- x
vs <- unsafeInterleaveIO $ sequenceI xs
return (v:vs)
On top of this combinator, you can define your own untilM, that streams. Doing this is left as an excercise to the reader.
Edit (corrected for compilation)
This is a proof-of-concept, untested implementation of untilM:
untilMI f p = do
f' <- f
p' <- p
if p'
then return [f']
else do g' <- unsafeInterleaveIO $ untilMI f p
return (f' : g')

Related

Parallel Haskell with HXT

I'm trying to get performance increases in a program I have that parses XML. The program can parse multiple XML files so I thought that I could make this run in parallel, but all my attempts have resulted in lower performance!
For XML parsing, I am using HXT.
I have a run function defined like this:
run printTasks xs = pExec xs >>= return . concat >>= doPrint printTasks 1
'pExec' is given a list of file names and is defined as:
pExec xs = do
ex <- mapM exec xs
as <- ex `usingIO` parList rdeepseq
return as
where 'exec' is defined as:
exec = runX . process
threadscope shows only one thread e ver being used (until the very end).
Can anyone explain why I have failed so miserably to parallelise this code?
In case it helps:
exec :: FilePath -> [CV_scene]
pExec :: [FilePath] -> IO [[CV_scene]]
data CV_scene = Scene [CV_layer] Time deriving (Show)
data CV_layer = Layer [DirtyRects] SourceCrop deriving (Show)
data Rect = Rect Int Int Int Int deriving (Show)-- Left Top Width Height
instance NFData CV_scene where
rnf = foldScene reduceScene
where reduceScene l t = rnf (seq t l)
instance NFData CV_layer where
rnf = foldLayer reduceLayer
where reduceLayer d s = rnf (seq s d)
instance NFData Rect where
rnf = foldRect reduceRect
where reduceRect l t w h = rnf [l,t,w,h]
type SourceCrop = Rect
type DirtyRect = Rect
type Time = Int64
Thanks in advance for your help!
First, it looks like you mislabeled the signature of exec, which should probably be:
exec :: FilePath -> IO [CV_scene]
Now for the important part. I've commented inline on what I think you think is going on.
pExec xs = do
-- A. Parse the file found at each location via exec.
ex <- mapM exec xs
-- B. Force the lazy parsing in parallel.
as <- ex `usingIO` parList rdeepseq
return as
Note that line A does not happen in paralell, which you might think is okay since it will just set up the parsing thunks which are forced in parallel in B. This is a fair assumption, and a clever use of laziness, but the results pull that into question for me.
I suspect that the implementation of exec forces most of the parsing before line B is even reached so that the deep seq doesn't do much. That fits pretty well with my experince parsing and the profiling supports that explanation.
Without the ability to test your code, I can only make the following suggestions. First try separating the parsing of the file from the IO and put the parsing in the parallel execution strategy. In that case lines A and B become something like:
ex <- mapM readFile xs
as <- ex `usingIO` parList (rdeepseq . exec')
with exec' the portion of exec after the file is read from disk.
exec' :: FilePath -> [CVScene]
Also, you may not even need rdeepSeq after this change.
As an alternative, you can do the IO and parsing in parallel using Software Transactional Memory. STM approaches are normally used for separate IO threads which act more like services, rather than pure computations. But if for some reason you cant get the strategies based approach to work, this might be worth a try.
import Control.Concurrent.STM.TChan --(from stm package)
import Control.Concurrent(forkIO)
pExec'' :: [FilePath] -> IO [[CVSene]]
pExec'' xs = do
-- A. create [(Filename,TChan [CVScene])]
tcx <- mapM (\x -> (x,) <$> newTChanIO) xs
-- B. do the reading/parsing in separate threads
mapM_ (forkIO . exec'') tcx
-- C. Collect the results
cvs <- mapM (atomically . readTChan . snd) tcx
exec'' :: [(FilePath,TChan [CVScene])] -> IO ()
exec'' (x,tch) = do
--D. The original exec function
cv <- exec x
--E. Put on the channel fifo buffer
atomically $ writeTChan tch cv
Good luck!

Is it recommended to use recursive IO actions in the tail recursive form?

Consider the two following variations:
myReadListTailRecursive :: IO [String]
myReadListTailRecursive = go []
where
go :: [String] -> IO [String]
go l = do {
inp <- getLine;
if (inp == "") then
return l;
else go (inp:l);
}
myReadListOrdinary :: IO [String]
myReadListOrdinary = do
inp <- getLine
if inp == "" then
return []
else
do
moreInps <- myReadListOrdinary
return (inp:moreInps)
In ordinary programming languages, one would know that the tail recursive variant is a better choice.
However, going through this answer, it is apparent that haskell's implementation of recursion is not similar to that of using the recursion stack repeatedly.
But because in this case the program in question involves actions, and a strict monad, I am not sure if the same reasoning applies. In fact, I think in the IO case, the tail recursive form is indeed better. I am not sure how to correctly reason about this.
EDIT: David Young pointed out that the outermost call here is to (>>=). Even in that case, does one of these styles have an advantage over the other?
FWIW, I'd go for existing monadic combinators and focus on readability/consiseness. Using unfoldM :: Monad m => m (Maybe a) -> m [a]:
import Control.Monad (liftM, mfilter)
import Control.Monad.Loops (unfoldM)
myReadListTailRecursive :: IO [String]
myReadListTailRecursive = unfoldM go
where
go :: IO (Maybe String)
go = do
line <- getLine
return $ case line of
"" -> Nothing
s -> Just s
Or using MonadPlus instance of Maybe, with mfilter :: MonadPlus m => (a -> Bool) -> m a -> m a:
myReadListTailRecursive :: IO [String]
myReadListTailRecursive = unfoldM (liftM (mfilter (/= "") . Just) getLine)
Another, more versatile option, might be to use LoopT.
That’s really not how I would write it, but it’s clear enough what you’re doing. (By the way, if you want to be able to efficiently insert arbitrary output from any function in the chain, without using monads, you might try a Data.ByteString.Builder.)
Your first implementation is very similar to a left fold, and your second very similar to a right fold or map. (You might try actually writing them as such!) The second one has several advantages for I/O. One of the most important, for handling input and output, is that it can be interactive.
You’ll notice that the first builds the entire list from the outside in: in order to determine what the first element of the list is, the program needs to compute the entire structure to get to the innermost thunk, which is return l. The program generates the entire data structure first, then starts to process it. That’s useful when you’re reducing a list, because tail-recursive functions and strict left folds are efficient.
With the second, the outermost thunk contains the head and tail of the list, so you can grab the tail, then call the thunk to generate the second list. This can work with infinite lists, and it can produce and return partial results.
Here’s a contrived example: a program that reads in one integer per line and prints the sums so far.
main :: IO ()
main = interact( display . compute 0 . parse . lines )
where parse :: [String] -> [Int]
parse [] = []
parse (x:xs) = (read x):(parse xs)
compute :: Int -> [Int] -> [Int]
compute _ [] = []
compute accum (x:xs) = let accum' = accum + x
in accum':(compute accum' xs)
display = unlines . map show
If you run this interactively, you’ll get something like:
$ 1
1
$ 2
3
$ 3
6
$ 4
10
But you could also write compute tail-recursively, with an accumulating parameter:
main :: IO ()
main = interact( display . compute [] . parse . lines )
where parse :: [String] -> [Int]
parse = map read
compute :: [Int] -> [Int] -> [Int]
compute xs [] = reverse xs
compute [] (y:ys) = compute [y] ys
compute (x:xs) (y:ys) = compute (x+y:x:xs) ys
display = unlines . map show
This is an artificial example, but strict left folds are a common pattern. If, however, you write either compute or parse with an accumulating parameter, this is what you get when you try to run interactively, and hit EOF (control-D on Unix, control-Z on Windows) after the number 4:
$ 1
$ 2
$ 3
$ 4
1
3
6
10
This left-folded version needs to compute the entire data structure before it can read any of it. That can’t ever work on an infinite list (When would you reach the base case? How would you even reverse an infinite list if you did?) and an application that can’t respond to user input until it quits is a deal-breaker.
On the other hand, the tail-recursive version can be strict in its accumulating parameter, and will run more efficiently, especially when it’s not being consumed immediately. It doesn’t need to keep any thunks or context around other than its parameters, and it can even re-use the same stack frame. A strict accumulating function, such as Data.List.foldl', is a great choice whenver you’re reducing a list to a value, not building an eagerly-evaluated list of output. Functions such as sum, product or any can’t return any useful intermediate value. They inherently have to finish the computation first, then return the final result.

Dynamic Programming with Vectors in Haskell

I'm trying to code a kind of a simple web crawler in haskell just for practice. To my own astonishment neither the web request itself nor parsing the web site was any complicated.
I coded the program purely functional with a recursive function, but only some fourty or fifty web requests later, the program eats up all the memory.
So I tried to do the task with dynamic programming, but here I'm totally stuck, which means, I have no idea where to begin. In this tiny program I got so many errors, that I'm not able to figure out, where to start.
This is my current concept:
scanPage :: String -> IO (String,String,[String])
scanPage url = ....
crawler :: String -> IO [(String, Int)]
crawler startUrl = runST $ do
toVisit <- newSTRef [startUrl] :: ST s (STRef s [String])
visited <- newSTRef [] :: ST s (STRef s [String])
result <- newSTRef [] :: ST s (STRef s [(String, Int)])
-- Iterate over urls to visit
while (liftM not $ liftM null $ readSTRef toVisit) $ do
url <- fmap (head) (readSTRef toVisit)
(moreUrls, value_a, value_b) <- scanPage url
-- Mark page as visited
vis <- readSTRef visited
writeSTRef visited (url : vis)
-- Add Results
res <- readSTRef result
writeSTRef result ((value_a, value_b) : res)
-- Extend urls to visit
nextUrls <- readSTRef toVisit
writeSTRef toVisit (nextUrls ++ (moreUrls \\ vis))
-- End of while
return =<< readSTRef result
main = do
putStrLn =<< fmap show (crawler "http://starturl.com")
I already wrote a lot of programs like this with arrays, which are much more convenient, as I can simply write or read from or to array elements. So I thought I could use mutable vectors for these lists, but they can't grow (at least in the same instance) or shrink. So I ended up with simple lists in STRef.
The first line I can't get to work is the line with the while command. I wrote my own while function like this
while :: (Monad m) => m Bool -> m a -> m ()
while cond action = do
c <- cond
when c $ do
action
while cond action
because I couldn't find any other while command. I googled many days for mutable vectors, but was not able to find a single tutorial or even example that I could use here. Please, can anyone tell me, how to write a syntactical correct crawler function? Yes, a pure functional solution would be nicer and more "haskellish", but I'm considering me still as a beginner and all this monad-stuff is still a bit strange for me. I'm willing to learn, but a hint or even an example would be really awesome.
EDIT:
Here comes some pseudocode of my messy code.
toVisitList = startURL
visitedList = []
resultList = []
while (length toVisitList /= 0) {
url = head toVisitList -- Get the 1st element
toVisitList -= url -- Remove this url from list
visitedList += url -- Append url to visitedList
(moreUrls, val_a, val_b) = scanPage url
resultList += (val_a, val_b) -- append the result
toVisitList += (moreUrls - visitedList)
}
return resultList
EDIT:
I still haven't any clue, how to put this pseudocode into real code, especially the while-statement. Any hints appreciacted.
The natural data structure for your toVisitList is a queue. There are a few implementations of queues around, but for this purpose, the simplest thing is to just use Data.Sequence.Seq. This lets you add things to the end with |> or <>, and to view the beginning with viewl. Consider something like
crawlOnce :: Seq Url -> [Url] -> IO (Either [Url] (Seq Url, [Url]))
crawlOnce toVisitList visitedList uses viewl to look at the front of the list of URLs to visit. If it's empty, it returns Left visitedList. Otherwise, it visits the first URL, appends it to the visited list, and adds the newly discovered URLS to the list to visit, then wraps them up in Right.
There are several reasonable variations. For instance, you could go for a type like ExceptT [Url] (StateT (Seq Url, [Url]) IO) a that "throws" its final result.

reading files with references to other files in haskell

I am trying to expand regular markdown with the ability to have references to other files, such that the content in the referenced files is rendered at the corresponding places in the "master" file.
But the furthest I've come is to implement
createF :: FTree -> IO String
createF Null = return ""
createF (Node f children) = ifNExists f (_id f)
(do childStrings <- mapM createF children
withFile (_path f) ReadMode $ \handle ->
do fc <- lines <$> hGetContents handle
return $ merge fc childStrings)
ifNExists is just a helper that can be ignored, the real problem happens in the reading of the handle, it just returns the empty string, I assume this is due to lazy IO.
I thought that the use of withFile filepath ReadMode $ \handle -> {-do stutff-}hGetContents handle would be the right solution as I've read fcontent <- withFile filepath ReadMode hGetContents is a bad idea.
Another thing that confuses me is that the function
createFT :: File -> IO FTree
createFT f = ifNExists f Null
(withFile (_path f) ReadMode $ \handle ->
do let thisParse = fparse (_id f :_parents f)
children <-rights . map ( thisParse . trim) . lines <$> hGetContents handle
c <- mapM createFT children
return $ Node f c)
works like a charm.
So why does createF return just an empty string?
the whole project and a directory/file to test can be found at github
Here are the datatype definitions
type ID = String
data File = File {_id :: ID, _path :: FilePath, _parents :: [ID]}
deriving (Show)
data FTree = Null
| Node { _file :: File
, _children :: [FTree]} deriving (Show)
As you suspected, lazy IO is probably the problem. Here's the (awful) rule you have to follow to use it properly without going totally nuts:
A withFile computation must not complete until all (lazy) I/O required to fully evaluate its result has been performed.
If something forces I/O after the handle is closed, you are not guaranteed to get an error, even though that would be very nice. Instead, you get completely undefined behavior.
You break this rule with return $ merge fc childStrings, because this value is returned before it's been fully evaluated. What you can do instead is something vaguely like
let retVal = merge fc childStrings
deepSeq retVal $ return retVal
An arguably cleaner alternative is to put all the rest of the code that relies on those results into the withFile argument. The only real reason not to do that is if you do a bunch of other work with the results after you're finished with that file. For example, if you're processing a bunch of different files and accumulating their results, then you want to be sure to close each of them when you're done with it. If you're just reading in one file and then acting on it, you can leave it open till you're finished.
By the way, I just submitted a feature request to the GHC team to see if they might be willing to make these kinds of programs more likely to fail early with useful error messages.
Update
The feature request was accepted, and such programs are now much more likely to produce useful error messages. See What caused this "delayed read on closed handle" error? for details.
I'd strongly suggest you to avoid lazy IO as it always creates problems like this, as described in What's so bad about Lazy I/O? As in your case, where you need to keep the file open until it's fully read, but this would mean closing the file somewhere in pure code, when the content is actually consumed.
One possibility would be to use strict ByteStrings and read files using readFile. This would also make many operations more efficient.
Another option would be to use one of the libraries that address the lazy IO problem (see What are the pros and cons of Enumerators vs. Conduits vs. Pipes?). These libraries allow you to separate content production from its processing or consumption. So you could have a producer that reads input files and produces a stream of some tokens, and a pure consumer (not depending on IO) that consumes the stream and produces some result. For example, in conduit-extra there is a module that converts an atto-parsec parser into a consumer.
See also Is there a better way to walk a directory tree?

spanM, takeWhileM in Haskell

I have following problem: Given a [String] and String->IO Int. So I can make a transformation (map) and get [IO Int]. Now, I have to do two things -- perfrorm that actions, from start, until result is positive and I need to know, was all list processed.
I am forbidded to process after first non-positive result.
takeWhileM do not answer second question(length compraison is too impractical), and spanM perform forbidden IO.
Of course, I can write recursive function myself, but I want to do it in Haskell way, with all good of high-order functions.
Suggestions? Probably, use completely another approach?
Task above is a bit simplified task from my project.
You can use allM from the monad-loops package:
Prelude Control.Monad.Loops> let xs = ["a", "bb", "ccc", "dddd", "eeeee"]
Prelude Control.Monad.Loops> let f x = putStrLn x >> return (length x)
Prelude Control.Monad.Loops> let p x = x < 2
Prelude Control.Monad.Loops> allM (fmap p . f) xs
a
bb
False
There's also an allM in Control.Monad.ListM, but it's not appropriately lazy—it will continue to perform computations after you hit a positive result.
(I'm with you on this, by the way—I hate writing one-off recursive functions.)
I'm not familiar with the functions takeWhileM and spanM (and neither is hoogle) (edit: as per comment, they can be found in Control.Monad.ListM).
Given that, I think the best thing for you to do is to make a one-off function to perform this task. If it later turns out that you need to write code to do something similar, then you can factor out the common parts and re-use them. There's nothing wrong with writing one-off code in general, it's code duplication that's bad.
There are a few ways to write the function you want - one possible way is like this:
process :: [IO Int] -> IO Bool
process [] = return True
process [a] = a >> return True
process (a:as) = do
n <- a
if n > 0
then return False
else process as
#illusionoflife: I don't see how using takeWhileM would improve on #Chris's solution.
For example:
import Control.Monad.ListM
process :: [IO Int] -> IO Bool
process as = do
taken <- takeWhileM (>>= return . (<= 0)) as
return (length taken >= length as - 1)
(Code not verified!)
#Chris's looks more readable, among other things because in his solution we don't need to figure out if we should use >= or ==. Besides, since I call length we can't use it on an infinite input list.

Resources