I have a set of problems that I would like to evaluate in parallel. These problems are expressed using a simple expression type very similar to this:
-- Expressions are either a constant value or two expressions
-- combined using a certain operation
data Expr
= Const NumType
| Binary BinOp Expr Expr
-- The possible operations
data BinOp = Add | Sub | Mul | Div
deriving (Eq)
These expressions are built on the fly and should evaluate to a certain result which may be valid or invalid. This is expressed as a monad to stop computation when encountering invalid results.
data Result a
= Val { val :: a }
| Exc { exc :: String }
instance Monad Result where
return = Val
(Exc e) >>= _ = (Exc e)
(Val v) >>= g = g v
To determine a value of each solved problem I have two relevant functions:
eval :: Expr -> Result NumType
score :: Expr -> NumType
And finally I have solve functions that will return a [Expr]. This leads to my main function currently looking like this:
main :: IO ()
main = do
strAvailableNumbers <- getLine
strTargetNumber <- getLine
let numbers = parseList strAvailableNumbers
target = parseTargetNumber strTargetNumber in
sequence $ map (print) $
solveHeuristic1 (Problem target numbers) [Add] [Sub] ++
solveHeuristic2 (Problem target numbers)
return ()
The basic idea is that I read a list of numbers and a target number from stdin and then print expressions on stdout.
But I have two problems that I would like to solve and I am not quite sure how related they are:
Those heuristics run entirely unaware of each other and therefore don't know whether the score of their solution is higher than any other. I would like to introduce some kind of state to the map function to only print the the new Expr if its score is higher then the Expr printed previously.
I would like to do these computations in parallel and attempted to do so by using (parMap rseq) instead of map, compiling with the -threaded option and running it using +RTS -N2. The result is a runtime increase from 5 seconds to 7 seconds. Not what I expected, altough time shows the CPU utilization is higher. I guess I am not correctly using parMap or do something wrong by using ++. So how would I run a list of independent functions, each returning a list of elements, in parallel?
Update: Created a gist with complete source code.
The problem here is that evaluating an IO action with seq does approximately nothing. So you're just running things sequentially with slightly more overhead.
You can refractor things to make them pure again
main :: IO ()
main = do
mapM_ (`seq` print "found it") -- make sure we're not
-- benchmarking printing stuff
. concat
. parMap rdeepseq (solve [1..10000000])
$ [42, 42]
return ()
And add instances of NFData to use rdeepseq which will fully evaluate things
instance NFData BinOp -- Binop is just an enum, WHNF = NF
instance NFData Expr where
rnf (Const a) = a `deepseq` ()
rnf (Binary b e1 e2) = b `deepseq` e1 `deepseq` e2 `deepseq` ()
And now if we run it we get... a stackoverflow. I bumped up the size sufficiently that we search in order to actually make it take long enough to be worth benchmarking and now fully loading both structures into memory will blow the stack. Bumping up the stack size to the point where we don't blow everything up leaves us running 40% faster (3 vs 5 seconds) using -N2 than without. Which I would consider the expected result. Visually when running this, I can see 2 cores briefly jump up to 100%.
Final compilation sequence
> ghc -O2 -threaded -rtsops bench.hs
> ./bench +RTS -K10000000 -N2
Related
I'm relatively new to Polysemy, and I'm trying to wrap my head around how to use NonDet correctly. Specifically, let's say I've got this computation
generate :: Member NonDet r => Sem r Int
generate = msum $ fmap pure [0..]
computation :: (Member NonDet r, Member (Final IO) r) => Sem r ()
computation = do
n <- generate
guard (n == 100)
embedFinal $ print n
It's a horribly inefficient way to print the number 100, but it demonstrates the problem I'm having. Now, I want to run this effect only insofar as to get the first success. That is, I want to run this effect long enough to "find" the number 100 and print it, and then I want to stop.
My first attempt
attempt1 :: IO ()
attempt1 = void . runFinal . runNonDet #[] $ computation
This one fails to short-circuit. It prints 100 but then hangs forever, looking for the number 100 again. That makes sense; after all, I didn't actually tell it I only wanted one solution. So let's try that.
My second attempt
runNonDetOnce :: Sem (NonDet ': r) a -> Sem r (Maybe a)
runNonDetOnce = fmap listToMaybe . runNonDet
attempt2 :: IO ()
attempt2 = void . runFinal . runNonDetOnce $ computation
All we're doing here is discarding all but the head of the list. Understandably, this didn't change anything. Haskell already wasn't evaluating the list, so discarding an unused value changes nothing. Like attempt1, this solution hangs forever after printing 100.
My third attempt
attempt3 :: IO ()
attempt3 = void . runFinal . runNonDetMaybe $ computation
So I tried using runNonDetMaybe. This one, unfortunately, just exits without printing anything. Figuring out why that is took a bit, but I have a theory. The documentation says
Unlike runNonDet, uses of <|> will not execute the second branch at all if the first option succeeds.
So it's greedy and doesn't backtrack after success, basically. Thus, it runs my computation like this.
computation = do
n <- generate -- Ah yes, n = 0. Excellent!
guard (n == 100) -- Wait, 0 /= 100! Failure! We can't backtrack, so abort.
embedFinal $ print n
Non-Solutions
In this small example, we could just alter the computation a bit, like so
computation :: (Member NonDet r, Member (Final IO) r) => Sem r ()
computation = msum $ fmap (\n -> guard (n == 100) >> embedFinal (print n)) [0..]
So rather than generate a number and then check it later, we simply move generate inside of computation. With this computation, attempt3 succeeds, since we can get to the "correct" answer without backtracking. This works in this small example, but it's infeasible for a larger codebase. Unless someone has a good systematic way of avoiding backtracking, I don't see a good way to generalize this solution to computations that span over multiple files in a large program.
The other non-solution is to cheat using IO.
computation :: (Member NonDet r, Member (Final IO) r) => Sem r ()
computation = do
n <- generate
guard (n == 100)
embedFinal $ print n
embedFinal $ exitSuccess
Now attempt1 and attempt2 succeed, since we simply forcibly exit the program after success. But, aside from feeling incredibly sloppy, this doesn't generalize either. I want to stop running the current computation after finding 100, not the whole program.
So, to summarize, I want the computation given in the first code snippet above to be run using Polysemy in some way that causes it to backtrack (in NonDet) until it finds one successful value (in the example above, n = 100) and then stop running side effects and end the computation. I tried delving into the source code of runNonDetMaybe and co in this hopes of being able to reproduce something similar to it that has the effect I want, but my Polysemy skills are not nearly to the level of understanding all of the Weaving and decomp shenanigans happening there. I hope someone here who has more expertise with this library than I do can point me in the right direction to running NonDet with the desired effects.
Now attempt1 and attempt2 succeed, since we simply forcibly exit the program after success. But, aside from feeling incredibly sloppy, this doesn't generalize either. I want to stop running the current computation after finding 100, not the whole program.
Rather than exitSuccess, a closely related idea is to throw an exception that you can catch in the interpreter.
main = do
input <- sequence [getLine, getLine, getLine]
mapM_ print input
Let's see this program in action:
m#m-X555LJ:~$ runhaskell wtf.hs
asdf
jkl
powe
"asdf"
"jkl"
"powe"
Surprisingly to me, there seems to be no laziness here. Instead, all 3 getLines are evaluated eagerly, the read values are stored in memory and then, not before, all are printed.
Compare to this:
main = do
input <- fmap lines getContents
mapM_ print input
Let's see this in action:
m#m-X555LJ:~$ runhaskell wtf.hs
asdf
"asdf"
lkj
"lkj"
power
"power"
Totally different stuff. Lines are read one by one and printed one by one. Which is odd to me because I don't really see any differences between these two programs.
From LearnYouAHaskell:
When used with I/O actions, sequenceA is the same thing as sequence!
It takes a list of I/O actions and returns an I/O action that will
perform each of those actions and have as its result a list of the
results of those I/O actions. That's because to turn an [IO a] value
into an IO [a] value, to make an I/O action that yields a list of
results when performed, all those I/O actions have to be sequenced so
that they're then performed one after the other when evaluation is
forced. You can't get the result of an I/O action without performing
it.
I'm confused. I don't need to perform ALL IO actions to get the results of just one.
A few paragraphs earlier the book shows a definition of sequence:
sequenceA :: (Applicative f) => [f a] -> f [a]
sequenceA [] = pure []
sequenceA (x:xs) = (:) <$> x <*> sequenceA xs
Nice recursion; nothing here hints me that this recursion should not be lazy;just like in any other recursion, to get the head of the returned list Haskell doesn't have to go down through ALL steps of recursion!
Compare:
rec :: Int -> [Int]
rec n = n:(rec (n+1))
main = print (head (rec 5))
In action:
m#m-X555LJ:~$ runhaskell wtf.hs
5
m#m-X555LJ:~$
Clearly, the recursion here is performed lazily, not eagerly.
Then why is the recursion in the sequence [getLine, getLine, getLine] example performed eagerly?
As to why it is important that IO actions are run in order
regardless of the results: Imagine an action createFile :: IO () and
writeToFile :: IO (). When I do a sequence [createFile,
writeToFile] I'd hope that they're both done and in order, even
though I don't care about their actual results (which are both the
very boring value ()) at all!
I'm not sure how this applies to this Q.
Maybe I'll word my Q this way...
In my mind this:
do
input <- sequence [getLine, getLine, getLine]
mapM_ print input
should detoriate to something like this:
do
input <- do
input <- concat ( map (fmap (:[])) [getLine, getLine, getLine] )
return input
mapM_ print input
Which, in turn, should detoriate to something like this (pseudocode, sorry):
do
[ perform print on the result of getLine,
perform print on the result of getLine,
perform print on the result of getLine
] and discard the results of those prints since print was applied with mapM_ which discards the results unlike mapM
getContents is lazy, getLine isn't. Lazy IO isn't a feature of Haskell per se, it's a feature of some particular IO actions.
I'm confused. I don't need to perform ALL IO actions to get the results of just one.
Yes you do! That is one of the most important features of IO, that if you write a >> b or equivalently,
do a
b
then you can be sure that a is definitely "run" before b (see footnote). getContents is actually the same, it "runs" before whatever comes after it... but the result it returns is a sneaky result that sneakily does more IO when you try to evaluate it. That is actually the surprising bit, and it can lead to some very interesting results in practice (like the file you're reading the contents of being deleted or changed while you're processing the results of getContents), so in practical programs you probably shouldn't be using it, it mostly exists for convenience in programs where you don't care about such things (Code Golf, throwaway scripts or teaching for instance).
As to why it is important that IO actions are run in order regardless of the results: Imagine an action createFile :: IO () and writeToFile :: IO (). When I do a sequence [createFile, writeToFile] I'd hope that they're both done and in order, even though I don't care about their actual results (which are both the very boring value ()) at all!
Addressing the edit:
should detoriate to something like this:
do
input <- do
input <- concat ( map (fmap (:[])) [getLine, getLine, getLine] )
return input
mapM_ print input
No, it actually turns into something like this:
do
input <- do
x <- getLine
y <- getLine
z <- getLine
return [x,y,z]
mapM_ print input
(the actual definition of sequence is more or less this:
sequence [] = return []
sequence (a:as) = do
x <- a
fmap (x:) $ sequence as
Technically, in
sequenceA (x:xs) = (:) <$> x <*> sequenceA xs
we find <*>, which first runs the action on the left, then the action on the right, and finally applies their result together. This is what makes the first effect in the list to be occur first, and so on.
Indeed, on monads, f <*> x is equivalent to
do theF <- f
theX <- x
return (theF theX)
More in general, note that all the IO actions are generally executed in order, first to last (see below for a few rare exceptions). Doing IO in a completely lazy way would be a nightmare for the programmer. For instance, consider:
do let aX = print "x" >> return 4
aY = print "y" >> return 10
x <- aX
y <- aY
print (x+y)
Haskell guarantees that the output is x y 14, in that order. If we had completely lazy IO we could also get y x 14, depending on which argument is forced first by +. In such case, we would need to know exactly the order in which the lazy thunks are demanded by every operation, which is something the programmer definitely does not want to care about. Under such detailed semantics, x + y is no longer equivalent to y + x, breaking equational reasoning in many cases.
Now, if we wanted to force IO to be lazy we could use one of the forbidden functions, e.g.
do let aX = unsafeInterleaveIO (print "x" >> return 4)
aY = unsafeInterleaveIO (print "y" >> return 10)
x <- aX
y <- aY
print (x+y)
The above code makes aX and aY lazy IO actions, and the order of the output is now at the whim of the compiler and the library implementation of +. This is in general dangerous, hence the unsafeness of lazy IO.
Now, about the exceptions. Some IO actions which only read from the environment, like getContents were implemented with lazy IO (unsafeInterleaveIO). The designers felt that for such reads, lazy IO can be acceptable, and that the precise timing of the reads is not that important in many cases.
Nowadays, this is controversial. While it can be convenient, lazy IO can be too unpredictable in many cases. For instance, we can't know where the file will be closed, and that could matter if we're reading from a socket. We also need to be very careful not to force the reads too early: that often leads to a deadlock when reading from a pipe. Today, it is usually preferred to avoid lazy IO, and resort to some library like pipes or conduit for "streaming"-like operations, where there is no ambiguity.
I'm trying to get performance increases in a program I have that parses XML. The program can parse multiple XML files so I thought that I could make this run in parallel, but all my attempts have resulted in lower performance!
For XML parsing, I am using HXT.
I have a run function defined like this:
run printTasks xs = pExec xs >>= return . concat >>= doPrint printTasks 1
'pExec' is given a list of file names and is defined as:
pExec xs = do
ex <- mapM exec xs
as <- ex `usingIO` parList rdeepseq
return as
where 'exec' is defined as:
exec = runX . process
threadscope shows only one thread e ver being used (until the very end).
Can anyone explain why I have failed so miserably to parallelise this code?
In case it helps:
exec :: FilePath -> [CV_scene]
pExec :: [FilePath] -> IO [[CV_scene]]
data CV_scene = Scene [CV_layer] Time deriving (Show)
data CV_layer = Layer [DirtyRects] SourceCrop deriving (Show)
data Rect = Rect Int Int Int Int deriving (Show)-- Left Top Width Height
instance NFData CV_scene where
rnf = foldScene reduceScene
where reduceScene l t = rnf (seq t l)
instance NFData CV_layer where
rnf = foldLayer reduceLayer
where reduceLayer d s = rnf (seq s d)
instance NFData Rect where
rnf = foldRect reduceRect
where reduceRect l t w h = rnf [l,t,w,h]
type SourceCrop = Rect
type DirtyRect = Rect
type Time = Int64
Thanks in advance for your help!
First, it looks like you mislabeled the signature of exec, which should probably be:
exec :: FilePath -> IO [CV_scene]
Now for the important part. I've commented inline on what I think you think is going on.
pExec xs = do
-- A. Parse the file found at each location via exec.
ex <- mapM exec xs
-- B. Force the lazy parsing in parallel.
as <- ex `usingIO` parList rdeepseq
return as
Note that line A does not happen in paralell, which you might think is okay since it will just set up the parsing thunks which are forced in parallel in B. This is a fair assumption, and a clever use of laziness, but the results pull that into question for me.
I suspect that the implementation of exec forces most of the parsing before line B is even reached so that the deep seq doesn't do much. That fits pretty well with my experince parsing and the profiling supports that explanation.
Without the ability to test your code, I can only make the following suggestions. First try separating the parsing of the file from the IO and put the parsing in the parallel execution strategy. In that case lines A and B become something like:
ex <- mapM readFile xs
as <- ex `usingIO` parList (rdeepseq . exec')
with exec' the portion of exec after the file is read from disk.
exec' :: FilePath -> [CVScene]
Also, you may not even need rdeepSeq after this change.
As an alternative, you can do the IO and parsing in parallel using Software Transactional Memory. STM approaches are normally used for separate IO threads which act more like services, rather than pure computations. But if for some reason you cant get the strategies based approach to work, this might be worth a try.
import Control.Concurrent.STM.TChan --(from stm package)
import Control.Concurrent(forkIO)
pExec'' :: [FilePath] -> IO [[CVSene]]
pExec'' xs = do
-- A. create [(Filename,TChan [CVScene])]
tcx <- mapM (\x -> (x,) <$> newTChanIO) xs
-- B. do the reading/parsing in separate threads
mapM_ (forkIO . exec'') tcx
-- C. Collect the results
cvs <- mapM (atomically . readTChan . snd) tcx
exec'' :: [(FilePath,TChan [CVScene])] -> IO ()
exec'' (x,tch) = do
--D. The original exec function
cv <- exec x
--E. Put on the channel fifo buffer
atomically $ writeTChan tch cv
Good luck!
Consider the following code that is supposed to print out random numbers:
import System.Random.Mersenne
main =
do g <- (newMTGen Nothing)
xs <- (randoms g) :: IO [Double]
mapM_ print xs
When run, I get a segmentation fault error. That is unsurprising, since the function 'randoms' produces an infinite list. Suppose I wanted to print out only the first ten values of xs. How could I do that? xs has type IO [Double], and I think I want a variable of type [IO Double]. What operators exist to convert between the two.
If you get a segmentation fault error, and you didn't use the FFI or any functions with unsafe in their name, that's not unsurprising, in any situation! It means there's a bug with either GHC, or a library you're using is doing something unsafe.
Printing out an infinite list of Doubles with mapM_ print is perfectly fine; the list will be processed incrementally and the program should run with constant memory usage. I suspect there is a bug in the System.Random.Mersenne module you're using, or a bug the C library it's based on, or a problem with your computer (such as faulty RAM).1 Note that newMTGen comes with this warning:
Due to the current SFMT library being vastly impure, currently only a single generator is allowed per-program. Attempts to reinitialise it will fail.
You might be better off using the provided global MTGen instead.
That said, you can't convert IO [Double] into [IO Double] in that way; there's no way to know how long the resulting list would be without executing the IO action, which is impossible, since you have a pure result (albeit one that happens to contain IO actions). For infinite lists, you could write:
desequence :: IO [a] -> [IO a]
desequence = desequence' 0
where
desequence n m = fmap (!! n) m : desequence (n+1) m
But every time you execute an action in this list, the IO [a] action would be executed again; it'd just discard the rest of the list.
The reason randoms can work and return an infinite list of random numbers is because it uses lazy IO with unsafeInterleaveIO. (Note that, despite the "unsafe" in the name, this one can't cause segfaults, so something else is afoot.)
1 Other, less likely possibilities include a miscompilation of the C library, or a bug in GHC.
Suppose I wanted to print out only the first ten values of xs. How could I do that?
Just use take:
main =
do g <- (newMTGen Nothing)
xs <- (randoms g) :: IO [Double]
mapM_ print $ take 10 xs
You wrote
xs has type IO [Double]
But actually, randoms g has type IO [Double], but thanks to the do notation, xs has type [Double], you can just apply take 10 to it.
You could also skip the binding using liftM:
main =
do g <- newMTGen Nothing
ys <- liftM (take 10) $ randoms g :: IO [Double]
mapM_ print ys
Greetings,
I'm trying to understand why I'm seeing the entire file loaded into memory with the following program, yet if you comment out the line below "(***)" then the program runs in constant (about 1.5M) space.
EDIT: The file is about 660MB, the field in column 26 is a date string like '2009-10-01', and there are one million lines. The process uses about 810MB by the time it hits the 'getLine'
Am I right in thinking it's related to the splitting of the string using 'split', and that somehow the underlying ByteString that has been read from the file can't be garbage-collected because it's still referenced? But if so, then I thought BS.copy would work around that. Any ideas how to force the computation - I can't seem to get 'seq' into the right place to have an effect.
(NB the source file is tab-separated lines)
Thanks in advance,
Kevin
module Main where
import System.IO
import qualified Data.ByteString.Lazy.Char8 as BS
import Control.Monad
type Record = BS.ByteString
importRecords :: String -> IO [Record]
importRecords filename = do
liftM (map importRecord.BS.lines) (BS.readFile filename)
importRecord :: BS.ByteString -> Record
importRecord txt = r
where
r = getField 26
getField f = BS.copy $ ((BS.split '\t' txt) !! f)
loopInput :: [Record] -> IO ()
loopInput jrs = do
putStrLn $ "Done" ++ (show $ last jrs)
hFlush stdout
x <- getLine
return ()
-- (***)
loopInput jrs
main = do
jrs <- importRecords "c:\\downloads\\lcg1m.txt"
loopInput jrs
Your call to last forces the list, jrs. To figure that out it must run through the entire file building up thunks for each entry in jrs. Because you aren't evaluating each element in jrs (except the last one) these thunks hang out with references to the bytestring, so that must stay in memory.
The solution is to force the evaluation of those thunks. Because we're talking about space the first thing I did was actually to store your info in a smaller format:
type Year = Word16
type Month = Word8
type Day = Word8
data Record = Rec {-# UNPACK #-} !Year {-# UNPACK #-} !Month {-# UNPACK #-} !Day
deriving (Eq, Ord, Show, Read)
This reduces that ugly 10 byte Bytestring (+ overhead of ~16 bytes of structure information) to around 8 bytes.
importRecord now has to call toRecord r to get the right type:
toRecord :: BS.ByteString -> Record
toRecord bs =
case BS.splitWith (== '-') bs of
(y:m:d:[]) -> Rec (rup y) (rup m) (rup d)
_ -> Rec 0 0 0
rup :: (Read a) => BS.ByteString -> a
rup = read . BS.unpack
We'll need to evalute data when we convert from ByteString to Record, so lets use the parallel package and define an NFData instance from DeepSeq.
instance NFData Record where
rnf (Rec y m d) = y `seq` m `seq` d `seq` ()
Now we're ready to go, I modified main to use evalList, thus forcing the whole list before your function that wants the last one:
main = do
jrs <- importRecords "./tabLines"
let jrs' = using jrs (evalList rdeepseq)
loopInput jrs'
And we can see the heap profile looks beautiful (and top agrees, the program uses very little memory).
Sorry about that other misleading wrong answer - I was hooked on the fact that incremental processing fixes it and didn't really realize the thunks really were hanging around, not sure why my brain glided over that. Though I do stand by the gist, you should incrementally process this information making all of this answer moot.
FYI the huge bytestring didn't show up in those previous heap profiles I posted because foreign allocations (which includes ByteString) aren't tracked by the heap profiler.
There seem to be two questions here:
why does the memory usage depend on the presence or absence of the line (***);
why is the memory usage with (***) present about 800MB, rather than, say, 40MB.
I don't really know what to say about the first one that TomMD didn't already say; inside the loopInput loop, jrs can never be freed, because it's needed as an argument to the recursive call of loopInput. (You know that return () doesn't do anything when (***) is present, right?)
As for the second question, I think you are right that the input ByteString isn't being garbage collected. The reason is that you never evaluate the elements of your list jrs besides the last one, so they still contain references to the original ByteString (even though they are of the form BS.copy ...). I would think that replacing show $ last jrs with show jrs would reduce your memory usage; does it? Alternatively, you could try a stricter map, like
map' f [] = []
map' f (x:xs) = ((:) $! (f $! x)) (map' f xs)
Replace the map in importRecords with map' and see whether that reduces your memory usage.