Parallel tree search - haskell

Let's say I have a lazy Tree whose leaves are possible solutions to a problem
data Tree a = Node [Tree a] | Leaf (Maybe a)
I need to find just one solution (or find out that there are none).
I have a P-core machine. From both time and memory efficiency considerations, it only makes sense to search along P different branches in parallel.
For example, suppose you have four branches of about the same computational complexity (corresponding to T seconds of CPU time), and each of them has an answer.
If you evaluate all four branches truly in parallel on a dual-core machine, then they all will finish in about 2T seconds.
If you evaluate just the first two branches and postpone the other two, then you'll get an answer in only T seconds, also using twice as less memory.
My question is, is it possible to use any of the parallel Haskell infrastructure (Par monad, parallel strategies, ...) to achieve this, or do I have to use lower-level tools like async?

Both Strategies and the Par monad will only start evaluating a new parallel task if there is a CPU available, so in your example with four branches on a 2-core machine, only two will be evaluated. Furthermore, Strategies will GC the other tasks once you have an answer (although it might take a while to do that).
However, if each of those two branches creates more tasks, then you probably wanted to give priority to the newer tasks (i.e., depth-first), but Strategies at least will give priority to the older tasks. The Par monad I think gives priority to the newer ones (but I'd have to check that), however the Par monad will evaluate all the tasks before returning an answer, because that is how it enforces determinism.
So probably the only way to get this to work exactly as you want it, at the moment, is to write a custom scheduler for the Par monad.

At least Par monad and strategies from parallel package allow to build only pure, unconditional parallel systems, which look prettily on such pictures:
a
/ \
b c
\ /\
d e
\ ...
While in general case you really need impure inter-thread communications:
solve :: Tree a -> Maybe a
smartPartition :: Tree a -> Int -> [[Tree a]]
smartPartition tree P = ... -- split the tree in fairly even chunks,
-- one per each machine core
solveP :: Tree a -> IO (Maybe a)
solveP tree = do
resRef <- newIORef Nothing
results <- parallel (map work (smartPartition tree))
return (msum results)
where work [] = return Nothing
work (t:ts) = do
res <- readIORef resRef
if (isJust res) then (return res) else do
let tRes = solve t
if (isNothing tRes) then (work ts) else do
writeIORef tRes
return tRes
However if your single leaf computations are sufficiently and equally expensive, unsing strategies should not (I'm not sure) harm performance much:
partitionLeafs :: Tree a -> Int -> [[Tree a]]
solveP :: Tree a -> Maybe a
solveP = msum . map step . transpose . partitionLeafs
where step = msum . parMap rdeepseq solve
P. S. I feel I understand field of the problem not better than you (at least), so you likely already know all the above. I've written this answer to develop discussion, because the question is very interesting for me.

Related

Efficient way to "keep turning the crank" on a stateful computation

I have a stateful process that is modelled as an i -> RWS r w s a. I want to feed it an input cmds :: [i]; currently I do that wholesale:
let play = runGame theGame . go
where
go [] = finished
go ((v, n):cmds) = do
end1 <- stepWorld
end2 <- ite (SBV.isJust end1) (return end1) $ stepPlayer (v, n)
ite (SBV.isJust end2) (return end2) $ go cmds
I can try searching for an input of a predetermined size like this:
result <- satWith z3{ verbose = True } $ do
cmds <- mapM sCmd [1..inputLength]
return $ SBV.fromMaybe sFalse $ fst $ play cmds
However, this gives me horrible performance in SBV itself, i.e. before Z3 is called (I can see that this is the case because the verbose output shows me the all the time is spent before the (check-sat) call). This is even with inputLength set to something small like 4.
However, with inputLength set to 1 or 2, the whole process is very snappy. So this makes me hope that there is a way to run SBV to get the model of the behaviour of a single step i -> s -> (s, a), and then tell the SMT solver to keep iterating that model for n different is.
So that is my question: in a stateful computation like this, where I want to feed SMT variables as input into the stateful computation, is there a way to let the SMT solver turn its crank to avoid the bad performance of SBV?
I guess a simplified "model question" would be if I have a function f :: St -> St, and a predicate p :: St -> SBool, and I want to solve for n :: SInt such that p (iterateN n f x0), what is the recommended way of doing that with SBV, supposing Mergeable St?
EDIT: I've uploaded the whole code to Github but bear in mind it is not a minimalized example; in fact it isn't even very nice Haskell code yet.
Full symbolic execution
It's hard to opine without seeing full code we can execute. (Stack-overflow works the best when you post code segments people can actually run.) But some of the tell-tale signs of exponential complexity is creeping in your program. Consider the following segment you posted:
go [] = finished
go ((v, n):cmds) = do
end1 <- stepWorld
end2 <- ite (SBV.isJust end1) (return end1) $ stepPlayer (v, n)
ite (SBV.isJust end2) (return end2) $ go cmds
This looks like a "linear" walk if you are programming with concrete values. But keep in mind that the ite construct has to "evaluate" both branches in each step. And you have a nested if: This is why you're getting an exponential slow down, with a factor of 4 in each iteration. As you observed, this gets out-of-hand pretty quickly. (One way to think about this is that SBV has to run all possible outcomes of those nested-if's in each step. You can draw the call-tree to see this grows exponentially.)
Without knowing the details of your stepWorld or stepPlayer it's hard to suggest any alternative schemes. But bottom line, you want to eliminate those calls to ite as much as possible, and push them as low in the recursive chain as you possibly can. Perhaps continuation-passing style can help, but it all depends on what the semantics of these operations are, and if you can "defer" decisions successfully.
Query mode
However, I do believe that a better approach for you to try would be to actually use SBV's query mode. In this mode, you do not symbolically simulate everything first before calling the solver. Instead, you incrementally add constraints to the solver, query for satisfiability, and based on the values you get you take different paths. I believe this approach would work the best in your situation, where you dynamically explore the "state space" but also make decisions along the way. There is an example of this in the documentation: HexPuzzle. In particular, the search function shows how you can navigate one-move-at-a-time, using the solver in the incremental mode (using push/pop).
I'm not exactly sure if this model of execution matches the logic in your game. Hopefully, it can at least give you an idea. But I've had good luck with the incremental approach in the past where you can explore such large search-spaces incrementally by avoiding having to make all of the choices before you send things to z3.

Data.Sequence.Seq lazy parallel Functor instance

I need parallel (but lazy) version of fmap for Seq from Data.Sequence package. But package doesn't export any Seq data constructors. So I can't just wrap it in newtype and implement Functor directly for the newtype.
Can I do it without rewriting the whole package?
The best you can do is probably to splitAt the sequence into chunks, fmap over each chunk, and then append the pieces back together. Seq is represented as a finger tree, so its underlying structure isn't particularly well suited to parallel algorithms—if you split it up by its natural structure, successive threads will get larger and larger pieces. If you want to give it a go, you can copy the definition of the FingerTree type from the Data.Sequence source, and use unsafeCoerce to convert between it and a Seq. You'll probably want to send the first few Deep nodes to one thread, but then you'll have to think pretty carefully about the rest. Finger trees can be very far from weight-balanced, primarily because 3^n grows asymptotically faster than 2^n; you'll need to take that into account to balance work among threads.
There are at least two sensible ways to split up the sequence, assuming you use splitAt:
Split it all before breaking the computation into threads. If you do this, you should split it from left to right or right to left, because splitting off small pieces is cheaper than splitting off large ones and then splitting those. You should append the results in a similar fashion.
Split it recursively in multiple threads. This might make sense if you want a lot of pieces or more potential laziness. Split the list near the middle and send each piece to a thread for further splitting and processing.
There's another splitting approach that might be nicer, using the machinery currently used to implement zipWith (see the GitHub ticket I filed requesting chunksOf), but I don't know that you'd get a huge benefit in this application.
The non-strict behavior you seek seems unlikely to work in general. You can probably make it work in many or most specific cases, but I'm not too optimistic that you'll find a totally general approach.
I found a solution, but it's actually not so efficient.
-- | A combination of 'parTraversable' and 'fmap', encapsulating a common pattern:
--
-- > parFmap strat f = withStrategy (parTraversable strat) . fmap f
--
parFmap :: Traversable t => Strategy b -> (a -> b) -> t a -> t b
parFmap strat f = (`using` parTraversable strat) . fmap f
-- | Parallel version of '<$>'
(<$|>) :: Traversable t => (a -> b) -> t a -> t b
(<$|>) = parFmap rpar

Why is there no implicit parallelism in Haskell?

Haskell is functional and pure, so basically it has all the properties needed for a compiler to be able to tackle implicit parallelism.
Consider this trivial example:
f = do
a <- Just 1
b <- Just $ Just 2
-- ^ The above line does not utilize an `a` variable, so it can be safely
-- executed in parallel with the preceding line
c <- b
-- ^ The above line references a `b` variable, so it can only be executed
-- sequentially after it
return (a, c)
-- On the exit from a monad scope we wait for all computations to finish and
-- gather the results
Schematically the execution plan can be described as:
do
|
+---------+---------+
| |
a <- Just 1 b <- Just $ Just 2
| |
| c <- b
| |
+---------+---------+
|
return (a, c)
Why is there no such functionality implemented in the compiler with a flag or a pragma yet? What are the practical reasons?
This is a long studied topic. While you can implicitly derive parallelism in Haskell code, the problem is that there is too much parallelism, at too fine a grain, for current hardware.
So you end up spending effort on book keeping, not running things faster.
Since we don't have infinite parallel hardware, it is all about picking the right granularity -- too
coarse and there will be idle processors, too fine and the overheads
will be unacceptable.
What we have is more coarse grained parallelism (sparks) suitable for generating thousands or millions of parallel tasks (so not at the instruction level), which map down onto the mere handful of cores we typically have available today.
Note that for some subsets (e.g. array processing) there are fully automatic parallelization libraries with tight cost models.
For background on this see Feedback Directed Implicit Parallelism, where they introduce an automated approach to the insertion of par in arbitrary Haskell programs.
While your code block may not be the best example due to implicit data
dependence between the a and b, it is worth noting that these two
bindings commute in that
f = do
a <- Just 1
b <- Just $ Just 2
...
will give the same results
f = do
b <- Just $ Just 2
a <- Just 1
...
so this could still be parallelized in a speculative fashion. It is worth noting that
this does not need to have anything to do with monads. We could, for instance, evaluate
all independent expressions in a let-block in parallel or we could introduce a
version of let that would do so. The lparallel library for Common Lisp does this.
Now, I am by no means an expert on the subject, but this is my understanding
of the problem.
A major stumbling block is determining when it is advantageous to parallelize the
evaluation of multiple expressions. There is overhead associated with starting
the separate threads for evaluation, and, as your example shows, it may result
in wasted work. Some expressions may be too small to make parallel evaluation
worth the overhead. As I understand it, coming up will a fully accurate metric
of the cost of an expression would amount to solving the halting problem, so
you are relegated to using an heuristic approach to determining what to
evaluate in parallel.
Then it is not always faster to throw more cores at a problem. Even when
explicitly parallelizing a problem with the many Haskell libraries available,
you will often not see much speedup just by evaluating expressions in parallel
due to heavy memory allocation and usage and the strain this puts on the garbage
collector and CPU cache. You end up needing a nice compact memory layout and
to traverse your data intelligently. Having 16 threads traverse linked lists will
just bottleneck you at your memory bus and could actually make things slower.
At the very least, what expressions can be effectively parallelized is something that is
not obvious to many programmers (at least it isn't to this one), so getting a compiler to
do it effectively is non-trivial.
Short answer: Sometimes running stuff in parallel turns out to be slower, not faster. And figuring out when it is and when it isn't a good idea is an unsolved research problem.
However, you still can be "suddenly utilizing all those cores without ever bothering with threads, deadlocks and race conditions". It's not automatic; you just need to give the compiler some hints about where to do it! :-D
One of the reason is because Haskell is non-strict and it does not evaluate anything by default. In general the compiler does not know that computation of a and b terminates hence trying to compute it would be waste of resources:
x :: Maybe ([Int], [Int])
x = Just undefined
y :: Maybe ([Int], [Int])
y = Just (undefined, undefined)
z :: Maybe ([Int], [Int])
z = Just ([0], [1..])
a :: Maybe ([Int], [Int])
a = undefined
b :: Maybe ([Int], [Int])
b = Just ([0], map fib [0..])
where fib 0 = 1
fib 1 = 1
fib n = fib (n - 1) + fib (n - 2)
Consider it for the following functions
main1 x = case x of
Just _ -> putStrLn "Just"
Nothing -> putStrLn "Nothing"
(a, b) part does not need to be evaluated. As soon as you get that x = Just _ you can proceed to branch - hence it will work for all values but a
main2 x = case x of
Just (_, _) -> putStrLn "Just"
Nothing -> putStrLn "Nothing"
This function enforces evaluation of tuple. Hence x will terminate with error while rest will work.
main3 x = case x of
Just (a, b) -> print a >> print b
Nothing -> putStrLn "Nothing"
This function will first print first list and then second. It will work for z (resulting in printing infinite stream of numbers but Haskell can deal with it). b will eventually run out of memory.
Now in general you don't know if computation terminates or not and how many resources it will consume. Infinite lists are perfectly fine in Haskell:
main = maybe (return ()) (print . take 5 . snd) b -- Prints first 5 Fibbonacci numbers
Hence spawning threads to evaluate expression in Haskell might try to evaluate something which is not meant to be evaluated fully - say list of all primes - yet programmers use as part of structure. The above examples are very simple and you may argue that compiler could notice them - however it is not possible in general due to Halting problem (you cannot write program which takes arbitrary program and its input and check if it terminates) - therefore it is not safe optimization.
In addition - which was mentioned by other answers - it is hard to predict whether the overhead of additional thread are worth engaging. Even though GHC doesn't spawn new threads for sparks using green threading (with fixed number of kernel threads - setting aside a few exceptions) you still need to move data from one core to another and synchronize between them which can be quite costly.
However Haskell do have guided parallelization without breaking the purity of language by par and similar functions.
Actually there was such attempt but not on common hardware due to the low available quantity of cores. The project is called Reduceron. It runs Haskell code with a high level of parallelism. In case it was ever released as a proper 2 GHz ASIC core, we'd have a serious breakthrough in Haskell execution speed.

ST Monad == code smell?

I'm working on implementing the UCT algorithm in Haskell, which requires a fair amount of data juggling. Without getting into too much detail, it's a simulation algorithm where, at each "step," a leaf node in the search tree is selected based on some statistical properties, a new child node is constructed at that leaf, and the stats corresponding to the new leaf and all of its ancestors are updated.
Given all that juggling, I'm not really sharp enough to figure out how to make the whole search tree a nice immutable data structure à la Okasaki. Instead, I've been playing around with the ST monad a bit, creating structures composed of mutable STRefs. A contrived example (unrelated to UCT):
import Control.Monad
import Control.Monad.ST
import Data.STRef
data STRefPair s a b = STRefPair { left :: STRef s a, right :: STRef s b }
mkStRefPair :: a -> b -> ST s (STRefPair s a b)
mkStRefPair a b = do
a' <- newSTRef a
b' <- newSTRef b
return $ STRefPair a' b'
derp :: (Num a, Num b) => STRefPair s a b -> ST s ()
derp p = do
modifySTRef (left p) (\x -> x + 1)
modifySTRef (right p) (\x -> x - 1)
herp :: (Num a, Num b) => (a, b)
herp = runST $ do
p <- mkStRefPair 0 0
replicateM_ 10 $ derp p
a <- readSTRef $ left p
b <- readSTRef $ right p
return (a, b)
main = print herp -- should print (10, -10)
Obviously this particular example would be much easier to write without using ST, but hopefully it's clear where I'm going with this... if I were to apply this sort of style to my UCT use case, is that wrong-headed?
Somebody asked a similar question here a couple years back, but I think my question is a bit different... I have no problem using monads to encapsulate mutable state when appropriate, but it's that "when appropriate" clause that gets me. I'm worried that I'm reverting to an object-oriented mindset prematurely, where I have a bunch of objects with getters and setters. Not exactly idiomatic Haskell...
On the other hand, if it is a reasonable coding style for some set of problems, I guess my question becomes: are there any well-known ways to keep this kind of code readable and maintainable? I'm sort of grossed out by all the explicit reads and writes, and especially grossed out by having to translate from my STRef-based structures inside the ST monad to isomorphic but immutable structures outside.
I don't use ST much, but sometimes it is just the best solution. This can be in many scenarios:
There are already well-known, efficient ways to solve a problem. Quicksort is a perfect example of this. It is known for its speed and in-place behavior, which cannot be imitated by pure code very well.
You need rigid time and space bounds. Especially with lazy evaluation (and Haskell doesn't even specify whether there is lazy evaluation, just that it is non-strict), the behavior of your programs can be very unpredictable. Whether there is a memory leak could depend on whether a certain optimization is enabled. This is very different from imperative code, which has a fixed set of variables (usually) and defined evaluation order.
You've got a deadline. Although the pure style is almost always better practice and cleaner code, if you are used to writing imperatively and need the code soon, starting imperative and moving to functional later is a perfectly reasonable choice.
When I do use ST (and other monads), I try to follow these general guidelines:
Use Applicative style often. This makes the code easier to read and, if you do switch to an immutable version, much easier to convert. Not only that, but Applicative style is much more compact.
Don't just use ST. If you program only in ST, the result will be no better than a huge C program, possibly worse because of the explicit reads and writes. Instead, intersperse pure Haskell code where it applies. I often find myself using things like STRef s (Map k [v]). The map itself is being mutated, but much of the heavy lifting is done purely.
Don't remake libraries if you don't have to. A lot of code written for IO can be cleanly, and fairly mechanically, converted to ST. Replacing all the IORefs with STRefs and IOs with STs in Data.HashTable was much easier than writing a hand-coded hash table implementation would have been, and probably faster too.
One last note - if you are having trouble with the explicit reads and writes, there are ways around it.
Algorithms which make use of mutation and algorithms which do not are different algorithms. Sometimes there is a strightforward bounds-preserving translation from the former to the latter, sometimes a difficult one, and sometimes only one which does not preserve complexity bounds.
A skim of the paper reveals to me that I don't think it makes essential use of mutation -- and so I think a potentially really nifty lazy functional algorithm could be developed. But it would be a different but related algorithm to that described.
Below, I describe one such approach -- not necessarily the best or most clever, but pretty straightforward:
Here's the setup a I understand it -- A) a branching tree is constructed B) payoffs are then pushed back from the leafs to the root which then indicates the best choice at any given step. But this is expensive, so instead, only portions of the tree are explored to the leafs in a nondeterministic manner. Furthermore, each further exploration of the tree is determined by what's been learned in previous explorations.
So we build code to describe the "stage-wise" tree. Then, we have another data structure to define a partially explored tree along with partial reward estimates. We then have a function of randseed -> ptree -> ptree that given a random seed and a partially explored tree, embarks on one further exploration of the tree, updating the ptree structure as we go. Then, we can just iterate this function over an empty seed ptree to get a list of increasingly more sampled spaces in the ptree. We then can walk this list until some specified cutoff condition is met.
So now we've gone from one algorithm where everything is blended together to three distinct steps -- 1) building the whole state tree, lazily, 2) updating some partial exploration with some sampling of a structure and 3) deciding when we've gathered enough samples.
It's can be really difficult to tell when using ST is appropriate. I would suggest you do it with ST and without ST (not necessarily in that order). Keep the non-ST version simple; using ST should be seen as an optimization, and you don't want to do that until you know you need it.
I have to admit that I cannot read the Haskell code. But if you use ST for mutating the tree, then you can probably replace this with an immutable tree without losing much because:
Same complexity for mutable and immutable tree
You have to mutate every node above the new leaf. An immutable tree has to replace all nodes above the modified node. So in both cases the touched nodes are the same, thus you don't gain anything in complexity.
For e.g. Java object creation is more expensive than mutation, so maybe you can gain a bit here in Haskell by using mutation. But this I don't know for sure. But a small gain does not buy you much because of the next point.
Updating the tree is presumably not the bottleneck
The evaluation of the new leaf will probably be much more expensive than updating the tree. At least this is the case for UCT in computer Go.
Use of the ST monad is usually (but not always) as an optimization. For any optimization, I apply the same procedure:
Write the code without it,
Profile and identify bottlenecks,
Incrementally rewrite the bottlenecks and test for improvements/regressions,
The other use case I know of is as an alternative to the state monad. The key difference being that with the state monad the type of all of the data stored is specified in a top-down way, whereas with the ST monad it is specified bottom-up. There are cases where this is useful.

Using parallel strategies with monads

I often see the usage and explanation of Haskell's parallel strategies connected to pure computations (for example fib). However, I do not often see it used with monadic constructions: is there a reasonable interpretation of the effect of par and related functions when applied to ST s or IO ? Would any speedup be gained from such a usage?
Parallelism in the IO monad is more correctly called "Concurrency", and is supported by forkIO and friends in the Control.Concurrent module.
The difficulty with parallelising the ST monad is that ST is necessarily single-threaded - that's its purpose. There is a lazy variant of the ST monad, Control.Monad.ST.Lazy, which in principle could support parallel evaluation, but I'm not aware of anyone having tried to do this.
There's a new monad for parallel evaluation called Eval, which can be found in recent versions of the parallel package. I recommend using the Eval monad with rpar and rseq instead of par and pseq these days, because it leads to more robust and readable code. For example, the usual fib example can be written
fib n = if n < 2 then 1 else
runEval $ do
x <- rpar (fib (n-1))
y <- rseq (fib (n-2))
return (x+y)
There are some situations where this makes sense, but in general you shouldn't do it. Examine the following:
doPar =
let a = unsafePerformIO $ someIOCalc 1
b = unsafePerformIO $ someIOCalc 2
in a `par` b `pseq` a+b
in doPar, a calculation for a is sparked, then the main thread evaluates b. But, it's possible that after the main thread finishes the calculation of b it will begin to evaluate a as well. Now you have two threads evaluating a, meaning that some of the IO actions will be performed twice (or possibly more). But if one thread finishes evaluating a, the other will just drop what it's done so far. In order for this to be safe, you need a few things to be true:
It's safe for the IO actions to be performed multiple times.
It's safe for only some of the IO actions to be performed (e.g. there's no cleanup)
The IO actions are free of any race conditions. If one thread mutates some data when evaluating a, will the other thread also working on a behave sensibly? Probably not.
Any foreign calls are re-entrant (you need this for concurrency in general of course)
If your someIOCalc looks like this
someIOCalc n = do
prelaunchMissiles
threadDelay n
launchMissiles
it's absolutely not safe to use this with par and unsafePerformIO.
Now, is it ever worth it? Maybe. Sparks are cheap, even cheaper than threads, so in theory it should be a performance gain. In practice, perhaps not so much. Roman Leschinsky has a nice blog post about this.
Personally, I've found it much simpler to reason about forkIO.

Resources