Parallel "any" or "all" in Haskell - haskell

A pattern I have come across a number of times now is one where a list of values needs to be checked by mapping some test over it and seeing if any or all of the elements passed. The typical solution is just to use the convenient built-ins all and any.
The problem is that these evaluate in serial. In many cases it would be much faster to evaluate in parallel with the process being complete once any thread finds a "False" for all or a "True" for any. I'm pretty sure that short-circuiting behavior can't be implemented using Control.Parallel as it requires inter-process communication and I don't understand anywhere near enough of Control.Concurrent to implement this yet.
It's a pretty common pattern in math (e.g. Miller-Rabin Primality) so I feel like someone has probably come up with a solution for this already, but for obvious reasons doing a google search for "parallel or/and/any/all on list haskell" doesn't return many relevant results.

In many realistic programs, you can use parallel strategies for this purpose. That's because, even though there is no explicit mechanism to cancel unneeded computations, this will happen implicitly when the garbage collector runs. As a concrete example, consider the following program:
import Control.Concurrent
import Control.Parallel.Strategies
import Data.Int
import System.Mem
lcgs :: Int32 -> [Int32]
lcgs = iterate lcg
where lcg x = 1664525 * x + 1013904223
hasWaldo :: Int32 -> Bool
hasWaldo x = waldo `elem` take 40000000 (lcgs x)
waldo :: Int32
waldo = 0
main :: IO ()
main = do
print $ or (map hasWaldo [1..100] `using` parList rseq)
This uses a parallel list strategy to search for waldo = 0 (which will never be found) in the output of 100 PRNG streams of 40 million numbers each. Compile and run it:
ghc -threaded -O2 ParallelAny.hs
./ParallelAny +RTS -s -N4
and it pegs four cores for about 16s, eventually printing False. Note in the statistics that all 100 sparks are "converted" and so run to completion:
SPARKS: 100(100 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
Now, change waldo to a value that can be found early:
waldo = 531186389 -- lcgs 5 !! 50000
and modify main to keep the thread alive for 10 seconds:
main :: IO ()
main = do
print $ or (map hasWaldo [1..100] `using` parList rseq)
threadDelay 10000000
You'll observe that it prints True almost immediately, but 4 cores remain pegged at 100% CPU (at least for a little while), illustrating that unneeded computations keep running and are not short-circuited, just as you might have feared.
BUT, things change if you force a garbage collection after getting the answer:
main :: IO ()
main = do
print $ or (map hasWaldo [1..100] `using` parList rseq)
performGC
threadDelay 10000000
Now, you'll see that the CPU drops to idle shortly after printing True, and the statistics show that most of the computations were garbage collected before running:
SPARKS: 100(9 converted, 0 overflowed, 0 dud, 91 GC'd, 0 fizzled)
In realistic programs, an explicit performGC will not be needed, as GCs will be performed regularly as a matter of course. Some unnecessary computations will continue to run after the answer is found, but in many realistic scenarios, the fraction of unnecessary computations will not be a particularly important factor.
In particular, if the list is large and each individual test of a list element is fast, parallel strategies will have excellent real-world performance and is easy to implement into the bargain.

Related

Heap is full of PINNED

I have a small program that have reasonable maximum residency but allocates linearly. At first, I thought that should be cons cells or I#, but running the program with -p -hc shows heap overwhelmed by PINNED. Does anyone understand the reason and/or can suggest an improvement?
The program
-- task27.hs
{-# LANGUAGE FlexibleContexts #-}
import Control.Monad
import Control.Monad.ST
import Control.Exception
import System.Random
import Data.Functor
import qualified Data.Vector.Generic.Mutable as V
import qualified Data.Vector.Unboxed as U
m = 120
task27 :: [Int] -> (Int, Int)
task27 l = runST $ do
r <- V.replicate m 0 :: ST s (U.MVector s Int)
let go [] = return (1,2)
go (a:as) = do
let p = a `mod` m
cur_lead <- r `V.read` p
when (a > cur_lead) (V.write r p a)
go as
go l
randTest ::
Int -> -- Length of random testing sequence
IO ()
randTest n =
newStdGen <&>
randoms <&>
take n <&>
task27 >>=
print
main = randTest 1000000
My package.yaml:
name: task27
dependencies:
- base == 4.*
- vector
- random
executables:
task27:
main: task27.hs
ghc-options: -O2
My cabal.project.local:
profiling: True
I do cabal -v0 run task27 -- +RTS -p -hc && hp2ps -e8in -c task27.hp and get this:
I tried to add bangs here and there but that did not seem to help.
As #WillemVanOnsem says, in GHC terms, 35kB resident is miniscule. Whatever performance issue you have, it's got nothing to do with this tiny bit of pinned memory. Originally, I said that this was probably the Vectors, but that's wrong. Data.Text uses pinned memory, but Data.Vector doesn't. This bit of PINNED memory looks like it's actually from the runtime system itself, so you can ignore it (see below).
In GHC code, "total allocation" is a measure of processing. A GHC program is an allocation engine. If it's not allocating, it's probably not doing anything (with rare exceptions). So, if you expect your algorithm to run in O(n) time, then it will also be O(n) in total allocation, usually gigabytes worth.
With respect to the "rare exceptions", a GHC program can run in constant "total allocation" but non-constant time if aggressive optimization allows computations using fully unboxed values. So, for example:
main = print (sum [1..10000000] :: Int)
runs in constant total allocation (e.g., 50kB allocated on the heap), because the Ints can be unboxed. For comparison,
main = print (sum [1..10000000] :: Integer)
runs with O(n) total allocation (e.g., 320MB allocated on the heap). By the way, try profiling this last program (and bump the count up until it runs long enough to generate a few seconds of profile data). You'll see that it uses the same amount of PINNED memory as your program, and the amount doesn't really change with the upper limit. So, this is just runtime system overhead.
Back to your example... If you are concerned about performance, the culprit is probably System.Random. This is an EXTREMELY slow random number generator. If I run your program with n = 10000000, it takes 4secs. If I replace the random number generator with a simple LCG:
randoms :: Word32 -> [Word32]
randoms seed = tail $ iterate lcg seed
where lcg x = (a * x + c)
a = 1664525
c = 1013904223
it runs in 0.2secs, so 20 times faster.

Parallel Repa code doesn't create sparks

I'm writing code to do a subset product: it takes a list of elements and a list of indicator variables (of the same length). The product is computed in a tree, which is crucial to our application. Each product is expensive, so my goal was to compute each level of the tree in parallel, evaluating consecutive levels in sequence. Thus there isn't any nested parallelism going on.
I only have repa code in ONE function, near the top level of my overall code. Note that subsetProd is not monadic.
The steps:
chunk up the lists into pairs (no parallelism)
zip the chunked lists (no parallelism)
map the product function over this list (using Repa map), creating a Delayed array
call computeP to evaluate the map in parallel
convert the Repa result back to a list
make a recursive call (on lists half the size of the inputs)
The code:
{-# LANGUAGE TypeOperators, FlexibleContexts, BangPatterns #-}
import System.Random
import System.Environment (getArgs)
import Control.Monad.State
import Control.Monad.Identity (runIdentity)
import Data.Array.Repa as Repa
import Data.Array.Repa.Eval as Eval
import Data.Array.Repa.Repr.Vector
force :: (Shape sh) => Array D sh e -> Array V sh e
force = runIdentity . computeP
chunk :: [a] -> [(a,a)]
chunk [] = []
chunk (x1:x2:xs) = (x1,x2):(chunk xs)
slow_fib :: Int -> Integer
slow_fib 0 = 0
slow_fib 1 = 1
slow_fib n = slow_fib (n-2) + slow_fib (n-1)
testSubsetProd :: Int -> Int -> IO ()
testSubsetProd size seed = do
let work = do
!flags <- replicateM size (state random)
!values <- replicateM size (state $ randomR (1,10))
return $ subsetProd values flags
value = evalState work (mkStdGen seed)
print value
subsetProd :: [Int] -> [Bool] -> Int
subsetProd [!x] _ = x
subsetProd !vals !flags =
let len = (length vals) `div` 2
!valpairs = Eval.fromList (Z :. len) $ chunk vals :: (Array V (Z :. Int) (Int, Int))
!flagpairs = Eval.fromList (Z :. len) $ chunk flags :: (Array V (Z :. Int) (Bool, Bool))
!prods = force $ Repa.zipWith mul valpairs flagpairs
mul (!v0,!v1) (!f0,!f1)
| (not f0) && (not f1) = 1
| (not f0) = v0+1
| (not f1) = v1+1
| otherwise = fromInteger $ slow_fib ((v0*v1) `mod` 35)
in subsetProd (toList prods) (Prelude.map (uncurry (||)) (toList flagpairs))
main :: IO ()
main = do
args <- getArgs
let [numleaves, seed] = Prelude.map read args :: [Int]
testSubsetProd numleaves seed
The entire program is compiled with
ghc -Odph -rtsopts -threaded -fno-liberate-case -funfolding-use-threshold1000 -funfolding-keeness-factor1000 -fllvm -optlo-O3
per these instructions, on GHC 7.6.2 x64.
I ran my program (Subset) using
$> time ./Test 4096 4 +RTS -sstderr -N4
8 seconds later later:
672,725,819,784 bytes allocated in the heap
11,312,267,200 bytes copied during GC
866,787,872 bytes maximum residency (49 sample(s))
433,225,376 bytes maximum slop
2360 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1284212 colls, 1284212 par 174.17s 53.20s 0.0000s 0.0116s
Gen 1 49 colls, 48 par 13.76s 4.63s 0.0946s 0.6412s
Parallel GC work balance: 16.88% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 497.80s (448.38s elapsed)
GC time 187.93s ( 57.84s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 685.73s (506.21s elapsed)
Alloc rate 1,351,400,138 bytes per MUT second
Productivity 72.6% of total user, 98.3% of total elapsed
gc_alloc_block_sync: 8670031
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 571398
My code does get slower as I increase the -N parameter, (7.628 seconds for -N1, 7.891 seconds for -N2, 8.659 seconds for -N4) but I'm getting 0 sparks created, which seems like a prime suspect as to why I'm not getting any parallelism. Also, compiling with a whole slew of optimizations helps with the runtime, but not the parallelism.
Threadscope confirms that no serious work is being done on three HECs, but the garbage collector seems to be using all 4 HECs.
So why isn't Repa making any sparks? My product tree has 64 leaves, so even if Repa made a spark for every internal node, there should be ~63 sparks. I feel like it could have something to do with my use of the ST monad encapsulating the parallelism, though I'm not quite sure why this would cause an issue. Perhaps sparks can only be created in an IO monad?
If this is the case, does anyone have an idea of how I could perform this tree product where each level is done in parallel (without resulting in nested parallelism, which seems unnecessary for my task). In general, perhaps there is a better way to parallelize the tree product or make better use of Repa.
Bonus points for explaining why the runtime increases as I increase the -N parameter, even when no sparks are created.
EDIT
I changed the code example above to be a compiling example of my problem. The program flow almost perfectly matches my real code: I randomly choose some inputs, and then do a subset product on them. I am now using the identity monad. I have tried lots of small changes to my code: inlining or not, bang patterns or not, variations on using two Repa lists and a Repa zipWith vs zipping the lists sequentially and using a Repa map, etc, none of which helped at all.
Even if I'm running into this problem in my example code, my real program is much larger.
Why is there no parallelism?
The main reason (at least for your now simplified and working) program for there being no parallelism is that you're using computeP on an array of V representation, and normal vectors aren't strict in their element types. So you aren't actually doing any real work in parallel. The easiest fix is to use an unboxed U array as the result, by changing force to this definition:
force :: (Shape sh, Unbox e) => Array D sh e -> Array U sh e
force a = runIdentity (computeP a)
I do recall that in your original code you claimed you're working with a complicated datatype that isn't unboxed. But is it really impossible to make it so? Perhaps you can extract the data you actually need into some unboxable representation? Or make the type an instance of the Unbox class? If not, then you can also use the following variant of force that works for a V-array:
import Control.DeepSeq (NFData(..))
...
force :: (Shape sh, NFData e) => Array D sh e -> Array V sh e
force a = runIdentity $ do
r <- computeP a
!b <- computeUnboxedP (Repa.map rnf r)
return r
The idea here is that we first compute the V-array structure, and then we compute a U-array of () type from it by mapping rnf over the array. The resulting array is uninteresting, but each of the V-array's elements will be forced in the process1.
Either of these changes brings runtime for a problem size of 4096 from ~9 down to ~3 seconds with -N4 on my machine.
In addition, I think it's strange that you convert between lists and arrays in every step. Why not make subsetProd take two arrays? Also, at least for the values, using an intermediate V array for the pairs seems unnecessary, you could just as well use a D array. But in my experiments these changes didn't have a significant beneficial effect on runtime.
Why are there no sparks?
Repa does never create sparks. Haskell has many different approaches to parallelism, and sparks are one particular mechanism that has special support in the run-time system. However, only some libraries, for example the parallel package and one particular scheduler of the monad-par package, actually make use of the mechanism. Repa, however, does not. It uses forkIO, i.e., threads, internally, but provides a pure interface to the outside. So the absence of sparks is in itself nothing to worry about.
1. I originally had no idea how to do that, so I asked Ben Lippmeier, the author of Repa. Thanks a lot to Ben for pointing out the option of mapping rnf to produce a different array, and the fact that there's an Unbox instance for (), to me.

Sequencing IO actions in parallel

I have a function that returns an IO action,
f :: Int -> IO Int
I would like to compute this function in parallel for multiple values of the argument. My naive implementation was as follows:
import Control.Parallel.Strategies
vals = [1..10]
main = do
results <- mapM f vals
let results' = results `using` parList rseq
mapM_ print results'
My reasoning for this was that the first mapM binds something of type IO [Int] to results, results' applies a parallel strategy to the contained list, and the mapM_ finally requests the actual values by printing them - but what is to be printed is already sparked in parallel, so the program should parallelize.
After being happy that it does indeed use all my CPUs, I noticed that the program is less effective (as in wall clock time) when being run with +RTS -N8 than without any RTS flags. The only explanation I can think of is that the first mapM has to sequence - i.e. perform - all the IO actions already, but that would not lead to ineffectivity, but make the N8 execution as effective as the unparallelized one, because all the work is done by the master thread. Running the program with +RTS -N8 -s yields SPARKS: 36 (11 converted, 0 overflowed, 0 dud, 21 GC'd, 4 fizzled), which surely isn't optimal, but unfortunately I can't make any sense of it.
I suppose I've found one of the beginner's stepping stones in Haskell parallelization or the internals of the IO monad. What am I doing wrong?
Background info: f n is a function that returns the solution for Project Euler problem n. Since many of them have data to read, I put the result into the IO monad. An example of how it may look like is
-- Problem 13: Work out the first ten digits of the sum of one-hundred 50-digit numbers.
euler 13 = fmap (first10 . sum) numbers
where
numbers = fmap (map read . explode '\n') $ readFile "problem_13"
first10 n
| n < 10^10 = n -- 10^10 is the first number with 11 digits
| otherwise = first10 $ n `div` 10
The full file can be found here (It's a bit long, but the first few "euler X" functions should be representative enough), the main file where I do the parallelism is this one.
Strategies are for parallel execution of pure computations. If it really is mandatory that your f returns an IO value, then consider using the async package instead. It provides useful combinators for running IO actions concurrently.
For your use case, mapConcurrently looks useful:
import Control.Concurrent.Async
vals = [1..10]
main = do
results <- mapConcurrently f vals
mapM_ print results
(I haven't tested though, because I don't know what your f is exactly.)
Try the parallel-io package. It allows you to change any mapM_ into parallel_.

Are there any problems with this Haskell function for strictly timing a computation?

Recently I was trying to determine the time needed to calculate a waveform using the vector storage type.
I wanted to do so without requiring to print the length or something like that. Finally I came up with the following two definitions. It seems simple enough, and from what I can tell it prints a non-zero computation time as expected the first time I run the function, but I'm wondering if there are any laziness caveats here that I've missed.
import System.IO
import System.CPUTime
import qualified Data.Vector.Storable as V
timerIO f = do
start <- getCPUTime
x <- f
let !y = x
end <- getCPUTime
let diff = (fromIntegral (end - start)) / (10^12)
print $ "Computation time: " ++ show diff ++ " sec\n"
timer f = timerIO $ do return f
main :: IO ()
main = do
let sr = 1000.0
time = V.map (/ sr) $ V.enumFromN 0 120000 :: V.Vector Float
wave = V.map (\x -> sin $ x * 2 * pi * 10) time :: V.Vector Float
timer wave
timer wave
prints,
Computation time: 0.16001 sec
Computation time: 0.0 sec
Are there any hidden bugs here? I'm really not sure that the let with strictness flag is really the best way to go here. Is there a more concise way to write this? Are there any standard functions that already do this that I should know about?
Edit: I should mention that I had read about criterion but in this case I was not looking for a robust way to calculate average timing for profiling-only purposes; rather I was looking for a simple / low-overhead way to integrate single timers into my program for tracing the timing of some computations during normal running of the application. Criterion is cool, but this was a slightly different use case.
If evaluating to weak head normal form is enough - for strict Vectors or UArrays it is -, then your timing code works well¹, however, instead of the bang pattern in the let-binding, you could put a bang on the monadic bind,
start <- getCPUTime
!x <- f
end <- getCPUTime
which to me looks nicer, or you could use Control.Exception.evaluate
start <- getCPUTime
evaluate f
end <- getCPUTime
which has the advantage of (supposed) portability, whereas bang patterns are a GHC extension. If WHNF is not enough, you would need to force full evaluation, for example using rnf or deepseq, like
start <- getCPUTime
!x <- rnf `fmap` f
end <- getCPUTime
However, repeatedly timing the same computation with that is hairy. If, as in your example, you give the thing a name, and call it
timer wave
timer wave
the compiler shares the computation, so it's only done once and all but the first timer calls return zero (or very close to zero) times. If you call it with code instead of a name,
timer (V.map (\x -> sin $ x * 2 * pi * 10) time :: V.Vector Float)
timer (V.map (\x -> sin $ x * 2 * pi * 10) time :: V.Vector Float)
the compiler can still share the computation, if it does common subexpression elimination. And although GHC doesn't do much CSE, it does some and I'm rather confident it would spot and share this (when compiling with optimisations). To reliably make the compiler repeat the computations, you need to hide the fact that they are the same from it (or use some low-level internals), which is not easy to do without influencing the time needed for the computation.
¹ It works well if the computation takes a significant amount of time. If it takes only a short time, the jitter introduced by outside influences (CPU load, scheduling, ...) will make single timings far too unreliable. Then you should do multiple measurements, and for that, as has been mentioned elsewhere, the criterion library is an excellent way to relieve you of the burden of writing robust timing code.
Are you familiar with the deepseq package? It's used by the criterion package for pretty much the purpose you describe.
Speaking of which, you may want to consider whether criterion itself does what you need anyway.

Haskell: can't use getCPUTime

I have:
main :: IO ()
main = do
iniciofibonaccimap <- getCPUTime
let fibonaccimap = map fib listaVintesete
fimfibonaccimap <- getCPUTime
let difffibonaccimap = (fromIntegral (fimfibonaccimap - iniciofibonaccimap)) / (10^12)
printf "Computation time fibonaccimap: %0.3f sec\n" (difffibonaccimap :: Double)
listaVintesete :: [Integer]
listaVintesete = replicate 100 27
fib :: Integer -> Integer
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
But
*Main> main
Computation time fibonaccimap: 0.000 sec
I do not understand why this happens.
Help-me thanks.
As others have said, this is due to lazy evaluation. To force evaluation you should use the deepseq package and BangPatterns:
{-# LANGUAGE BangPatterns #-}
import Control.DeepSeq
import Text.Printf
import System.CPUTime
main :: IO ()
main = do
iniciofibonaccimap <- getCPUTime
let !fibonaccimap = rnf $ map fib listaVintesete
fimfibonaccimap <- getCPUTime
let difffibonaccimap = (fromIntegral (fimfibonaccimap - iniciofibonaccimap)) / (10^12)
printf "Computation time fibonaccimap: %0.3f sec\n" (difffibonaccimap :: Double)
...
In the above code you should notice three things:
It compiles (modulo the ... of functions you defined above). When you post code for questions please make sure it runs (iow, you should include imports)
The use of rnf from deepseq. This forces the evaluation of each element in the list.
The bang pattern on !fibonaccimap, meaning "do this now, don't wait". This forces the list to be evaluated to weak-head normal form (whnf, basically just the first constructor (:)). Without this the rnf function would itself remain unevaluated.
Resulting in:
$ ghc --make ds.hs
$ ./ds
Computation time fibonaccimap: 6.603 sec
If you're intending to do benchmarking you should also use optimization (-O2) and the Criterion package instead of getCPUTime.
Haskell is lazy. The computation you request in the line
let fibonaccimap = map fib listaVintesete
doesn't actually happen until you somehow use the value of fibonaccimap. Thus to measure the time used, you'll need to introduce something that will force the program to perform the actual computation.
ETA: I originally suggested printing the last element to force evaluation. As TomMD points out, this is nowhere near good enough -- I strongly recommend reading his response here for an actually working way to deal with this particular piece of code.
I suspect you are a "victim" of lazy evaluation. Nothing forces the evaluation of fibonaccimap between the timing calls, so it's not computed.
Edit
I suspect you're trying to benchmark your code, and in that case it should be pointed out that there are better ways to do this more reliably.
10^12 is an integer, which forces the value of fromIntegral to be an integer, which means difffibonaccimap is assigned a rounded value, so it's 0 if the time is less than half a second. (That's my guess, anyway. I don't have time to look into it.)
Lazy evaluation has in fact bitten you, as the other answers have said. Specifically, 'let' doesn't force the evaluation of an expression, it just scopes a variable. The computation won't actually happen until its value is demanded by something, which probably won't happen until an actual IO action needs its value. So you need to put your print statement between your getCPUTime evaluations. Of course, this will also get the CPU time used by print in there, but most of print's time is waiting on IO. (Terminals are slow.)

Resources