I have an implementation of Conway's Game of Life. I want to speed it up if possible by using parallelism.
life :: [(Int, Int)] -> [(Int, Int)]
life cells = map snd . filter rules . freq $ concatMap neighbours cells
where rules (n, c) = n == 3 || (n == 2 && c `elem` cells)
freq = map (length &&& head) . group . sort
parLife :: [(Int, Int)] -> [(Int, Int)]
parLife cells = parMap rseq snd . filter rules . freq . concat $ parMap rseq neighbours cells
where rules (n, c) = n == 3 || (n == 2 && c `elem` cells)
freq = map (length &&& head) . group . sort
neigbours :: (Int, Int) -> [(Int, Int)]
neighbours (x, y) = [(x + dx, y + dy) | dx <- [-1..1], dy <- [-1..1], dx /= 0 || dy /= 0]
in profiling, neighbours accounts for 6.3% of the time spent, so while small I expected a noticable speedup by mapping it in parallel.
I tested with a simple function
main = print $ last $ take 200 $ iterate life fPent
where fPent = [(1, 2), (2, 2), (2, 1), (2, 3), (3, 3)]
and compiled the parallel version as
ghc --make -O2 -threaded life.hs
and ran it as
./life +RTS -N3
it turns out that the parallel version is slower. Am I using parMap incorrectly here? is this even a case where parallelism can be used?
I don't think you're measuring right. Your parLife is indeed a bit faster than life. In fact, on my machine (Phenom X4, 4 core,) the former only takes about 92.5% of the time the latter does, which considering you're saying you're expecting only a 6% improvement is quite good.
What is your benchmarking setup? Have you tried using criterion? Here's what I did:
import Criterion
import Criterion.Main
-- your code, minus main
runGame f n = last $ take n $ iterate f fPent
where fPent = [(1, 2), (2, 2), (2, 1), (2, 3), (3, 3)]
main = defaultMain
[ bench "No parallelism 200" $ whnf (runGame life) 200
, bench "Parallelism 200" $ whnf (runGame parLife) 200 ]
Compiled with ghc --make -O2 -o bench and ran with ./bench -o bencht.hmtl +RTS -N3.
Here's the detailed result of the report.
Related
I'm trying to solve Knight's Open Tour in Haskell,and come up with a solution to generate all possible solutions:
knightsTour :: Int -> [[(Int, Int)]]
knightsTour size = go 1 [(1, 1)]
where
maxSteps = size^2
isValid (x, y) = x >= 1 && x <= size && y >= 1 && y <= size
go :: Int -> [(Int, Int)] -> [[(Int, Int)]]
go count acc | count == maxSteps = return $ reverse acc
go count acc = do
next <- nextSteps (head acc)
guard $ isValid next && next `notElem` acc
go (count + 1) (next : acc)
fs = replicateM 2 [(*1), (*(-1))]
nextSteps :: (Int, Int) -> [(Int, Int)]
nextSteps (x, y) = do
(x', y') <- [(1, 2), (2, 1)]
[f, f'] <- fs
return (x + f x', y + f' y')
However, when tested with 8-by-8 chess board, the above function never stops, which is because the solution space is insanely large(19,591,828,170,979,904 different open tours according to 1). So I want to find only one solution. Fisrt, I tried:
-- First try
head (knightsTour 8)
with the hope that Haskell's lazy evaluation may come to save the day. But that didn't happen, the solution still runs forever.
Then, I tried:
-- second try
import Data.List (find)
import Data.Maybe (fromMaybe)
knightsTour' :: Int -> [(Int, Int)]
knightsTour' size = go 1 [(1, 1)]
where
maxSteps = size^2
isValid (x, y) = x >= 1 && x <= size && y >= 1 && y <= size
go :: Int -> [(Int, Int)] -> [(Int, Int)]
go count acc | count == maxSteps = reverse acc
go count acc =
let
nextSteps' = [step | step <- nextSteps (head acc), isValid step && step `notElem` acc]
in
fromMaybe [] (find (not . null) $ fmap (\step -> go (count+1) (step:acc)) nextSteps')
fs = replicateM 2 [(*1), (*(-1))]
nextSteps :: (Int, Int) -> [(Int, Int)]
nextSteps (x, y) = do
(x', y') <- [(1, 2), (2, 1)]
[f, f'] <- fs
return (x + f x', y + f' y')
But the solution above still cannot deliver, because it still runs forever.
My questions are:
Why can't lazy evaluation work as I expected to produce only the first solution found? In my opinion, in both tries, only the first solution is required.
How to change the code above to produce only the first solution?
So first the good news: your code is doing what you expect, and only producing the first solution!
That's also the bad news: it really is taking this long to even find the first solution. I think something you underestimate greatly is how many "dead ends" need to be encountered in order to produce a solution.
For example, here's a tweak of your initial version using the Debug.Trace module to let us know how many dead ends you encounter while trying to find the first path:
import Control.Monad
import Debug.Trace (trace)
import System.Environment (getArgs)
knightsTour :: Int -> [[(Int, Int)]]
knightsTour size = go 1 [(1, 1)]
where
maxSteps = size * size
isValid (x, y) = x >= 1 && x <= size && y >= 1 && y <= size
go :: Int -> [(Int, Int)] -> [[(Int, Int)]]
go count acc | count == maxSteps = return $ reverse acc
go count acc = do
let nextPossible' = [ next |
next <- nextSteps (head acc)
, isValid next && next `notElem` acc]
nextPossible = if null nextPossible'
then trace ("dead end; count: " ++ show count) []
else nextPossible'
next <- nextPossible
-- guard $ isValid next && next `notElem` acc
go (count + 1) (next : acc)
fs = replicateM 2 [(*1), (*(-1))]
nextSteps :: (Int, Int) -> [(Int, Int)]
nextSteps (x, y) = do
(x', y') <- [(1, 2), (2, 1)]
[f, f'] <- fs
return (x + f x', y + f' y')
main :: IO ()
main = do
[n] <- getArgs
print (head $ knightsTour (read n))
Now, let's see how much output that gives us for different board sizes:
/tmp$ ghc -o kntest -O2 kntest.hs
[1 of 1] Compiling Main ( kntest.hs, kntest.o )
Linking kntest ...
/tmp$ ./kntest 5 2>&1 | wc
27366 109461 547424
/tmp$ ./kntest 6 2>&1 | wc
783759 3135033 15675378
/tmp$ ./kntest 7 2>&1 | wc
818066 3272261 16361596
Okay, so we encountered 27,365 dead ends on a board size of 5 and over 800 thousand dead ends on a board size of 7. For a board size of eight, I redirected it to a file:
/tmp$ ./kntest 8 2> kn8.deadends.txt
It's still running. At this point, it's encountered over 38 million dead ends:
/tmp$ wc -l kn8.deadends.txt
38178728 kn8.deadends.txt
How many of those dead ends were really close to the end?
/tmp$ wc -l kn8.deadends.txt ; fgrep 'count: 61' kn8.deadends.txt | wc -l ; fgrep 'count: 62' kn8.deadends.txt | wc -l; fgrep 'count: 63' kn8.deadends.txt | wc -l ; wc -l kn8.deadends.txt
52759655 kn8.deadends.txt
1448
0
0
64656651 kn8.deadends.txt
So it's up to well over 64 million dead ends now and it still hasn't found a dead end longer than 61 steps.
And now it's at 85 million, and if I take too long to write the rest of this it could be at over 100 million by the time I finish this answer.
There are some things you might do to speed up your program (such as using a vector to track already visited spots rather than the O(n) notElem lookup), but fundamentally it's taking so long to get just the first answer because it's really much, much longer to the first answer than you initially thought.
EDIT: If you add a very simple, naive implementation of Warnsdorf's rule then you get the first knight's tour almost instantly even for very large (40x40) boards:
import Control.Monad
import System.Environment (getArgs)
import Data.List (sort)
knightsTour :: Int -> [[(Int, Int)]]
knightsTour size = go 1 [(1, 1)]
where
maxSteps = size * size
isValid (x, y) = x >= 1 && x <= size && y >= 1 && y <= size
getValidFor from acc = do
next <- nextSteps from
guard $ isValid next && next `notElem` acc
return next
go :: Int -> [(Int, Int)] -> [[(Int, Int)]]
go count acc | count == maxSteps = return $ reverse acc
go count acc = do
let allPoss = getValidFor (head acc) acc
sortedPossible = map snd $ sort $
map (\x -> (length $ getValidFor x acc, x))
allPoss
next <- sortedPossible
go (count + 1) (next : acc)
fs = replicateM 2 [(*1), (*(-1))]
nextSteps :: (Int, Int) -> [(Int, Int)]
nextSteps (x, y) = do
(x', y') <- [(1, 2), (2, 1)]
[f, f'] <- fs
return (x + f x', y + f' y')
main :: IO ()
main = do
[n] <- getArgs
print (head $ knightsTour (read n))
Hi I am generating a sparse DAG of 1000 X 1000 nodes each having ~4 edges (direction). Here is the relevant code : Full Code with imports
The problem i am solving has values between [0-1500]. I have hardcoded 1501 as upper value for now. I am trying to calculate longest path of edges in the DAG. However, these details are not direct part of my question :
My question is related to how to judge the usage of force or similar constructs while writing algos in haskell :
type OutGoingEdges = Map.Map NodeId [ NodeId ]
type NodesData = Map.Map NodeId Node
type NodeId = Int
data DAG = AdjList
{ outGoingEdges :: OutGoingEdges
, nodesData :: NodesData
} deriving (Eq, Show)
makeDAG :: DAGDataPath -> IO (DAG, SourceNodes)
makeDAG filepath = do
listOfListOfInts <- makeInteger <$> readLines filepath
let [width, height] = head listOfListOfInts
numNodes = width * height
rows = (replicate width 1501) : (drop 1 listOfListOfInts) ++ [(replicate width 1501)]
heightsWithNodeIdsRows = force . fmap (\ (row, rowId) -> fmap (\ (height, colId) -> (height, rowId * width + colId)) $ zip row [1..]) $ zip rows [1..]
emptyGraph = AdjList Map.empty $ Map.fromList (fmap (\(h, nid) -> (nid, Node h)) . concat . tail . init $ heightsWithNodeIdsRows)
emptyNodesWithEdges = Set.empty
threeRowsInOneGo = zip3 heightsWithNodeIdsRows (drop 1 heightsWithNodeIdsRows) (drop 2 heightsWithNodeIdsRows)
(graph, nodesWithInEdges) = DL.foldl' makeGraph (emptyGraph, emptyNodesWithEdges) threeRowsInOneGo
sourceNodes = Set.difference (Set.fromList . Map.keys . nodesData $ graph) nodesWithInEdges
-- traceShow [take 10 . Map.keys . nodesData $ graph] (return (Set.toList sourceNodes))
-- traceShow graph (return (Set.toList sourceNodes))
-- traceShow sourceNodes (return (Set.toList sourceNodes))
return (graph, force $ Set.toList sourceNodes)
where
makeGraph (graphTillNow, nodesWithInEdges) (prevRow, row, nextRow) =
let updownEdges = zip3 prevRow row nextRow
(graph', nodesInEdges') = addEdges (graphTillNow, nodesWithInEdges) updownEdges
leftRightEdges = zip3 ((1501, 0) : row) (drop 1 row) (drop 2 row)
(graph'', nodesInEdges'') = addEdges (graph', nodesInEdges') leftRightEdges
Next line is interesting... graph'' is DAG and nodesInEdges'' is a Set NodeId
in (graph'', nodesInEdges'')
addEdges (g, n) edges =
DL.foldl' (\ (!g', !n') ((p, pId), (c, cId), (n, nId)) ->
let (g'', n'') = if c > p
then (makeEdge cId pId g', Set.insert pId n')
else (g', n')
(g''', n''') = if c > n
then (makeEdge cId nId g'', Set.insert nId n'')
else (g'', n'')
in (g''', n'''))
(g, n)
edges
While profiling i found that, if i use (force graph'', force nodesInEdges'') instead of (graph'', nodesInEdges''), my memory usage reduces from 3 GB to 600 MB. But running time of program increases from 37 secs to 69 secs. These numbers are from time command and looking at activity monitor. I also checked with profiling and it was similar results.
I am compiling profile builds with :
stack build --executable-profiling --library-profiling --ghc-options="-fprof-auto -auto-all -caf-all -fforce-recomp -rtsopts" --file-watch
I have ghc-7.10.3 and stack 1.1.2.
I think that force goes over the data structure again and again.
Can force be told to not go over the graph if it already fully evaluated ?
Can i use some other strategy ?
Sample Input:
2 2 -- width height
1 2
3 4
Output:
3
Output is length of longest path in the graph. [4 -> 2 -> 1] i.e. [(1,1),(0,1), (0,0)]. Just to remind, correctness of program is not the question;
space/time efficiency is. Thanks
I am pretty new to Haskell threads (and parallel programming in general) and I am not sure why my parallel version of an algorithm runs slower than the corresponding sequential version.
The algorithm tries to find all k-combinations without using recursion. For this, I am using this helper function, which given a number with k bits set, returns the next number with the same number of bits set:
import Data.Bits
nextKBitNumber :: Integer -> Integer
nextKBitNumber n
| n == 0 = 0
| otherwise = ripple .|. ones
where smallest = n .&. (-n)
ripple = n + smallest
newSmallest = ripple .&. (-ripple)
ones = (newSmallest `div` smallest) `shiftR` 1 - 1
It is now easy to obtain sequentially all k-combinations in the range [(2^k - 1), (2^(n-k)+...+ 2^(n-1)):
import qualified Data.Stream as ST
combs :: Int -> Int -> [Integer]
combs n k = ST.takeWhile (<= end) $ kBitNumbers start
where start = 2^k - 1
end = sum $ fmap (2^) [n-k..n-1]
kBitNumbers :: Integer -> ST.Stream Integer
kBitNumbers = ST.iterate nextKBitNumber
main :: IO ()
main = do
params <- getArgs
let n = read $ params !! 0
k = read $ params !! 1
print $ length (combs n k)
My idea is that this should be easily parallelizable splitting this range into smaller parts. For example:
start :: Int -> Integer
start k = 2 ^ k - 1
end :: Int -> Int -> Integer
end n k = sum $ fmap (2 ^) [n-k..n-1]
splits :: Int -> Int -> Int -> [(Integer, Integer, Int)]
splits n k numSplits = fixedRanges ranges []
where s = start k
e = end n k
step = (e-s) `div` (min (e-s) (toInteger numSplits))
initSplits = [s,s+step..e]
ranges = zip initSplits (tail initSplits)
fixedRanges [] acc = acc
fixedRanges [x] acc = acc ++ [(fst x, e, k)]
fixedRanges (x:xs) acc = fixedRanges xs (acc ++ [(fst x, snd x, k)])
At this point, I would like to run each split in parallel, something like:
runSplit :: (Integer, Integer, Int) -> [Integer]
runSplit (start, end, k) = ST.takeWhile (<= end) $ kBitNumbers (fixStart start)
where fixStart s
| popCount s == k = s
| otherwise = fixStart $ s + 1
For pallalelization I am using the monad-par package:
import Control.Monad.Par
import System.Environment
import qualified Data.Set as S
main :: IO ()
main = do
params <- getArgs
let n = read $ params !! 0
k = read $ params !! 1
numTasks = read $ params !! 2
batches = runPar $ parMap runSplit (splits n k numTasks)
reducedNumbers = foldl S.union S.empty $ fmap S.fromList batches
print $ S.size reducedNumbers
The result is that the sequential version is way faster and it uses little memory, while the parallel version consumes a lot of memory and it is noticeable slower.
What might be the reasons causing this? Are threads a good approach for this problem? For example, every thread generates a (potentially large) list of integers and the main thread reduces the results; are threads expected to need much memory or are simply meant to produce simple results (i.e. only cpu-intensive computations)?
I compile my program with stack build --ghc-options -threaded --ghc-options -rtsopts --executable-profiling --library-profiling and run it with ./.stack-work/install/x86_64-osx/lts-6.1/7.10.3/bin/combinatorics 20 3 4 +RTS -pa -N4 -RTS for n=20, k=3 and numSplits=4. An example of the profiling report for the parallel version can be found here and for the sequential version here.
In your sequential version calling combs does not build up a list in memory since after length consumes an element it isn't needed anymore and is freed. Indeed, GHC may not even allocate storage for it.
For instance, this will take a while but won't consume a lot of memory:
main = print $ length [1..1000000000] -- 1 billion
In your parallel version you are generating sub-lists, concatenating them together, building Sets, etc. and therefore the results of each sub-task have to be kept in memory.
A fairer comparison would be to have each parallel task compute the length of the k-bit numbers in its assigned range, and then add up the results. That way the k-bit numbers found by each parallel task wouldn't have to be kept in memory and would operate more like the sequential version.
Update
Here is an example of how to use parMap. Note: under 7.10.2 I've had mixed success getting the parallelism to fire - sometimes it does and sometimes it doesn't. (Figured it out - I was using -RTS -N2 instead of +RTS -N2.)
{-
compile with: ghc -O2 -threaded -rtsopts foo.hs
compare:
time ./foo 26 +RTS -N1
time ./foo 26 +RTS -N2
-}
import Data.Bits
import Control.Parallel.Strategies
import System.Environment
nextKBitNumber :: Integer -> Integer
nextKBitNumber n
| n == 0 = 0
| otherwise = ripple .|. ones
where smallest = n .&. (-n)
ripple = n + smallest
newSmallest = ripple .&. (-ripple)
ones = (newSmallest `div` smallest) `shiftR` 1 - 1
combs :: Int -> Int -> [Integer]
combs n k = takeWhile (<= end) $ iterate nextKBitNumber start
where start = 2^k - 1
end = shift start (n-k)
main :: IO ()
main = do
( arg1 : _) <- getArgs
let n = read arg1
print $ parMap rseq (length . combs n) [1..n]
good approaches for this problem
What do you mean by this problem? If it's how to write, analyze and tune a parallel Haskell program, then this is required background reading:
Simon Marlow: Parallel and Concurrent Programming in Haskell
http://community.haskell.org/~simonmar/pcph/
in particular, Section 15 (Debugging, Tuning, ..)
Use threadscope! (a graphical viewer for thread profile information generated by the Glasgow Haskell compiler) https://hackage.haskell.org/package/threadscope
I need to generate an infinite stream of random integers, with numbers to be in range [1..n]. However the probability for each number p_i is given in advance thus the distribution is not uniform.
Is there a library function to do it in Haskell?
As people have pointed out there is a function in Control.Monad.Random, but it has pretty poor complexity. Here is some code that I by some coincidence wrote this morning. It uses the beautiful Alias algorithm.
module Data.Random.Distribution.NonUniform(randomsDist) where
import Data.Array
import Data.List
import System.Random
genTable :: (Num a, Ord a) => [a] -> (Array Int a, Array Int Int)
genTable ps =
let n = length ps
n' = fromIntegral n
(small, large) = partition ((< 1) . snd) $ zip [0..] $ map (n' *) ps
loop ((l, pl):ls) ((g, pg):gs) probs aliases =
let prob = (l,pl)
alias = (l,g)
pg' = (pg + pl) - 1
gpg = (g, pg')
in if pg' < 1 then loop (gpg:ls) gs (prob:probs) (alias:aliases)
else loop ls (gpg:gs) (prob:probs) (alias:aliases)
loop ls gs probs aliases = loop' (ls ++ gs) probs aliases
loop' [] probs aliases = (array (0,n-1) probs, array (0,n-1) aliases)
loop' ((g,_):gs) probs aliases = loop' gs ((g,1):probs) ((g, -1):aliases)
in loop small large [] []
-- | Generate an infinite list of random values with the given distribution.
-- The probabilities are scaled so they do not have to add up to one.
--
-- Uses Vose's alias method for generating the values.
-- For /n/ values this has O(/n/) setup complexity and O(1) complexity for each
-- generated item.
randomsDist :: (RandomGen g, Random r, Fractional r, Ord r)
=> g -- | random number generator
-> [(a, r)] -- | list of values with the probabilities
-> [a]
randomsDist g xps =
let (xs, ps) = unzip xps
n = length xps
axs = listArray (0, n-1) xs
s = sum ps
(probs, aliases) = genTable $ map (/ s) ps
(g', g'') = split g
is = randomRs (0, n-1) g'
rs = randoms g''
ks = zipWith (\ i r -> if r <= probs!i then i else aliases!i) is rs
in map (axs!) ks
Just to expand on dflemstr's answer, you can create an infinite list of weighted values using Control.Monad.Random like this:
import Control.Monad.Random
import System.Random
weightedList :: RandomGen g => g -> [(a, Rational)] -> [a]
weightedList gen weights = evalRand m gen
where m = sequence . repeat . fromList $ weights
And use it like this:
> let values = weightedList (mkStdGen 123) [(1, 2), (2, 5), (3, 10)]
> take 20 values
[2,1,3,2,1,2,2,3,3,3,3,3,3,2,3,3,2,2,2,3]
This doesn't require the IO monad, but you need to provide the random number generator that's used for the stream.
Control.Monad.Random offers this function in form of fromList:: MonadRandom m => [(a, Rational)] -> m a
You can use it in the IO Monad with:
import Control.Monad.Random
-- ...
someNums <- evalRandIO . sequence . repeat . fromList $ [(1, 0.3), (2, 0.2), (3, 0.5)]
print $ take 200 someNums
There are other ways of running the Rand Monad as you can see in that package. The weights do not have to add up to 1.
EDIT: Rand is apparently lazier than I thought, so replicateM n can be replaced by sequence . repeat, as #shang suggested.
There is also System.Random.Distributions.frequency
frequency :: (Floating w, Ord w, Random w, RandomGen g) => [(w, a)] -> g -> (a, g)
See https://hackage.haskell.org/package/Euterpea-1.0.0/docs/System-Random-Distributions.html
longest'inc'subseq seq = maximum dp
where dp = 1 : [val n | n <- [1..length seq - 1]]
val n = (1 +) . filter'and'get'max ((<= top) . (seq!!)) $ [0..pred n]
where top = seq!!n
-----
filter'and'get'max f [] = 0
filter'and'get'max f [x] = if f x then dp!!x else 0
filter'and'get'max f (x:xs) = if f x then ( if vx > vxs then vx else vxs ) else vxs
where vx = dp!!x
vxs = filter'and'get'max f xs
that take about 1-2s with lenght of seq = 1000
while in python is come out imtermedialy
in python
def longest(s):
dp = [0]*len(s)
dp[0] = 1
for i in range(1,len(s)):
need = 0
for j in range (0, i):
if s[j] <= s[i] and dp[j] > need:
need = dp[j]
dp[i] = need + 1
return max(dp)
and when length of seq is 10000, the haskell program run sooo long
while python return the answer after 10-15s
Can we improve haskell speed?
Your core problem is that you're using the wrong data structure in Haskell for this algorithm. You've translated an algorithm that depends on O(1) lookups on a sequence (as in your Python code), into one that does O(n) lookups on a list in Haskell.
Use like-for-like data structures, and then your complexity problems will take care of themselves. In this case, it means using something like Data.Vector.Unboxed to represent the sequence, which has O(1) indexing, as well as low constant overheads in general.
With nothing more than a really mindless wrapping of your lists into Vectors I get 2.5 seconds when the input list is [1..10000].
import qualified Data.Vector as V
import Data.Vector (Vector, (!))
main = print $ liss [0..10000]
liss :: [Int] -> Int
liss seqL = V.maximum dp
where dp = V.fromList $ 1 : [val n | n <- [1..length seqL - 1]]
seq = V.fromList seqL
val n = (1 +) . filter'and'get'max ((<= top) . (seq!)) $ [0..pred n]
where top = seq!n
-----
filter'and'get'max :: (Int -> Bool) -> [Int] -> Int
filter'and'get'max f [] = 0
filter'and'get'max f [x] = if f x then dp!x else 0
filter'and'get'max f (x:xs) = if f x then ( if vx > vxs then vx else vxs ) else vxs
where vx = dp!x
vxs = filter'and'get'max f xs
The compilation and execution:
tommd#Mavlo:Test$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.0.3
tommd#Mavlo:Test$ ghc -O2 so.hs
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd#Mavlo:Test$ time ./so
10001
real 0m2.536s
user 0m2.528s
A worker-wrapper transformation on filter'and'get'max seems to shave off another second.
Also, I don't understand why you need that middle case (filter'and'get'max f [x]), shouldn't it work fine without that? I guess this changes the result if dp!x < 0. Note eliminating that saves 0.3 seconds right there.
And the python code you provided takes ~ 10.7 seconds (added a call of longest(range(1,10000));).
tommd#Mavlo:Test$ time python so.py
real 0m10.745s
user 0m10.729s