Haskell: parallel program not utilizing all cores - multithreading

The following code has the same performance whether compiled with -threaded or without, or when I write the code in a single threaded manner. Both blocks (using par and the commented forkIO/forkOS/forkOn) result in the same performance. In fact, performance is slightly degraded in the parallel version (presumably due to the overhead of parallel GC). Viewing the CPU utilization from a program like htop shows only one CPU getting pegged, which is pretty confusing since my reading of the code is that it should use most of the cores.
The fact that forkOS doesn't use more cores is particularly confusing since the relevant section from ghc/rts/posix/OSThreads.c:forkOS_createThread seems to imply that it forces a call to pthread_create.
-- (Apologies if I have missed an import or two)
import Data.List
import GHC.Conc
import Control.Concurrent
import Control.DeepSeq
import qualified Data.HashMap.Lazy as HM
main :: IO ()
main = do
let [one::Int, two] = [15, 1000000]
{-
s <- numSparks
putStrLn $ "Num sparks " <> show s
n <- getNumCapabilities
putStrLn $ "Num capabilities " <> show n
m <- newEmptyMVar
forkIO $ void $ forM [(1::Int)..one] $ \cpu -> do
-- forkOn cpu $ void $ do
forkOS $ void $ do
-- forkIO $ void $ do
-- void $ do
putStrLn $ "core " <> show cpu
s <- return $ sort $ HM.keys $ HM.fromList $ zip [cpu..two + cpu] (repeat (0::Int))
putStrLn $ "core " <> show cpu <> " done " <> show (sum s)
putMVar m ()
forM [1..one] $ \i -> takeMVar m
let s :: String = "hey!"
putStrLn s
-}
print one
print two
let __pmap__ f xs = case xs of
[] -> []
x:xs -> let y = f x
ys = __pmap__ f xs
in (y `par` ys) `pseq` (y: ys)
n <- pure $ sum . concat $ flip __pmap__ [1..one] $ \i ->
force $ sort $ HM.keys $ HM.fromList $ zip [i..(two + i)] (repeat (0::Int))
putStrLn $ "sum " <> show n
s <- numSparks
putStrLn $ "Num sparks " <> show s
Relevant section from my .cabal file
ghc-options:
-threaded
-rtsopts
"-with-rtsopts=-N15 -qg1"
Platform information
$ stack --version
Version 1.2.0, Git revision 241cd07d576d9c0c0e712e83d947e3dd64541c42 (4054 commits) x86_64 hpack-0.14.0
$ stack exec ghc -- --version
The Glorious Glasgow Haskell Compilation System, version 7.10.3
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.1 LTS
Release: 16.04
Codename: xenial
$ uname -r
4.4.0-36-generic
Why isn't my code getting parallelized?
EDIT: if it's helpful at all, adding the -s runtime flag produces the following report
21,829,377,776 bytes allocated in the heap
126,512,021,712 bytes copied during GC
86,659,312 bytes maximum residency (322 sample(s))
6,958,976 bytes maximum slop
218 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 41944 colls, 0 par 16.268s 17.272s 0.0004s 0.0011s
Gen 1 322 colls, 321 par 237.056s 23.822s 0.0740s 0.2514s
Parallel GC work balance: 13.01% (serial 0%, perfect 100%)
TASKS: 32 (1 bound, 31 peak workers (31 total), using -N15)
SPARKS: 15 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 15 fizzled)
INIT time 0.004s ( 0.003s elapsed)
MUT time 12.504s ( 13.301s elapsed)
GC time 253.324s ( 41.094s elapsed)
EXIT time 0.000s ( 0.017s elapsed)
Total time 265.920s ( 54.413s elapsed)
Alloc rate 1,745,791,568 bytes per MUT second
Productivity 4.7% of total user, 23.1% of total elapsed
gc_alloc_block_sync: 10725286
whitehole_spin: 0
gen[0].sync: 2171
gen[1].sync: 1057315
EDIT2: Messing with the arena size seems to have helped considerably. I added -H2G -A1G to the RTS options and the time came down from 43s to 5.2s. Is there anything else that can be improved about the situation to get a full 15x speedup?
EDIT3: Edited the code to reflect the par, pseq pattern suggested by two people giving feedback

The issue is caused by the definition of __pmap__. Specifically there is a problem in the following expression:
let y = f x
in y `par` (y: __pmap__ f xs)
You would expect that this would cause y and y: __pmap__ f xs to be evaluated in parallel, but this is not the case. What happens is that GHC tries to evaluate them in parallel, but the second subexpression contains y, which is the first subexpression. Because of that, the second subexpression depends on the first one and thus they cannot be evaluated in parallel. The correct way to write the above expression is
let y = f x
ys = __pmap__ f xs
in y `par` (ys `pseq` (y : ys))
because the pseq will force ys to be evaluated before y : ys and thus the evaluation of the second subexpression can be started while the evaluation of y is running. See also this thread for some discussion on this.
So putting it all together, we get the following:
main :: IO ()
main = do
let [one::Int, two] = [15, 1000000]
print one
print two
let __pmap__ f xs = case xs of
[] -> []
x:xs -> let y = f x
ys = __pmap__ f xs
in y `par` ys `pseq` (y : ys)
n <- pure $ sum . concat $ flip __pmap__ [1..one] $ \i ->
traceShow i $ force $ sort $ HM.keys $ HM.fromList $ zip [i..(two + i)] (repeat (0::Int))
putStrLn $ "sum " <> show n
s <- numSparks
putStrLn $ "Num sparks " <> show s
Notice that I've added a traceShow (from Debug.Trace). If you run this with -N1 in rtsopts you will see that the list will be evaluated one element at a time, whereas if you use -N3, it will be evaluated 3 elements at a time.
The moral of the story is that par and pseq are easy to misuse and you should therefore prefer higher level solutions such as parMap rdeepseq (which is equivalent to your __pmap__) from parallel.

Related

Sorting in parallel performance

I tried to run some programs with multicore and kinda confused by the results.
By default sorting in program below takes 20 seconds, when I run it with +RTS -N2 it takes around 16 secs, but with +RTS -N4 it takes 21 second!
Why it is like that? And is there example of program that gets faster with each extra core? (had similar results with other programs in tutorials)
Here's example of program:
import Data.List
import Control.Parallel
import Data.Time.Clock.POSIX
qsort :: Ord a => [a] -> [a]
qsort (x:xs)
= let a = qsort $ filter (<=x) xs
b = qsort $ filter (>x) xs
in b `par` a ++ x:b
qsort [] = []
randomList :: Int -> [Int]
randomList n = take n $ tail (iterate lcg 1)
where lcg x = (a * x + c) `rem` m
a = 1664525
c = 1013904223
m = 2^32
main :: IO ()
main = do
let randints = randomList 5000000
t1 <- getPOSIXTime
print . sum $ qsort randints
t2 <- getPOSIXTime
putStrLn $ "SORT TIME: " ++ show (t2 - t1) ++ "\n"
I can't duplicate your results. (Which is a good thing, since I think I was the one claiming to see a performance improvement with -N2 and -N4 with the code you posted.)
On Linux with GHC 8.8.3, and compiling to a standalone executable with -O2 -threaded, I get the following timings on a 4-core desktop:
$ stack ghc -- --version
Stack has not been tested with GHC versions above 8.6, and using 8.8.3, this may fail
Stack has not been tested with Cabal versions above 2.4, but version 3.0.1.0 was found, this may fail
The Glorious Glasgow Haskell Compilation System, version 8.8.3
$ stack ghc -- -O2 -threaded QuickSort3.hs
Stack has not been tested with GHC versions above 8.6, and using 8.8.3, this may fail
Stack has not been tested with Cabal versions above 2.4, but version 3.0.1.0 was found, this may fail
[1 of 1] Compiling Main ( QuickSort3.hs, QuickSort3.o )
Linking QuickSort3 ...
$ ./QuickSort3 +RTS -N1
10741167410134688
SORT TIME: 7.671760902s
$ ./QuickSort3 +RTS -N2
10741167410134688
SORT TIME: 5.700858877s
$ ./QuickSort3 +RTS -N3
10741167410134688
SORT TIME: 4.88330669s
$ ./QuickSort3 +RTS -N4
10741167410134688
SORT TIME: 4.93364958s
I get similar results with a 16-core Linux laptop and also similar results with a 4-core Windows virtual machine (also using GHC 8.8.3) running on that laptop.
I can think of a few possible explanations for your results.
First, I don't have a tremendously fast desktop machine, so your timings of 20secs seem suspicious. Is it possible you're doing something like:
$ stack runghc QuickSort3.hs +RTS -N4
If so, this passes the +RTS flags to stack, and then runs the Haskell program in single-threaded mode using the slow byte-code interpreter. In my tests, the sort then takes about 30secs no matter what -Nx flag value I pass.
Second, is it possible you're running this on a virtual machine with a limited number of cores (or an extremely old piece of two-core hardware)? As noted, I tried testing under a Windows virtual machine and got similar results to the Linux version with a 4-core virtual machine but quite erratic results with a 2-core virtual machine (e.g., 11.4, 13.0, and 51.3secs for -N1, -N2, and -N4 respectively, so worse performance for more cores in general, and off-the-charts bad performance for 4 cores).
You could try the following simple parallel sums benchmark, which might scale better:
import Data.List
import Control.Parallel
import Data.Time.Clock.POSIX
randomList :: Int -> Int -> [Int]
randomList seed n = take n $ tail (iterate lcg seed)
where lcg x = (a * x + c) `rem` m
a = 1664525
c = 1013904223
m = 2^32
main :: IO ()
main = do
t1 <- getPOSIXTime
let n = 50000000
a = sum $ randomList 1 n
b = sum $ randomList 2 n
c = sum $ randomList 3 n
d = sum $ randomList 4 n
e = sum $ randomList 5 n
f = sum $ randomList 6 n
g = sum $ randomList 7 n
h = sum $ randomList 8 n
print $ a `par` b `par` c `par` d `par` e `par` f `par` g `par` h `par` (a+b+c+d+e+f+g+h)
t2 <- getPOSIXTime
putStrLn $ "SORT TIME: " ++ show (t2 - t1) ++ "\n"

Finding the size of a list that's too big for memory?

Brand new Haskell programmer here. Just finished "Learn you a Haskell"... I'm interested in how large a set is that has some particular properties. I have working code for some small parameter values, but I'd like to know how to deal with larger structures. I know Haskell can do "infinite data structures" but I'm just not seeing how to get there from where I'm at and Learn You a Haskell / Google isn't getting me over this.
Here's the working code for my eSet given "small" arguments r and t
import Control.Monad
import System.Environment
import System.Exit
myPred :: [Int] -> Bool
myPred a = myPred' [] a
where
myPred' [] [] = False
myPred' [] [0] = True
myPred' _ [] = True
myPred' acc (0:aTail) = myPred' acc aTail
myPred' acc (a:aTail)
| a `elem` acc = False
| otherwise = myPred' (a:acc) aTail
superSet :: Int -> Int -> [[Int]]
superSet r t = replicateM r [0..t]
eSet :: Int -> Int -> [[Int]]
eSet r t = filter myPred $ superSet r t
main :: IO ()
main = do
args <- getArgs
case args of
[rArg, tArg] ->
print $ length $ eSet (read rArg) (read tArg)
[rArg, tArg, "set"] ->
print $ eSet (read rArg) (read tArg)
_ ->
die "Usage: eSet r r set <set optional for printing set itself otherwise just print the size
When compiled/run I get
$ ghc eSet.hs -rtsopts
[1 of 1] Compiling Main ( eSet.hs, eSet.o )
Linking eSet ...
$ # Here's is a tiny eSet to illustrate. It is the set of lists of r integers from zero to t with no repeated nonzero list entries
$ ./eSet 4 2 set
[[0,0,0,0],[0,0,0,1],[0,0,0,2],[0,0,1,0],[0,0,1,2],[0,0,2,0],[0,0,2,1],[0,1,0,0],[0,1,0,2],[0,1,2,0],[0,2,0,0],[0,2,0,1],[0,2,1,0],[1,0,0,0],[1,0,0,2],[1,0,2,0],[1,2,0,0],[2,0,0,0],[2,0,0,1],[2,0,1,0],[2,1,0,0]]
$ ./eSet 8 4 +RTS -sstderr
3393
174,406,136 bytes allocated in the heap
29,061,152 bytes copied during GC
4,382,568 bytes maximum residency (7 sample(s))
148,664 bytes maximum slop
14 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 328 colls, 0 par 0.047s 0.047s 0.0001s 0.0009s
Gen 1 7 colls, 0 par 0.055s 0.055s 0.0079s 0.0147s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.298s ( 0.301s elapsed)
GC time 0.102s ( 0.102s elapsed)
EXIT time 0.001s ( 0.001s elapsed)
Total time 0.406s ( 0.405s elapsed)
%GC time 25.1% (25.2% elapsed)
Alloc rate 585,308,888 bytes per MUT second
Productivity 74.8% of total user, 75.0% of total elapsed
$ ./eSet 10 5 +RTS -sstderr
63591
27,478,010,744 bytes allocated in the heap
4,638,903,384 bytes copied during GC
532,163,096 bytes maximum residency (15 sample(s))
16,500,072 bytes maximum slop
1556 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 52656 colls, 0 par 6.865s 6.864s 0.0001s 0.0055s
Gen 1 15 colls, 0 par 8.853s 8.997s 0.5998s 1.8617s
INIT time 0.000s ( 0.000s elapsed)
MUT time 52.652s ( 52.796s elapsed)
GC time 15.717s ( 15.861s elapsed)
EXIT time 0.193s ( 0.211s elapsed)
Total time 68.564s ( 68.868s elapsed)
%GC time 22.9% (23.0% elapsed)
Alloc rate 521,883,277 bytes per MUT second
Productivity 77.1% of total user, 76.7% of total elapsed
I see my memory usage is very high and there's a lot of garbage collecting. When running eSet 12 6 I get a Segmentation fault.
I feel like filter myPred $ superSet r t is keeping me from lazily making the subset one part at a time so I can deal with much larger (but finite) sets, but I don't know how to change to another approach that would do that. I think that's the root of my question.
Also, as this is my first Haskell program, points on style and how to achieve the Haskell analog of "pythonic" are much appreciated!
I suspect the culprit here is replicateM, which has this implementation:
replicateM cnt0 f =
loop cnt0
where
loop cnt
| cnt <= 0 = pure []
| otherwise = liftA2 (:) f (loop (cnt - 1))
The problem line is liftA2 (:) f (loop (cnt - 1)); probably loop (cnt - 1) is getting shared among all the calls to (:) partially applied to elements of f, and so loop (cnt - 1) must be kept in memory. Unfortunately loop (cnt - 1) is quite a long list...
It can be a bit fiddly to convince GHC not to share something. The following redefinition of superSet gives me a nice flat memory usage; it will probably be a bit slower on examples that do fit in memory, of course. The key idea is to make it look to the untrained eye (i.e. GHC) like the recursive monadic action depends on the choices made earlier, even though it doesn't.
superSet :: Int -> Int -> [[Int]]
superSet r t = go r 0 where
go 0 ignored = if ignored == 0 then [[]] else [[]]
go r ignored = do
x <- [0..t]
xs <- go (r-1) (ignored+x)
return (x:xs)
If you don't mind avoiding optimizations, the more natural definition also works:
superSet 0 t = [[]]
superSet r t = do
x <- [0..t]
xs <- superSet (r-1) t
return (x:xs)
...but with -O2 GHC is too clever and notices that it can share the recursive call.
A completely alternate approach is to just do a little bit of combinatorics analysis. Here's the process that eSet r t does, as near as I can make out:
Choose at most r elements without replacement from a set of size t.
Pad the sequence to a length of r by interleaving a sentinel value.
So let's just count the ways of doing each of these steps, rather than actually performing them. We'll introduce a new parameter, s, which is the length of the sequence produced by step (1) (and which we therefore know has s <= r and s <= t). How many sequences of size s are there when drawing elements without replacement from a set of size t? Well, there are t choices for the first element, t-1 choices for the second element, t-2 choices for the third element, ...
-- sequencesWithoutReplacement is a very long name!
seqWORepSize :: Integer -> Integer -> Integer
seqWORepSize s t = product [t-s+1 .. t]
Then we want to pad the sequence out to a length of r. We're going to choose s positions in the r-long sequence to be drawn from our sequence, and the remainder will be sentinels. How many ways are there to do that? This one is a well-known combinatorics operator called choose.
choose :: Integer -> Integer -> Integer
choose r s = product [r-s+1 .. r] `div` product [2 .. s]
The number of ways to produce a padded sequence of a given length is just the product of these two numbers, since the choices of "what values to insert" and "where to insert values" can be made completely independently.
paddedSeqSize :: Integer -> Integer -> Integer -> Integer
paddedSeqSize r s t = seqWORepSize s t * (r `choose` s)
And now we're pretty much done. Just iterate over all possible sequence lengths and add up the paddedSeqSize.
eSetSize :: Integer -> Integer -> Integer
eSetSize r t = sum $ map (\s -> paddedSeqSize r s t) [0..r]
We can try it out in ghci:
> :set +s
> map length $ [eSet 1 1, eSet 4 4, eSet 6 4, eSet 4 6]
[2,209,1045,1045]
(0.13 secs, 26,924,264 bytes)
> [eSetSize 1 1, eSetSize 4 4, eSetSize 6 4, eSetSize 4 6]
[2,209,1045,1045]
(0.01 secs, 120,272 bytes)
This way is significantly faster and significantly more memory-friendly. Indeed, we can make queries and get answers about eSets that we would never be able to count the size of one-by-one, e.g.
> length . show $ eSetSize 1000 1000
2594
(0.26 secs, 909,746,448 bytes)
Good luck counting to 10^2594 one at a time. =P
Edit
This idea can also be adapted to produce the padded sequences themselves rather than just counting how many there are. First, a handy utility that I find myself defining over and over for picking out individual elements of a list and reporting on the leftovers:
zippers :: [a] -> [([a], a, [a])]
zippers = go [] where
go ls [] = []
go ls (h:rs) = (ls, h, rs) : go (h:ls) rs
Now, the sequences without replacement can be done by repeatedly choosing a single element from the leftovers.
seqWORep :: Int -> [a] -> [[a]]
seqWORep 0 _ = [[]]
seqWORep n xs = do
(ls, y, rs) <- zippers xs
ys <- seqWORep (n-1) (ls++rs)
return (y:ys)
Once we have a sequence, we can pad it to a desired size by producing all the interleavings of the sentinel value as follows:
interleavings :: Int -> a -> [a] -> [[a]]
interleavings 0 _ xs = [xs]
interleavings n z [] = [replicate n z]
interleavings n z xs#(x:xt) = map (z:) (interleavings (n-1) z xs)
++ map (x:) (interleavings n z xt)
And finally, the top-level function just delegates to seqWORep and interleavings.
eSet :: Int -> Int -> [[Int]]
eSet r t = do
s <- [0..r]
xs <- seqWORep s [1..t]
interleavings (r-s) 0 xs
In my tests this modified eSet has nice flat memory usage both with and without optimizations; does not generate any spurious elements that need to be later filtered out, and so is faster than your original proposal; and to me looks like quite a natural definition compared to the answer that relies on tricking GHC. A nice collection of properties!
After re-reading parts of LYaH and thinking about #daniel-wagners answer monadically composing sounded like it would be worthwhile to try again. The new code total memory is flat and works with and without the -O2 optimization.
Source:
import Control.Monad
import System.Environment
import System.Exit
allowed :: [Int] -> Bool
allowed a = allowed' [] a
where
allowed' [ ] [ ] = False
allowed' [ ] [0] = True
allowed' _ [ ] = True
allowed' acc (0:aTail) = allowed' acc aTail
allowed' acc (a:aTail)
| a `elem` acc = False
| otherwise = allowed' (a:acc) aTail
branch :: Int -> [Int] -> [[Int]]
branch t x = filter allowed [n:x | n <- [0..t]]
eSet :: Int -> Int -> [[Int]]
eSet r t = return [] >>= foldr (<=<) return (replicate r (branch t))
main :: IO ()
main = do
args <- getArgs
case args of
[rArg, tArg] ->
print $ length $ eSet (read rArg) (read tArg)
[rArg, tArg, "set"] ->
print $ eSet (read rArg) (read tArg)
_ -> die "Usage: eSet r r set <set optional>"
The version with monadic function composition tests much faster and without the memory issues...
$ ./eSetMonad 10 5 +RTS -sstderr
63591
289,726,000 bytes allocated in the heap
997,968 bytes copied during GC
63,360 bytes maximum residency (2 sample(s))
24,704 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 553 colls, 0 par 0.008s 0.008s 0.0000s 0.0002s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0002s 0.0003s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.426s ( 0.429s elapsed)
GC time 0.009s ( 0.009s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.439s ( 0.438s elapsed)
%GC time 2.0% (2.0% elapsed)
Alloc rate 680,079,893 bytes per MUT second
Productivity 98.0% of total user, 98.3% of total elapsed

Haskell: Parallel code is slower than sequential version

I am pretty new to Haskell threads (and parallel programming in general) and I am not sure why my parallel version of an algorithm runs slower than the corresponding sequential version.
The algorithm tries to find all k-combinations without using recursion. For this, I am using this helper function, which given a number with k bits set, returns the next number with the same number of bits set:
import Data.Bits
nextKBitNumber :: Integer -> Integer
nextKBitNumber n
| n == 0 = 0
| otherwise = ripple .|. ones
where smallest = n .&. (-n)
ripple = n + smallest
newSmallest = ripple .&. (-ripple)
ones = (newSmallest `div` smallest) `shiftR` 1 - 1
It is now easy to obtain sequentially all k-combinations in the range [(2^k - 1), (2^(n-k)+...+ 2^(n-1)):
import qualified Data.Stream as ST
combs :: Int -> Int -> [Integer]
combs n k = ST.takeWhile (<= end) $ kBitNumbers start
where start = 2^k - 1
end = sum $ fmap (2^) [n-k..n-1]
kBitNumbers :: Integer -> ST.Stream Integer
kBitNumbers = ST.iterate nextKBitNumber
main :: IO ()
main = do
params <- getArgs
let n = read $ params !! 0
k = read $ params !! 1
print $ length (combs n k)
My idea is that this should be easily parallelizable splitting this range into smaller parts. For example:
start :: Int -> Integer
start k = 2 ^ k - 1
end :: Int -> Int -> Integer
end n k = sum $ fmap (2 ^) [n-k..n-1]
splits :: Int -> Int -> Int -> [(Integer, Integer, Int)]
splits n k numSplits = fixedRanges ranges []
where s = start k
e = end n k
step = (e-s) `div` (min (e-s) (toInteger numSplits))
initSplits = [s,s+step..e]
ranges = zip initSplits (tail initSplits)
fixedRanges [] acc = acc
fixedRanges [x] acc = acc ++ [(fst x, e, k)]
fixedRanges (x:xs) acc = fixedRanges xs (acc ++ [(fst x, snd x, k)])
At this point, I would like to run each split in parallel, something like:
runSplit :: (Integer, Integer, Int) -> [Integer]
runSplit (start, end, k) = ST.takeWhile (<= end) $ kBitNumbers (fixStart start)
where fixStart s
| popCount s == k = s
| otherwise = fixStart $ s + 1
For pallalelization I am using the monad-par package:
import Control.Monad.Par
import System.Environment
import qualified Data.Set as S
main :: IO ()
main = do
params <- getArgs
let n = read $ params !! 0
k = read $ params !! 1
numTasks = read $ params !! 2
batches = runPar $ parMap runSplit (splits n k numTasks)
reducedNumbers = foldl S.union S.empty $ fmap S.fromList batches
print $ S.size reducedNumbers
The result is that the sequential version is way faster and it uses little memory, while the parallel version consumes a lot of memory and it is noticeable slower.
What might be the reasons causing this? Are threads a good approach for this problem? For example, every thread generates a (potentially large) list of integers and the main thread reduces the results; are threads expected to need much memory or are simply meant to produce simple results (i.e. only cpu-intensive computations)?
I compile my program with stack build --ghc-options -threaded --ghc-options -rtsopts --executable-profiling --library-profiling and run it with ./.stack-work/install/x86_64-osx/lts-6.1/7.10.3/bin/combinatorics 20 3 4 +RTS -pa -N4 -RTS for n=20, k=3 and numSplits=4. An example of the profiling report for the parallel version can be found here and for the sequential version here.
In your sequential version calling combs does not build up a list in memory since after length consumes an element it isn't needed anymore and is freed. Indeed, GHC may not even allocate storage for it.
For instance, this will take a while but won't consume a lot of memory:
main = print $ length [1..1000000000] -- 1 billion
In your parallel version you are generating sub-lists, concatenating them together, building Sets, etc. and therefore the results of each sub-task have to be kept in memory.
A fairer comparison would be to have each parallel task compute the length of the k-bit numbers in its assigned range, and then add up the results. That way the k-bit numbers found by each parallel task wouldn't have to be kept in memory and would operate more like the sequential version.
Update
Here is an example of how to use parMap. Note: under 7.10.2 I've had mixed success getting the parallelism to fire - sometimes it does and sometimes it doesn't. (Figured it out - I was using -RTS -N2 instead of +RTS -N2.)
{-
compile with: ghc -O2 -threaded -rtsopts foo.hs
compare:
time ./foo 26 +RTS -N1
time ./foo 26 +RTS -N2
-}
import Data.Bits
import Control.Parallel.Strategies
import System.Environment
nextKBitNumber :: Integer -> Integer
nextKBitNumber n
| n == 0 = 0
| otherwise = ripple .|. ones
where smallest = n .&. (-n)
ripple = n + smallest
newSmallest = ripple .&. (-ripple)
ones = (newSmallest `div` smallest) `shiftR` 1 - 1
combs :: Int -> Int -> [Integer]
combs n k = takeWhile (<= end) $ iterate nextKBitNumber start
where start = 2^k - 1
end = shift start (n-k)
main :: IO ()
main = do
( arg1 : _) <- getArgs
let n = read arg1
print $ parMap rseq (length . combs n) [1..n]
good approaches for this problem
What do you mean by this problem? If it's how to write, analyze and tune a parallel Haskell program, then this is required background reading:
Simon Marlow: Parallel and Concurrent Programming in Haskell
http://community.haskell.org/~simonmar/pcph/
in particular, Section 15 (Debugging, Tuning, ..)
Use threadscope! (a graphical viewer for thread profile information generated by the Glasgow Haskell compiler) https://hackage.haskell.org/package/threadscope

Haskell MVar : How to execute shortest job first?

When more than one thread is waiting to write an MVar, they are executed in first-in first-out scheme. I want to execute thread as per shortest job scheduling.
I have tired to code this using MVar. Here job is to calculate a Fibonacci number and write a MVar. 1st thread calculates Fibonacci 30 and 2nd thread calculates Fibonacci 10. As time taken for calculating Fibonacci 10 is less than 30, thus 2nd thread should execute first. I a not getting the desired result from the following block of code.
How to implement shortest job first scheduling in Haskell (or may be using Haskell STM)?
Code
module Main
where
import Control.Parallel
import Control.Concurrent
import System.IO
nfib :: Int -> Int
nfib n | n <= 2 = 1
| otherwise = par n1 (pseq n2 (n1 + n2 ))
where n1 = nfib (n-1)
n2 = nfib (n-2)
type MInt = MVar Int
updateMVar :: MInt -> Int -> IO ()
updateMVar n v = do x1 <- readMVar n
let y = nfib v
x2 <- readMVar n
if x1 == x2
then do t <- takeMVar n
putMVar n y
else return()
main :: IO ()
main = do
n <- newEmptyMVar
putMVar n 0
forkIO(updateMVar n 30)
t <- readMVar n
putStrLn("n is : " ++ (show t))
forkIO(updateMVar n 10)
t <- readMVar n
putStrLn("n is : " ++ (show t))
Output
n is : 832040
n is : 55
To implement scheduling you need to use MVars and threads together. Start with an empty MVar. Fork the jobs you wish to run in the background. The main thread can then block on each result in turn. The fastest will come first. Like so:
{-# LANGUAGE BangPatterns #-}
import Control.Parallel
import Control.Concurrent
import System.IO
nfib :: Int -> Int
nfib n | n <= 2 = 1
| otherwise = par n1 (pseq n2 (n1 + n2 ))
where n1 = nfib (n-1)
n2 = nfib (n-2)
main :: IO ()
main = do
result <- newEmptyMVar
forkIO $ do
let !x = nfib 40
putMVar result x
forkIO $ do
let !x = nfib 30
putMVar result x
t <- takeMVar result
print $ "Fastest result was: " ++ show t
t <- takeMVar result
print $ "Slowest result was: " ++ show t
Note that it is important to use bang patterns to evaluate the fibonacci calls outside of the MVar (don't want to simply return an unevaluated thunk to the main thread).
Compile with the threaded runtime:
$ ghc -o A --make A.hs -threaded -fforce-recomp -rtsopts
[1 of 1] Compiling Main ( A.hs, A.o )
Linking A.exe ...
And run on two cores:
$ ./A.exe +RTS -N2
"Fastest result was: 832040"
"Slowest result was: 102334155"
Productivity is pretty good as well (use +RTS -s to see runtime performance statistics).
Productivity 89.3% of total user, 178.1% of total elapsed
The first thread to finish will have its result printed first. The main thread will then block until the second thread is done.
The main thing is to take advantage of MVar empty/full semantics to block the main thread on each of the children threads.

Abnormally slow Haskell code

I've been trying to practice with the Digits-Recognizer Dojo in Haskell after having done it in F#. I'm getting results, but for some reason my Haskell code is insanely slow, and I cannot seem to find what's wrong.
Here is my code (the .csv files can be found on the Dojo's GitHub):
import Data.Char
import Data.List
import Data.List.Split
import Data.Ord
import System.IO
type Pixels = [Int]
data Digit = Digit { label :: Int, pixels :: Pixels }
distance :: Pixels -> Pixels -> Float
distance d1 d2 = sqrt . sum $ map pointDistance $ zip d1 d2
where pointDistance (a, b) = fromIntegral $ (a - b) * (a - b)
parseDigit :: String -> Digit
parseDigit s = Digit label pixels
where (label:pixels) = map read $ splitOn "," s
identify :: Digit -> [Digit] -> (Digit, Float)
identify digit training = minimumBy (comparing snd) distances
where distances = map fn training
fn ref = (ref, distance (pixels digit) (pixels ref))
readDigits :: String -> IO [Digit]
readDigits filename = do
fileContent <- readFile filename
return $ map parseDigit $ tail $ lines fileContent
main :: IO ()
main = do
trainingSample <- readDigits "trainingsample.csv"
validationSample <- readDigits "validationsample.csv"
let result = [(d, identify d trainingSample) | d <- validationSample]
fmt (d, (ref, dist)) = putStrLn $ "Found..."
mapM_ fmt result
What would be the reason of these bad performances?
[UPDATE] Thank you for your many ideas! I have switched my usage of String to Data.Text and my usage of List to Data.Vector as suggested, unfortunately the result is still far from satisfactory.
My updated code is available here.
To give you a better understanding of my interrogation, here's the output of my Haskell (left) and F# (right) implementation. I'm a total newbie of both languages, so I sincerely believe that there has to be a major mistake in my Haskell version to be that much slower.
If you're patient, you'll notice that the second result is calculated much faster than the first. That's because your implementation takes some time to read in the csv files.
You may be tempted to stick a print statement to see when it's done loading like so:
main = do
trainingSample <- readDigits "trainingsample.csv"
validationSample <- readDigits "validationsample.csv"
putStrLn "done loading data"
But due to lazyIO, this won't do what you think it does. trainingSample and validationSample are not yet fully evaluated. So your print statement will print almost immediately, and the first result will still take forever.
You can force readDigits to fully evaluate their return values, though, which will give you a better idea of how much time is spent there. You could either switch to using non-lazy IO, or just print something derived from the data:
readDigits :: String -> IO [Digit]
readDigits filename = do
fileContent <- readFile filename
putStr' $ filename ++ ": "
rows <- forM (tail $ lines fileContent) $ \line -> do
let xs = parseDigit line
putStr' $ case compare (sum $ pixels xs) 0 of
LT -> "-"
EQ -> "0"
GT -> "+"
return xs
putStrLn ""
return rows
where putStr' s = putStr s >> hFlush stdout
On my machine, this let me see that it took about 27 seconds to fully read the digits from trainingsample.csv.
This is printf-style profiling, which isn't great (much better to use a real profiler, or use criterion to benchmark various parts of your code), but good enough for these purposes.
That's clearly a major part of the slowdown, so it's worth trying to switch to strict io. Using Data.Text.IO.readFile, which is strict, cut it down to ~18 seconds.
UPDATE
Here's how to speed up your updated code:
Use unboxed vectors for Pixels (small win):
import qualified Data.Vector.Unboxed as U
-- ...
type Pixels = U.Vector Int
-- ...
distance :: Pixels -> Pixels -> Float
distance d1 d2 = sqrt . U.sum $ U.zipWith pointDistance d1 d2
where pointDistance a b = fromIntegral $ (a - b) * (a - b)
parseDigit :: T.Text -> Digit
parseDigit s = Digit label (U.fromList pixels)
where (label:pixels) = map toDigit $ T.splitOn (T.pack ",") s
toDigit s = either (\_ -> 0) fst (T.Read.decimal s)
Force the distance evaluation early by using seq (big win):
identify :: Digit -> V.Vector Digit -> (Digit, Float)
identify digit training = V.minimumBy (comparing snd) distances
where distances = V.map fn training
fn ref = let d = distance (pixels digit) (pixels ref) in d `seq` (ref, d)
On my machine, the whole program now runs in ~5s:
% ghc --make -O2 Main.hs
[1 of 1] Compiling Main ( Main.hs, Main.o )
Linking Main ...
% time ./Main
./Main 5.00s user 0.11s system 99% cpu 5.115 total
The thunks were killing you.
Your Vector's version, partially unboxed, adapted for ByteString and compiled with -O2 -fllvm runs in 8 seconds on my machine:
import Data.Ord
import Data.Maybe
import qualified Data.Vector as V
import qualified Data.Vector.Unboxed as U
import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as BC
type Pixels = U.Vector Int
data Digit = Digit { label :: !Int, pixels :: !Pixels }
distance :: Pixels -> Pixels -> Float
distance d1 d2 = sqrt . U.sum . U.zipWith pointDistance d1 $ d2
where pointDistance a b = fromIntegral $ (a - b) * (a - b)
parseDigit :: B.ByteString -> Digit
parseDigit bs =
let (label:pixels) = toIntegers bs []
in Digit label (U.fromList pixels)
where
toIntegers bs is =
let Just (i,bs') = BC.readInt bs
in if B.null bs' then reverse is else toIntegers (BC.tail bs') (i:is)
identify :: Digit -> V.Vector Digit -> (Digit, Float)
identify digit training = V.minimumBy (comparing snd) distances
where distances = V.map fn training
fn ref = (ref, distance (pixels digit) (pixels ref))
readDigits :: String -> IO (V.Vector Digit)
readDigits filename = do
fileContent <- B.readFile filename
return . V.map parseDigit . V.fromList . tail . BC.lines $ fileContent
main :: IO ()
main = do
trainingSample <- readDigits "trainingsample.csv"
validationSample <- readDigits "validationsample.csv"
let result = V.map (\d -> (d, identify d trainingSample)) validationSample
fmt (d, (ref, dist)) = putStrLn $ "Found " ++ show (label ref) ++ " for " ++ show (label d) ++ " (distance=" ++ show dist ++ ")"
V.mapM_ fmt result
Output of +RTS -s:
989,632,984 bytes allocated in the heap
19,875,368 bytes copied during GC
31,016,504 bytes maximum residency (5 sample(s))
22,748,608 bytes maximum slop
78 MB total memory in use (1 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1761 colls, 0 par 0.05s 0.05s 0.0000s 0.0008s
Gen 1 5 colls, 0 par 0.00s 0.02s 0.0030s 0.0085s
INIT time 0.00s ( 0.00s elapsed)
MUT time 7.42s ( 7.69s elapsed)
GC time 0.05s ( 0.06s elapsed)
EXIT time 0.00s ( 0.01s elapsed)
Total time 7.47s ( 7.77s elapsed)
%GC time 0.7% (0.8% elapsed)
Alloc rate 133,419,569 bytes per MUT second
Productivity 99.3% of total user, 95.5% of total elapsed

Resources