I have a small program that have reasonable maximum residency but allocates linearly. At first, I thought that should be cons cells or I#, but running the program with -p -hc shows heap overwhelmed by PINNED. Does anyone understand the reason and/or can suggest an improvement?
The program
-- task27.hs
{-# LANGUAGE FlexibleContexts #-}
import Control.Monad
import Control.Monad.ST
import Control.Exception
import System.Random
import Data.Functor
import qualified Data.Vector.Generic.Mutable as V
import qualified Data.Vector.Unboxed as U
m = 120
task27 :: [Int] -> (Int, Int)
task27 l = runST $ do
r <- V.replicate m 0 :: ST s (U.MVector s Int)
let go [] = return (1,2)
go (a:as) = do
let p = a `mod` m
cur_lead <- r `V.read` p
when (a > cur_lead) (V.write r p a)
go as
go l
randTest ::
Int -> -- Length of random testing sequence
IO ()
randTest n =
newStdGen <&>
randoms <&>
take n <&>
task27 >>=
print
main = randTest 1000000
My package.yaml:
name: task27
dependencies:
- base == 4.*
- vector
- random
executables:
task27:
main: task27.hs
ghc-options: -O2
My cabal.project.local:
profiling: True
I do cabal -v0 run task27 -- +RTS -p -hc && hp2ps -e8in -c task27.hp and get this:
I tried to add bangs here and there but that did not seem to help.
As #WillemVanOnsem says, in GHC terms, 35kB resident is miniscule. Whatever performance issue you have, it's got nothing to do with this tiny bit of pinned memory. Originally, I said that this was probably the Vectors, but that's wrong. Data.Text uses pinned memory, but Data.Vector doesn't. This bit of PINNED memory looks like it's actually from the runtime system itself, so you can ignore it (see below).
In GHC code, "total allocation" is a measure of processing. A GHC program is an allocation engine. If it's not allocating, it's probably not doing anything (with rare exceptions). So, if you expect your algorithm to run in O(n) time, then it will also be O(n) in total allocation, usually gigabytes worth.
With respect to the "rare exceptions", a GHC program can run in constant "total allocation" but non-constant time if aggressive optimization allows computations using fully unboxed values. So, for example:
main = print (sum [1..10000000] :: Int)
runs in constant total allocation (e.g., 50kB allocated on the heap), because the Ints can be unboxed. For comparison,
main = print (sum [1..10000000] :: Integer)
runs with O(n) total allocation (e.g., 320MB allocated on the heap). By the way, try profiling this last program (and bump the count up until it runs long enough to generate a few seconds of profile data). You'll see that it uses the same amount of PINNED memory as your program, and the amount doesn't really change with the upper limit. So, this is just runtime system overhead.
Back to your example... If you are concerned about performance, the culprit is probably System.Random. This is an EXTREMELY slow random number generator. If I run your program with n = 10000000, it takes 4secs. If I replace the random number generator with a simple LCG:
randoms :: Word32 -> [Word32]
randoms seed = tail $ iterate lcg seed
where lcg x = (a * x + c)
a = 1664525
c = 1013904223
it runs in 0.2secs, so 20 times faster.
Related
I'm looking to try to run a moderately expensive function on a large list of inputs, using part of the output of that function as one of its inputs. The code runs as expected, unfortunately it consumes a large amount of memory in the process (just under 22GiB on the heap, just over 1GiB maximum residency). Here is a simplified example of what I mean:
{-# LANGUAGE OverloadedStrings #-}
import Data.List (foldl')
import qualified Data.Text as T
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.IO as TL
import qualified Data.Text.Lazy.Builder as TB
main :: IO ()
main = TL.putStr $ TB.toLazyText showInts
showInts :: TB.Builder
showInts = foldMap fst shownLines
where
shownLines = map (showInt maxwidth) [0..10^7]
maxwidth = foldl' (\n -> max n . snd) 0 shownLines
showInt :: Int -> Int -> (TB.Builder, Int)
showInt maxwidth n = (builder, len)
where
builder = TB.fromText "This number: "
<> TB.fromText (T.replicate (maxwidth - len) " ") <> thisText
<> TB.singleton '\n'
(thisText, len) = expensiveShow n
expensiveShow :: Int -> (TB.Builder, Int)
expensiveShow n = (TB.fromText text, T.length text)
where text = T.pack (show n)
Note that in the where clause of showInts, showInt takes maxwidth as an argument, where maxwidth itself depends on the output of running showInt maxwidth on the whole list.
If, on the other hand, I do the naïve thing and replace the definition of maxwidth with foldl' max 0 $ map (snd . expensiveShow) [0..10^7], then maximum residency falls to just 44KiB. I would hope that performance like this would be achievable without workarounds like precomputing expensiveShow and then zipping it with the list [0..10^7].
I tried consuming the list strictly (using the foldl package), but this did not improve the situation.
I'm trying to have my cake and eat it too: exploiting laziness, while also making things strict enough that we don't build up a mountain of thunks. Is this possible to do? Or is there a better technique for accomplishing this?
You can't do it like this.
The problem is that your showInts has to traverse the list twice, first to find the longest number, second to print the numbers with the necessary format. That means the list has to be held in memory between the first and second passes. This isn't a problem with unevaluated thunks; it is simply that the whole list, completely evaluated, is being traversed twice.
The only solution is to generate the same list twice. In this case it is trivial; just have two [0..10^7] values, one for the maximum length and the second to format them. I suspect in your real application you are reading them from a file or something, in which case you need to read the file twice.
Trying to get to grips with the concepts I am trying to solve an exercise in Haskell using WriterT and State (it's advent of code day 15). For some reason I do not understand I end up using loads of memory and my notebook (just 4G Ram) comes to a halt.
My first idea was to use strictness and sprinkle bangs around - but the issue persists.
Could someone explain me where I did go wrong?
Here's cleaned up code:
{-# LANGUAGE BangPatterns #-}
module Main where
import Control.Monad.State.Strict
import Control.Monad.Writer.Strict
main = do
let generators = (Generator 65 16807, Generator 8921 48271)
res1 = compute generators (4*10^7)
putStrLn "Answer 1"
print res1
data Generator = Generator { _value :: Int
, _factor :: Int
}
deriving Show
newtype Value = Value Int
deriving (Show, Eq)
newtype Counter = Counter Int
deriving (Show, Eq)
instance Monoid Counter where
mempty = Counter 0
mappend (Counter !a) (Counter !b) = Counter (a+b)
generate :: Generator -> (Value, Generator)
generate (Generator v f) = (Value newval, Generator newval f)
where newval = (v * f) `mod` 2147483647
agree (Value a) (Value b) = (a `mod` mf) == (b `mod` mf)
where mf = 2^16
oneComp :: State (Generator, Generator) Bool
oneComp = do
(!ga, !gb) <- get
let (va, gan) = generate ga
(vb, gbn) = generate gb
!ag = agree va vb
put (gan, gbn)
pure ag
counterStep :: WriterT Counter (State (Generator, Generator)) ()
counterStep = do
!ag <- lift oneComp
when ag $ tell (Counter 1)
afterN :: Int -> WriterT Counter (State (Generator, Generator)) ()
afterN n = replicateM_ n counterStep
compute s0 n = evalState (execWriterT (afterN n)) s0
I compile it with stack. The entry in the cabal file is:
executable day15
hs-source-dirs: app
main-is: day15.hs
ghc-options: -threaded -rtsopts -with-rtsopts=-N
build-depends: base
, advent
, hspec
, mtl
default-language: Haskell2010
update
I had a little more time and followed the suggestion to make Generator strict. However still something is using too much memory.
Here's the part of the prof file that I think may be relevant.
Fri Dec 15 16:28 2017 Time and Allocation Profiling Report (Final)
day15 +RTS -N -p -RTS
total time = 71.66 secs (71662 ticks # 1000 us, 1 processor)
total alloc = 17,600,423,088 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
afterN Main app/day15.hs:79:1-36 41.1 20.0
mappend Main app/day15.hs:51:3-51 31.0 3.6
oneComp Main app/day15.hs:(64,1)-(71,9) 9.2 49.1
generate.(...) Main app/day15.hs:55:9-42 8.5 14.5
The cause is likely to be the WriterT layer.
Even the "strict" WriterT is completely lazy in the accumulator —it is strict in another sense unrelated to the accumulator.
For example, this program runs without errors:
import Data.Monoid
import Control.Monad.Trans.Writer
import Control.Exception
main :: IO ()
main = do
let (x,_) = runWriter $ do
tell $ Sum (1::Float)
tell (throw $ AssertionFailed "oops")
tell (error "oops")
tell undefined
tell (let z = z in z)
return True
print x
Furthermore, it is impossible to "strictify" the accumulator from within WriterT, because there's no way to get to it.
For long computations, thunks will accumulate and consume a lot of memory.
One solution is to store the counter in a StateT layer instead. The strict modify' function is helpful here.
Using StateT for an append-only accumulator is a bit unsatisfactory though. Another option is to use Accum with judiciously positioned BangPatterns. This program throws an error:
import Control.Monad.Trans.Accum
main :: IO ()
main = do
let (x,_) = flip runAccum mempty $ do
add $ Sum (1::Float)
add $ error "oops"
!_ <- look
return True
print x
Accum is like a Writer that lets you access the accumulator. It doesn't let you change it at will, only add to it. But getting hold of it is enough to introduce strictness.
I'm trying to learn/evaluate Haskell and I'm struggling with getting efficient executable for a simple case.
The test I'm using is a PRNG sequence (replicating PCG32 RNG). I've written it as an iteration of a basic state transition function (I'm looking only at the state for now).
{-# LANGUAGE BangPatterns #-}
import System.Environment (getArgs)
import Data.Bits
import Data.Word
iterate' f !x = x : iterate' f (f x)
main = print $ pcg32_rng 100000000
pcg32_random_r :: Word64 -> Word64 -> Word64
pcg32_random_r !i !state = state * (6364136223846793005 :: Word64) + (i .|. 1)
{-# INLINE pcg32_random_r #-}
pcg32_rng_s = iterate' (pcg32_random_r 1) 0
pcg32_rng n = pcg32_rng_s !! (n - 1)
I can get that code to compile and run. It still uses a lot more memory than it should and runs 10x slower than the C equivalent. The main issue seems to be that the iteration is not turned into a simple loop.
What am I missing to get GHC to produce faster / more efficient code here?
EDIT
This is the C version I compare against which captures in essence what I'm trying to achieve. I try for a fair comparison but let me know if I missed something.
#include <stdio.h>
#include <stdint.h>
int main() {
uint64_t oldstate,state;
int i;
for(i=0;i<100000000;i++) {
oldstate = state;
// Advance internal state
state = oldstate * 6364136223846793005ULL + (1|1);
}
printf("%ld\n",state);
}
I tried initally with the Prelude iterate function but this result in lazy evaluation and a stack overflow. The ìterate'`is aimed at fixing that issue.
My next step was to try to get GHC to inline pcg32_random_rand that's where I added the strictness to it but that doesn't seem to be enough. When I look at the GHC core, it is not inlined.
#WillemVanOnsem I confirm with performthe result is on par with C and actually the pcg32_random_rfunction was inlined. I'm reaching the limit of my grasp of Haskell and GHC at this stage. Can you elaborate on why perform performs better and how to decide when to use what?
Would this transformation be feasible automatically by the compiler or is it something that requires a design decision?
The reason to ask the last question is that I would like as much to separate functionality and implementation choice (speed / space tradeoffs, ...) to maximize reuse and I was hoping Haskell to help me there.
It looks to me that the issue is more that you produce a list, and obtain the i-th element from that list. As a result you are going to unfold that list function, and each time you construct a new element if you need to move further in the list.
Instead of constructing such list (which will construct new nodes, and perform memory allocations, and consume a lot of memory). You can construct a function that will perform a given function n times:
perform_n :: (a -> a) -> Int -> a -> a
perform_n !f = step
where step !n !x | n <= 0 = x
| otherwise = step (n-1) (f x)
So now we can perform a function f n times. We can thus rewrite it like:
pcg32_rng n = perform_n (pcg32_random_r 1) (n-1) 0
If I compile this with ghc -O2 file.hs (GHC 8.0.2) run this file with time, I get:
$ time ./file
2264354473547460187
0.14user 0.00system 0:00.14elapsed 99%CPU (0avgtext+0avgdata 3408maxresident)k
0inputs+0outputs (0major+161minor)pagefaults 0swaps
the original file produces the following benchmarks:
$ time ./file2
2264354473547460187
0.54user 0.00system 0:00.55elapsed 99%CPU (0avgtext+0avgdata 3912maxresident)k
0inputs+0outputs (0major+287minor)pagefaults 0swaps
EDIT:
As #WillNess says, if you do not name the list, at runtime the list will be garbage collected: if you process through a list, and do not keep a reference to the head of the list, then that head can be removed once we step over it.
If we however construct a file like:
{-# LANGUAGE BangPatterns #-}
import System.Environment (getArgs)
import Data.Bits
import Data.Word
iterate' f !x = x : iterate' f (f x)
main = print $ pcg32_rng 100000000
pcg32_random_r :: Word64 -> Word64 -> Word64
pcg32_random_r !i !state = state * (6364136223846793005 :: Word64) + (i .|. 1)
{-# INLINE pcg32_random_r #-}
pcg32_rng n = iterate' (pcg32_random_r 1) 0 !! (n - 1)
we obtain:
$ time ./speedtest3
2264354473547460187
0.54user 0.01system 0:00.56elapsed 99%CPU (0avgtext+0avgdata 3908maxresident)k
0inputs+0outputs (0major+291minor)pagefaults 0swaps
although the memory burden can be reduced, there is little impact on time. The reason is probably that working with list elements creates cons objects. So we do a lot of packing and unpacking into lists. This also results in constructing a lot of objects (and memory allocations) which still produces overhead.
I have a relatively simple "copy" program that merely copies all the lines of one file to another. I'm playing around with Haskell's concurrency support with TMQueue and STM so I thought I'd try it like this:
{-# LANGUAGE BangPatterns #-}
module Main where
import Control.Applicative
import Control.Concurrent.Async -- from async
import Control.Concurrent.Chan
import Control.Concurrent.STM (atomically)
import Control.Concurrent.STM.TMQueue -- from stm-chans
import Control.Monad (replicateM, forM_, forever, unless)
import qualified Data.ByteString.Char8 as B
import Data.Function (fix)
import Data.Maybe (catMaybes, maybe)
import System.IO (withFile, IOMode(..), hPutStrLn, hGetLine)
import System.IO.Error (catchIOError)
input = "data.dat"
output = "out.dat"
batch = 100 :: Int
consumer :: TMQueue B.ByteString -> IO ()
consumer q = withFile output WriteMode $ \fh -> fix $ \loop -> do
!items <- catMaybes <$> replicateM batch readitem
forM_ items $ B.hPutStrLn fh
unless (length items < batch) loop
where
readitem = do
!item <- atomically $ readTMQueue q
return item
producer :: TMQueue B.ByteString -> IO ()
producer q = withFile input ReadMode $ \fh ->
(forever (B.hGetLine fh >>= atomically . writeTMQueue q))
`catchIOError` const (atomically (closeTMQueue q) >> putStrLn "Done")
main :: IO ()
main = do
q <- atomically newTMQueue
thread <- async $ consumer q
producer q
wait thread
I can make a little test input file like this
ghc -e 'writeFile "data.dat" (unlines (map show [1..5000000]))'
And build it like this
ghc --make QueueTest.hs -O2 -prof -auto-all -caf-all -threaded -rtsopts -o q
When I run it like so ./q +RTS -s -prof -hc -L60 -N2, it says that "2117 MB total memory in use"! But the input file is only 38 MB!
I am new to profiling, but I have produced graph after graph and cannot pinpoint my mistake.
As the OP points out, by now I may as well write a real answer. Let's start with the memory consumption.
Two useful references are Memory footprint of Haskell data types and http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html. We'll also need to look at the definitions of some of our structures.
-- from http://hackage.haskell.org/package/stm-chans-3.0.0.2/docs/src/Control-Concurrent-STM-TMQueue.html
data TMQueue a = TMQueue
{-# UNPACK #-} !(TVar Bool)
{-# UNPACK #-} !(TQueue a)
deriving Typeable
-- from http://hackage.haskell.org/package/stm-2.4.3/docs/src/Control-Concurrent-STM-TQueue.html
-- | 'TQueue' is an abstract type representing an unbounded FIFO channel.
data TQueue a = TQueue {-# UNPACK #-} !(TVar [a])
{-# UNPACK #-} !(TVar [a])
The TQueue implementation uses a standard functional queue with a read end and write end.
Let's set an upper bound on memory usage and assume that we read the entire file into the TMQueue before the consumer does anything. In that case, the write end of our TQueue will contain a list with one element per input line (stored as a bytestring). Each list node will look like
(:) bytestring tail
which takes 3 words (1 per field + 1 for the constructor). Each bytestring is 9 words, so add the two together and there are 12 words of overhead per line, not including the actual data. Your test data is 5 million lines, so that's 60 million words of overhead for the whole file (plus some constants), which on a 64-bit system is about 460MB (assuming I did my math right, always questionable). Add in 40MB for the actual data, and we get values pretty close to what I see on my system.
So, why is our memory usage close to this upper bound? I have a theory (investigation left as an exercise!). First, the producer is likely to run a bit faster than the consumer simply because reading is usually faster than writing (I'm using spinning disks, maybe an SSD would be different). Here's the definition of readTQueue:
-- |Read the next value from the 'TQueue'.
readTQueue :: TQueue a -> STM a
readTQueue (TQueue read write) = do
xs <- readTVar read
case xs of
(x:xs') -> do writeTVar read xs'
return x
[] -> do ys <- readTVar write
case ys of
[] -> retry
_ -> case reverse ys of
[] -> error "readTQueue"
(z:zs) -> do writeTVar write []
writeTVar read zs
return z
First we try to read from the read end, and if that's empty we try to read from the write end, after reversing that list.
What I think is happening is this: when the consumer needs to read from the write end, it needs to traverse the input list within the STM transaction. This takes some time, which will cause it to contend with the producer. As the producer gets further ahead, this list gets longer, causing the read to take yet more time, during which the producer is able to write more values, causing the read to fail. This process repeats until the producer finishes, and only then does the consumer get a chance to process the bulk of the data. Not only does this ruin concurrency, it adds more CPU overhead because the consumer transaction is continually retrying and failing.
So, what about unagi? There are a couple key differences. First, unagi-chan uses arrays internally instead of lists. This reduces the overhead a little. Most of the overhead is from the ByteString pointers, so not much, but a little. Secondly, unagi keeps chunks of arrays. Even if we pessimistically assume that the producer always wins contentions, after the array gets filled it's pushed off the producer's side of the channel. Now the producer is writing to a new array and the consumer reads from the old array. This situation is near-ideal; there's no contention to shared resources, the consumer has good locality of reference, and because the consumer is working on a different chunk of memory there aren't issues with cache coherence. Unlike my theoretical description of the TMQueue, now you're getting concurrent operations, allowing the producer to clear some of the memory usage so it never hits the upper bound.
As an aside, I think the consumer batching is not beneficial. Handles are buffered by the IO subsystem already, so I don't think this gains anything. For me performance improved a little when I changed the consumer to operate line-by-line anyway.
Now, what can you do about this problem? Going from my working hypothesis that TMQueue is suffering from contention problems, and your specified requirements, you'll just need to use another type of queue. Obviously unagi works pretty well. I also tried TMChan, it was about 25% slower than unagi but used 45% less memory, so that could be a good option too. (this isn't too surprising, TMChan has a different structure from TMQueue so it'll have different performance characteristics)
You could also try to change your algorithm so that the producer sends multi-line chunks. This would lower the memory overhead from all the ByteStrings.
So, when is it ok to use TMQueue? If the producer and consumer are about the same speed, or the consumer is faster, it should be ok. Also, if processing times are non-uniform, or the producer runs in bursts, you'll probably get good amortized performance. This is pretty much a worst-case situation, and perhaps it should be reported as a bug against stm? I think if the read function were changed to
-- |Read the next value from the 'TQueue'.
readTQueue :: TQueue a -> STM a
readTQueue (TQueue read write) = do
xs <- readTVar read
case xs of
(x:xs') -> do writeTVar read xs'
return x
[] -> do ys <- readTVar write
case ys of
[] -> retry
_ -> do writeTVar write []
let (z:zs) = reverse ys
writeTVar read zs
return z
it would avoid this problem. Now the z and zs bindings should both be evaluated lazily, so the list traversal would happen outside this transaction, allowing the read operation to succeed sometimes under contention. Assuming I'm correct about the issue in the first place, of course (and that this definition is lazy enough). There might be other unexpected downsides though.
I'm writing code to do a subset product: it takes a list of elements and a list of indicator variables (of the same length). The product is computed in a tree, which is crucial to our application. Each product is expensive, so my goal was to compute each level of the tree in parallel, evaluating consecutive levels in sequence. Thus there isn't any nested parallelism going on.
I only have repa code in ONE function, near the top level of my overall code. Note that subsetProd is not monadic.
The steps:
chunk up the lists into pairs (no parallelism)
zip the chunked lists (no parallelism)
map the product function over this list (using Repa map), creating a Delayed array
call computeP to evaluate the map in parallel
convert the Repa result back to a list
make a recursive call (on lists half the size of the inputs)
The code:
{-# LANGUAGE TypeOperators, FlexibleContexts, BangPatterns #-}
import System.Random
import System.Environment (getArgs)
import Control.Monad.State
import Control.Monad.Identity (runIdentity)
import Data.Array.Repa as Repa
import Data.Array.Repa.Eval as Eval
import Data.Array.Repa.Repr.Vector
force :: (Shape sh) => Array D sh e -> Array V sh e
force = runIdentity . computeP
chunk :: [a] -> [(a,a)]
chunk [] = []
chunk (x1:x2:xs) = (x1,x2):(chunk xs)
slow_fib :: Int -> Integer
slow_fib 0 = 0
slow_fib 1 = 1
slow_fib n = slow_fib (n-2) + slow_fib (n-1)
testSubsetProd :: Int -> Int -> IO ()
testSubsetProd size seed = do
let work = do
!flags <- replicateM size (state random)
!values <- replicateM size (state $ randomR (1,10))
return $ subsetProd values flags
value = evalState work (mkStdGen seed)
print value
subsetProd :: [Int] -> [Bool] -> Int
subsetProd [!x] _ = x
subsetProd !vals !flags =
let len = (length vals) `div` 2
!valpairs = Eval.fromList (Z :. len) $ chunk vals :: (Array V (Z :. Int) (Int, Int))
!flagpairs = Eval.fromList (Z :. len) $ chunk flags :: (Array V (Z :. Int) (Bool, Bool))
!prods = force $ Repa.zipWith mul valpairs flagpairs
mul (!v0,!v1) (!f0,!f1)
| (not f0) && (not f1) = 1
| (not f0) = v0+1
| (not f1) = v1+1
| otherwise = fromInteger $ slow_fib ((v0*v1) `mod` 35)
in subsetProd (toList prods) (Prelude.map (uncurry (||)) (toList flagpairs))
main :: IO ()
main = do
args <- getArgs
let [numleaves, seed] = Prelude.map read args :: [Int]
testSubsetProd numleaves seed
The entire program is compiled with
ghc -Odph -rtsopts -threaded -fno-liberate-case -funfolding-use-threshold1000 -funfolding-keeness-factor1000 -fllvm -optlo-O3
per these instructions, on GHC 7.6.2 x64.
I ran my program (Subset) using
$> time ./Test 4096 4 +RTS -sstderr -N4
8 seconds later later:
672,725,819,784 bytes allocated in the heap
11,312,267,200 bytes copied during GC
866,787,872 bytes maximum residency (49 sample(s))
433,225,376 bytes maximum slop
2360 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1284212 colls, 1284212 par 174.17s 53.20s 0.0000s 0.0116s
Gen 1 49 colls, 48 par 13.76s 4.63s 0.0946s 0.6412s
Parallel GC work balance: 16.88% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 497.80s (448.38s elapsed)
GC time 187.93s ( 57.84s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 685.73s (506.21s elapsed)
Alloc rate 1,351,400,138 bytes per MUT second
Productivity 72.6% of total user, 98.3% of total elapsed
gc_alloc_block_sync: 8670031
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 571398
My code does get slower as I increase the -N parameter, (7.628 seconds for -N1, 7.891 seconds for -N2, 8.659 seconds for -N4) but I'm getting 0 sparks created, which seems like a prime suspect as to why I'm not getting any parallelism. Also, compiling with a whole slew of optimizations helps with the runtime, but not the parallelism.
Threadscope confirms that no serious work is being done on three HECs, but the garbage collector seems to be using all 4 HECs.
So why isn't Repa making any sparks? My product tree has 64 leaves, so even if Repa made a spark for every internal node, there should be ~63 sparks. I feel like it could have something to do with my use of the ST monad encapsulating the parallelism, though I'm not quite sure why this would cause an issue. Perhaps sparks can only be created in an IO monad?
If this is the case, does anyone have an idea of how I could perform this tree product where each level is done in parallel (without resulting in nested parallelism, which seems unnecessary for my task). In general, perhaps there is a better way to parallelize the tree product or make better use of Repa.
Bonus points for explaining why the runtime increases as I increase the -N parameter, even when no sparks are created.
EDIT
I changed the code example above to be a compiling example of my problem. The program flow almost perfectly matches my real code: I randomly choose some inputs, and then do a subset product on them. I am now using the identity monad. I have tried lots of small changes to my code: inlining or not, bang patterns or not, variations on using two Repa lists and a Repa zipWith vs zipping the lists sequentially and using a Repa map, etc, none of which helped at all.
Even if I'm running into this problem in my example code, my real program is much larger.
Why is there no parallelism?
The main reason (at least for your now simplified and working) program for there being no parallelism is that you're using computeP on an array of V representation, and normal vectors aren't strict in their element types. So you aren't actually doing any real work in parallel. The easiest fix is to use an unboxed U array as the result, by changing force to this definition:
force :: (Shape sh, Unbox e) => Array D sh e -> Array U sh e
force a = runIdentity (computeP a)
I do recall that in your original code you claimed you're working with a complicated datatype that isn't unboxed. But is it really impossible to make it so? Perhaps you can extract the data you actually need into some unboxable representation? Or make the type an instance of the Unbox class? If not, then you can also use the following variant of force that works for a V-array:
import Control.DeepSeq (NFData(..))
...
force :: (Shape sh, NFData e) => Array D sh e -> Array V sh e
force a = runIdentity $ do
r <- computeP a
!b <- computeUnboxedP (Repa.map rnf r)
return r
The idea here is that we first compute the V-array structure, and then we compute a U-array of () type from it by mapping rnf over the array. The resulting array is uninteresting, but each of the V-array's elements will be forced in the process1.
Either of these changes brings runtime for a problem size of 4096 from ~9 down to ~3 seconds with -N4 on my machine.
In addition, I think it's strange that you convert between lists and arrays in every step. Why not make subsetProd take two arrays? Also, at least for the values, using an intermediate V array for the pairs seems unnecessary, you could just as well use a D array. But in my experiments these changes didn't have a significant beneficial effect on runtime.
Why are there no sparks?
Repa does never create sparks. Haskell has many different approaches to parallelism, and sparks are one particular mechanism that has special support in the run-time system. However, only some libraries, for example the parallel package and one particular scheduler of the monad-par package, actually make use of the mechanism. Repa, however, does not. It uses forkIO, i.e., threads, internally, but provides a pure interface to the outside. So the absence of sparks is in itself nothing to worry about.
1. I originally had no idea how to do that, so I asked Ben Lippmeier, the author of Repa. Thanks a lot to Ben for pointing out the option of mapping rnf to produce a different array, and the fact that there's an Unbox instance for (), to me.