I'm building a project that stores words into a dictionary (using the library dawg), and when compiled with -fprof-auto, the profiler doesn't tell me that most of the time is spent in functions and CAFs from dawg modules.
The code (also using conduit, but it's pretty straightforward) is:
import Data.DAWG.Static as D
import qualified Data.DAWG.Dynamic as DD
import Conduit
import qualified Data.Conduit.Combinators as C
import qualified Data.Text as T
import Data.List (isSuffixOf)
import Control.Monad
insertEntry dawg word =
DD.insertWith (+) (T.unpack word) 1 dawg
isWhitespace x = x `elem` [' ', '.', '\n', '\'']
appendFileToDDAWG dawg fp =
C.sourceFile fp $= C.decodeUtf8
$= C.splitOnUnboundedE isWhitespace
$$ C.foldl insertEntry dawg
loadDirToDAWG :: FilePath -> IO (DAWG Char () Int)
loadDirToDAWG dir = runResourceT $ do
d <- C.sourceDirectoryDeep True dir
$= C.filter (".txt" `isSuffixOf`)
$$ C.foldM appendFileToDDAWG DD.empty
return $ D.freeze d
main = do d <- loadDirToDAWG some_directory
mapM_ print $ D.assocs d
Running with +RTS -p shows me most of the time is spent in insertEntry (which is normal):
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 211 0 0.0 0.0 100.0 100.0
main Main 423 0 2.3 1.7 99.9 100.0
loadDirToDAWG Mymodule.BuildDAWG 425 0 1.9 1.0 97.6 98.2
appendFileToDDAWG Mymodule.BuildDAWG 427 74 8.0 5.1 95.7 97.2
insertEntry Mymodule.BuildDAWG 430 71366 86.5 92.1 86.5 92.1
[...]
CAF Data.DAWG.Static 418 0 0.1 0.0 0.1 0.0
CAF Data.DAWG.Trans.Map 412 0 0.0 0.0 0.0 0.0
But it doesn't tell me that the time is actually spent inside the Data.DAWG.Dynamic module. Which is weird because it shows the Data.DAWG.Static module, so it's able to "detect" some modules from dawg but not all of them, and especially not the one where most of the work is done.
After downloading dawg, modifying its .cabal file so it's compiled with -fprof-auto-top, and rebuilding everything I get a larger profiler output that shows all inner functions of Data.DAWG.Dynamic and seems ok. But I don't want the full detail (I'm not profiling dawg, just my code) and I just want to be sure that the time is spent in dawg and not in my code (or else it means my code has a problem).
So why in the first case was the Data.DAWG.Dynamic module not shown?
Is there something I'm missing regarding how GHC handles profiling?
Related
In my computer science class we were using Haskell to solve the "queens" problem in which you must find all possible placements of n queens in an nxn board. This was the code we were given:
queens n = solve n
where
solve 0 = [ [] ]
solve k = [ h:partial | partial <- solve(k-1), h <- [0..(n-1)], safe h partial ]
safe h partial = and [ not (checks h partial i) | i <- [0..(length partial)-1] ]
checks h partial i = h == partial!!i || abs(h-partial!!i) == i+1
However, the first time I entered it I accidentally swapped the order in solve k and found that it still gave a correct solution but took much longer:
queens n = solve n
where
solve 0 = [ [] ]
solve k = [ h:partial | h <- [0..(n-1)], partial <- solve(k-1), safe h partial ]
safe h partial = and [ not (checks h partial i) | i <- [0..(length partial)-1] ]
checks h partial i = h == partial!!i || abs(h-partial!!i) == i+1
Why does this second version take so much longer? My thought process is that the second version does recursion at every step while the first version does recursion only once and then backtracks. This is not for a homework problem, I'm just curious and feel like it will help me better understand the language.
Simply put,
[ ... | x <- f 42, n <- [1..100] ]
will evaluate f 42 once to a list, and for each element x in such list it will generate all ns from 1 to 100. Instead,
[ ... | n <- [1..100], x <- f 42 ]
will first generate an n from 1 to 100, and for each of them call f 42. So f is now being called 100 times instead of one.
This is no different from what happens in imperative programming when using nested loops:
for x in f(42): # calls f once
for n in range(1,100):
...
for n in range(1,100):
for x in f(42): # calls f 100 times
...
The fact that your algorithm is recursive makes this swap particularly expensive, since the additional cost factor (100, above) accumulates at each recursive call.
You can also try to bind the result of f 42 to some variable so that it does not need to be recomputed, even if you nest it the other way around:
[ ... | let xs = f 42, n <- [1..100], x <- xs ]
Note that this will keep the whole xs list in memory for the whole loop, preventing it from being garbage collected. Indeed, xs will be fully evaluated for n=1, and then reused for higher values of n.
My guess is that your first version does a depth-first traversal while your second version does a breadth-first traversal of the tree (see Tree Traversal on Wikipedia).
As the complexity of the problem grows with the size of the board, the second version uses more and more memory to keep track of each level of the tree while the first version quickly forgets the previous branch it visited.
Managing the memory takes a lot of time!
By enabling profiling, you can see how the Haskell runtime behaves with your functions.
If you compare the number of calls, they are strictly the same, but still the second version takes more time:
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 44 0 0.0 0.0 100.0 100.0
main Main 89 0 0.3 0.0 0.3 0.0
CAF Main 87 0 0.0 0.0 99.7 100.0
main Main 88 1 0.2 0.6 99.7 100.0
queens2 Main 94 1 0.0 0.0 55.6 48.2
queens2.solve Main 95 13 3.2 0.8 55.6 48.2
queens2.safe Main 96 10103868 42.1 47.5 52.3 47.5
queens2.checks Main 100 37512342 10.2 0.0 10.2 0.0
queens1 Main 90 1 0.0 0.0 43.9 51.1
queens1.solve Main 91 13 2.0 1.6 43.9 51.1
queens1.safe Main 92 10103868 29.3 49.5 41.9 49.5
queens1.checks Main 93 37512342 12.7 0.0 12.7 0.0
Looking at the heap profile tells you what really happens.
The first version has a small and constant heap use:
While the second version has a huge heap use which must also face garbage collection (look at the peaks):
Looking at the core, the first function generates a single function in core, which is tail recursive (constant stack space - very fast and very nice function. Thanks GHC!). However, the 2nd generates two functions: one to do a single step of the inner loop; and a 2nd function which looks like
loop x = case x of { 0 -> someDefault; _ -> do1 (loop (x-1)) }
This function likely isn't performant because do1 must traverse the entire input list, and each iteration appends new elements to the list (meaning the input list to do1 grows monotonically in length). Whereas the core function for the fast version is generating the output list directly, without having to process some other list. It is quite difficult to reason about the performance of list comprehension, I believe, so first translate the function to not use them:
guard b = if b then [()] else []
solve_good k =
concatMap (\partial ->
concatMap (\h ->
guard (safe h partial) >> return (h:partial)
) [0..n-1]
) (solve $ k-1)
solve_bad k =
concatMap (\h ->
concatMap (\partial ->
guard (safe h partial) >> return (h:partial)
) (solve $ k-1)
) [0..n-1]
The transformation is fairly mechanical and is detailed somewhere in the Haskell report, but essentially <- becomes concatMap and conditions become guards. It is much easier to see what is happening now - solve_good makes a recursive call a single time, then concatMaps over that recursively created list. However, solve_bad makes the recursive call inside the outer concatMap, meaning it will potentially (likely) be recomputed for every element in [0..n-1]. Note that there is no semantic reason for solve $ k-1 to be in the inner concatMap - it does not depend on the value that that concatMap binds (the h variable) so it can be safely lifted out above the concatMap which binds h (as is done in solve_good).
I'm looking for optimization oportunities in my Haskell program by compiling with -prof, but I don't know how to interpret the cost centres that contain ellipses. What are filter.(...) and jankRoulette.select.(...)?
COST CENTRE MODULE %time %alloc
filter.(...) Forest 46.5 22.3
set-union Forest 22.5 4.1
cache-lookup Forest 16.0 0.1
removeMany MultiMapSet 3.7 1.9
insertMany MultiMapSet 3.3 1.8
jankRoulette.select.(...) Forest 1.4 15.2
I generated that with: $ ghc --make -rtsopts -prof -auto-all main.hs && ./main +RTS -p && cat main.prof
The function filter has a few definitions in a where clause, like this:
filter a b = blahblah where
foo = bar
bar = baz
baz = bing
But those all show up as filter.foo, filter.bar, etc.
I thought they might be nested let expressions, but jankRoulette.select doesn't have any. And I've added SCC directives in front of most of them without any of those cost centres rising to the top.
Since most of the time is spent in filter.(...), I'd like to know what that is. :)
TL; DR: GHC generates this when you do a pattern match in a let binding, like let (x,y) = c. The cost of evaluating c is tracked by the ... cost centre (since there is no unique name to it)`.
So how did I find this out?
A grep for (...) in the GHC source code finds the following (from compiler/deSugar/Coverage.hs):
-- TODO: Revisit this
addTickLHsBind (L pos (pat#(PatBind { pat_lhs = lhs, pat_rhs = rhs }))) = do
let name = "(...)"
(fvs, rhs') <- getFreeVars $ addPathEntry name $ addTickGRHSs False False rhs
{- ... more code following, but not relevant to this purpose
-}
That code tells us that it has to do something with pattern bindings.
So we can make a small test program to check the behavior:
x :: Int
(x:_) = reverse [1..1000000000]
main :: IO ()
main = print x
Then, we can run this program with profiling enabled. An indeed, GHC generates the following output:
COST CENTRE MODULE no. entries %time %alloc %time
%alloc
MAIN MAIN 42 0 0.0 0.0 100.0 100.0
CAF Main 83 0 0.0 0.0 100.0 100.0
(...) Main 86 1 100.0 100.0 100.0 100.0
x Main 85 1 0.0 0.0 0.0 0.0
main Main 84 1 0.0 0.0 0.0 0.0
So it turns out the assumption made from the code was correct. All of the time of the program is spent evaluating the reverse [1..1000000000] expression, and it's assigned to the (...) cost centre.
Purely Functional Data Structures has the following exercise:
-- 2.5 Sharing can be useful within a single object, not just between objects.
-- For example, if the two subtress of a given node are identical, then they can
-- be represented by the same tree.
-- Part a: make a `complete a Int` function that creates a tree of
-- depth Int, putting a in every leaf of the tree.
complete :: a -> Integer -> Maybe (Tree a)
complete x depth
| depth < 0 = Nothing
| otherwise = Just $ complete' depth
where complete' d
| d == 0 = Empty
| otherwise = let copiedTree = complete' (d-1)
in Node x copiedTree copiedTree
Does this implementation run in O(d) time? Could you please say why or why not?
The interesting part of the code is the complete' function:
complete' d
| d == 0 = Empty
| otherwise = let copiedTree = complete' (d-1)
in Node x copiedTree copiedTree
As Cirdec's answer suggests, we should be careful to analyze each part of the implementation to make sure our assumptions are valid. As a general rule, we can assume that the following take 1 unit of time each*:
Using a data constructor to construct a value (e.g., using Empty to make an empty tree or using Node to turn a value and two trees into a tree).
Pattern matching on a value to see what data constructor it was built from and what values the data constructor was applied to.
Guards and if/then/else expressions (which are implemented internally using pattern matching).
Comparing an Integer to 0.
Cirdec mentions that the operation of subtracting 1 from an Integer is logarithmic in the size of the integer. As they say, this is essentially an artifact of the way Integer is implemented. It is possible to implement integers so that it takes only one step to compare them to 0 and also takes only one step to decrement them by 1. To keep things very general, it's safe to assume that there is some function c such that the cost of decrementing an Integer is c(depth).
Now that we've taken care of those preliminaries, let's get down to work! As is generally the case, we need to set up a system of equations and solve it. Let f(d) be the number of steps needed to calculate complete' d. Then the first equation is very simple:
f(0) = 2
That's because it costs 1 step to compare d to 0, and another step to check that the result is True.
The other equation is the interesting part. Think about what happens when d > 0:
We calculate d == 0.
We check if that is True (it's not).
We calculate d-1 (let's call the result dm1)
We calculate complete' dm1, saving the result as copiedTree.
We apply a Node constructor to x, copiedTree, and copiedTree.
The first part takes 1 step. The second part takes one step. The third part takes c(depth) steps, and the fifth step takes 1 step. What about the fourth part? Well, that takes f(d-1) steps, so this will be a recursive definition.
f(0) = 2
f(d) = (3+c(depth)) + f(d-1) when d > 0
OK, now we're cooking with gas! Let's calculate the first few values of f:
f(0) = 2
f(1) = (3+c(depth)) + f(0) = (3+c(depth)) + 2
f(2) = (3+c(depth)) + f(1)
= (3+c(depth)) + ((3+c(depth)) + 2)
= 2*(3+c(depth)) + 2
f(3) = (3+c(depth)) + f(2)
= (3+c(depth)) + (2*(3+c(depth)) + 2)
= 3*(3+c(depth)) + 2
You should be starting to see a pattern by now:
f(d) = d*(3+c(depth)) + 2
We generally prove things about recursive functions using mathematical induction.
Base case:
The claim holds for d=0 because 0*(3+c(depth))+2=0+2=2=f(0).
Suppose that the claim holds for d=D. Then
f(D+1) = (3+c(depth)) + f(D)
= (3+c(depth)) + (D*(3+c(depth))+2)
= (D+1)*(3+c(depth))+2
So the claim holds for D+1 as well. Thus by induction, it holds for all natural numbers d. As a reminder, this gives the conclusion that complete' d takes
f(d) = d*(3+c(depth))+2
time. Now how do we express that in big O terms? Well, big O doesn't care about the constant coefficients of any of the terms, and only cares about the highest-order terms. We can safely assume that c(depth)>=1, so we get
f(d) ∈ O(d*c(depth))
Zooming out to complete, this looks like O(depth*c(depth))
If you use the real cost of Integer decrement, this gives you O(depth*log(depth)). If you pretend that Integer decrement is O(1), this gives you O(depth).
Side note: As you continue to work through Okasaki, you will eventually reach section 10.2.1, where you will see a way to implement natural numbers supporting O(1) decrement and O(1) addition (but not efficient subtraction).
* Haskell's lazy evaluation keeps this from being precisely true, but if you pretend that everything is evaluated strictly, you will get an upper bound for the true value, which will be good enough in this case. If you want to learn how to analyze data structures that use laziness to get good asymptotic bounds, you should keep reading Okasaki.
Theoretical Answer
No, it does not run in O(d) time. Its asymptotic performance is dominated by the the Integer subtraction d-1, which takes O(log d) time. This is repeated O(d) times, giving an asymptotic upper bound on time of O(d log d).
This upper bound can improve if you use an Integer representation with an asymptotically optimal O(1) decrement. In practice we don't, since the asymptotically optimal Integer implementations are slower even for unimaginably large values.
Practically the Integer arithmetic will be a small part of the running time of the program. For practical "large" depths (smaller than a machine word) the program's running time will be dominated by allocating and populating memory. For larger depths you will exhaust the resources of the computer.
Practical Answer
Ask the run time system's profiler.
In order to profile your code, we first need to make sure it is run. Haskell is lazily evaluated, so, unless we do something to cause the tree to be completely evaluated, it might not be. Unfortunately, completely exploring the tree will take O(2^d) steps. We could avoid forcing nodes we had already visited if we kept track of their StableNames. Fortunately, traversing a structure and keeping track of visited nodes by their memory locations is already provided by the data-reify package. Since we will be using it for profiling, we need to install it with profiling enabled (-p).
cabal install -p data-reify
Using Data.Reify requires the TypeFamilies extension and Control.Applicative.
{-# LANGUAGE TypeFamilies #-}
import Data.Reify
import Control.Applicative
We reproduce your Tree code.
data Tree a = Empty | Node a (Tree a) (Tree a)
complete :: a -> Integer -> Maybe (Tree a)
complete x depth
| depth < 0 = Nothing
| otherwise = Just $ complete' depth
where complete' d
| d == 0 = Empty
| otherwise = let copiedTree = complete' (d-1)
in Node x copiedTree copiedTree
Converting data to a graph with data-reify requires that we have a base functor for the data type. The base functor is a representation of the type with explicit recursion removed. The base functor for Tree is TreeF. An additional type parameter is added for the representation of recursive occurrence of the type, and each recursive occurrence is replaced by the new parameter.
data TreeF a x = EmptyF | NodeF a x x
deriving (Show)
The MuRef instance required by reifyGraph requires that we provide a mapDeRef to traverse the structure with an Applicative and convert it to the base functor . The first argument provided to mapDeRef, which I have named deRef, is how we can convert the recursive occurrences of the structure.
instance MuRef (Tree a) where
type DeRef (Tree a) = TreeF a
mapDeRef deRef Empty = pure EmptyF
mapDeRef deRef (Node a l r) = NodeF a <$> deRef l <*> deRef r
We can make a little program to run to test the complete function. When the graph is small, we'll print it out to see what's going on. When the graph gets big, we'll only print out how many nodes it has.
main = do
d <- getLine
let (Just tree) = complete 0 (read d)
graph#(Graph nodes _) <- reifyGraph tree
if length nodes < 30
then print graph
else print (length nodes)
I put this code in a file named profileSymmetricTree.hs. To compile it, we need to enable profiling with -prof and enable the run-time system with -rtsopts.
ghc -fforce-recomp -O2 -prof -fprof-auto -rtsopts profileSymmetricTree.hs
When we run it, we'll enable the time profile with the +RTS option -p. We'll give it the depth input 3 for the first run.
profileSymmetricTree +RTS -p
3
let [(1,NodeF 0 2 2),(2,NodeF 0 3 3),(3,NodeF 0 4 4),(4,EmptyF)] in 1
We can already see from the graph that the nodes are being shared between the left and right sides of the tree.
The profiler makes a file, profileSymmetricTree.prof.
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 43 0 0.0 0.7 100.0 100.0
main Main 87 0 100.0 21.6 100.0 32.5
...
main.(...) Main 88 1 0.0 4.8 0.0 5.1
complete Main 90 1 0.0 0.0 0.0 0.3
complete.complete' Main 92 4 0.0 0.2 0.0 0.3
complete.complete'.copiedTree Main 94 3 0.0 0.1 0.0 0.1
It shows in the entries column that complete.complete' was executed 4 times, and the complete.complete'.copiedTree was evaluated 3 times.
If you repeat this experiment with different depths, and plot the results, you should get a good idea what the practical asymptotic performance of complete is.
Here are the profiling results for a much greater depth, 300000.
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 43 0 0.0 0.0 100.0 100.0
main Main 87 0 2.0 0.0 99.9 100.0
...
main.(...) Main 88 1 0.0 0.0 2.1 5.6
complete Main 90 1 0.0 0.0 2.1 5.6
complete.complete' Main 92 300001 1.3 4.4 2.1 5.6
complete.complete'.copiedTree Main 94 300000 0.8 1.3 0.8 1.3
I'm currently benchmarking my program to see whether I can improve its performance. Currently my program will take an input file and run some algorithm to split it into multiple files.
It takes roughly 14s to split a file into 3 parts, with -O2 compilation flag for both library and executable.
ghc-options: -Wall -fno-warn-orphans -O2 -auto-all
It looks like it is spending approximately 60% of its time in sinkFile, and I'm wondering whether there is anything I can do to improve the following code.
-- | Get the sink file, a list of FilePaths and the share number of the file to output to.
idxSinkFile :: MonadResource m
=> [FilePath]
-> Int
-> Consumer [Word8] m ()
idxSinkFile outFileNames shareNumber =
let ccm = CC.concatMap $ flip atMay shareNumber
cbs = CC.map BS.singleton
sf = sinkFile (outFileNames !! shareNumber)
in ccm =$= cbs =$= sf
-- | Generate a sink which will take a list of bytes and write each byte to its corresponding file share
sinkMultiFiles :: MonadResource m
=> [FilePath]
-> [Int]
-> Sink [Word8] m ()
sinkMultiFiles outFileNames xs =
let len = [0..length xs - 1]
in getZipSink $ otraverse_ (ZipSink . idxSinkFile outFileNames) len
Here are the output of GHC's profiling:
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
splitFile.sink HaskSplit.Conduit.Split 289 1 0.0 0.0 66.8 74.2
sinkMultiFiles HaskSplit.Conduit.Split 290 1 27.4 33.2 66.8 74.2
idxSinkFile HaskSplit.Conduit.Split 303 3 7.9 11.3 39.4 41.0
idxSinkFile.ccm HaskSplit.Conduit.Split 319 3 3.1 3.6 3.1 3.6
idxSinkFile.cbs HaskSplit.Conduit.Split 317 3 3.5 4.2 3.5 4.2
idxSinkFile.sf HaskSplit.Conduit.Split 307 3 24.9 21.9 24.9 21.9
sinkMultiFiles.len HaskSplit.Conduit.Split 291 1 0.0 0.0 0.0 0.0
Which shows sinkFile taking a lot of time. (I've benchmarked the list access etc in case you're wondering and they have 0% of processing)
While I understand for a small program like this IO is often the bottleneck, I'd like to see if I can improve the runtime performance of my program.
Cheers!
Following nh2's advice, I decided to pack the ByteStrings in 256 byte chunks instead of doing a BS.singleton on each Word8 instance.
cbs = CL.sequence (CL.take 256) =$= CC.map BS.pack
instead of
cbs = CC.map BS.singleton
and I'm able to reduce the running time as well as the memory usage quite significantly, as demonstrated below:
Original Run
total time = 194.37 secs (194367 ticks # 1000 us, 1 processor)
total alloc = 102,021,859,892 bytes (excludes profiling overheads)
New Run, with CL.take
total time = 35.88 secs (35879 ticks # 1000 us, 1 processor)
total alloc = 21,970,152,800 bytes (excludes profiling overheads)
That's some serious improvement! I'd like to optimize it more but that's for another question :)
I am getting my feet wet writing concurrent programs in Haskell with GHC for multicore machines. As a first step I decided to write a program that reads and writes concurrently to an IOArray. I had the impression that reads and writes to IOArray involve no synchronization. I'm doing this to establish a baseline to compare with the performance of other data structures that do use appropriate synchronization mechanisms. I ran in to some surprising results, namely that in many cases, I am not getting any speed up at all. This makes me wonder if there is some low level synchronization happening in the ghc runtime, for example, synchronization and blocking on evaluation of thunks (i.e. "black holes"). Here are the details...
I write a couple variations on a single program. The main idea is that I wrote a DirectAddressTable data structure, which is simply a wrapper around an IOArray providing insert and lookup methods:
-- file DirectAddressTable.hs
module DirectAddressTable
( DAT
, newDAT
, lookupDAT
, insertDAT
, getAssocsDAT
)
where
import Data.Array.IO
import Data.Array.MArray
newtype DAT = DAT (IOArray Int Char)
-- create a fixed size array; missing keys have value '-'.
newDAT :: Int -> IO DAT
newDAT n = do a <- newArray (0, n - 1) '-'
return (DAT a)
-- lookup an item.
lookupDAT :: DAT -> Int -> IO (Maybe Char)
lookupDAT (DAT a) i = do c <- readArray a i
return (if c=='-' then Nothing else Just c)
-- insert an item
insertDAT :: DAT -> Int -> Char -> IO ()
insertDAT (DAT a) i v = writeArray a i v
-- get all associations (exclude missing items, i.e. those whose value is '-').
getAssocsDAT :: DAT -> IO [(Int,Char)]
getAssocsDAT (DAT a) =
do assocs <- getAssocs a
return [ (k,c) | (k,c) <- assocs, c /= '-' ]
I then have a main program that initializes a new table, forks some threads, with each thread writing and reading some fixed number of values to the just initialized table. The overall number of elements to write is fixed. The number of threads to use is a taken from a command line argument, and the elements to process are evenly divided among the threads.
-- file DirectTableTest.hs
import DirectAddressTable
import Control.Concurrent
import Control.Parallel
import System.Environment
main =
do args <- getArgs
let numThreads = read (args !! 0)
vs <- sequence (replicate numThreads newEmptyMVar)
a <- newDAT arraySize
sequence_ [ forkIO (doLotsOfStuff numThreads i a >>= putMVar v)
| (i,v) <- zip [1..] vs]
sequence_ [ takeMVar v >>= \a -> getAssocsDAT a >>= \xs -> print (last xs)
| v <- vs]
doLotsOfStuff :: Int -> Int -> DAT -> IO DAT
doLotsOfStuff numThreads i a =
do let p j c = (c `seq` insertDAT a j c) >>
lookupDAT a j >>= \v ->
v `pseq` return ()
sequence_ [ p j c | (j,c) <- bunchOfKeys i ]
return a
where bunchOfKeys i = take numElems $ zip cyclicIndices $ drop i cyclicChars
numElems = numberOfElems `div` numThreads
cyclicIndices = cycle [0..highestIndex]
cyclicChars = cycle chars
chars = ['a'..'z']
-- Parameters
arraySize :: Int
arraySize = 100
highestIndex = arraySize - 1
numberOfElems = 10 * 1000 * 1000
I compiled this using ghc 7.2.1 (similar results with 7.0.3) with "ghc --make -rtsopts -threaded -fforce-recomp -O2 DirectTableTest.hs".
Running "time ./DirectTableTest 1 +RTS -N1" takes about 1.4 seconds and running "time ./DirectTableTest 2 +RTS -N2" take about 2.0 seconds! Using one more core than worker threads is a little better, with "time ./DirectTableTest 1 +RTS -N1" takes about 1.4 seconds and running "time ./DirectTableTest 1 +RTS -N2" and "time ./DirectTableTest 2 +RTS -N3" both taking about 1.4 seconds.
Running with the "-N2 -s" option shows that productivity is 95.4% and GC is 4.3%. Looking at a run of the program with ThreadScope I don't see anything too alarming. Each HEC yields once per ms when a GC occurs. Running with 4 cores gives a time of about 1.2 seconds, which is at least a little better than 1 core. More cores doesn't improve over this.
I found that changing the array type used in the implementation of DirectAddressTable from IOArray to IOUArray fixes this problem. With this change, the running time of "time ./DirectTableTest 1 +RTS -N1" is about 1.4 seconds whereas the running "time ./DirectTableTest 2 +RTS -N2" is about 1.0 seconds. Increasing to 4 cores gives a run time of 0.55 seconds. Running with "-s" shows a GC time of %3.9 percent. Under ThreadScope I can see that both threads yield every 0.4 ms, more frequently than in the previous program.
Finally, I tried one more variation. Instead of having the threads work on the same shared array, I had each thread work on its own array. This scales nicely (as you would expect), more or less like the second program, with either IOArray or IOUArray implementing the DirectAddressTable data structure.
I understand why IOUArray might perform better than IOArray, but I don't know why it scales better to multiple threads and cores. Does anyone know why this might be happening or what I can do to find out what is going on? I wonder if this problem could be due to multiple threads blocking while evaluating the same thunk and whether it is related to this: http://hackage.haskell.org/trac/ghc/ticket/3838 .
Running "time ./DirectTableTest 1 +RTS -N1" takes about 1.4 seconds and running "time ./DirectTableTest 2 +RTS -N2" take about 2.0 seconds!
I can not reproduce your results:
$ time ./so2 1 +RTS -N1
(99,'k')
real 0m0.950s
user 0m0.932s
sys 0m0.016s
tommd#Mavlo:Test$ time ./so2 2 +RTS -N2
(99,'s')
(99,'s')
real 0m0.589s
user 0m1.136s
sys 0m0.024s
And this seems to scale as expected as the number of light weight threads increases too:
ghc -O2 so2.hs -threaded -rtsopts
[1 of 2] Compiling DirectAddressTable2 ( DirectAddressTable2.hs, DirectAddressTable2.o )
[2 of 2] Compiling Main ( so2.hs, so2.o )
Linking so2 ...
tommd#Mavlo:Test$ time ./so2 4
(99,'n')
(99,'z')
(99,'y')
(99,'y')
real 0m1.538s
user 0m1.320s
sys 0m0.216s
tommd#Mavlo:Test$ time ./so2 4 +RTS -N2
(99,'z')
(99,'x')
(99,'y')
(99,'y')
real 0m0.600s
user 0m1.156s
sys 0m0.020s
Do you actually have 2 CPUs? If you run with more GHC threads (-Nx) than you have available CPUs then your results will be very poor. What I think I'm really asking is: are you sure no other CPU intensive processes are running on your system?
As for the IOUArray (by edit)
I understand why IOUArray might perform better than IOArray, but I don't know why it scales better to multiple threads and cores
An unboxed array will be contiguous and thus benefit much more from caching. Boxed values living in arbitrary locations on the heap could cause a large increase in cache invalidations between the cores.