Julia JuMP garbage collection time - garbage-collection

I'm new to Julia and is trying to use JuMP to solve a NLP, below is the code
m = JuMP.Model(solver=Ipopt.IpoptSolver(max_iter=50,tol=1e-6))
function lkhf(x1,x2,x3,x4,x5,x6,x7,x8)
x = [x1,x2,x3,x4,x5,x6,x7,x8]
#time ll = loglikelihood(x,pdpoeFacSim,intercept, pdpoeFacMean, pdpoeFacInitial, pdpoeTarget, pdVar, poeVar,corPDPOE,resRnd)
return ll
end
JuMP.register(m, :lkhf, 8, lkhf, autodiff=true)
#nlobjective(m, Min, lkhf(x1,x2,x3,x4,x5,x6,x7,x8))
As you can see from the code, I try to time each time the solver calls my objective function and notice something, below is part of the output:
0.224411 seconds (47.41 k allocations: 198.016 MiB, 54.80% gc time)
0.027915 seconds (30.61 k allocations: 22.983 MiB, 21.84% gc time)
0.213348 seconds (47.41 k allocations: 198.016 MiB, 56.25% gc time)
0.026281 seconds (30.61 k allocations: 22.983 MiB, 22.52% gc time)
0.214388 seconds (47.41 k allocations: 198.016 MiB, 55.36% gc time)
0.028030 seconds (30.61 k allocations: 22.983 MiB, 22.65% gc time)
I don't quite understand why calling same objective function cause significantly different gc time.
Could you please give me some advise on how I can speed up my code?

Related

Slowdown when using ghc parallel strategies

In order to learn about GHC's parallel strategies, I've written a simple particle simulator, that, given a particle's position, velocity, and acceleration, will project that particle's path forward.
import Control.Parallel.Strategies
-- Use phantom a to store axis.
newtype Pos a = Pos Double deriving Show
newtype Vel a = Vel Double deriving Show
newtype Acc a = Acc Double deriving Show
newtype TimeStep = TimeStep Double deriving Show
-- Phantom axis
data X
data Y
-- Position, velocity, acceleration for a particle.
data Particle = Particle (Pos X) (Pos Y) (Vel X) (Vel Y) (Acc X) (Acc Y) deriving (Show)
stepParticle :: TimeStep -> Particle -> Particle
stepParticle ts (Particle x y xv yv xa ya) =
Particle x' y' xv' yv' xa' ya'
where
(x', xv', xa') = step ts x xv xa
(y', yv', ya') = step ts y yv ya
-- Given a position, velocity, and accel, calculate the pos, vel, acc after
-- a given TimeStep.
step :: TimeStep -> Pos a -> Vel a -> Acc a -> (Pos a, Vel a, Acc a)
step (TimeStep ts) (Pos p) (Vel v) (Acc a) = (Pos p', Vel v', Acc a)
where
v' = ts * a + v
p' = ts * v + p
-- Build a list of lazy infinite lists of a particles' travel
-- with each update a TimeStep apart. Evaluate each inner list in
-- parallel.
simulateParticlesPar :: TimeStep -> [Particle] -> [[Particle]]
simulateParticlesPar ts = withStrategy (parList (parBuffer 250 particleStrategy))
. fmap (simulateParticle ts)
-- Build a lazy infinite list of the particle's travel with each
-- update being a TimeStep apart.
simulateParticle :: TimeStep -> Particle -> [Particle]
simulateParticle ts m = m' : simulateParticle ts m'
where
m' = stepParticle ts m
particleStrategy :: Strategy Particle
particleStrategy (Particle (Pos x) (Pos y) (Vel xv) (Vel yv) (Acc xa) (Acc ya)) = do
x' <- rseq x
y' <- rseq y
xv' <- rseq xv
yv' <- rseq yv
xa' <- rseq xa
ya' <- rseq ya
return $ Particle (Pos x') (Pos y') (Vel xv') (Vel yv') (Acc xa') (Acc ya')
main :: IO ()
main = do
let world = replicate 100 (Particle (Pos 0) (Pos 0) (Vel 1) (Vel 1) (Acc 0) (Acc 0))
ts = TimeStep 0.1
print $ fmap (take 10000) (simulateParticlesPar ts world)
For each particle, I create a lazy infinite list projecting the particle's path into the future. I start out with 100 of these particles and project these all forward, my intention being to project each of these forward in parallel (roughly a spark per infinite list). If I project these lists forward long enough, I'd expect a significant speedup. Unfortunately, I see a slight slow down.
Compilation: ghc phys.hs -rtsopts -threaded -eventlog -O2
With 1 thread:
$ ./phys +RTS -N1 -sstderr -ls > /dev/null
24,264,983,224 bytes allocated in the heap
441,881,088 bytes copied during GC
1,942,848 bytes maximum residency (104 sample(s))
75,880 bytes maximum slop
7 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 46820 colls, 0 par 0.82s 0.88s 0.0000s 0.0039s
Gen 1 104 colls, 0 par 0.23s 0.23s 0.0022s 0.0037s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 1025000 (25 converted, 0 overflowed, 0 dud, 28680 GC'd, 996295 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 9.90s ( 10.09s elapsed)
GC time 1.05s ( 1.11s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 10.95s ( 11.20s elapsed)
Alloc rate 2,451,939,648 bytes per MUT second
Productivity 90.4% of total user, 88.4% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
With 2 threads:
$ ./phys +RTS -N2 -sstderr -ls > /dev/null
24,314,635,280 bytes allocated in the heap
457,603,240 bytes copied during GC
1,962,152 bytes maximum residency (104 sample(s))
119,824 bytes maximum slop
7 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 46555 colls, 46555 par 1.40s 0.85s 0.0000s 0.0048s
Gen 1 104 colls, 103 par 0.42s 0.25s 0.0024s 0.0043s
Parallel GC work balance: 16.85% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)
SPARKS: 1025000 (1023572 converted, 0 overflowed, 0 dud, 1367 GC'd, 61 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 11.07s ( 11.20s elapsed)
GC time 1.82s ( 1.10s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 12.89s ( 12.30s elapsed)
Alloc rate 2,196,259,905 bytes per MUT second
Productivity 85.9% of total user, 90.0% of total elapsed
gc_alloc_block_sync: 9222
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 2393
I have an Intel i5 with 2 cores and 4 threads, and with -N4 it's 2x slower than -N1 (total time ~20 sec).
I've spent quite a bit of time trying different strategies, such as chunking the outer list (so each spark gets more than one stream to project forward) and using rpar for each field in particleStrategy, but I've yet to get any speed up at all.
Below is a zoomed in section of the eventlog under threadscope. As you can see, I'm getting almost no concurrency. Most of the work is being done by HEC0, with some activity from HEC1 interleaved in, but only one HEC is doing work at a time. This is pretty representative of all the strategies I've tried.
As a sanity check, I've run a few of the example programs from "Parallel and Concurrent Programming in Haskell" and also see slow downs on these programs, even though I'm using the same params that give them significant speeds ups in the book! I'm beginning to think there's something wrong with my ghc.
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.8.3
Installed from: https://ghcformacosx.github.io/
OS X 10.10.2
Update:
I've found this thread in the ghc tracker on an OS X threaded RTS performance regression: https://ghc.haskell.org/trac/ghc/ticket/7602. I'm hesitant to blame the compiler, but my -N4 outputs supports this hypothesis. The "parallel gc word balance" is terrible:
$ ./phys +RTS -N4 -sstderr -ls > /dev/null
24,392,146,832 bytes allocated in the heap
481,001,648 bytes copied during GC
1,989,272 bytes maximum residency (104 sample(s))
181,208 bytes maximum slop
8 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 46555 colls, 46555 par 4.80s 1.98s 0.0000s 0.0055s
Gen 1 104 colls, 103 par 0.99s 0.39s 0.0037s 0.0048s
Parallel GC work balance: 7.59% (serial 0%, perfect 100%)
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
SPARKS: 1025000 (1023640 converted, 0 overflowed, 0 dud, 1331 GC'd, 29 fizzled)
INIT time 0.00s ( 0.01s elapsed)
MUT time 14.85s ( 13.12s elapsed)
GC time 5.79s ( 2.36s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 20.65s ( 15.49s elapsed)
Alloc rate 1,642,170,155 bytes per MUT second
Productivity 71.9% of total user, 95.9% of total elapsed
gc_alloc_block_sync: 61429
whitehole_spin: 0
gen[0].sync: 1
gen[1].sync: 617
On the other hand, I don't know if this explains my threadscope output, which shows a lack of any concurrency at all.

Making a histogram computation in Haskell faster

I am quite new to Haskell and I am wanting to create a histogram. I am using Data.Vector.Unboxed to fuse operations on the data; which is blazing fast (when compiled with -O -fllvm) and the bottleneck is my fold application; which aggregates the bucket counts.
How can I make it faster? I read about trying to reduce the number of thunks by keeping things strict so I've made things strict by using seq and foldr' but not seeing much performance increase. Your ideas are strongly encouraged.
import qualified Data.Vector.Unboxed as V
histogram :: [(Int,Int)]
histogram = V.foldr' agg [] $ V.zip k v
where
n = 10000000
c = 1000000
k = V.generate n (\i -> i `div` c * c)
v = V.generate n (\i -> 1)
agg kv [] = [kv]
agg kv#(k,v) acc#((ck,cv):as)
| k == ck = let a = (ck,cv+v):as in a `seq` a
| otherwise = let a = kv:acc in a `seq` a
main :: IO ()
main = print histogram
Compiled with:
ghc --make -O -fllvm histogram.hs
First, compile the program with -O2 -rtsopts. Then, to get a first idea where you could optimize, run the program with the options +RTS -sstderr:
$ ./question +RTS -sstderr
[(0,1000000),(1000000,1000000),(2000000,1000000),(3000000,1000000),(4000000,1000000),(5000000,1000000),(6000000,1000000),(7000000,1000000),(8000000,1000000),(9000000,1000000)]
1,193,907,224 bytes allocated in the heap
1,078,027,784 bytes copied during GC
282,023,968 bytes maximum residency (7 sample(s))
86,755,184 bytes maximum slop
763 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1964 colls, 0 par 3.99s 4.05s 0.0021s 0.0116s
Gen 1 7 colls, 0 par 1.60s 1.68s 0.2399s 0.6665s
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.67s ( 2.68s elapsed)
GC time 5.59s ( 5.73s elapsed)
EXIT time 0.02s ( 0.03s elapsed)
Total time 8.29s ( 8.43s elapsed)
%GC time 67.4% (67.9% elapsed)
Alloc rate 446,869,876 bytes per MUT second
Productivity 32.6% of total user, 32.0% of total elapsed
Notice that 67% of your time is spent in GC! There is clearly something wrong. To find out what is wrong, we can run the program with heap profiling enabled (using +RTS -h), which produces the following figure:
So, you're leaking thunks. How does this happen? Looking at the code, the only time where a thunk is build up (recursively) in agg is when you do the addition. Making cv strict by adding a bang pattern thus fixes the issue:
{-# LANGUAGE BangPatterns #-}
import qualified Data.Vector.Unboxed as V
histogram :: [(Int,Int)]
histogram = V.foldr' agg [] $ V.zip k v
where
n = 10000000
c = 1000000
k = V.generate n (\i -> i `div` c * c)
v = V.generate n id
agg kv [] = [kv]
agg kv#(k,v) acc#((ck,!cv):as) -- Note the !
| k == ck = (ck,cv+v):as
| otherwise = kv:acc
main :: IO ()
main = print histogram
Output:
$ time ./improved +RTS -sstderr
[(0,499999500000),(1000000,1499999500000),(2000000,2499999500000),(3000000,3499999500000),(4000000,4499999500000),(5000000,5499999500000),(6000000,6499999500000),(7000000,7499999500000),(8000000,8499999500000),(9000000,9499999500000)]
672,063,056 bytes allocated in the heap
94,664 bytes copied during GC
160,028,816 bytes maximum residency (2 sample(s))
1,464,176 bytes maximum slop
155 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 992 colls, 0 par 0.03s 0.03s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.03s 0.03s 0.0161s 0.0319s
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.24s ( 1.25s elapsed)
GC time 0.06s ( 0.06s elapsed)
EXIT time 0.03s ( 0.03s elapsed)
Total time 1.34s ( 1.34s elapsed)
%GC time 4.4% (4.5% elapsed)
Alloc rate 540,674,868 bytes per MUT second
Productivity 95.5% of total user, 95.1% of total elapsed
./improved +RTS -sstderr 1,14s user 0,20s system 99% cpu 1,352 total
This is much better.
So now you could ask, why did the issue appear, even though you used seq? The reason for this is the seq only forces the first argument to be WHNF, and for a pair, (_,_) (where _ are unevaluated thunks) is already WHNF! Also, seq a a is the same as a, because it seq a b (informally) means: evaluate a before b is evaluated, so seq a a just means: evaluate a before a is evaluated, and that is the same as just evaluating a!

Project Euler #14 Tips in Haskell? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I am trying euler challenge 14. I was wondering if I could have any tips for calculating it quickly in haskell. I tried this naive approach.
import Data.List
import Data.Function
collatz n | even n = n quot 2
| otherwise = 3*n+1
colSeq = takeWhile (/= 1) . (iterate collatz)
main=print $ maximumBy (compare on (length . colSeq)) [1..999999]
But that took too long.
λ <*Main System.Timeout>: timeout (10^6*60) main
Nothing
I also tried using the reverse collatz relation, and keeping the lengths in a map to eliminate redundant calculations, but that didn't work either. And don't want the solution, but does anyone have some mathematical literature, or programming technique that will make this quicker, or do I just have to leave it over night?
Your program is not as slow as you might think…
First of all, your program runs fine and finishes in under two minutes if you compile with -O2 and increase the stack size (I used +RTS -K100m, but your system might vary):
$ .\collatz.exe +RTS -K100m -s
65,565,993,768 bytes allocated in the heap
16,662,910,752 bytes copied during GC
77,042,796 bytes maximum residency (1129 sample(s))
5,199,140 bytes maximum slop
184 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 124724 colls, 0 par 18.41s 18.19s 0.0001s 0.0032s
Gen 1 1129 colls, 0 par 16.67s 16.34s 0.0145s 0.1158s
INIT time 0.00s ( 0.00s elapsed)
MUT time 39.98s ( 41.17s elapsed)
GC time 35.08s ( 34.52s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 75.06s ( 75.69s elapsed)
%GC time 46.7% (45.6% elapsed)
Alloc rate 1,639,790,387 bytes per MUT second
Productivity 53.3% of total user, 52.8% of total elapsed
…but that's still slow
Productivity of ~50% percent means that the GC is using half the time we're staring at the screen, waiting for our result. In our case we create to much garbage by iterating the sequence for every value.
Improvements
The Collatz sequence is a recursive sequence. Therefore, we should define it as a recursive sequence instead of a iterative one and have a look at what happens.
colSeq 1 = [1]
colSeq n
| even n = n : colSeq (n `div` 2)
| otherwise = n : colSeq (3 * n + 1)
The list in Haskell is a fundamental type, so GHC should have some nifty optimization (-O2). So lets try this one:
Result
$ .\collatz_rec.exe +RTS -s
37,491,417,368 bytes allocated in the heap
4,288,084 bytes copied during GC
41,860 bytes maximum residency (2 sample(s))
19,580 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 72068 colls, 0 par 0.22s 0.22s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
INIT time 0.00s ( 0.00s elapsed)
MUT time 32.89s ( 33.12s elapsed)
GC time 0.22s ( 0.22s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 33.11s ( 33.33s elapsed)
%GC time 0.7% (0.7% elapsed)
Alloc rate 1,139,881,573 bytes per MUT second
Productivity 99.3% of total user, 98.7% of total elapsed
Note that we're now up to 99% productivity in ~80% MUT time (compared to the original version). Just by this small change we decreased the runtime tremendously.
Wait, there's more!
There's a thing that's rather strange. Why are we calculating the length of both 1024 and 512? After all, the later cannot create a longer Collatz sequence.
Improvements
However, in this case we must see the problem as one big task, and not a map. We need to keep track of the values we already calculated, and we want to clear those values we already visited.
We use Data.Set for this:
problem_14 :: S.Set Integer -> [(Integer, Integer)]
problem_14 s
| S.null s = []
| otherwise = (c, fromIntegral $ length csq) : problem_14 rest
where (c, rest') = S.deleteFindMin s
csq = colSeq c
rest = rest' `S.difference` S.fromList csq
And we use problem_14 like that:
main = print $ maximumBy (compare `on` snd) $ problem_14 $ S.fromList [1..999999]
Result
$ .\collatz_set.exe +RTS -s
18,405,282,060 bytes allocated in the heap
1,645,842,328 bytes copied during GC
27,446,972 bytes maximum residency (40 sample(s))
373,056 bytes maximum slop
79 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 35193 colls, 0 par 2.17s 2.03s 0.0001s 0.0002s
Gen 1 40 colls, 0 par 0.84s 0.77s 0.0194s 0.0468s
INIT time 0.00s ( 0.00s elapsed)
MUT time 14.91s ( 15.17s elapsed)
GC time 3.02s ( 2.81s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 17.92s ( 17.98s elapsed)
%GC time 16.8% (15.6% elapsed)
Alloc rate 1,234,735,903 bytes per MUT second
Productivity 83.2% of total user, 82.9% of total elapsed
We loose some productivity, but that's reasonable. After all, we're now using Set and not the list anymore and use 79MB instead of 1MB. However, our program now runs in 17s instead of 34s, that's only 25% of the original time.
Using ST
Inspiration (C++)
int main(){
std::vector<bool> Q(1000000,true);
unsigned long long max_l = 0, max_c = 1;
for(unsigned long i = 1; i < Q.size(); ++i){
if(!Q[i])
continue;
unsigned long long c = i, l = 0;
while(c != 1){
if(c < Q.size()) Q[c] = false;
c = c % 2 == 0 ? c / 2 : 3 * c + 1;
l++;
}
if(l > max_l){
max_l = l;
max_c = i;
}
}
std::cout << max_c << std::endl;
}
This program runs in 130ms. Our yet best version needs 100 times more. We can fix that.
Haskell
problem_14_vector_st :: Int -> (Int, Int)
problem_14_vector_st limit =
runST $ do
q <- V.replicate (limit+1) True
best <- newSTRef (1,1)
forM_ [1..limit] $ \i -> do
b <- V.read q i
when b $ do
let csq = colSeq $ fromIntegral i
let l = fromIntegral $ length csq
forM_ (map fromIntegral csq) $ \j->
when (j<= limit && j>= 0) $ V.write q j False
m <- fmap snd $ readSTRef best
when (l > m) $ writeSTRef best (i,l)
readSTRef best
Result
$ collatz_vector_st.exe +RTS -s
2,762,282,216 bytes allocated in the heap
10,021,016 bytes copied during GC
1,026,580 bytes maximum residency (2 sample(s))
21,684 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 5286 colls, 0 par 0.02s 0.02s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
INIT time 0.00s ( 0.00s elapsed)
MUT time 3.09s ( 3.08s elapsed)
GC time 0.02s ( 0.02s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 3.11s ( 3.11s elapsed)
%GC time 0.5% (0.7% elapsed)
Alloc rate 892,858,898 bytes per MUT second
Productivity 99.5% of total user, 99.6% of total elapsed
~3 seconds. Someone else might know more tricks, but that's the most I could squeeze out of Haskell.
Caching the value of integers you've already hit will save you a lot of time. If you toss in the number 1234, and find that takes 273 steps to get to 1, associate the values. 1234->273.
Now if you ever hit 1234 while in a sequence, you don't have to take 273 more steps to find the answer, just add 273 to your current number and you know the length of the sequence.
Do this for every number you calculate, even the ones in the middle of a sequence. For example, if you are at 1234 and you don't have a value yet, do the step (divide by 2) and calculate and cache the value for 617. You cache almost all the important values really quick this way. There are some really long chains that you'll end up on again and again.
The easiest way to cache all the values as you go is to make a recursive function. Like this (in pseudo-code):
function collatz(number) {
if number is 1: return 1
else if number is in cache: return cached value
else perform step: newnumber = div 2 if even, time 3 + 1 if odd
steps = collatz(newnumber) + 1 //+1 for the step we just took
cache steps as the result for number
return steps
}
Hopefully Haskell won't have problems with the depths of recursion that you'll end up in like this. However, it haskell doesn't like it, you can implement the same thing with a stack, it is just less intuitive.
The main source of time and memory issues is that you build the whole Collatz sequences, whereas for the task you only need their lengths, and unfortunately the laziness doesn't save the day. The simple solution calculating only lengths runs in a few seconds:
simpleCol :: Integer -> Int
simpleCol 1 = 1
simpleCol x | even x = 1 + simpleCol (x `quot` 2)
| otherwise = 1 + simpleCol (3 * x + 1)
problem14 = maximum $ map simpleCol [1 .. 999999]
It also takes much less memory and doesn't need enlarged stack:
$> ./simpleCollatz +RTS -s
simpleCollatz +RTS -s
2,517,321,124 bytes allocated in the heap
217,468 bytes copied during GC
41,860 bytes maximum residency (2 sample(s))
19,580 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 4804 colls, 0 par 0.00s 0.02s 0.0000s 0.0046s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
INIT time 0.00s ( 0.00s elapsed)
MUT time 4.47s ( 4.49s elapsed)
GC time 0.00s ( 0.02s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 4.47s ( 4.52s elapsed)
%GC time 0.0% (0.5% elapsed)
Alloc rate 563,316,615 bytes per MUT second
Productivity 100.0% of total user, 98.9% of total elapsed
To illustrate the proposed solution with caching, there is a nifty technique called memoization. Arguably the easiest way to use it is to install a memoize package:
import Data.Function.Memoize
memoCol :: Integer -> Int
memoCol = memoFix mc where
mc _ 1 = 1
mc f x | even x = 1 + f (x `quot` 2)
| otherwise = 1 + f (3 * x + 1)
This cuts down the both the runtime and memory usage, but also heavily uses GC in order to maintain cached values:
$> ./memoCollatz +RTS -s
memoCollatz +RTS -s
1,577,954,668 bytes allocated in the heap
1,056,591,780 bytes copied during GC
303,942,300 bytes maximum residency (12 sample(s))
341,468 bytes maximum slop
616 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 3003 colls, 0 par 1.11s 1.19s 0.0004s 0.0010s
Gen 1 12 colls, 0 par 3.48s 3.65s 0.3043s 1.7065s
INIT time 0.00s ( 0.00s elapsed)
MUT time 7.55s ( 7.50s elapsed)
GC time 4.59s ( 4.84s elapsed)
EXIT time 0.00s ( 0.05s elapsed)
Total time 12.14s ( 12.39s elapsed)
%GC time 37.8% (39.1% elapsed)
Alloc rate 209,087,160 bytes per MUT second
Productivity 62.2% of total user, 60.9% of total elapsed
Make sure you use Integer instead of Int beacuse of Int32 overflow that makes recursion issues.
collatz :: Integer -> Integer

Why does attoparsec use 100 times more memory than my input file?

I have a 2.5 MB file full of floats separated by spaces (the code below can generate it for you) and want to parse it into an array with attoparsec.
It is surprisingly slow, taking almost a second, and allocating a lot of memory:
time ./Attoparsec-problem +RTS -sstderr < floats.txt
299999.0
956,647,344 bytes allocated in the heap
752,875,520 bytes copied during GC
166,485,416 bytes maximum residency (7 sample(s))
8,874,168 bytes maximum slop
337 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1604 colls, 0 par 0.21s 0.27s 0.0002s 0.0092s
Gen 1 7 colls, 0 par 0.24s 0.36s 0.0520s 0.1783s
...
%GC time 65.5% (75.1% elapsed)
Alloc rate 3,985,781,488 bytes per MUT second
Productivity 34.5% of total user, 28.6% of total elapsed
My parser is incredibly simple: It is essentially double <* skipSpace.
This is the code:
import Control.Applicative
import Data.Attoparsec.ByteString.Char8 as A
import qualified Data.ByteString as BS
import qualified Data.Vector.Unboxed as U
-- Compile with:
-- ghc --make -O2 -prof -auto-all -caf-all -rtsopts -fforce-recomp Attoparsec-problem.hs
-- Run:
-- time ./Attoparsec-problem +RTS -sstderr < floats.txt
main :: IO ()
main = do
-- For creating the test file (2.5 MB)
-- writeFile "floats.txt" (Prelude.unwords $ Prelude.map show [1.0,2.0..300000.0])
r <- parse parser <$> BS.getContents
case r of
Done _ arr -> print $ U.last arr
x -> print x
where
parser = do
U.replicateM (300000-1) (double <* skipSpace)
-- This gives surprisingly bad productivity (70% GC time) and 180 MB max residency
-- for a 2.5 MB file!
Can you explain me what is going on?

Why does the strictness flag make memory usage increase?

The following two programs differ only by the strictness flag on variable st
$ cat testStrictL.hs
module Main (main) where
import qualified Data.Vector as V
import qualified Data.Vector.Generic as GV
import qualified Data.Vector.Mutable as MV
len = 5000000
testL = do
mv <- MV.new len
let go i = do
if i >= len then return () else
do let st = show (i+10000000) -- no strictness flag
MV.write mv i st
go (i+1)
go 0
v <- GV.unsafeFreeze mv :: IO (V.Vector String)
return v
main =
do
v <- testL
print (V.length v)
mapM_ print $ V.toList $ V.slice 4000000 5 v
$ cat testStrictS.hs
module Main (main) where
import qualified Data.Vector as V
import qualified Data.Vector.Generic as GV
import qualified Data.Vector.Mutable as MV
len = 5000000
testS = do
mv <- MV.new len
let go i = do
if i >= len then return () else
do let !st = show (i+10000000) -- this has the strictness flag
MV.write mv i st
go (i+1)
go 0
v <- GV.unsafeFreeze mv :: IO (V.Vector String)
return v
main =
do
v <- testS
print (V.length v)
mapM_ print $ V.toList $ V.slice 4000000 5 v
Compiling and running these two programs on Ubuntu 10.10 with ghc 7.03
I get the following results
$ ghc --make testStrictL.hs -O3 -rtsopts
[2 of 2] Compiling Main ( testStrictL.hs, testStrictL.o )
Linking testStrictL ...
$ ghc --make testStrictS.hs -O3 -rtsopts
[2 of 2] Compiling Main ( testStrictS.hs, testStrictS.o )
Linking testStrictS ...
$ ./testStrictS +RTS -sstderr
./testStrictS +RTS -sstderr
5000000
"14000000"
"14000001"
"14000002"
"14000003"
"14000004"
824,145,164 bytes allocated in the heap
1,531,590,312 bytes copied during GC
349,989,148 bytes maximum residency (6 sample(s))
1,464,492 bytes maximum slop
656 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 1526 collections, 0 parallel, 5.96s, 6.04s elapsed
Generation 1: 6 collections, 0 parallel, 2.79s, 4.36s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.77s ( 2.64s elapsed)
GC time 8.76s ( 10.40s elapsed)
EXIT time 0.00s ( 0.13s elapsed)
Total time 10.52s ( 13.04s elapsed)
%GC time 83.2% (79.8% elapsed)
Alloc rate 466,113,027 bytes per MUT second
Productivity 16.8% of total user, 13.6% of total elapsed
$ ./testStrictL +RTS -sstderr
./testStrictL +RTS -sstderr
5000000
"14000000"
"14000001"
"14000002"
"14000003"
"14000004"
81,091,372 bytes allocated in the heap
143,799,376 bytes copied during GC
44,653,636 bytes maximum residency (3 sample(s))
1,005,516 bytes maximum slop
79 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 112 collections, 0 parallel, 0.54s, 0.59s elapsed
Generation 1: 3 collections, 0 parallel, 0.41s, 0.45s elapsed
INIT time 0.00s ( 0.03s elapsed)
MUT time 0.12s ( 0.18s elapsed)
GC time 0.95s ( 1.04s elapsed)
EXIT time 0.00s ( 0.06s elapsed)
Total time 1.06s ( 1.24s elapsed)
%GC time 89.1% (83.3% elapsed)
Alloc rate 699,015,343 bytes per MUT second
Productivity 10.9% of total user, 9.3% of total elapsed
Could someone please explain why the strictness flag seems to cause the program to use so much
memory? This simple example came about from my attempts to understand why my programs
use so much memory when reading large files of 5 million lines and creating vectors
of records.
The problem here is mainly that you're using the String (which is an alias for [Char]) type which due to its representation as a non-strict list of single Chars requires 5 words per characters on the memory heap (See also this blog article for some memory footprint comparisons)
In the lazy case, you basically store an unevaluated thunk pointing to the (shared) evaluation function show . (+10000000) and a varying integer, whereas in the strict case the complete strings composed of 8 characters seem to be materialized (usually the bang-pattern would only force the outermost list-constructor :, i.e. the first character of a lazy String, to be evaluated), which requires way more heap space the longer the strings become.
Storing 5000000 String-typed strings of length 8 thus requires 5000000*8*5 = 200000000 words, which on 32-bit correspond to about ~763 MiB. If the Char digits are shared, you only need 3/5 of that, i.e. ~458 MiB, which seems to match your observed memory overhead.
If you replace your String by something more compact such as a Data.ByteString.ByteString you'll notice that the memory overhead will be about one magnitude lower compared to when using a plain String.

Resources