Although I have a good LSFR C implementation I thought I'd try the same in Haskell - just to see how it goes. What I came up with, so far, is two orders of magnitude slower than the C implementation, which begs the question: How can the performance be improved? Clearly, the bit-fiddling operations are the bottleneck, and the profiler confirms this.
Here's the baseline Haskell code using lists and Data.Bits:
import Control.Monad (when)
import Data.Bits (Bits, shift, testBit, xor, (.&.), (.|.))
import System.Environment (getArgs)
import System.Exit (exitFailure, exitSuccess)
tap :: [[Int]]
tap = [
[], [], [], [3, 2],
[4, 3], [5, 3], [6, 5], [7, 6],
[8, 6, 5, 4], [9, 5], [10, 7], [11, 9],
[12, 6, 4, 1], [13, 4, 3, 1], [14, 5, 3, 1], [15, 14],
[16,15,13,4], [17, 14], [18, 11], [19, 6, 2, 1],
[20, 17], [21, 19], [22, 21], [23, 18],
[24,23,22,17], [25, 22], [26, 6, 2, 1], [27, 5, 2, 1],
[28, 25], [29, 27], [30, 6, 4, 1], [31, 28],
[32,22,2,1], [33,20], [34,27,2,1], [35,33],
[36,25], [37,5,4,3,2,1],[38,6,5,1], [39,35],
[40,38,21,19], [41,38], [42,41,20,19], [43,42,38,37],
[44,43,18,17], [45,44,42,41], [46,45,26,25], [47,42],
[48,47,21,20], [49,40], [50,49,24,23], [51,50,36,35],
[52,49], [53,52,38,37], [54,53,18,17], [55,31],
[56,55,35,34], [57,50], [58,39], [59,58,38,37],
[60,59], [61,60,46,45], [62,61,6,5], [63,62] ]
xor' :: [Bool] -> Bool
xor' = foldr xor False
mask :: (Num a, Bits a) => Int -> a
mask len = shift 1 len - 1
advance :: Int -> [Int] -> Int -> Int
advance len tap lfsr
| d0 = shifted
| otherwise = shifted .|. 1
where
shifted = shift lfsr 1 .&. mask len
d0 = xor' $ map (testBit lfsr) tap'
tap' = map (subtract 1) tap
main :: IO ()
main = do
args <- getArgs
when (null args) $ fail "Usage: lsfr <number-of-bits>"
let len = read $ head args
when (len < 8) $ fail "No need for LFSR"
let out = last $ take (shift 1 len) $ iterate (advance len (tap!!len)) 0
if out == 0 then do
putStr "OK\n"
exitSuccess
else do
putStr "FAIL\n"
exitFailure
Basically it tests whether the LSFR defined in tap :: [[Int]] for any given bit-length is of maximum-length. (More precisely, it just checks whether the LSFR reaches the initial state (zero) after 2n iterations.)
According to the profiler the most costly line is the feedback bit d0 = xor' $ map (testBit lfsr) tap'.
What I've tried so far:
use Data.Array: Attempt abandoned because there's no foldl/r
use Data.Vector: Slightly faster than the baseline
The compiler options I use are: -O2, LTS Haskell 8.12 (GHC-8.0.2).
The reference C++ program can be found on gist.github.com.
The Haskell code can't be expected (?) to run as fast as the C code, but two orders of magnitude is too much, there must be a better way to do the bit-fiddling.
Update: Results of applying the optimisations suggested in the answers
The reference C++ program with input 28, compiled with LLVM 8.0.0, runs in 0.67s on my machine (the same with clang 3.7 is marginally slower, 0.68s)
The baseline Haskell code runs about 100x slower (because of the space inefficiency don't try it with inputs larger than 25)
With the rewrite of #Thomas M. DuBuisson, still using the default GHC backend, the execution time goes down to 5.2s
With the rewrite of #Thomas M. DuBuisson, now using the LLVM backend (GHC option -O2 -fllvm), the execution time goes down to 1.7s
Using GHC option -O2 -fllvm -optlc -mcpu=native brings this to 0.73s
Replacing iterate with iterate' of #cirdec makes no difference when Thomas' code is used (both with the default 'native' backend and LLVM). However, it does make a difference when the baseline code is used.
So, we've come from 100x to 8x to 1.09x, i.e. only 9% slower than C!
Note
The LLVM backend to GHC 8.0.2 requires LLVM 3.7. On Mac OS X this means installing this version with brew and then symlinking opt and llc. See 7.10. GHC Backends.
Up Front Matters
For starters, I'm using GHC 8.0.1 on an Intel I5 ~2.5GHz, linux x86-64.
First Draft: Oh No! The slows!
Your starting code with parameter 25 runs:
% ghc -O2 orig.hs && time ./orig 25
[1 of 1] Compiling Main ( orig.hs, orig.o )
Linking orig ...
OK
./orig 25 7.25s user 0.50s system 99% cpu 7.748 total
So the time to beat is 77ms - two orders of magnitude better than this Haskell code. Lets dive in.
Issue 1: Shifty Code
I found a couple of oddities with the code. First was the use of shift in high performance code. Shift supports both left and right shift and to do so it requires a branch. Lets kill that with more readable powers of two and such (shift 1 x ~> 2^x and shift x 1 ~> 2*x):
% ghc -O2 noShift.hs && time ./noShift 25
[1 of 1] Compiling Main ( noShift.hs, noShift.o )
Linking noShift ...
OK
./noShift 25 0.64s user 0.00s system 99% cpu 0.637 total
(As you noted in the comments: Yes, this bears investigation. It might be that some oddity of the prior code was preventing a rewrite rule from firing and, as a result, much worse code resulted)
Issue 2: Lists Of Bits? Int operations save the day!
One change, one order of magnitude. Yay. What else? Well you have this awkward list of bit locations you're tapping that just seems like its begging for inefficiency and/or leans on fragile optimizations. At this point I'll note that hard-coding any one selection from that list results in really good performance (such as testBit lsfr 24 `xor` testBit lsfr 21) but we want a more general fast solution.
I propose we compute the mask of all the tap locations then do a one-instruction pop count. To do this we only need a single Int passed in to advance instead of a whole list. The popcount instruction requires good assembly generation which requires llvm and probably -optlc-mcpu=native or another instruction set selection that is non-pessimistic.
This step gives us pc below. I've folded in the guard-removal of advance that was mentioned in the comments:
let tp = sum $ map ((2^) . subtract 1) (tap !! len)
pc lfsr = fromEnum (even (popCount (lfsr .&. tp)))
mask = 2^len - 1
advance' :: Int -> Int
advance' lfsr = (2*lfsr .&. mask) .|. pc lfsr
out :: Int
out = last $ take (2^len) $ iterate advance' 0
Our resulting performance is:
% ghc -O2 so.hs -fforce-recomp -fllvm -optlc-mcpu=native && time ./so 25
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
OK
./so 25 0.06s user 0.00s system 96% cpu 0.067 total
That's over two orders of magnitude from start to finish, so hopefully it matches your C. Finally, in deployed code it is actually really common to have Haskell packages with C bindings but this is often an educational exercise so I hope you had fun.
Edit: The now-available C++ code takes my system 0.10 (g++ -O3) and 0.12 (clang++ -O3 -march=native) seconds, so it seems we've beat our mark by a fair bit.
I suspect that the following line is building a large list-like thunk in memory before evaluating it.
let out = last $ take (shift 1 len) $ iterate (advance len (tap!!len)) 0` is
Let's find out if I'm right, and if I am, we'll fix it. The first debugging step is to get an idea of the memory used by the program. To do this we're going to compile with the options -rtsopts in addition to -O2. This enables running the program with RTS options, inclusing +RTS -s which outputs a small memory summary.
Initial Performance
Running your program as lfsr 25 +RTS -s I get the following output
OK
5,420,148,768 bytes allocated in the heap
6,705,977,216 bytes copied during GC
1,567,511,384 bytes maximum residency (20 sample(s))
357,862,432 bytes maximum slop
3025 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 10343 colls, 0 par 2.453s 2.522s 0.0002s 0.0009s
Gen 1 20 colls, 0 par 2.281s 3.065s 0.1533s 0.7128s
INIT time 0.000s ( 0.000s elapsed)
MUT time 1.438s ( 1.162s elapsed)
GC time 4.734s ( 5.587s elapsed)
EXIT time 0.016s ( 0.218s elapsed)
Total time 6.188s ( 6.967s elapsed)
%GC time 76.5% (80.2% elapsed)
Alloc rate 3,770,538,273 bytes per MUT second
Productivity 23.5% of total user, 19.8% of total elapsed
That's a lot of memory used at once. It's very likely there's a big thunk building up in there somewhere.
Trying to reduce the thunk size
I hypothesized that the thunk is being built in iterate (advance ...). If this is the case, we can try to reduce the thunk size by making advance more strict in its lsfr argument. This won't remove the spine of the thunk (the successive iterations), but it might reduce the size of the state that's built up as the spine's evaluated.
BangPatterns is an easy way to make a function strict in an argument. f !x = .. is shorthand for f x = seq x $ ...
{-# LANGUAGE BangPatterns #-}
advance :: Int -> [Int] -> Int -> Int
advance len tap = go
where
go !lfsr
| d0 = shifted
| otherwise = shifted .|. 1
where
shifted = shift lfsr 1 .&. mask len
d0 = xor' $ map (testBit lfsr) tap'
tap' = map (subtract 1) tap
Let's see what difference this makes ...
>lfsr 25 +RTS -s
OK
5,420,149,072 bytes allocated in the heap
6,705,979,368 bytes copied during GC
1,567,511,448 bytes maximum residency (20 sample(s))
357,862,448 bytes maximum slop
3025 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 10343 colls, 0 par 2.688s 2.711s 0.0003s 0.0059s
Gen 1 20 colls, 0 par 2.438s 3.252s 0.1626s 0.8013s
INIT time 0.000s ( 0.000s elapsed)
MUT time 1.328s ( 1.146s elapsed)
GC time 5.125s ( 5.963s elapsed)
EXIT time 0.000s ( 0.226s elapsed)
Total time 6.484s ( 7.335s elapsed)
%GC time 79.0% (81.3% elapsed)
Alloc rate 4,081,053,418 bytes per MUT second
Productivity 21.0% of total user, 18.7% of total elapsed
None that's noticeable.
Eliminating the Spine
I guess it's the spine of that iterate (advance ...) that's being built. After all, for the command I'm running the list would be 2^25, or a little over 33 million items long. The list itself is probably being removed by list fusion, but the thunk for the last item of the list is over 33 million applications of advance ...
To solve this problem we need a strict version of iterate so that the value is forced to an Int before applying the advance function again. This should keep the memory down to only a single lfsr value at a time, along with the currently computed application of advance.
Unfortunately, there isn't a strict iterate in Data.List. Here's one that doesn't give up on the list fusion that's providing other important (I think) performance optimizations to this problem.
{-# LANGUAGE BangPatterns #-}
import GHC.Base (build)
{-# NOINLINE [1] iterate' #-}
iterate' :: (a -> a) -> a -> [a]
iterate' f = go
where go !x = x : go (f x)
{-# NOINLINE [0] iterateFB' #-}
iterateFB' :: (a -> b -> b) -> (a -> a) -> a -> b
iterateFB' c f = go
where go !x = x `c` go (f x)
{-# RULES
"iterate'" [~1] forall f x. iterate' f x = build (\c _n -> iterateFB' c f x)
"iterateFB'" [1] iterateFB' (:) = iterate'
#-}
This is just iterate from GHC.List (along with all its rewrite rules), but made strict in the accumulated argument.
Equipped with a strict iterate, iterate', we can change the troublesome line to
let out = last $ take (shift 1 len) $ iterate' (advance len (tap!!len)) 0
I expect that this will perform much better. Let's see ...
>lfsr 25 +RTS -s
OK
3,758,156,184 bytes allocated in the heap
297,976 bytes copied during GC
43,800 bytes maximum residency (1 sample(s))
21,736 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 7281 colls, 0 par 0.047s 0.008s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0002s 0.0002s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.750s ( 0.783s elapsed)
GC time 0.047s ( 0.008s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.797s ( 0.792s elapsed)
%GC time 5.9% (1.0% elapsed)
Alloc rate 5,010,874,912 bytes per MUT second
Productivity 94.1% of total user, 99.0% of total elapsed
This used 0.00002 times as much memory and ran 10 times as fast.
I don't know if this will improve on Thomas DeBuisson's answer that improves advance but still leaves a lazy iterate advance' in place. It would be easy to check; add the iterate' code to that answer and use iterate' in place of iterate in that answer.
Does the compiler lift tap !! len out of the loop? I suspect it does, but moving it out to guarantee this can't hurt:
let tap1 = tap !! len
let out = last $ take (shift 1 len) $ iterate (advance len tap1) 0
In the comments you say "2^len is needed exactly once", but this is wrong. You do it each time in advance. So you could try
advance len tap mask lfsr
| d0 = shifted
| otherwise = shifted .|. 1
where
shifted = shift lfsr 1 .&. mask
d0 = xor' $ map (testBit lfsr) tap'
tap' = map (subtract 1) tap
-- in main
let tap1 = tap !! len
let numIterations = 2^len
let mask = numIterations - 1
let out = iterate (advance len tap1 mask) 0 !! (numIterations - 1)
(the compiler can't optimize last $ take ... to !! in general, because they are different for finite lists, but iterate always returns an infinite one.)
You compared foldr with foldl, but foldl is almost never what you need; since xor always needs both arguments and is associative, foldl' is very likely to be the right choice (the compiler can optimize it, but if there is any real difference between foldl and foldr and not just random variation, it might have failed in this case).
Related
I am trying to practice Haskell by solving some of the tasks on Project Euler. In Problem 3, we have to find the biggest prime factor of the number 600851475143, which I had done before in Java a few years back.
I came up with the following:
primes :: [Int]
primes = sieve [2..]
where sieve (p:xs) = p : sieve (filter (\x -> x `rem` p /= 0) xs)
biggestPrimeFactor :: Int -> Int
biggestPrimeFactor 1 = 0
biggestPrimeFactor x =
if x `elem` takeWhile (< x + 1) primes
then x
else last (filter (\y -> x `rem` y == 0) (takeWhile (< x `div` 2) primes))
which works great for smaller numbers, but is terribly inefficient and as a result doesn't work well on the number I have been given.
This seems obvious, because the program iterates over all primes smaller than the number divided by 2 (if it isn't prime itself), but I am unsure what to do about it. Ideally I would be able to further restrict the possible checks, but I don't know how to accomplish this.
Note that I am not looking for an "optimal solution", but rather one that is at least moderately efficient for bigger numbers, and simple to understand and implement, as I am still a beginner in Haskell.
You have two main sources of slowness here. The easier one to address is the boundary condition in biggestPrimeFactor. Checking up to p > x `div` 2 is asymptotically worse than checking up to p^2 > x. But even that is very suboptimal when a number has a lot of factors. The largest factor may be far smaller than sqrt x. If you continually reduce the target number as you find factors, you can account for this and speed up the processing of random inputs by quite a lot.
Here's an example of that, including Daniel Wagner's note from the comments:
-- Naive trial division against a list of primes. Doesn't do anything
-- intelligent when asked to factor a number less than 2.
factorsNaive :: [Integer] -> Integer -> [Integer]
factorsNaive primes#(p : ps) x
| p * p > x = [x]
| otherwise = case x `quotRem` p of
(q, 0) -> p : factorsNaive primes q
_ -> factorsNaive ps x
A few notes:
I decided to have the primes list passed in. This is relevant in the next section, but it also allowed me to write this without a helper.
I specialized to Integer instead of Int because I wanted to throw big numbers at it without caring what maxBound :: Int is. This is slower, but I decided to default to correctness first.
I removed a traversal of the input list. Doing it in one pass is a bit more efficient, but mostly it's cleaner.
Strictly speaking, this is correct even if the input list contains non-primes, so long as the list starts at 2, is monotonically non-decreasing, and eventually contains every prime.
Note that when it recurses, it either discards a prime or produces one. It never will do both at the same time. This is an easy way to ensure it doesn't miss repeated factors.
I named this factorsNaive just to make it clear that it's not doing anything clever with number theory. There are very many things that could be done which are far more complex than this, but this is a good stopping point for understandable factoring of relatively small numbers...
Or at least it is okay at factoring as long as you have a convenient list of prime numbers. It turns out this is the second major cause of slowdown in your code. Your list of prime numbers is slow to generate as it gets longer.
Your definition of primes essentially stacks a bunch of filters on an input list. Every prime produced must go through a filter test for each previous prime. This might sound familiar - it's at least O(n^2) work to generate the first n primes. (It's actually more because division gets more costly as numbers get bigger, but let's ignore that for now.) It's a known (to mathematicians, I had to look it up to be sure) result that the number of primes less than or equal to n approaches n/ln n as n gets large. That approaches linear as n gets large, so generating the list of primes up to n approaches O(n^2) as n gets big.
(Yes, that argument is a mess. A formal version of it is presented in Melissa O'Neill's paper "The Genuine Sieve of Eratosthenes". Refer to it for much more rigorous argumentation of the result.)
It's possible to write much more efficient definitions of primes that have both better constant factors and better asymptotics. As that's the entire point of the paper mentioned in the parenthetical above, I won't go into the details too far. I'll just point out the very first possible optimization:
-- trial division. let's work in Integer for predictable correctness
-- on positive numbers
trialPrimes :: [Integer]
trialPrimes = 2 : sieve [3, 5 ..]
where
sieve (p : ps) = p : sieve (filter (\x -> x `rem` p /= 0) ps)
This does less than you might think. It doesn't double the speed, as the performance improvement is eventually outweighed by the filter stack mentioned before. This version only removes one filter from that stack, but at least it's the filter that rejects the most inputs in the initial version.
In ghci (no compilation or optimizations, and those can really make a difference), this was fast enough to factor the product of two five-digit primes in a few seconds.
ghci> :set +s
ghci> factorsNaive trialPrimes $ 84761 * 60821
[60821,84761]
(5.98 secs, 4,103,321,840 bytes)
Numbers with several small factors are handled much faster. Also notice that because the list of primes is a top-level binding, calculations are cached. Running the computation again has the list of primes pre-computed now.
ghci> factorsNaive trialPrimes $ 84761 * 60821
[60821,84761]
(0.01 secs, 6,934,688 bytes)
That also shows that the run time is absolutely dominated by generating the list of primes. The naive factorization is almost instant at that scale when the list of primes is already in memory.
But you shouldn't really trust performance of interpreted code.
main :: IO ()
main = print (factorsNaive trialPrimes $ 84761 * 60821)
gives
carl#DESKTOP:~/hask/2023$ ghc -O2 -rtsopts factor.hs
[1 of 2] Compiling Main ( factor.hs, factor.o )
[2 of 2] Linking factor
carl#DESKTOP:~/hask/2023$ ./factor +RTS -s
[60821,84761]
1,884,787,896 bytes allocated in the heap
32,303,080 bytes copied during GC
89,072 bytes maximum residency (2 sample(s))
29,400 bytes maximum slop
7 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 326 colls, 0 par 0.021s 0.021s 0.0001s 0.0002s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0002s 0.0004s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.523s ( 0.522s elapsed)
GC time 0.021s ( 0.022s elapsed)
EXIT time 0.000s ( 0.007s elapsed)
Total time 0.545s ( 0.550s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,603,678,988 bytes per MUT second
Productivity 96.0% of total user, 94.8% of total elapsed
That dropped the run time from six seconds to a half-second. (Yeah, +RTS -s is pretty verbose for this, but it's quick and easy.) I think this is a reasonable place to stop with beginner-level code.
If you want to look into more efficient prime generation, the primes package on hackage contains an implementation of the algorithm in O'Neill's paper and an implementation of naive factoring that's equivalent to the one here.
I needed to use an algorithm to solve a KP problem some time ago, in haskell
Here is what my code look like:
stepKP :: [Int] -> (Int, Int) -> [Int]
stepKP l (p, v) = take p l ++ zipWith bestOption l (drop p l)
where bestOption a = max (a+v)
kp :: [(Int, Int)] -> Int -> Int
kp l pMax = last $ foldl stepKP [0 | i <- [0..pMax]] l
main = print $ kp (zip weights values) 20000
where weights = [0..2000]
values = reverse [8000..10000]
But when I try to execute it (after compilation with ghc, no flags), it seems pretty bad:
here is the result of the command ./kp -RTS -s
1980100
9,461,474,416 bytes allocated in the heap
6,103,730,184 bytes copied during GC
1,190,494,880 bytes maximum residency (18 sample(s))
5,098,848 bytes maximum slop
2624 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6473 colls, 0 par 2.173s 2.176s 0.0003s 0.0010s
Gen 1 18 colls, 0 par 4.185s 4.188s 0.2327s 1.4993s
INIT time 0.000s ( 0.000s elapsed)
MUT time 3.320s ( 3.322s elapsed)
GC time 6.358s ( 6.365s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 9.679s ( 9.687s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,849,443,762 bytes per MUT second
Productivity 34.3% of total user, 34.3% of total elapsed
I thinks that my programm takes O(n*w) memory, while it could do it in O(w).
(w is the total capacity)
Is that a problem of lazy evaluation taking too much space, or something else ?
How could this code be more memory and time efficient ?
We can think of a left fold as performing iterations while keeping an accumulator that is returned at the end.
When there are lots of iterations, one concern is that the accumulator might grow too large in memory. And because Haskell is lazy, this can happen even when the accumulator is of a primitive type like Int: behind some seemingly innocent Int value a large number of pending operations might lurk, in the form of thunks.
Here the strict left fold function foldl' is useful because it ensures that, as the left fold is being evaluated, the accumulator will always be kept in weak head normal form (WHNF).
Alas, sometimes this isn't enough. WHNF only says that evaluation has progressed up to the "outermost constructor" of the value. This is enough for Int, but for recursive types like lists or trees, that isn't saying much: the thunks might simply lurk further down the list, or in branches below.
This is the case here, where the accumulator is a list that is recreated at each iteration. Each iteration, the foldl' only evaluates the list up to _ : _. Unevaluated max and zipWith operations start to pile up.
What we need is a way to trigger a full evaluation of the accumulator list at each iteration, one which cleans any max and zipWith thunks from memory. And this is what force accomplishes. When force $ something is evaluated to WHNF, something is fully evaluated to normal form, that is, not only up to the outermost constructor but "deeply".
Notice that we still need the foldl' in order to "trigger" the force at each iteration.
I am new to Haskell.
While studying about foldr many are suggesting to use it and avoid explicit recursion which can lead to Memory Inefficient code.
https://www.reddit.com/r/haskell/comments/1nb80j/proper_use_of_recursion_in_haskell/
As I was running the sample mentioned in the above link. I can see the explicit recursion is doing better in terms of memory. First I thought May be running on GHCi is not near to perfect benchmark and I tried compiling it using stack ghc. And btw How can I pass Compiler Optimization flags via stack ghc. What am I missing from the Expression Avoid Explicit Recursion.
find p = foldr go Nothing
where go x rest = if p x then Just x else rest
findRec :: (a -> Bool) -> [a] -> Maybe a
findRec _ [] = Nothing
findRec p (x:xs) = if p x then Just x else (findRec p xs)
main :: IO ()
main = print $ find (\x -> x `mod` 2 == 0) [1, 3..1000000]
main = print $ findRec (\x -> x `mod` 2 == 0) [1, 3..1000000]
-- find
Nothing
92,081,224 bytes allocated in the heap
9,392 bytes copied during GC
58,848 bytes maximum residency (2 sample(s))
26,704 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 87 colls, 0 par 0.000s 0.000s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.000s 0.001s 0.0004s 0.0008s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.031s ( 0.043s elapsed)
GC time 0.000s ( 0.001s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.031s ( 0.044s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,946,599,168 bytes per MUT second
Productivity 100.0% of total user, 96.8% of total elapsed
-- findRec
Nothing
76,048,432 bytes allocated in the heap
13,768 bytes copied during GC
42,928 bytes maximum residency (2 sample(s))
26,704 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 71 colls, 0 par 0.000s 0.000s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.000s 0.001s 0.0004s 0.0007s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.031s ( 0.038s elapsed)
GC time 0.000s ( 0.001s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.031s ( 0.039s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,433,549,824 bytes per MUT second
Productivity 100.0% of total user, 96.6% of total elapsed
You are measuring how quickly GHC can do half a million modulus operations. As you might expect, "in the blink of an eye" is the answer regardless of how you iterate. There is no obvious difference in speed.
You claim that you can see that explicit recursion is using less memory, but the heap profiling data you provide shows the opposite: more allocation and higher max residency when using explicit recursion. I don't think the difference is significant, but if it were then your evidence would be contradicting your claim.
As to the question of why to avoid explicit recursion, it's not really clear what part of that thread you read that made you come to your conclusion. You linked to a giant thread which itself links to another giant thread, with many competing opinions. The comment that stands out the most to me is it's not about efficiency, it's about levels of abstraction. You are looking at this the wrong way by trying to measure its performance.
First, don't try to understand the performance of GHC-compiled code using anything other than optimized compilation:
$ stack ghc -- -O2 Find.hs
$ ./Find +RTS -s
With the -O2 flag (and GHC version 8.6.4), your find performs as follows:
16,051,544 bytes allocated in the heap
14,184 bytes copied during GC
44,576 bytes maximum residency (2 sample(s))
29,152 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
However, this is very misleading. None of this memory usage is due to the looping performed by foldr. Rather it's all due to the use of boxed Integers. If you switch to using plain Ints which the compiler can unbox:
main = print $ find (\x -> x `mod` 2 == 0) [1::Int, 3..1000000]
^^^^^
the memory performance changes drastically and demonstrates the true memory cost of foldr:
51,544 bytes allocated in the heap
3,480 bytes copied during GC
44,576 bytes maximum residency (1 sample(s))
25,056 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
If you test findRec with Ints like so:
main = print $ findRec (\x -> x `mod` 2 == 0) [1::Int, 3..1000000]
you'll see much worse memory performance:
40,051,528 bytes allocated in the heap
14,992 bytes copied during GC
44,576 bytes maximum residency (2 sample(s))
29,152 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
which seems to make a compelling case that recursion should be avoided in preference to foldr, but this, too, is very misleading. What you are seeing here is not the memory cost of recursion, but rather the memory cost of "list building".
See, foldr and the expression [1::Int, 3..1000000] both include some magic called "list fusion". This means that when they are used together (i.e., when foldr is applied to [1::Int 3..1000000]), an optimization can be performed to completely eliminate the creation of a Haskell list. Critically, the foldr code, even using list fusion, compiles to recursive code which looks like this:
main_go
= \ x ->
case gtInteger# x lim of {
__DEFAULT ->
case eqInteger# (modInteger x lvl) lvl1 of {
__DEFAULT -> main_go (plusInteger x lvl);
-- ^^^^^^^ - SEE? IT'S JUST RECURSION
1# -> Just x
};
1# -> Nothing
}
end Rec }
So, it's list fusion, rather than "avoiding recursion" that makes find faster than findRec.
You can see this is true by considering the performance of:
find1 :: Int -> Maybe Int
find1 n | n >= 1000000 = Nothing
| n `mod` 2 == 0 = Just n
| otherwise = find1 (n+2)
main :: IO ()
main = print $ find1 1
Even though this uses recursion, it doesn't generate a list (or use boxed Integers), so it runs just like the foldr version:
51,544 bytes allocated in the heap
3,480 bytes copied during GC
44,576 bytes maximum residency (1 sample(s))
25,056 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
So, what are the take home lessons?
Always benchmark Haskell code using ghc -O2, never GHCi or ghc without optimization flags.
Less than 10% of people in any Reddit thread know what they're talking about.
foldr can sometimes perform better than explicit recursion when special optimizations like list fusion can apply.
But in the general case, explicit recursion performs just as well as foldr or other specialized constructs.
Also, optimizing Haskell code is hard.
Actually, here's a better (more serious) take-home lesson. Especially when you're getting started with Haskell, make every possible effort to avoid thinking about "optimizing" your code. Far more than any other language I know, there is an enormous gulf between the code you write and the code the compiler generates, so don't even try to figure it out right now. Instead, write code that is clear, straightforward, and idiomatic. If you try to learn the "rules" for high-performance code now, you'll get them all wrong and learn really bad programming style into the bargain.
I'm trying to solve AdventOfCode 2018 day 14. The task is roughly to create a number with a lot of digits by iteratively appending one or two digits based on two already existing digits. With Haskell I thought Integer might be a good fit for representing the huge number. I think my program is correct, at least it seems to work for the samples AoC provides. However I noticed that the performance of the program drops drastically when the number contains more than 10^4 digits (recipeCount in the appended program). I observed the following execution times when increasing the number up to the following number of digits:
10000 digits: 0.314s
20000 digits: 1.596s
30000 digits: 4.306s
40000 digits: 8.954s
Looks like O(n^2) or worse, doesn't it?
Why is that? The program only does basic calculations as far as I can tell.
import Data.Bool (bool)
main :: IO ()
main = print solve
recipeCount :: Int
recipeCount = 10000
solve :: Integer
solve = loop 0 1 37 2
where
loop recipeA recipeB recipes recipesLength
| recipesLength >= recipeCount + 10 = recipes `rem` (10 ^ 10)
| otherwise =
let recipeAScore = digitAt (recipesLength - 1 - recipeA) recipes
recipeBScore = digitAt (recipesLength - 1 - recipeB) recipes
recipeSum = fromIntegral $ recipeAScore + recipeBScore
recipeSumDigitCount = bool 2 1 $ recipeSum < 10
recipes' = recipes * (10 ^ recipeSumDigitCount) + recipeSum
recipesLength' = recipesLength + recipeSumDigitCount
recipeA' = (recipeA + recipeAScore + 1) `rem` recipesLength'
recipeB' = (recipeB + recipeBScore + 1) `rem` recipesLength'
in loop recipeA' recipeB' recipes' recipesLength'
digitAt :: Int -> Integer -> Int
digitAt i number = fromIntegral $ number `quot` (10 ^ i) `rem` 10
P.S.: Because I'm very new to Haskell I also kindly appreciate feedback on the program itself (style, algorithm, etc.).
EDIT:
I found options to profile both versions of my program.
Both versions are compiled with ghc -O2 -rtsopts ./Program.hs and run with ./Program +RTS -sstderr.
The first version with integers produces the following output when generating 50,000 recipes:
2,435,108,280 bytes allocated in the heap
886,656 bytes copied during GC
44,672 bytes maximum residency (2 sample(s))
29,056 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1925 colls, 0 par 0.018s 0.017s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s
INIT time 0.000s ( 0.000s elapsed)
MUT time 15.208s ( 15.225s elapsed)
GC time 0.018s ( 0.017s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 15.227s ( 15.242s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 160,115,875 bytes per MUT second
Productivity 99.9% of total user, 99.9% of total elapsed
The second version with mutable arrays produces the following output when generating ~500,000 recipes:
93,437,744 bytes allocated in the heap
16,120 bytes copied during GC
538,408 bytes maximum residency (2 sample(s))
29,056 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 88 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.021s ( 0.020s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.021s ( 0.021s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 4,552,375,284 bytes per MUT second
Productivity 97.0% of total user, 97.2% of total elapsed
I think using Integer for your recipes list in the first place is a big red flag. Integers store numbers, but your problem does not call for a number. It calls for a list of digits. An Integer, whose first priority is to be a number, is basically "compressed": it's in binary, not decimal, and trying to extract a decimal digit from it means you have to do funky, nontrivial math, as others have said. Also, purity works against you, because each time you add new digits to your list, you end up copying the whole list. With problem sizes on the order of 100,000-1,000,000 digits (I was given a problem input of about 800,000), that's copying Integers on the order of log_(2^8)(10^(10^5)) = ~41000 bytes in size each time. This part also seems quadratic.
I would recommend "decompressing" your list of digits. You can represent a single digit by 1 byte (which does waste a lot of space!)
import Data.Word
type Digit = Word8
addDigit :: Digit -> Digit -> (Digit, Digit)
addDigit = _yourJob
You can implement the meat of the logic as a function using arrays. Yes, Haskell does have arrays, in the sense of contiguous hunks of memory with practically O(1) indexing. It's just that we like to find "more functional" ways to phrase a problem than with arrays. But, they're always there if you need them.
import Data.Array.Unboxed -- from the array package, which is a core library
makeRecipes ::
-- | Elf 1's starting score
Digit ->
-- | Elf 2's starting score
Digit ->
-- | Number of recipes to make
Int ->
-- | Scores of the recipes made, indices running from 0 upwards
UArray Int Digit
The cool thing about arrays is that you can mutate them inside the ST monad, while getting a pure result. Thus, this array does not suffer any copying, and the math involved for indexing it is minimal.
import Control.Monad.ST
import Data.Array.ST
makeRecipes elf1 elf2 need = runSTUArray $ do
arr <- newArray_ (0, need)
writeArray arr 0 elf1
writeArray arr 1 elf2
loop arr 0 1 2
return arr
where
loop :: STUArray s Int Digit -> Int -> Int -> Int -> ST s ()
loop arr loc1 loc2 done = _yourJob
loop is given the array, which is partially filled with done recipe scores, and the locations of the two elves, loc1, loc2 < done. It should calculate the new recipes' scores with addDigit and readArray and add them to the array at the correct location with writeArray. If the array is full, it should terminate (it doesn't return anything useful). Otherwise, it should go on to figure out the new locations of the elves, and then recurse.
You can then write a little adapter on top of makeRecipes to actually extract the last ten recipes, supply the correct inputs, etc. When I filled in all the blanks in the program, I got a runtime of .07s on my input (about 800,000) with -O2, and about 0.8s with -O0. It seems to take O(n) time in the input.
(this is exciting!) I know, the subject matter is well known. The state of the art (in Haskell as well as other languages) for efficient generation of unbounded increasing sequence of Hamming numbers, without duplicates and without omissions, has long been the following (AFAIK - and by the way it is equivalent to the original Edsger Dijkstra's solution, too):
hamm :: [Integer]
hamm = 1 : map (2*) hamm `union` map (3*) hamm `union` map (5*) hamm
where
union a#(x:xs) b#(y:ys) = case compare x y of
LT -> x : union xs b
EQ -> x : union xs ys
GT -> y : union a ys
The question I'm asking is, can you find the way to make it more efficient in any significant measure? Is it still the state of the art or is it in fact possible to improve this to run twice faster?
If your answer is yes, please show the code and discuss its speed and empirical orders of growth in comparison to the above (it runs at about ~ n1.05…1.10 for the first few hundreds of thousands of numbers produced). Also, if it exists, can this efficient algorithm be extended to producing a sequence of smooth numbers with any given set of primes?
(clarification: I'm not asking about the much faster direct generation of an nth Hamming number, but rather generating all first n numbers in the sequence.)
If a constant factor(1) speedup counts as significant, then I can offer a significantly more efficient version:
hamm :: [Integer]
hamm = mrg1 hamm3 (map (2*) hamm)
where
hamm5 = iterate (5*) 1
hamm3 = mrg1 hamm5 (map (3*) hamm3)
merge a#(x:xs) b#(y:ys)
| x < y = x : merge xs b
| otherwise = y : merge a ys
mrg1 (x:xs) ys = x : merge xs ys
You can easily generalise it to smooth numbers for a given set of primes:
hamm :: [Integer] -> [Integer]
hamm [] = [1]
hamm [p] = iterate (p*) 1
hamm ps = foldl' next (iterate (q*) 1) qs
where
(q:qs) = sortBy (flip compare) ps
next prev m = let res = mrg1 prev (map (m*) res) in res
merge a#(x:xs) b#(y:ys)
| x < y = x : merge xs b
| otherwise = y : merge a ys
mrg1 (x:xs) ys = x : merge xs ys
It's more efficient because that algorithm doesn't produce any duplicates and it uses less memory. In your version, when a Hamming number near h is produced, the part of the list between h/5 and h has to be in memory. In my version, only the part between h/2 and h of the full list, and the part between h/3 and h of the 3-5-list needs to be in memory. Since the 3-5-list is much sparser, and the density of k-smooth numbers decreases, those two list parts need much less memory that the larger part of the full list.
Some timings for the two algorithms to produce the kth Hamming number, with empirical complexity of each target relative to the previous, excluding and including GC time:
k Yours (MUT/GC) Mine (MUT/GC)
10^5 0.03/0.01 0.01/0.01 -- too short to say much, really
2*10^5 0.07/0.02 0.02/0.01
5*10^5 0.17/0.06 0.968 1.024 0.06/0.04 1.199 1.314
10^6 0.36/0.13 1.082 1.091 0.11/0.10 0.874 1.070
2*10^6 0.77/0.27 1.097 1.086 0.21/0.21 0.933 1.000
5*10^6 1.96/0.71 1.020 1.029 0.55/0.59 1.051 1.090
10^7 4.05/1.45 1.047 1.043 1.14/1.25 1.052 1.068
2*10^7 8.73/2.99 1.108 1.091 2.31/2.65 1.019 1.053
5*10^7 21.53/7.83 0.985 1.002 6.01/7.05 1.044 1.057
10^8 45.83/16.79 1.090 1.093 12.42/15.26 1.047 1.084
As you can see, the factor between the MUT times is about 3.5, but the GC time is not much different.
(1) Well, it looks constant, and I think both variants have the same computational complexity, but I haven't pulled out pencil and paper to prove it, nor do I intend to.
So basically, now that Daniel Fischer gave his answer, I can say that I came across this recently, and I think this is an exciting development, since the classical code was known for ages, since Dijkstra.
Daniel correctly identified the redundancy of the duplicates generation which must then be removed, in the classical version.
The credit for the original discovery (AFAIK) goes to Rosettacode.org's contributor Ledrug, as of 2012-08-26. And of course the independent discovery by Daniel Fischer, here (2012-09-18).
Re-written slightly, that code is:
import Data.Function (fix)
hamm = 1 : foldr (\n s -> fix (merge s . (n:) . map (n*))) [] [2,3,5]
with the usual implementation of merge,
merge a#(x:xs) b#(y:ys) | x < y = x : merge xs b
| otherwise = y : merge a ys
merge [] b = b
merge a [] = a
It gives about 2.0x - 2.5x a speedup vs. the classical version.
Well this was easier than I thought. This will do 1000 Hammings in 0.05 seconds on my slow PC at home. This afternoon at work and a faster PC times of less than 600 were coming out as zero seconds.
This take Hammings from Hammings. It's based on doing it fastest in Excel.
I was getting wrong numbers after 250000, with Int. The numbers grow very big very fast, so Integer must be used to be sure, because Int is bounded.
mkHamm :: [Integer] -> [Integer] -> [Integer] -> [Integer]
-> Int -> (Integer, [Int])
mkHamm ml (x:xs) (y:ys) (z:zs) n =
if n <= 1
then (last ml, map length [(x:xs), (y:ys), (z:zs)])
else mkHamm (ml++[m]) as bs cs (n-1)
where
m = minimum [x,y,z]
as = if x == m then xs ++ [m*2] else (x:xs) ++ [m*2]
bs = if y == m then ys ++ [m*3] else (y:ys) ++ [m*3]
cs = if z == m then zs ++ [m*5] else (z:zs) ++ [m*5]
Testing,
> mkHamm [1] [2] [3] [5] 5000
(50837316566580,[306,479,692]) -- (0.41 secs)
> mkHamm [1] [2] [3] [5] 10000
(288325195312500000,[488,767,1109]) -- (1.79 secs)
> logBase 2 (1.79/0.41) -- log of times ratio =
2.1262637726461726 -- empirical order of growth
> map (logBase 2) [488/306, 767/479, 1109/692] :: [Float]
[0.6733495, 0.6792009, 0.68041545] -- leftovers sizes ratios
This means that this code's run time's empirical order of growth is above quadratic (~n^2.13 as measured, interpreted, at GHCi prompt).
Also, the sizes of the three dangling overproduced segments of the sequence are each ~n^0.67 i.e. ~n^(2/3).
Additionally, this code is non-lazy: the resulting sequence's first element can only be accessed only after the very last one is calculated.
The state of the art code in the question is linear, overproduces exactly 0 elements past the point of interest, and is properly lazy: it starts producing its numbers immediately.
So, though an immense improvement over the previous answers by this poster, it is still significantly worse than the original, let alone its improvement as appearing in the top two answers.
12.31.2018
Only the very best people educate. #Will Ness also has authored or co-authored 19 chapters in GoalKicker.com “Haskell for Professionals”. The free book is a treasure.
I had carried around the idea of a function that would do this, like this. I was apprehensive because I thought it would be convoluted and involved logic like in some modern languages. I decided to start writing and was amazed how easy Haskell makes the realization of even bad ideas.
I've not had difficulty generating unique lists. My problem is the lists I generate do not end well. Even when I use diagonalization they leave residual values making their use unreliable at best.
Here is a reworked 3's and 5's list with nothing residual at the end. The denationalization is to reduce residual values not to eliminate duplicates which are never included anyway.
g3s5s n=[t*b|(a,b)<-[ (((d+n)-(d*2)), 5^d) | d <- [0..n]],
t <-[ 3^e | e <- [0..a+8]],
(t*b)<-(3^(n+6))+a]
ham2 n = take n $ ham2' (drop 1.sort.g3s5s $ 48) [1]
ham2' o#(f:fo) e#(h:hx) = if h == min h f
then h:ham2' o (hx ++ [h*2])
else f:ham2' fo ( e ++ [f*2])
The twos list can be generated with all 2^es multiplied by each of the 3s5s but when identity 2^0 is included, then, in total, it is the Hammings.
3/25/2019
Well, finally. I knew this some time ago but could not implement it without excess values at the end. The problem was how to not generate the excess that is the result of a Cartesian Product. I use Excel a lot and could not see the pattern of values to exclude from the Cartesian Product worksheet. Then, eureka! The functions generate lists of each lead factor. The value to limit the values in each list is the end point of the first list. When this is done, all Hammings are produced with no excess.
Two functions for Hammings. The first is a new 3's & 5's list which is then used to create multiples with the 2's. The multiples are Hammings.
h35r x = h3s5s x (5^x)
h3s5s x c = [t| n<-[3^e|e<-[0..x]],
m<-[5^e|e<-[0..x]],
t<-[n*m],
t <= c ]
a2r n = sort $ a2s n (2^n)
a2s n c = [h| b<-h35r n,
a<-[2^e| e<-[0..n]],
h<-[a*b],
h <= c ]
last $ a2r 50
1125899906842624
(0.16 secs, 321,326,648 bytes)
2^50
1125899906842624
(0.00 secs, 95,424 bytes
This is an alternate, cleaner & faster with less memory usage implementation.
gnf n f = scanl (*) 1 $ replicate f n
mk35 n = (\c-> [m| t<- gnf 3 n, f<- gnf 5 n, m<- [t*f], m<= c]) (2^(n+1))
mkHams n = (\c-> sort [m| t<- mk35 n, f<- gnf 2 (n+1), m<- [t*f], m<= c]) (2^(n+1))
last $ mkHams 50
2251799813685248
(0.03 secs, 12,869,000 bytes)
2^51
2251799813685248
5/6/2019
Well, I tried limiting differently but always come back to what is simplest. I am opting for the least memory usage as also seeming to be the fastest.
I also opted to use map with an implicit parameter.
I also found that mergeAll from Data.List.Ordered is faster that sort or sort and concat.
I also like when sublists are created so I can analyze the data much easier.
Then, because of #Will Ness switched to iterate instead of scanl making much cleaner code. Also because of #Will Ness I stopped using the last of of 2s list and switched to one value determining all lengths.
I do think recursively defined lists are more efficient, the previous number multiplied by a factor.
Just separating the function into two doesn't make a difference so the 3 and 5 multiples would be
m35 lim = mergeAll $
map (takeWhile (<=lim).iterate (*3)) $
takeWhile (<=lim).iterate (*5) $ 1
And the 2s each multiplied by the product of 3s and 5s
ham n = mergeAll $
map (takeWhile (<=lim).iterate (*2)) $ m35 lim
where lim= 2^n
After editing the function I ran it
last $ ham 50
1125899906842624
(0.00 secs, 7,029,728 bytes)
then
last $ ham 100
1267650600228229401496703205376
(0.03 secs, 64,395,928 bytes)
It is probably better to use 10^n but for comparison I again used 2^n
5/11/2019
Because I so prefer infinite and recursive lists I became a bit obsessed with making these infinite.
I was so impressed and inspired with #Daniel Wagner and his Data.Universe.Helpers I started using +*+ and +++ but then added my own infinite list. I had to mergeAll my list to work but then realized the infinite 3 and 5 multiples were exactly what they should be. So, I added the 2s and mergeAlld everything and they came out. Before, I stupidly thought mergeAll would not handle infinite list but it does most marvelously.
When a list is infinite in Haskell, Haskell calculates just what is needed, that is, is lazy. The adjunct is that it does calculate from, the start.
Now, since Haskell multiples until the limit of what is wanted, no limit is needed in the function, that is, no more takeWhile. The speed up is incredible and the memory lowered too,
The following is on my slow home PC with 3GB of RAM.
tia = mergeAll.map (iterate (*2)) $
mergeAll.map (iterate (*3)) $ iterate (*5) 1
last $ take 10000 tia
288325195312500000
(0.02 secs, 5,861,656 bytes)
6.5.2019
I learned how to ghc -02 So the following is for 50000 Hammings to 2.38E+30. And this is further proof my code is garbage.
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.000s ( 0.916s elapsed)
GC time 0.047s ( 0.041s elapsed)
EXIT time 0.000s ( 0.005s elapsed)
Total time 0.047s ( 0.962s elapsed)
Alloc rate 0 bytes per MUT second
Productivity 0.0% of total user, 95.8% of total elapsed
6.13.2019
#Will Ness rawks. He provided a clean and elegant revision of tia above and it proved to be five times as fast in GHCi. When I ghc -O2 +RTS -s his against mine, mine was several times as fast. There had to be a compromise.
So, I started reading about fusion that I had encountered in R. Bird's Thinking Functionally with Haskell and almost immediately tried this.
mai n = mergeAll.map (iterate (*n))
mai 2 $ mai 3 $ iterate (*5) 1
It matched Will's at 0.08 for 100K Hammings in GHCi but what really surprised me is (also for 100K Hammings.) this and especially the elapsed times. 100K is up to 2.9e+38.
TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.000s ( 0.002s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.000s ( 0.002s elapsed)
Alloc rate 0 bytes per MUT second
Productivity 100.0% of total user, 90.2% of total elapsed