New state of the art in unlimited generation of Hamming sequence

New state of the art in unlimited generation of Hamming sequence - haskell

(this is exciting!) I know, the subject matter is well known. The state of the art (in Haskell as well as other languages) for efficient generation of unbounded increasing sequence of Hamming numbers, without duplicates and without omissions, has long been the following (AFAIK - and by the way it is equivalent to the original Edsger Dijkstra's solution, too):
hamm :: [Integer]
hamm = 1 : map (2*) hamm `union` map (3*) hamm `union` map (5*) hamm
where
union a#(x:xs) b#(y:ys) = case compare x y of
LT -> x : union xs b
EQ -> x : union xs ys
GT -> y : union a ys
The question I'm asking is, can you find the way to make it more efficient in any significant measure? Is it still the state of the art or is it in fact possible to improve this to run twice faster?
If your answer is yes, please show the code and discuss its speed and empirical orders of growth in comparison to the above (it runs at about ~ n1.05…1.10 for the first few hundreds of thousands of numbers produced). Also, if it exists, can this efficient algorithm be extended to producing a sequence of smooth numbers with any given set of primes?
(clarification: I'm not asking about the much faster direct generation of an nth Hamming number, but rather generating all first n numbers in the sequence.)

If a constant factor(1) speedup counts as significant, then I can offer a significantly more efficient version:
hamm :: [Integer]
hamm = mrg1 hamm3 (map (2*) hamm)
where
hamm5 = iterate (5*) 1
hamm3 = mrg1 hamm5 (map (3*) hamm3)
merge a#(x:xs) b#(y:ys)
| x < y = x : merge xs b
| otherwise = y : merge a ys
mrg1 (x:xs) ys = x : merge xs ys
You can easily generalise it to smooth numbers for a given set of primes:
hamm :: [Integer] -> [Integer]
hamm [] = [1]
hamm [p] = iterate (p*) 1
hamm ps = foldl' next (iterate (q*) 1) qs
where
(q:qs) = sortBy (flip compare) ps
next prev m = let res = mrg1 prev (map (m*) res) in res
merge a#(x:xs) b#(y:ys)
| x < y = x : merge xs b
| otherwise = y : merge a ys
mrg1 (x:xs) ys = x : merge xs ys
It's more efficient because that algorithm doesn't produce any duplicates and it uses less memory. In your version, when a Hamming number near h is produced, the part of the list between h/5 and h has to be in memory. In my version, only the part between h/2 and h of the full list, and the part between h/3 and h of the 3-5-list needs to be in memory. Since the 3-5-list is much sparser, and the density of k-smooth numbers decreases, those two list parts need much less memory that the larger part of the full list.
Some timings for the two algorithms to produce the kth Hamming number, with empirical complexity of each target relative to the previous, excluding and including GC time:
k Yours (MUT/GC) Mine (MUT/GC)
10^5 0.03/0.01 0.01/0.01 -- too short to say much, really
2*10^5 0.07/0.02 0.02/0.01
5*10^5 0.17/0.06 0.968 1.024 0.06/0.04 1.199 1.314
10^6 0.36/0.13 1.082 1.091 0.11/0.10 0.874 1.070
2*10^6 0.77/0.27 1.097 1.086 0.21/0.21 0.933 1.000
5*10^6 1.96/0.71 1.020 1.029 0.55/0.59 1.051 1.090
10^7 4.05/1.45 1.047 1.043 1.14/1.25 1.052 1.068
2*10^7 8.73/2.99 1.108 1.091 2.31/2.65 1.019 1.053
5*10^7 21.53/7.83 0.985 1.002 6.01/7.05 1.044 1.057
10^8 45.83/16.79 1.090 1.093 12.42/15.26 1.047 1.084
As you can see, the factor between the MUT times is about 3.5, but the GC time is not much different.
(1) Well, it looks constant, and I think both variants have the same computational complexity, but I haven't pulled out pencil and paper to prove it, nor do I intend to.

So basically, now that Daniel Fischer gave his answer, I can say that I came across this recently, and I think this is an exciting development, since the classical code was known for ages, since Dijkstra.
Daniel correctly identified the redundancy of the duplicates generation which must then be removed, in the classical version.
The credit for the original discovery (AFAIK) goes to Rosettacode.org's contributor Ledrug, as of 2012-08-26. And of course the independent discovery by Daniel Fischer, here (2012-09-18).
Re-written slightly, that code is:
import Data.Function (fix)
hamm = 1 : foldr (\n s -> fix (merge s . (n:) . map (n*))) [] [2,3,5]
with the usual implementation of merge,
merge a#(x:xs) b#(y:ys) | x < y = x : merge xs b
| otherwise = y : merge a ys
merge [] b = b
merge a [] = a
It gives about 2.0x - 2.5x a speedup vs. the classical version.

Well this was easier than I thought. This will do 1000 Hammings in 0.05 seconds on my slow PC at home. This afternoon at work and a faster PC times of less than 600 were coming out as zero seconds.
This take Hammings from Hammings. It's based on doing it fastest in Excel.
I was getting wrong numbers after 250000, with Int. The numbers grow very big very fast, so Integer must be used to be sure, because Int is bounded.
mkHamm :: [Integer] -> [Integer] -> [Integer] -> [Integer]
-> Int -> (Integer, [Int])
mkHamm ml (x:xs) (y:ys) (z:zs) n =
if n <= 1
then (last ml, map length [(x:xs), (y:ys), (z:zs)])
else mkHamm (ml++[m]) as bs cs (n-1)
where
m = minimum [x,y,z]
as = if x == m then xs ++ [m*2] else (x:xs) ++ [m*2]
bs = if y == m then ys ++ [m*3] else (y:ys) ++ [m*3]
cs = if z == m then zs ++ [m*5] else (z:zs) ++ [m*5]
Testing,
> mkHamm [1] [2] [3] [5] 5000
(50837316566580,[306,479,692]) -- (0.41 secs)
> mkHamm [1] [2] [3] [5] 10000
(288325195312500000,[488,767,1109]) -- (1.79 secs)
> logBase 2 (1.79/0.41) -- log of times ratio =
2.1262637726461726 -- empirical order of growth
> map (logBase 2) [488/306, 767/479, 1109/692] :: [Float]
[0.6733495, 0.6792009, 0.68041545] -- leftovers sizes ratios
This means that this code's run time's empirical order of growth is above quadratic (~n^2.13 as measured, interpreted, at GHCi prompt).
Also, the sizes of the three dangling overproduced segments of the sequence are each ~n^0.67 i.e. ~n^(2/3).
Additionally, this code is non-lazy: the resulting sequence's first element can only be accessed only after the very last one is calculated.
The state of the art code in the question is linear, overproduces exactly 0 elements past the point of interest, and is properly lazy: it starts producing its numbers immediately.
So, though an immense improvement over the previous answers by this poster, it is still significantly worse than the original, let alone its improvement as appearing in the top two answers.
12.31.2018
Only the very best people educate. #Will Ness also has authored or co-authored 19 chapters in GoalKicker.com “Haskell for Professionals”. The free book is a treasure.
I had carried around the idea of a function that would do this, like this. I was apprehensive because I thought it would be convoluted and involved logic like in some modern languages. I decided to start writing and was amazed how easy Haskell makes the realization of even bad ideas.
I've not had difficulty generating unique lists. My problem is the lists I generate do not end well. Even when I use diagonalization they leave residual values making their use unreliable at best.
Here is a reworked 3's and 5's list with nothing residual at the end. The denationalization is to reduce residual values not to eliminate duplicates which are never included anyway.
g3s5s n=[t*b|(a,b)<-[ (((d+n)-(d*2)), 5^d) | d <- [0..n]],
t <-[ 3^e | e <- [0..a+8]],
(t*b)<-(3^(n+6))+a]
ham2 n = take n $ ham2' (drop 1.sort.g3s5s $ 48) [1]
ham2' o#(f:fo) e#(h:hx) = if h == min h f
then h:ham2' o (hx ++ [h*2])
else f:ham2' fo ( e ++ [f*2])
The twos list can be generated with all 2^es multiplied by each of the 3s5s but when identity 2^0 is included, then, in total, it is the Hammings.
3/25/2019
Well, finally. I knew this some time ago but could not implement it without excess values at the end. The problem was how to not generate the excess that is the result of a Cartesian Product. I use Excel a lot and could not see the pattern of values to exclude from the Cartesian Product worksheet. Then, eureka! The functions generate lists of each lead factor. The value to limit the values in each list is the end point of the first list. When this is done, all Hammings are produced with no excess.
Two functions for Hammings. The first is a new 3's & 5's list which is then used to create multiples with the 2's. The multiples are Hammings.
h35r x = h3s5s x (5^x)
h3s5s x c = [t| n<-[3^e|e<-[0..x]],
m<-[5^e|e<-[0..x]],
t<-[n*m],
t <= c ]
a2r n = sort $ a2s n (2^n)
a2s n c = [h| b<-h35r n,
a<-[2^e| e<-[0..n]],
h<-[a*b],
h <= c ]
last $ a2r 50
1125899906842624
(0.16 secs, 321,326,648 bytes)
2^50
1125899906842624
(0.00 secs, 95,424 bytes
This is an alternate, cleaner & faster with less memory usage implementation.
gnf n f = scanl (*) 1 $ replicate f n
mk35 n = (\c-> [m| t<- gnf 3 n, f<- gnf 5 n, m<- [t*f], m<= c]) (2^(n+1))
mkHams n = (\c-> sort [m| t<- mk35 n, f<- gnf 2 (n+1), m<- [t*f], m<= c]) (2^(n+1))
last $ mkHams 50
2251799813685248
(0.03 secs, 12,869,000 bytes)
2^51
2251799813685248
5/6/2019
Well, I tried limiting differently but always come back to what is simplest. I am opting for the least memory usage as also seeming to be the fastest.
I also opted to use map with an implicit parameter.
I also found that mergeAll from Data.List.Ordered is faster that sort or sort and concat.
I also like when sublists are created so I can analyze the data much easier.
Then, because of #Will Ness switched to iterate instead of scanl making much cleaner code. Also because of #Will Ness I stopped using the last of of 2s list and switched to one value determining all lengths.
I do think recursively defined lists are more efficient, the previous number multiplied by a factor.
Just separating the function into two doesn't make a difference so the 3 and 5 multiples would be
m35 lim = mergeAll $
map (takeWhile (<=lim).iterate (*3)) $
takeWhile (<=lim).iterate (*5) $ 1
And the 2s each multiplied by the product of 3s and 5s
ham n = mergeAll $
map (takeWhile (<=lim).iterate (*2)) $ m35 lim
where lim= 2^n
After editing the function I ran it
last $ ham 50
1125899906842624
(0.00 secs, 7,029,728 bytes)
then
last $ ham 100
1267650600228229401496703205376
(0.03 secs, 64,395,928 bytes)
It is probably better to use 10^n but for comparison I again used 2^n
5/11/2019
Because I so prefer infinite and recursive lists I became a bit obsessed with making these infinite.
I was so impressed and inspired with #Daniel Wagner and his Data.Universe.Helpers I started using +*+ and +++ but then added my own infinite list. I had to mergeAll my list to work but then realized the infinite 3 and 5 multiples were exactly what they should be. So, I added the 2s and mergeAlld everything and they came out. Before, I stupidly thought mergeAll would not handle infinite list but it does most marvelously.
When a list is infinite in Haskell, Haskell calculates just what is needed, that is, is lazy. The adjunct is that it does calculate from, the start.
Now, since Haskell multiples until the limit of what is wanted, no limit is needed in the function, that is, no more takeWhile. The speed up is incredible and the memory lowered too,
The following is on my slow home PC with 3GB of RAM.
tia = mergeAll.map (iterate (*2)) $
mergeAll.map (iterate (*3)) $ iterate (*5) 1
last $ take 10000 tia
288325195312500000
(0.02 secs, 5,861,656 bytes)
6.5.2019
I learned how to ghc -02 So the following is for 50000 Hammings to 2.38E+30. And this is further proof my code is garbage.
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.000s ( 0.916s elapsed)
GC time 0.047s ( 0.041s elapsed)
EXIT time 0.000s ( 0.005s elapsed)
Total time 0.047s ( 0.962s elapsed)
Alloc rate 0 bytes per MUT second
Productivity 0.0% of total user, 95.8% of total elapsed
6.13.2019
#Will Ness rawks. He provided a clean and elegant revision of tia above and it proved to be five times as fast in GHCi. When I ghc -O2 +RTS -s his against mine, mine was several times as fast. There had to be a compromise.
So, I started reading about fusion that I had encountered in R. Bird's Thinking Functionally with Haskell and almost immediately tried this.
mai n = mergeAll.map (iterate (*n))
mai 2 $ mai 3 $ iterate (*5) 1
It matched Will's at 0.08 for 100K Hammings in GHCi but what really surprised me is (also for 100K Hammings.) this and especially the elapsed times. 100K is up to 2.9e+38.
TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.000s ( 0.002s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.000s ( 0.002s elapsed)
Alloc rate 0 bytes per MUT second
Productivity 100.0% of total user, 90.2% of total elapsed

Related

How to (efficiently) find the biggest prime factor of a number using Haskell?

I am trying to practice Haskell by solving some of the tasks on Project Euler. In Problem 3, we have to find the biggest prime factor of the number 600851475143, which I had done before in Java a few years back.
I came up with the following:
primes :: [Int]
primes = sieve [2..]
where sieve (p:xs) = p : sieve (filter (\x -> x `rem` p /= 0) xs)
biggestPrimeFactor :: Int -> Int
biggestPrimeFactor 1 = 0
biggestPrimeFactor x =
if x `elem` takeWhile (< x + 1) primes
then x
else last (filter (\y -> x `rem` y == 0) (takeWhile (< x `div` 2) primes))
which works great for smaller numbers, but is terribly inefficient and as a result doesn't work well on the number I have been given.
This seems obvious, because the program iterates over all primes smaller than the number divided by 2 (if it isn't prime itself), but I am unsure what to do about it. Ideally I would be able to further restrict the possible checks, but I don't know how to accomplish this.
Note that I am not looking for an "optimal solution", but rather one that is at least moderately efficient for bigger numbers, and simple to understand and implement, as I am still a beginner in Haskell.

You have two main sources of slowness here. The easier one to address is the boundary condition in biggestPrimeFactor. Checking up to p > x `div` 2 is asymptotically worse than checking up to p^2 > x. But even that is very suboptimal when a number has a lot of factors. The largest factor may be far smaller than sqrt x. If you continually reduce the target number as you find factors, you can account for this and speed up the processing of random inputs by quite a lot.
Here's an example of that, including Daniel Wagner's note from the comments:
-- Naive trial division against a list of primes. Doesn't do anything
-- intelligent when asked to factor a number less than 2.
factorsNaive :: [Integer] -> Integer -> [Integer]
factorsNaive primes#(p : ps) x
| p * p > x = [x]
| otherwise = case x `quotRem` p of
(q, 0) -> p : factorsNaive primes q
_ -> factorsNaive ps x
A few notes:
I decided to have the primes list passed in. This is relevant in the next section, but it also allowed me to write this without a helper.
I specialized to Integer instead of Int because I wanted to throw big numbers at it without caring what maxBound :: Int is. This is slower, but I decided to default to correctness first.
I removed a traversal of the input list. Doing it in one pass is a bit more efficient, but mostly it's cleaner.
Strictly speaking, this is correct even if the input list contains non-primes, so long as the list starts at 2, is monotonically non-decreasing, and eventually contains every prime.
Note that when it recurses, it either discards a prime or produces one. It never will do both at the same time. This is an easy way to ensure it doesn't miss repeated factors.
I named this factorsNaive just to make it clear that it's not doing anything clever with number theory. There are very many things that could be done which are far more complex than this, but this is a good stopping point for understandable factoring of relatively small numbers...
Or at least it is okay at factoring as long as you have a convenient list of prime numbers. It turns out this is the second major cause of slowdown in your code. Your list of prime numbers is slow to generate as it gets longer.
Your definition of primes essentially stacks a bunch of filters on an input list. Every prime produced must go through a filter test for each previous prime. This might sound familiar - it's at least O(n^2) work to generate the first n primes. (It's actually more because division gets more costly as numbers get bigger, but let's ignore that for now.) It's a known (to mathematicians, I had to look it up to be sure) result that the number of primes less than or equal to n approaches n/ln n as n gets large. That approaches linear as n gets large, so generating the list of primes up to n approaches O(n^2) as n gets big.
(Yes, that argument is a mess. A formal version of it is presented in Melissa O'Neill's paper "The Genuine Sieve of Eratosthenes". Refer to it for much more rigorous argumentation of the result.)
It's possible to write much more efficient definitions of primes that have both better constant factors and better asymptotics. As that's the entire point of the paper mentioned in the parenthetical above, I won't go into the details too far. I'll just point out the very first possible optimization:
-- trial division. let's work in Integer for predictable correctness
-- on positive numbers
trialPrimes :: [Integer]
trialPrimes = 2 : sieve [3, 5 ..]
where
sieve (p : ps) = p : sieve (filter (\x -> x `rem` p /= 0) ps)
This does less than you might think. It doesn't double the speed, as the performance improvement is eventually outweighed by the filter stack mentioned before. This version only removes one filter from that stack, but at least it's the filter that rejects the most inputs in the initial version.
In ghci (no compilation or optimizations, and those can really make a difference), this was fast enough to factor the product of two five-digit primes in a few seconds.
ghci> :set +s
ghci> factorsNaive trialPrimes $ 84761 * 60821
[60821,84761]
(5.98 secs, 4,103,321,840 bytes)
Numbers with several small factors are handled much faster. Also notice that because the list of primes is a top-level binding, calculations are cached. Running the computation again has the list of primes pre-computed now.
ghci> factorsNaive trialPrimes $ 84761 * 60821
[60821,84761]
(0.01 secs, 6,934,688 bytes)
That also shows that the run time is absolutely dominated by generating the list of primes. The naive factorization is almost instant at that scale when the list of primes is already in memory.
But you shouldn't really trust performance of interpreted code.
main :: IO ()
main = print (factorsNaive trialPrimes $ 84761 * 60821)
gives
carl#DESKTOP:~/hask/2023$ ghc -O2 -rtsopts factor.hs
[1 of 2] Compiling Main ( factor.hs, factor.o )
[2 of 2] Linking factor
carl#DESKTOP:~/hask/2023$ ./factor +RTS -s
[60821,84761]
1,884,787,896 bytes allocated in the heap
32,303,080 bytes copied during GC
89,072 bytes maximum residency (2 sample(s))
29,400 bytes maximum slop
7 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 326 colls, 0 par 0.021s 0.021s 0.0001s 0.0002s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0002s 0.0004s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.523s ( 0.522s elapsed)
GC time 0.021s ( 0.022s elapsed)
EXIT time 0.000s ( 0.007s elapsed)
Total time 0.545s ( 0.550s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,603,678,988 bytes per MUT second
Productivity 96.0% of total user, 94.8% of total elapsed
That dropped the run time from six seconds to a half-second. (Yeah, +RTS -s is pretty verbose for this, but it's quick and easy.) I think this is a reasonable place to stop with beginner-level code.
If you want to look into more efficient prime generation, the primes package on hackage contains an implementation of the algorithm in O'Neill's paper and an implementation of naive factoring that's equivalent to the one here.

Efficient bit-fiddling in a LFSR implementation

Although I have a good LSFR C implementation I thought I'd try the same in Haskell - just to see how it goes. What I came up with, so far, is two orders of magnitude slower than the C implementation, which begs the question: How can the performance be improved? Clearly, the bit-fiddling operations are the bottleneck, and the profiler confirms this.
Here's the baseline Haskell code using lists and Data.Bits:
import Control.Monad (when)
import Data.Bits (Bits, shift, testBit, xor, (.&.), (.|.))
import System.Environment (getArgs)
import System.Exit (exitFailure, exitSuccess)
tap :: [[Int]]
tap = [
[], [], [], [3, 2],
[4, 3], [5, 3], [6, 5], [7, 6],
[8, 6, 5, 4], [9, 5], [10, 7], [11, 9],
[12, 6, 4, 1], [13, 4, 3, 1], [14, 5, 3, 1], [15, 14],
[16,15,13,4], [17, 14], [18, 11], [19, 6, 2, 1],
[20, 17], [21, 19], [22, 21], [23, 18],
[24,23,22,17], [25, 22], [26, 6, 2, 1], [27, 5, 2, 1],
[28, 25], [29, 27], [30, 6, 4, 1], [31, 28],
[32,22,2,1], [33,20], [34,27,2,1], [35,33],
[36,25], [37,5,4,3,2,1],[38,6,5,1], [39,35],
[40,38,21,19], [41,38], [42,41,20,19], [43,42,38,37],
[44,43,18,17], [45,44,42,41], [46,45,26,25], [47,42],
[48,47,21,20], [49,40], [50,49,24,23], [51,50,36,35],
[52,49], [53,52,38,37], [54,53,18,17], [55,31],
[56,55,35,34], [57,50], [58,39], [59,58,38,37],
[60,59], [61,60,46,45], [62,61,6,5], [63,62] ]
xor' :: [Bool] -> Bool
xor' = foldr xor False
mask :: (Num a, Bits a) => Int -> a
mask len = shift 1 len - 1
advance :: Int -> [Int] -> Int -> Int
advance len tap lfsr
| d0 = shifted
| otherwise = shifted .|. 1
where
shifted = shift lfsr 1 .&. mask len
d0 = xor' $ map (testBit lfsr) tap'
tap' = map (subtract 1) tap
main :: IO ()
main = do
args <- getArgs
when (null args) $ fail "Usage: lsfr <number-of-bits>"
let len = read $ head args
when (len < 8) $ fail "No need for LFSR"
let out = last $ take (shift 1 len) $ iterate (advance len (tap!!len)) 0
if out == 0 then do
putStr "OK\n"
exitSuccess
else do
putStr "FAIL\n"
exitFailure
Basically it tests whether the LSFR defined in tap :: [[Int]] for any given bit-length is of maximum-length. (More precisely, it just checks whether the LSFR reaches the initial state (zero) after 2n iterations.)
According to the profiler the most costly line is the feedback bit d0 = xor' $ map (testBit lfsr) tap'.
What I've tried so far:
use Data.Array: Attempt abandoned because there's no foldl/r
use Data.Vector: Slightly faster than the baseline
The compiler options I use are: -O2, LTS Haskell 8.12 (GHC-8.0.2).
The reference C++ program can be found on gist.github.com.
The Haskell code can't be expected (?) to run as fast as the C code, but two orders of magnitude is too much, there must be a better way to do the bit-fiddling.
Update: Results of applying the optimisations suggested in the answers
The reference C++ program with input 28, compiled with LLVM 8.0.0, runs in 0.67s on my machine (the same with clang 3.7 is marginally slower, 0.68s)
The baseline Haskell code runs about 100x slower (because of the space inefficiency don't try it with inputs larger than 25)
With the rewrite of #Thomas M. DuBuisson, still using the default GHC backend, the execution time goes down to 5.2s
With the rewrite of #Thomas M. DuBuisson, now using the LLVM backend (GHC option -O2 -fllvm), the execution time goes down to 1.7s
Using GHC option -O2 -fllvm -optlc -mcpu=native brings this to 0.73s
Replacing iterate with iterate' of #cirdec makes no difference when Thomas' code is used (both with the default 'native' backend and LLVM). However, it does make a difference when the baseline code is used.
So, we've come from 100x to 8x to 1.09x, i.e. only 9% slower than C!
Note
The LLVM backend to GHC 8.0.2 requires LLVM 3.7. On Mac OS X this means installing this version with brew and then symlinking opt and llc. See 7.10. GHC Backends.

Up Front Matters
For starters, I'm using GHC 8.0.1 on an Intel I5 ~2.5GHz, linux x86-64.
First Draft: Oh No! The slows!
Your starting code with parameter 25 runs:
% ghc -O2 orig.hs && time ./orig 25
[1 of 1] Compiling Main ( orig.hs, orig.o )
Linking orig ...
OK
./orig 25 7.25s user 0.50s system 99% cpu 7.748 total
So the time to beat is 77ms - two orders of magnitude better than this Haskell code. Lets dive in.
Issue 1: Shifty Code
I found a couple of oddities with the code. First was the use of shift in high performance code. Shift supports both left and right shift and to do so it requires a branch. Lets kill that with more readable powers of two and such (shift 1 x ~> 2^x and shift x 1 ~> 2*x):
% ghc -O2 noShift.hs && time ./noShift 25
[1 of 1] Compiling Main ( noShift.hs, noShift.o )
Linking noShift ...
OK
./noShift 25 0.64s user 0.00s system 99% cpu 0.637 total
(As you noted in the comments: Yes, this bears investigation. It might be that some oddity of the prior code was preventing a rewrite rule from firing and, as a result, much worse code resulted)
Issue 2: Lists Of Bits? Int operations save the day!
One change, one order of magnitude. Yay. What else? Well you have this awkward list of bit locations you're tapping that just seems like its begging for inefficiency and/or leans on fragile optimizations. At this point I'll note that hard-coding any one selection from that list results in really good performance (such as testBit lsfr 24 `xor` testBit lsfr 21) but we want a more general fast solution.
I propose we compute the mask of all the tap locations then do a one-instruction pop count. To do this we only need a single Int passed in to advance instead of a whole list. The popcount instruction requires good assembly generation which requires llvm and probably -optlc-mcpu=native or another instruction set selection that is non-pessimistic.
This step gives us pc below. I've folded in the guard-removal of advance that was mentioned in the comments:
let tp = sum $ map ((2^) . subtract 1) (tap !! len)
pc lfsr = fromEnum (even (popCount (lfsr .&. tp)))
mask = 2^len - 1
advance' :: Int -> Int
advance' lfsr = (2*lfsr .&. mask) .|. pc lfsr
out :: Int
out = last $ take (2^len) $ iterate advance' 0
Our resulting performance is:
% ghc -O2 so.hs -fforce-recomp -fllvm -optlc-mcpu=native && time ./so 25
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
OK
./so 25 0.06s user 0.00s system 96% cpu 0.067 total
That's over two orders of magnitude from start to finish, so hopefully it matches your C. Finally, in deployed code it is actually really common to have Haskell packages with C bindings but this is often an educational exercise so I hope you had fun.
Edit: The now-available C++ code takes my system 0.10 (g++ -O3) and 0.12 (clang++ -O3 -march=native) seconds, so it seems we've beat our mark by a fair bit.

I suspect that the following line is building a large list-like thunk in memory before evaluating it.
let out = last $ take (shift 1 len) $ iterate (advance len (tap!!len)) 0` is
Let's find out if I'm right, and if I am, we'll fix it. The first debugging step is to get an idea of the memory used by the program. To do this we're going to compile with the options -rtsopts in addition to -O2. This enables running the program with RTS options, inclusing +RTS -s which outputs a small memory summary.
Initial Performance
Running your program as lfsr 25 +RTS -s I get the following output
OK
5,420,148,768 bytes allocated in the heap
6,705,977,216 bytes copied during GC
1,567,511,384 bytes maximum residency (20 sample(s))
357,862,432 bytes maximum slop
3025 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 10343 colls, 0 par 2.453s 2.522s 0.0002s 0.0009s
Gen 1 20 colls, 0 par 2.281s 3.065s 0.1533s 0.7128s
INIT time 0.000s ( 0.000s elapsed)
MUT time 1.438s ( 1.162s elapsed)
GC time 4.734s ( 5.587s elapsed)
EXIT time 0.016s ( 0.218s elapsed)
Total time 6.188s ( 6.967s elapsed)
%GC time 76.5% (80.2% elapsed)
Alloc rate 3,770,538,273 bytes per MUT second
Productivity 23.5% of total user, 19.8% of total elapsed
That's a lot of memory used at once. It's very likely there's a big thunk building up in there somewhere.
Trying to reduce the thunk size
I hypothesized that the thunk is being built in iterate (advance ...). If this is the case, we can try to reduce the thunk size by making advance more strict in its lsfr argument. This won't remove the spine of the thunk (the successive iterations), but it might reduce the size of the state that's built up as the spine's evaluated.
BangPatterns is an easy way to make a function strict in an argument. f !x = .. is shorthand for f x = seq x $ ...
{-# LANGUAGE BangPatterns #-}
advance :: Int -> [Int] -> Int -> Int
advance len tap = go
where
go !lfsr
| d0 = shifted
| otherwise = shifted .|. 1
where
shifted = shift lfsr 1 .&. mask len
d0 = xor' $ map (testBit lfsr) tap'
tap' = map (subtract 1) tap
Let's see what difference this makes ...
>lfsr 25 +RTS -s
OK
5,420,149,072 bytes allocated in the heap
6,705,979,368 bytes copied during GC
1,567,511,448 bytes maximum residency (20 sample(s))
357,862,448 bytes maximum slop
3025 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 10343 colls, 0 par 2.688s 2.711s 0.0003s 0.0059s
Gen 1 20 colls, 0 par 2.438s 3.252s 0.1626s 0.8013s
INIT time 0.000s ( 0.000s elapsed)
MUT time 1.328s ( 1.146s elapsed)
GC time 5.125s ( 5.963s elapsed)
EXIT time 0.000s ( 0.226s elapsed)
Total time 6.484s ( 7.335s elapsed)
%GC time 79.0% (81.3% elapsed)
Alloc rate 4,081,053,418 bytes per MUT second
Productivity 21.0% of total user, 18.7% of total elapsed
None that's noticeable.
Eliminating the Spine
I guess it's the spine of that iterate (advance ...) that's being built. After all, for the command I'm running the list would be 2^25, or a little over 33 million items long. The list itself is probably being removed by list fusion, but the thunk for the last item of the list is over 33 million applications of advance ...
To solve this problem we need a strict version of iterate so that the value is forced to an Int before applying the advance function again. This should keep the memory down to only a single lfsr value at a time, along with the currently computed application of advance.
Unfortunately, there isn't a strict iterate in Data.List. Here's one that doesn't give up on the list fusion that's providing other important (I think) performance optimizations to this problem.
{-# LANGUAGE BangPatterns #-}
import GHC.Base (build)
{-# NOINLINE [1] iterate' #-}
iterate' :: (a -> a) -> a -> [a]
iterate' f = go
where go !x = x : go (f x)
{-# NOINLINE [0] iterateFB' #-}
iterateFB' :: (a -> b -> b) -> (a -> a) -> a -> b
iterateFB' c f = go
where go !x = x `c` go (f x)
{-# RULES
"iterate'" [~1] forall f x. iterate' f x = build (\c _n -> iterateFB' c f x)
"iterateFB'" [1] iterateFB' (:) = iterate'
#-}
This is just iterate from GHC.List (along with all its rewrite rules), but made strict in the accumulated argument.
Equipped with a strict iterate, iterate', we can change the troublesome line to
let out = last $ take (shift 1 len) $ iterate' (advance len (tap!!len)) 0
I expect that this will perform much better. Let's see ...
>lfsr 25 +RTS -s
OK
3,758,156,184 bytes allocated in the heap
297,976 bytes copied during GC
43,800 bytes maximum residency (1 sample(s))
21,736 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 7281 colls, 0 par 0.047s 0.008s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0002s 0.0002s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.750s ( 0.783s elapsed)
GC time 0.047s ( 0.008s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.797s ( 0.792s elapsed)
%GC time 5.9% (1.0% elapsed)
Alloc rate 5,010,874,912 bytes per MUT second
Productivity 94.1% of total user, 99.0% of total elapsed
This used 0.00002 times as much memory and ran 10 times as fast.
I don't know if this will improve on Thomas DeBuisson's answer that improves advance but still leaves a lazy iterate advance' in place. It would be easy to check; add the iterate' code to that answer and use iterate' in place of iterate in that answer.

Does the compiler lift tap !! len out of the loop? I suspect it does, but moving it out to guarantee this can't hurt:
let tap1 = tap !! len
let out = last $ take (shift 1 len) $ iterate (advance len tap1) 0
In the comments you say "2^len is needed exactly once", but this is wrong. You do it each time in advance. So you could try
advance len tap mask lfsr
| d0 = shifted
| otherwise = shifted .|. 1
where
shifted = shift lfsr 1 .&. mask
d0 = xor' $ map (testBit lfsr) tap'
tap' = map (subtract 1) tap
-- in main
let tap1 = tap !! len
let numIterations = 2^len
let mask = numIterations - 1
let out = iterate (advance len tap1 mask) 0 !! (numIterations - 1)
(the compiler can't optimize last $ take ... to !! in general, because they are different for finite lists, but iterate always returns an infinite one.)
You compared foldr with foldl, but foldl is almost never what you need; since xor always needs both arguments and is associative, foldl' is very likely to be the right choice (the compiler can optimize it, but if there is any real difference between foldl and foldr and not just random variation, it might have failed in this case).

Running into memory issues with Data.Sequence on a manageably sized dataset

TL;DR: I'm working on a piece of code which generates a (long) array of numbers. I'm able to generate this array, convert it to a List and then calculate the maximum (using a strict left fold). BUT, I run into memory issues when I try to convert the list to a Sequence prior to calculating the maximum. This is quite counter-intuitive to me.
My question: Why is this happening and what is the correct approach for converting the data to a Sequence structure?
Background:
I'm working on a problem which I've chosen to tackle in using three steps (below).
*Note: I'm intentionally keeping the problem statement vague so this post doesn't serve as a hint.
Anyways, my proposed approach:
(i) First, generate a long list of integers, namely, the number of factors for each integer from 1 to 100 million (NOT the factors themselves, just the number of factors)
(ii) second, convert this list into a Sequence.
(iii) lastly, use an efficient sliding window maximum algorithm to calc my answer (this step requires dequeue operations, hence the need for a Sequence)
(Again, the specifics of the problem aren't that relevant as I'm just curious as to why I'm running into this particular issue in the first place.)
What I've done so far?
Step 1 was fairly straightforward - see output below (full code is included at the bottom). I just bruteforce a sieve using an Unboxed Array and the accumArray function, nothing fancy. Note: I've used this same algorithm to solve a number of other such problems so I'm reasonably confident that it's giving the right answer.
For the purposes of showing execution time / memory-usage stats, I've (admittedly arbitrarily) chosen to calculate the maximum element in the resulting array - the idea is simply to use a function which forces construction of all elements of the Array, thereby ensuring that we see meaningful stats for exec time / memory-usage.
The following main function...
main = print $ maximum' $ elems (sieve (10^8))
...results in the following (i.e., it says that the number below 100 million with the most divisors has a total of 768 divisors):
Linking maxdivSO ...
768
33.73s user 70.80s system 99% cpu 1:44.62 total
344,214,504,640 bytes allocated in the heap
58,471,497,320 bytes copied during GC
200,062,352 bytes maximum residency (298 sample(s))
3,413,824 bytes maximum slop
386 MB total memory in use (0 MB lost due to fragmentation)
%GC time 24.7% (30.5% elapsed)
The problem
It seems like we can accomplish the first step without breaking a sweat since I've allocated a total of 5gb to my VirtualBox and the above code uses <400mb (as reference, I've seen programs execute successfully and report using 3gb+ of total memory). In other words, it seems like we've accomplished Step 1 with plenty of headroom.
So I'm a bit surprised as to why the following version of the main function fails. We attempt to perform the same calculation of the maximum but after converting the list of integers to a Sequence. The following code...
main = print $ maximum' $ fromList $ elems (sieve (10^8))
...results in the following:
Linking maxdivSO ...
maxdivSO: out of memory (requested 2097152 bytes)
39.48s user 76.35s system 99% cpu 1:56.03 total
My question: Why does the algorithm (as currently written) run out of memory if we try to convert the list to a Sequence? And how might I go about successfully converting this list into a Sequence?"
(I'm not one to stubbornly stick to brute-force for these types of problems - but I have a strong suspicion that this particular issue is due to my not being able to reason well about evaluation.)
The code itself:
{-# LANGUAGE NoMonomorphismRestriction #-}
import Data.Word (Word32, Word16)
import Data.Foldable (Foldable, foldl')
import Data.Array.Unboxed (UArray, accumArray, elems)
import Data.Sequence (fromList)
main :: IO ()
main = print $ maximum' $ elems (sieve (10^8)) -- <-- works
--main = print $ maximum' $ fromList $ elems (sieve (10^8)) -- <-- doesn't work
maximum' :: (Foldable t, Ord a, Num a) => t a -> a
maximum' = foldl' (\x acc -> if x > acc then x else acc) 0
sieve :: Int -> UArray Word32 Word16
sieve u' = accumArray (+) 2 (1,u) ( (1,-1) : factors )
where
u = fromIntegral u'
cap = floor $ sqrt (fromIntegral u) :: Word32
factors = [ (i*d,j) | d <- [2..cap]
, i <- [2..(u `quot` d)]
, d <= i, let j = if i == d then 1 else 2
]

I think the reason for this is that to get the first element of of a sequence requires the full sequence to be constructed in memory (since the internal representation of the sequence is a tree). In the list case elems yields the elements lazily.
Rather than turning the full array into a sequence, why not make the sequence only as long as your sliding window?

Can Haskell evaluate and not garbage collect random indexes in a list?

As I understand it, Haskell only garbage collects when something goes out of scope, so a top level binding will only be evaluated once and never go out of scope. So if I run this code in GHCI, the first 50 elements will be evaluated and saved.
let xs = map f [0..]
take 50 xs
My questions is what happens when I execute the following snippet: xs !! 99. What does the garbage collector save? Does it
Keep results for indexes 0 - 49, thunk for indexes 50 - 98, result for index 99, thunk for indexes 100+
Keep results for indexes 0 - 49, thunk for indexes 50+
Keep results for indexes 0 - 99, thunk for indexes 100+

Haskell lists are linked lists made up of (:) ("cons") cells and terminated by a [] ("nil") value. I'll draw such cells like this
[x] -> (tail - remainder of list)
|
v
(head - a value)
So when thinking about what is evaluated, there are two pieces to consider. First is the spine, that is the structure of cons cells, and second is the values the list contains. Instead of 50 and 99, let's use 2 and 4, respectively.
ghci> take 2 xs
[0,1]
Printing this list forces the evaluation of the first two cons cells as well as the values within them. So your list looks like this:
[x] -> [x] -> (thunk)
| |
v v
0 1
Now, when we
ghci> xs !! 4
3
we haven't demanded the 2nd or 3rd values, but we need to evaluate those cons cells to get to the 4th element. So we have forced the spine all the way up to the 4th element, but we have only evaluated the 4th value, so the list now looks like this:
[x] -> [x] -> [x] -> [x] -> [x] -> (thunk)
| | | | |
v v v v v
0 1 (thunk) (thunk) 4
Nothing in this picture will be garbage-collected. However, sometimes those thunks can take up a lot of space or reference something large, and evaluating them to a plain value will allow those resources to be freed. See this answer for a small discussion of those subtleties.

Let's ask the profiler.
We will compile the following example program that should do approximately what your GHCI session did. It's important that we print the results, like GHCI did, as this forces the computations.
f x = (-x)
xs = map f [0..]
main = do
print (take 50 xs)
print (xs !! 99)
I saved mine as example.hs. We'll compile it with options to enable profiling
ghc -prof -fprof-auto -rtsopts example.hs
Time profile
We can find out how many times f was applied with a time profile.
profile +RTS -p
This produces an output file named example.prof, the following is the interesting portion:
COST CENTRE MODULE no. entries
...
f Main 78 51
We can see that f was evaluated 51 times, 50 times for the print (take 50 xs) and once for print (xs !! 99). Therefore we can rule out your third possibility, since f was only evaluated 51 times, there can't be results for all of the indices 0-99
Keep results for indexes 0 - 99, thunk for indexes 100+
Heap profile of results
Profiling the memory on the heap is a bit trickier. The heap profiler takes samples, by default once every .1 seconds. Our program will run so fast that the heap profiler won't take any samples while it's running. We'll add a spin to our program so that the heap profiler will get a chance to take a sample. The following will spin for a number of seconds.
import Data.Time.Clock
spin :: Real a => a -> IO ()
spin seconds =
do
startTime <- getCurrentTime
let endTime = addUTCTime (fromRational (toRational seconds)) startTime
let go = do
now <- getCurrentTime
if now < endTime then go else return ()
go
We don't want the garbage collector to collect the data while the program is running, so we'll add another usage of xs after the spin.
main = do
print (take 50 xs)
print (xs !! 99)
spin 1
print (xs !! 0)
We'll run this with the default heap profiling option, which groups memory usage by cost center.
example +RTS -h
This produces the file example.hp. We'll pull a sample out of the middle of the file where the numbers are stable (while it was in a spin).
BEGIN_SAMPLE 0.57
(42)PINNED 32720
(48)Data.Time.Clock.POSIX.CAF 48
(85)spin.endTime/spin/mai... 56
(84)spin.go/spin/main/Mai... 64
(81)xs/Main.CAF 4848
(82)f/xs/Main.CAF 816
(80)main/Main.CAF 160
(64)GHC.IO.Encoding.CAF 72
(68)GHC.IO.Encoding.CodeP... 136
(57)GHC.IO.Handle.FD.CAF 712
(47)Main.CAF 96
END_SAMPLE 0.57
We can see that f has produced 816 bytes of memory. For "small" Integers, an Integer consumes 2 word of memory. On my system, a word is 8 bytes of memory, so a "small" Integer takes 16 bytes. Therefore 816/16 = 51 of the Integers produced by f are probably still in memory.
We can check that all of this memory is actually being used for "small" Integers by asking for a profile by closure description with -hd. We can't both group memory usage by closure descripiton and cost-centre, but we can restrict the profiling to a single cost-centre with -hc, in this case, we are interested in the f cost-centre
example +RTS -hd -hcf
This says that all 816 bytes due to f are being used by S#, the constructor for "small" Integers
BEGIN_SAMPLE 0.57
S# 816
END_SAMPLE 0.57
We can certainly remove the following, since 51 Integer results are being kept, and it expects to only keep 50 Integers
Keep results for indexes 0 - 49, thunk for indexes 50+
Heap profile of structure and thunks
This leaves us with option
Keep results for indexes 0 - 49, thunks for indexes 50 - 98, result for index 99, thunk for indexes 100+
Let's guess how much memory this situation would be consuming.
In general, Haskell data types require 1 word of memory for the constructor, and 1 word for each field. The [] type's : constructor has two fields, so it should take 3 words of memory, or 24 bytes. 100 :s would then take 2400 bytes of memory. We will see that this is exactly correct when we ask about the closure description for xs.
It's difficult to reason about the size of thunks, but we'll give it a try. There would be 49 thunks for the values of indices [50, 98]. Each of these thunks is probably holding the Integer from the generator [0..]. It also needs to hold the structure of a thunk, which unfortunately changes when profiling. There will also be a thunk for the remainder of a the list. It will need the Integer from which the remainder of the list is generated, and the structure of a thunk.
Asking for a memory breakdown by closure description for the xs cost-centre
example +RTS -hd -hcxs
Gives us the following sample
BEGIN_SAMPLE 0.60
<base:GHC.Enum.sat_s34b> 32
<base:GHC.Base.sat_s1be> 32
<main:Main.sat_s1w0> 16
S# 800
<base:GHC.Base.sat_s1bd> 1568
: 2400
END_SAMPLE 0.60
We were exactly correct about there being 100 :s requiring 2400 bytes of memory. There are 49+1 = 50 "small" Integers S# occupying 800 bytes for the thunks for the 49 uncalculated values and the thunk for the remainder of the lists. There are 1568 bytes which are probably the 49 thunks for the uncalculated values, which would each then be 32 bytes or 4 words. There are another 80 bytes we can't exactly explain that are left over for the thunk for the remainder of the list.
Both memory and time profiles are consistent with our belief that the program will
Keep results for indexes 0 - 49, thunks for indexes 50 - 98, result for index 99, thunk for indexes 100+

Sieving prime numbers with Haskell

OK, so I'm trying to write a Haskell program which counts prime numbers extremely fast. Presumably I am not the first person to try to do this. (In particular, I'm damned sure I saw some prior art, but I can't find it now...)
Initially, I want to count the number of primes less than 10^11. Currently I've left my program running for about 15 minutes and it's not even half way there yet. Some rabid C++ programmer claims his program only takes 8 seconds minutes. So clearly I'm doing something horrifyingly wrong.
(In case it matters, my current implementation uses IOUArray Integer Bool and multiple threads to process independent subranges of the search space. Currently it takes several seconds to remove all the multiples of 2 from a 10MB array chunk...)
Note that 10^11 is too big for 32-bit arithmetic. Also, 10^11 bits = 12.5 GB, far too much data to fit into Haskell's 32-bit address space. So you can't have the entire bitmap in memory at once. Finally, note that the number of primes less than 10^11 is just a shade less than 2^32, so there's no way you can store the actual integers all at once either.
Edit: Apparently I misread the timing information. What the C++ guy actually claimed was:
Counting primes < 10^11 takes 8 minutes using just one core, and 56 seconds using 4 cores. (CPU type not specified.)
Counting primes < 10^10 takes 5 seconds. (Not sure how many cores that's using.)
Sorry about the mistake...
Edit: My source code can be found here: http://hpaste.org/72898

Using the package arithmoi by the excellent StackOverflow teacher Daniel Fischer:
import Math.NumberTheory.Primes.Counting
main = print $ primeCount (10^11)
-- $ time ./arith
-- 4118054813
--
-- real 0m0.209s
-- user 0m0.198s
-- sys 0m0.008s
Which is 40 times faster than whatever your 'rabid' C++ friend has written; maybe he can learn a thing or two looking at the Haskell source ... Here are the haddocks

Some rabid C++ programmer claims his program only takes 8 seconds.
Is that wall-clock time or CPU time?
If wall-clock, and the task is split across 100 CPUs, say, it's not very impressive (it's decent), if split across 1000, it's pitiful.
If it's CPU time:
I'm pretty sure that time is not reached by actually sieving up to 1011.
With a few more than 4×109 primes until then, assuming a somewhat normal 2-3GHz CPU, you'd have 4-6 cycles per prime.
You cannot achieve that with a sieve of Eratosthenes, nor with a sieve of Atkin. Each prime has to be inspected and counted, each composite marked as such and inspected. That gives a theoretical lower bound of two cycles per number in the sieve, not counting e.g. array initialisation, loop bound checking, loop variable updates, redundant markings. You're not going to come near that theoretical bound.
A few data points:
Daniel Bernstein's primegen (Sieve of Atkin), with the sieving blocks adjusted to take full advantage of my 32KB L1-cache, takes 90 seconds to sieve the primes to 1011 and count them (234 seconds with the default sieve-block size of 8K words) on my Core i5 2410M (2.3GHz). (It's much optimised for the range up to 232, but above that, it becomes noticeably slower, for the limit 109 the times are 0.49 resp 0.64 seconds.)
My segmented Sieve of Eratosthenes, using some not exposed internals to avoid list creation, sieves and counts to 1011 in 340 seconds (sniff :-/, but hey, for 109 it took 2.92 seconds - it's getting closer, and somewhere between 1012 and 1013 it overtakes primegen :) Using the exposed interface creating a list of primes roughly doubles the time taken, as does compiling it with a 32-bit GHC.
So I'd wager that the reported time of 8 seconds - if CPU time - is, if correct, for an algorithm counting the number of primes without actually sieving the whole way. As indicated by applicative's answer, that can be done much faster.
dafis#schwartz:~/Haskell/Repos/arithmoi> time tests/primeCount 100000000000
4118054813
real 0m0.145s
user 0m0.139s
sys 0m0.006s
Note that 10^11 is too big for 32-bit arithmetic. Also, 10^11 bits = 12.5 GB, far too much data to fit into Haskell's 32-bit address space. So you can't have the entire bitmap in memory at once.
To sieve that range, you have to use a segmented sieve. Even if you're not restricted by a 32-bit address space, using such a large array will yield abysmal performance due to frequent cache misses. Your programme will spend most of its time to wait for data being transferred from main memory. Sieve in chunks that fit in your L2-cache (I haven't succeeded in trying to make it faster by making the sieve fit in L1, I guess the overhead of the GHC runtime is too large to make it work).
Also, eliminate the multiples of some small primes from the sieve, that reduces the needed work, and additionally improves performance by making the sieve smaller. Eliminating even numbers is trivial, multiples of 3 easy, multiples of 5 not very difficult.
Finally, note that the number of primes less than 10^11 is just a shade less than 2^32, so there's no way you can store the actual integers all at once either.
If you store the sieve as a list of bit-arrays, withe multiples of 2, 3 and 5 removed, you need about 3.3GB to store the chunks, so if you really can have up to 4GB, it would fit. But you should rather let the chunks you don't need anymore be garbage-collected immediately.
(In case it matters, my current implementation uses IOUArray Integer Bool and multiple threads to process independent subranges of the search space. Currently it takes several seconds to remove all the multiples of 2 from a 10MB array chunk...)
It does matter.
Use Int for the indices and unsafeRead/unsafeWrite to read and modify the array. Integer computations are much slower than Int computations, and the bounds-checking you get with readArray/writeArray really hurts.
10MB chunks are far too large, you lose cache-locality. Use chunks of a few hundred KB at most (L2 cache minus some space for other things needed).
Still, it shouldn't take several seconds to remove multiples of 2 even with Integer indices, bounds-checking and 10MB chunks. Can we see your code?
Post-vacation update:
Eight minutes to sieve the primes up to 1011 is possible without deep wizardry. I don't see how going from one to four cores could yield an eightfold speedup, since there should be no cache-effects here, but whatever, it may be, without seeing the code, I can't investigate.
So let's take a look at your code.
First, an incorrectness:
vs <-
mapM
(\ start -> do
let block = (start, start + block_size)
v <- newEmptyMVar
writeChan queue $ do
say $ "New chunk " ++ show block
chunk <- chunk_new block
sieve_top base chunk
c <- chunk_count chunk
putMVar v c
return v
)
(takeWhile (< target) $ iterate (+block_size) base_max)
The numbers base_max + k*block_size appear in two chunks each, if any of them is prime, that prime is counted twice, also you should cap the upper bound at target.
Now to the performance aspect:
One thing that jumps out is that it's real chatty, so chatty that it's measurable once you have adjusted the block_size to the cache (I took 256KB blocks for a 512KB L2 cache) - then the threads are slowed down by fighting for stdout for the if prime < 100 then say $ "Sieve " ++ show prime else return () message.
Let's look at your (silenced) sieving loop:
chunk_sieve :: Chunk -> Integer -> IO ()
chunk_sieve array prime = do
(min, max) <- getBounds array
let n0 = min `mod` prime
let n1 = if n0 == 0 then min else min - n0 + prime
mapM_
(\ n -> writeArray array n (n == prime))
(takeWhile (<= max) $ iterate (+prime) n1)
One thing that costs time is that each index is compared to the prime whose multiples are marked off. Each single comparison is cheap (though considerably more expensive than an Int comparison), but the huge number of comparisons, only one of which may yield True, adds up. Unconditionally writing False and if necessary writing True at the prime's index after the loop yields a considerable speedup.
For timing purposes I've reduced the target to 109 and ran it on two cores. The original code took 155s (elapsed, 292s user), with the reduced block_size 148s, silenced 143s. Omitting the comparison,
mapM_
(\ n -> writeArray array n False)
(takeWhile (<= max) $ iterate (+prime) n1)
when (min <= prime && prime <= max) $ writeArray array prime True
it runs in 131s.
Now it's time for some bigger speedups. Did I already mention that bounds-checking costs a lot of time? Since the loop condition guarantees that no out-of-bounds access is attempted (and the primes are small enough that no Int-overflow can happen), we should really use the unchecked access:
chunk_sieve :: Chunk -> Integer -> IO ()
chunk_sieve array prime = do
(min, max) <- getBounds array
let n0 = min `mod` prime
n1 = if n0 == 0 then min else min - n0 + prime
n2 = fromInteger (n1 - min)
mx = fromInteger (max - min)
pr = fromInteger prime
mapM_
(\ n -> unsafeWrite array n False)
(takeWhile (<= mx) $ iterate (+pr) n2)
when (min <= prime && prime <= max) $ writeArray array prime True
which reduces the running time to 96s. Much better, but still abysmal. The culprit is
takeWhile (<= mx) $ iterate (+pr) n2
GHC can't fuse that composition well, and you get a list of boxed Ints that is traversed. Replace that with an arithmetic sequence, [n2, n2+pr .. mx] and GHC happily creates a loop using unboxed Int#s, 37 seconds.
Much much better, but still bad. The biggest time-consumer now is
chunk_count :: Chunk -> IO Integer
chunk_count array = do
(min, max) <- getBounds array
work min max 0
where
work i max count = do
b <- readArray array i
let count' = count + if b then 1 else 0
evaluate count'
let i' = i+1
if i' > max
then return count'
else work i' max count'
Again, the bounds-checking costs a lot of time. With
chunk_count :: Chunk -> IO Integer
chunk_count array = do
(min, max) <- getBounds array
work 0 (fromInteger (max-min)) 0
where
work i max count = do
b <- unsafeRead array i
let count' = count + if b then 1 else 0
evaluate count'
let i' = i+1
if i' > max
then return count'
else work i' max count'
we're down to 15 seconds. Now, evaluate count' is a somewhat expensive way to make work strict in count. Using else work i' max $! count' in the last line instead of evaluate reduces the running time to 13 seconds. Defining work in a more suitable (for GHC, at least) way,
chunk_count :: Chunk -> IO Integer
chunk_count array = do
(min, max) <- getBounds array
let mx = fromInteger (max-min)
work i !ct
| mx < i = return ct
| otherwise = do
b <- unsafeRead array i
work (i+1) (if b then ct+1 else ct)
work 0 0
brings the time down to 6.55 seconds. Now we're in a situation where say $ "New chunk " ++ show block makes a measurable difference, disabling that gets us down to 6.18 seconds.
However, counting set bits by reading a byte from the array, masking off the undesired bits and comparing to 0 for each individual bit is not the most efficient way. It's faster to read entire Words from the array (via castIOUArray) and use popCount, if "you know what you're doing...", that gets us down to 4.25 seconds; stopping the marking when the square of the prime becomes larger than the upper bound of the chunk
sieve_top :: Chunk -> Chunk -> IO ()
sieve_top base chunk = work 2
where
work prime = do
chunk_sieve chunk prime
mp <- chunk_next_prime base prime
case mp of
Nothing -> return ()
Just p' -> do
(_,mx) <- getBounds chunk
when (p'*p' <= mx) $ work p'
to 3.9 seconds. Still not spectacular, but considering where we started, not bad. Just to illustrate the importance of cache locality once other bad behaviour has been reduced: the same code with the original 10MB block size takes 8.5 seconds.
Another small problem in your code is that all threads use the same mutable array of small primes for sieving. Since it is mutable, access to that must be synchronised, which adds a bit of overhead. With only two threads, the overhead isn't too big, using an immutable copy to do the sieving only reduces the time to 3.75 seconds here, but I expect that the effect would be larger for more threads. (I have only two physical cores - with hyperthreading - so using more than two threads doing the same kind of work introduces a slowdown that may invalidate conclusions drawn from that, but using four threads, I get 4.55 seconds with the immutable array versus 5.3 seconds with the mutable array. That seems to corroborate the growing synchronisation overhead.)
There's still a bit to be gained by eliminating more Integer calculations and writing code for GHC's optimiser (more worker/wrapper transformations, some static argument transformations), but not very much, maybe 10-15%.
The next big improvement is to be obtained by eliminating even numbers from the sieve. That reduces the work, allocation and running time by more than half. No prime sieve should ever consider even numbers, really, that's just a pointless waste of work.

This is a strange question/answer in that the accepted answer doesn't match the question: The question asks for help improving the speed of a sieve (correctly choosing a Page-Segmented Sieve of Eratosthenes implementation) but the accepted answer doesn't use a sieve but rather a numerical analysis technique and is just a library. Although that is fine for finding the total number of primes up to a large range very quickly (and there are faster and broader versions for doing that in other languages such as Kim Walisch's primecount in C++, and also for quickly calculating the sums of primes over a range, a sieve is useful for doing particular types of analysis such as finding prime gaps, existence of prime doubles, triples, etc. (generally K-Tuple primes), etc. In fact, general numerical analysis techniques such as the Meissel–Lehmer algorithm, upon which most of these are based, require a source of "seed primes" to start, which is best produced by an optimized Sieve of Eratosthenes.
In fact, Kim Walisch's primecount as per the above link has a GHC/Haskell API already built for it and can easily by called through the Foreign Function Interface (FFI) so therefore would be a better answer than the arithmoi library as it is faster. It is so fast that it is currently the record holder in calculating the number of primes up to 1e28! If one must make such a value available to a Haskell program and doesn't care if they understand how it gets it, it calculates the number of primes to 1e11 in tens of milliseconds.
In a similar fashion, if a sieve is what is really required, then Kim Walisch's primesieve also has a GHC Haskell FFI and could also be called directly.
While using libraries gets the job done, one doesn't learn anything about how to implement them by just using them; thus, the reason for Daniel Fischer's (DF's) very good tutorial answer and this follow-on series to what he started. DF's answer shows how to improve the question's code, but there is nowhere a summary that shows what the code should look like after working through all of his suggestions; This is especially important as the original question code in an OP's hpaste has disappeared (mercifully, as it is only a good example of how not to do it, but perhaps the code should be embedded in the question for reference), and we can only reconstruct what it did through DF's comments in his answer. This series of answers seeks to rectify that in case someone has a need for such a sieve in pure GHC Haskell, starting with a summary of the code to which DF's teaching leads one and following that up with further staged improvements.
TLDR; Jump to the end of the last of my answers for posted Haskell code that actually is almost as fast or as fast as Kim Walisch's primesieve, which up to now is probably the fastest in the world, at least up to smaller ranges of ten to a hundred billion or so and isn't beat by much above that by anything other than an extremely optimized version of YAFU which may be up to about 5% faster for large ranges. DF's final code before wheel factorization is stated to be about 40 times faster than the original question code, I extend that to where my code is 30 to 32 times faster yet again for a total of about 1200 to 1280 times faster than the original question code!.
The original question code
This will be the only time I reference it since it is no longer available (and in my opintion isn't worth modifying anyway): The only thing I liked about that code was the implementation of a thread pool, but even that was flawed in that it used mapM to feed the entire job queue to be processed by the threads to a channel, which is a non-deferred function that thus could potentially push a huge amount of work onto the job channel consuming a lot of memory rather than just pushing enough work to keep all the threads busy, and then feeding one more job for evey one returned from the results Channel. I will correct that in my code at the bottom of this answer and follow-on codes. In fact, only a results MVar pool is necessary, as the GHC runtime forks new threads faster than it takes to pipe new work to a waiting pool.
One problem with both the original code and DF's improved code is neither of them used the "-fllvm" compiler flag to use the LLVM back end code generator. For tight loops as we are trying to write here, LLVM can reduce the time per loop by as much as about a factor of two. It didn't matter for the original code, which had loops so non-tight that LLVM couldn't help, but DF's code does have tight loops and would have benefited by reducing the loop time to about 60 per cent.
The other problem with the use of MVar's (and thus Chan's) is that they aren't particularly fast, with an overhead of about three milliseconds per set activation. We had evidence of this problem in DF's answer in his final analysis where he said "using four threads, I get 4.55 seconds with the immutable array versus 5.3 seconds with the mutable array" as compared to the 3.75 seconds using two threads. Now his machine had only two cores and the extra two threads were Hyper Threaded, sharing most of the same resources as the other two, so one doesn't expect much in better performance in using them, but one doesn't expect worse performance as here. The reason is that there is so much overhead that adding the inefficient cores actually adds extra work and slows the final result. I also see this in my four "real" thread/core machine after increasing efficiency by using the LLVM back end. I only get a reduction of time to about 55% in using all four codes, which is in line with that when I use only two cores the total execution time actually increases. Since `MVar's are the only way we can implement "wait for result" using GHC, the solution to this is to make the work slices much larger (more coarse grained multi-threading) so this overhead becomes a negligible fraction, which is part of my algorithmic improvements in my seconds answer.
There is also no need to use channels as in infinite depth places to receive work and return results once the overloading problem is fixed, so I eliminated them in favor of just a "round robin" array which has the number of elements as the number of processes in the pool.
The test environments
I'm not sure if DF still has his Sandy Bridge laptop, and although I have had a Sandy Bridge CPU, it is currently down for maintenance. However, the online IDE website, Wandbox uses a Broadwell CPU of about exactly the same rating as DF's machine used in his answer at 2.3 GHz with a turbo boost to 2.9 GHz for single threaded use with two cores/four threads (Hyper Threaded). This has the same performance as DF's machine as proven by my taking his referenced internals of the "arithmoi" library and forcing it to run for a range of a billion. this Wandbox link shows it running at almost exactly the same 2.92 seconds as he mentioned in his answer. I did not bother to count the result (which is only about 0.01 milliseconds using pop count) as that would not change the comparison, but only forced it to run over the range with a 128 Kilobyte buffer size as is the default.
So, in Wandbox we have a readily referenced comparable machine; however, it has the limitation that it does not support the use of the LLVM back end, of which use is important for optimization of the tight culling loops we will be using. Accordingly, I will be doing comparisons with and without LLVM on my own machine, which is a Intel Skylake i5-2500 at 3.2 GHz with a boost to 3.6 GHz for single threaded use. There is a slight limitation in this in that the results will not scale directly by clock speed used because Skylake has a further improvement in the architecture for better branch prediction and elision of branches to as low as zero time when they are correctly predicted; as the loops which we are developing spend almost all of their time in tight loops, this can make a ten to fifteen percent reduction in execution time for the sieve implementations.
The principles behind a fast Sieve algorithm
These principles are just two, as follows:
It is important to keep the total number of operations low for a given sieving range.
Each operation must take a small number of CPU clock cycles on average.
The end execution efficiency is then the product of these two.
The performance target
DF seems to think that Atkin and Bernstein's "primegen" is a "gold standard" in sieve performance. It is not for the major reason that it does not and can not take a small number of CPU clock cycles per operation on average (principle 2) and the number of cycles it consumes increases with range faster than what I regard to be the "gold standard" - Kim Walisch's primesieve as referenced in the TLDR. While it is absolutely true that this implementation of the Sieve of Atkin (SoA) passes principle 1) above as it quickly converges to a constant number of operations of 0.2587 times the range for ranges as low as about 100 and at a range of about 1e11 this is less than the best practical implementations of a Maximally Wheel Factorized Sieve of Eratosthenes as per the combo sieve here as estimated by the formulas above that point on the web page (0.2984 for 1e11, higher with increasing range), it does not live up to expectations as to efficiency. The comparison made by the SoA document is flawed as to its comparison to the Sieve of Eratosthenese (SoE): in the same download as the code for "primegen" is the code for "eratspeed", which is a reference version of the SoE that sieves over about a billion. However, they crippled that reference version, as they limited it to the same 2/3/5 wheel factorization as is baked into the SoA (which can't be increased) instead of using the Maximally Wheel Factorized combo sieve (for which there is evidence in the file they knew of). This made the number of operations over this range a little over 400 million operations as compared to the SoA's about 258.7 million. Next, it appears that they further crippled the reference SoE in making the sieve buffer smaller than that used for the SoA so as to increase the time per operation of the SoE so about that of the SoA, in spite of those operations being simpler than those of the SoA. In this way they claimed that the SoA waas about 40% faster than the SoA.
Berstein has done some hand optimizations to the tight inner operation loop for both of these in a similar way, perhaps due to the C compilers of the day not being able to fully optimize these loops, and in order for those compilers not to undo these hand optimizations he states in the notes that the compiler should be run with only the first level of optimization. That is no longer true for the "gcc" version of today as the performance of both are increased with "-O3" high level optimizations. If both are set to 8192 32-bit words (32 Kilobytes) with the above and compiled with the optimizations as above, they both sieve to a billion in about 0.49/0.50 seconds on the DF-type machine, indicating that the number of CPU cycles per cull is about 36 per cent less for "eratspeed". If the Maximum Wheel Factorization combo principles were applied, then it should be about yet another 40% faster, and this is before doing optimizations on the tight inner culling loop; these optimizations to the culling loop are not possible with the SoA because it must use a variable-span-per-operation loop as compared to the fixed span per loop of SoE. This reference "eratspeed" SoE implementation will be discussed further down the answer(s), as this is the approach that leads to my improved algoithm answer.
As a final note about the SoA "primegen", Bernstein seems to believe that the sieve buffer needed to be limited to less than the CPU L1 cache size. While that may have been true for the CPU's on which he developed this work, it is no longer true for modern CPU's whose L2 cache performance is much faster than the SoA and close to the reference SoE tight inner loop time. Thus, if one makes the sieve buffer equal to the CPU L2 cache size (256 Kiloytes in the case of these CPU's), the time to sieve to a billion changes almost not at all at about 0.50 seconds for "primegen", but the time to sieve to 100 billion almost scales linearly to about 52 seconds (as it should). This works because the buffer has been increased so the SoA isn't plagued so quickly by operation span overflows, but it doesn't fix the problem, it just moves it further up the range and the SoA still won't be faster than a maximally optimized SoE at even the highest practical ranges.
For reference, a "primesieve" type of algorithm sieves to a billion in about 0.18 seconds for a range of a billion and about 25 seconds to 100 billion when single threaded with both times reduced by a factor of about two using two threads on the Wandbox/DF range of CPU.
State of DF's final results of his answer and proposed further work
DF stated that his final answer code would sieve to a billion in about 7.5/3.75 seconds when single threaded/run on two threads, repectively. This represents about 2.5514 billion operations and at a CPU clock of 2.9 GHz represents about 9 CPU clocks per cull (CpC). This is not good, as a basic culling loop should take about 3.5 CpC. This is the result of not using the LLVM as discussed above.
He suggested that the first further improvement would be Wheel factorization by odds-only for an improvement in speed of about 2.5 times and that further improvements using more extended wheel factorization are reasonably easily possible. This is true. However, his attempt to do extended wheel factorization in the primes function of the "arithmoi" library is very much a failure: 2.92 seconds to do 404.8 million culls at 2.9 GHz is about 21 CpC, just as the 340 seconds to do the 46.07 billion culls to sieve to 100 billion is also abysmal at about the same clocks per cull. This is so slow that there is no reason for this extended 2/3/5 wheel factorization as the result will be the same or slower than if one just used odds-only even at 9 CpC. The reason for this terrible efficiency is that he uses some complex, and thus slow, mathemeatics to do the reduction in culls for wheel factorization, but those computations take a lot of machine time. There are Look Up Table ways of doing this that are about twice as fast at about 12 CPU clock cycles per cull, but they are still too slow for use in such small ranges; their use should be limited to augmenting the efficient range for very large ranges where the percentage of the time they take is a small part of the overall time.
To show how bad these results are, here is a reference Wandbox Javascript odds-only version sieving to a billion in about 2.14 seconds or about 6.15 CpC; this runs on my Skylake machine at 1.54 seconds, with the reduced time above the ratio in clock rates due to the improvements in architecture as mentioned above for 5.4 CpC.
Further, in Haskell, here is a odds-only version from my submission to RosettaCode, the second faster version which runs on the reference Wandbox CPU in 2.26 seconds sieving to a billion compiled without LLVM for about 6.4 CpC. On my Skylake machine , this runs at 1.83 seconds and 1.023 seconds without/with LLVM respectively (6.4/3.6 CpC, respectively) and sieves to 100 billion in about 210/127 seconds no LLVM/LLVM, respectively (6.7/4.0 CpC, respectively). Note that these are faster than "arithmoi" library wheel factorized version. This will form the basis for further algorithmic improvements in my second answer.
So, the following code is an odds-only algorithm which performs as per DF mentions as his final answer:
-- Multi-threaded Page-Segmented Bit-Packed Odds-Only Sieve of Eratosthenes...
-- "Running a modern CPU single threaded is like
-- running a race car on one cylinder" me ...
-- compile with "-threaded" to use maximum available cores and threads...
-- compile with "-fllvm" for highest speed by a factor of up to two times.
{-# LANGUAGE FlexibleContexts, ScopedTypeVariables #-} -- , BangPatterns, MagicHash, UnboxedTuples, Strict
{-# OPTIONS_GHC -O2 -fllvm #-} -- or -O3 -keep-s-files -fno-cse -rtsopts
import Data.Int ( Int32, Int64 )
import Data.Word ( Word32, Word64 )
import Data.Bits ( (.&.), (.|.), shiftL, shiftR, popCount )
import Data.Array.Base (
UArray(..), listArray, assocs, unsafeAt, elems,
STUArray(..), newArray,
unsafeRead, unsafeWrite,
unsafeThaw, unsafeFreezeSTUArray, castSTUArray )
import Data.Array.ST ( runSTUArray )
import Control.Monad.ST ( ST, runST )
import Data.Time.Clock.POSIX ( getPOSIXTime )
-- imports to do with multi-threading...
import Data.Array (Array)
import Control.Monad ( forever, when )
import GHC.Conc ( getNumProcessors )
import Control.Monad.Cont ( join )
import Control.Concurrent
( ThreadId,
forkIO,
getNumCapabilities,
myThreadId,
setNumCapabilities )
import Control.Concurrent.MVar ( MVar, newEmptyMVar, putMVar, takeMVar )
import System.IO.Unsafe ( unsafePerformIO )
type Prime = Word64
type PrimeNdx = Int64
type StartAddr = Int32
type StartAddrArr = UArray Int StartAddr
type BasePrimeRep = Word32
type BasePrimeRepArr = UArray Int BasePrimeRep
type SieveBuffer = UArray Int Bool -- no point to artificial index!
-- constants related to odds-only...
cWHLPRMS :: [Prime]
cWHLPRMS = [2] -- excludes even numbers other than 2
cFRSTSVPRM :: Prime
cFRSTSVPRM = 3 -- start at first prime past the wheel prime(s)
makeSieveBuffer :: Int -> SieveBuffer
{-# INLINE makeSieveBuffer #-}
makeSieveBuffer szbts = runSTUArray $ do
newArray (0, szbts - 1) False
-- count the remaining un-marked composite bits using very fast popcount...
{-# INLINE countSieveBuffer #-}
countSieveBuffer :: Int -> SieveBuffer -> Int
countSieveBuffer lstndx sb = runST $ do
cmpsts <- unsafeThaw sb -- :: ST s (STUArray s PrimeNdx Bool)
wrdcmpsts <-
(castSTUArray :: STUArray s Int Bool ->
ST s (STUArray s Int Word64)) cmpsts
let lstwrd = lstndx `shiftR` 6
let lstmsk = 0xFFFFFFFFFFFFFFFE `shiftL` (lstndx .&. 63)
let loopwi wi cnt =
if wi < lstwrd then do
v <- unsafeRead wrdcmpsts wi
case cnt - popCount v of
ncnt -> ncnt `seq` loopwi (wi + 1) ncnt
else do
v <- unsafeRead wrdcmpsts lstwrd
return $ fromIntegral (cnt - popCount (v .|. lstmsk))
loopwi 0 (lstwrd * 64 + 64)
cWHLPTRNLEN64 :: Int
cWHLPTRNLEN64 = 2048
cWHLPTRN :: SieveBuffer -- twice as big to allow for overflow...
cWHLPTRN = makeSieveBuffer (131072 + 131072)
-- could be faster using primitive copyByteArray#...
-- in preparation for filling with pre-cull pattern...
fillSieveBuffer :: PrimeNdx -> SieveBuffer -> SieveBuffer
fillSieveBuffer lwi sb#(UArray _ _ rng _) = runSTUArray $ do
ptrn <- unsafeThaw cWHLPTRN :: ST s (STUArray s Int Bool)
ptrnu64 <- (castSTUArray :: STUArray s Int Bool ->
ST s (STUArray s Int Word64)) ptrn
cmpsts <- unsafeThaw sb :: ST s (STUArray s Int Bool)
cmpstsu64 <- (castSTUArray :: STUArray s Int Bool ->
ST s (STUArray s Int Word64)) cmpsts
let lmt = rng `shiftR` 6
lwi64 = lwi `shiftR` 6
loop i | i >= lmt = return cmpsts
| otherwise =
let mdlo = fromIntegral $ lwi64 `mod` fromIntegral cWHLPTRNLEN64
sloop j
| j >= cWHLPTRNLEN64 = loop (i + cWHLPTRNLEN64)
| otherwise = do
v <- unsafeRead ptrnu64 (mdlo + j)
unsafeWrite cmpstsu64 (i + j) v; sloop (j + 1) in sloop 0
loop 0
cullSieveBuffer :: PrimeNdx -> [BasePrimeRepArr] -> SieveBuffer -> SieveBuffer
cullSieveBuffer lwi bpras sb#(UArray _ _ rng _) = runSTUArray $ do
cmpsts <- unsafeThaw sb :: ST s (STUArray s Int Bool)
let limi = lwi + fromIntegral rng - 1
loopbpras [] = return cmpsts -- stop warning incomplete pattern match!
loopbpras (bpra#(UArray _ _ bprrng _) : bprastl) =
let loopbpi bpi
| bpi >= bprrng = loopbpras bprastl
| otherwise =
let bp = unsafeAt bpra bpi
bpndx = (fromIntegral bp - cFRSTSVPRM) `shiftR` 1
rsqri = fromIntegral ((bpndx + bpndx) * (bpndx + cFRSTSVPRM)
+ cFRSTSVPRM) - lwi in
if rsqri >= fromIntegral rng then return cmpsts else
let bpint = fromIntegral bp
bppn = fromIntegral bp
cullbits c | c >= rng = loopbpi (bpi + 1)
| otherwise = do unsafeWrite cmpsts c True
cullbits (c + bpint)
s = if rsqri >= 0 then fromIntegral rsqri else
let r = fromIntegral (-rsqri `rem` bppn)
in if r == 0 then 0 else fromIntegral (bppn - r)
in cullbits s in loopbpi 0
loopbpras bpras
-- multithreading goes here...
{-# NOINLINE cNUMPROCS #-}
cNUMPROCS :: Int -- force to the maximum number of threads available
cNUMPROCS = -- 1
-- {-
unsafePerformIO $ do -- no side effects because global!
np <- getNumProcessors; setNumCapabilities np
getNumCapabilities
--}
-- list of culled soeve buffers from index with give bit size...
makePrimePagesFrom :: forall r. PrimeNdx -> Int ->
(PrimeNdx -> SieveBuffer -> r) -> Bool -> [r]
makePrimePagesFrom stwi szbts cnvrtrf thrdd =
-- great, we can make an extra thread pool whenever we might need more, and
-- it should die and be collected whenever this goes out of scope!
let bpras = makeBasePrimeRepArrs thrdd
jbparms() =
let loop lwi szb =
(lwi, szb) : loop (lwi + fromIntegral szb) szb
in loop stwi szbts in
if thrdd then
let
{-# NOINLINE strttsk #-}
strttsk lwi szbts bpras mvr = -- do some strict work but define it non-strictly,
forkIO $ do -- else it will run in forground before threading!
-- and return it using a MVar; force strict execution in thread...
putMVar mvr $! cnvrtrf lwi $ cullSieveBuffer lwi bpras
$ fillSieveBuffer lwi $ makeSieveBuffer szbts
-- start a result pool, initialized to start with the first tasks...
{-# NOINLINE rsltpool #-}
rsltpool :: Array Int (MVar r) = unsafePerformIO $! do
mvlst <- mapM (const newEmptyMVar) [ 1 .. cNUMPROCS ] -- unique copies
mapM_ (\ (mvr, (lwi, szb)) -> strttsk lwi szb bpras mvr)
$ zip mvlst $ jbparms()
return $! listArray (0, cNUMPROCS - 1) mvlst
-- lazily loop over the entire job list...
loop (fdhd : fdtl) =
let {-# NOINLINE getnxt #-}
getnxt ((lwi, szb), i) = unsafePerformIO $! do -- wait for and get result of next page
let mvr = unsafeAt rsltpool i
r <- takeMVar mvr -- recycle mvr for next
strttsk lwi szb bpras mvr; return $! r
in getnxt fdhd : loop fdtl
-- lazily cycle over the rest of the jobs forever...
in rsltpool `seq` loop $ zip (drop cNUMPROCS $ jbparms())
(cycle [ 0 .. cNUMPROCS - 1 ]) else
-- back to non multi-threaded functions...
let loop ((lwi, szb) : jbpmstl) =
(cnvrtrf lwi . cullSieveBuffer lwi bpras . fillSieveBuffer lwi .
makeSieveBuffer) szb : loop jbpmstl
in loop $ jbparms()
makeBasePrimeRepArrs :: Bool -> [BasePrimeRepArr]
makeBasePrimeRepArrs thrdd =
let sb2bpra :: PrimeNdx -> SieveBuffer -> BasePrimeRepArr
sb2bpra lwi sb#(UArray _ _ rng _) =
let len = countSieveBuffer (rng - 1) sb
bpbs = fromIntegral cFRSTSVPRM + fromIntegral (lwi + lwi) in
listArray (0, len - 1) [ bpbs + fromIntegral (i + i) |
(i, False) <- assocs sb ]
fkbpras = [ sb2bpra 0 $ makeSieveBuffer 512 ]
bpra0 = sb2bpra 0 $ cullSieveBuffer 0 fkbpras $ makeSieveBuffer 131072
in bpra0 : makePrimePagesFrom 131072 131072 sb2bpra thrdd
-- result functions are here...
-- prepends the wheel factorized initial primes to the sieved primes output...
-- some faster not useing higher-order-functions, but still slow so who cares?
primes :: Int -> Bool -> [Prime]
primes szbts thrdd = cWHLPRMS ++ concat prmslsts where
-- convert a list of sieve buffers to a UArray of primes...
sb2prmsa :: PrimeNdx -> SieveBuffer -> UArray Int Prime
sb2prmsa lwi sb#(UArray _ _ rng _) = -- bsprm `seq` loop 0 where
let bsprm = cFRSTSVPRM + fromIntegral (lwi + lwi)
len = countSieveBuffer (rng - 1) sb in
bsprm `seq` len `seq`
listArray (0, len - 1)
[ bsprm + fromIntegral (i + i) | (i, False) <- assocs sb ]
prmslsts = map elems $ makePrimePagesFrom 0 szbts sb2prmsa thrdd
-- count the primes from the sieved page list to the limit...
countPrimesTo :: Prime -> Int -> Bool -> Int64
countPrimesTo limit szbts thrdd =
let lmtndx = fromIntegral $ (limit - cFRSTSVPRM) `shiftR` 1 :: PrimeNdx
sb2cnt lwi sb#(UArray _ _ rng _) =
let nlwi = lwi + fromIntegral rng in
if nlwi < lmtndx then (countSieveBuffer (rng - 1) sb, nlwi)
else (countSieveBuffer (fromIntegral (lmtndx - lwi)) sb, nlwi)
loop [] cnt = cnt
loop ((cnt, nxtlwi) : cntstl) ocnt =
if nxtlwi > lmtndx then ocnt + fromIntegral cnt
else loop cntstl $ ocnt + fromIntegral cnt
in if limit < cFRSTSVPRM then
if limit < 2 then 0 else 1
else loop (makePrimePagesFrom 0 szbts sb2cnt thrdd) 1
-- test it...
main :: IO ()
main = do
let limit = 10^9 :: Prime
-- page segmentation sized for most efficiency;
-- fastest with CPU L1 cache size but more address calculation overhead;
-- a little slower with CPU L2 cache size but just about enough to
-- cancell out the gain from reduced page start address calculations...
let cSIEVEPGSZ = (2^18) * 8 :: Int -- CPU L2 cache size in bits
let threaded = True
putStrLn $ "There are " ++ show cNUMPROCS ++ " threads available."
strt <- getPOSIXTime
-- let answr = length $ takeWhile (<= limit) $ primes cSIEVEPGSZ threaded -- slow way
let answr = countPrimesTo limit cSIEVEPGSZ threaded -- fast way
stop <- answr `seq` getPOSIXTime -- force evaluation of answr b4 stop time!
let elpsd = round $ 1e3 * (stop - strt) :: Int64
putStr $ "Found " ++ show answr
putStr $ " primes up to " ++ show limit
putStrLn $ " in " ++ show elpsd ++ " milliseconds."
This has been refactored from my RosettaCode submission referenced above by making it possible to have different sieve buffer sizes for the main sieve loop and the secondary base prime feed loop, as well as adding multi-threading (improved above DF's as discussed above). It runs at about the same speed as DF mentions in CpC for his final answer on equivalent to his machine without LLVM (about 3/1.5 seconds, 9 CpC to one billion) and runs at one second to a billion/125 seconds to 100 billion (3.7/4.1 CpC, respectively) on my Skylake machine with LLVM single threaded, and about half of those times when multi-threaded due to the problem of not being "coarse-grained" enough as explained above.
This answer is only a factor of less than two faster than DF's code, mostly due to the recommended use of the LLVM back end.

This answer presumes that the questioner just wanted a count of the primes to a limit by the fastest means possible and didn't know that a direct Sieve is not the fastest way of doing this, which explains why the prime counting function from the "arithmoi" library was chosen as the accepted answer. However, as analytic techniques go, the "arithmoi" prime counting function isn't the best, so this answer makes it possible to access faster methods.
The fastest currently available library to count the primes to a limit is Kim Walisch's C++ primecount repository, which can count the primes to 1e11 in about 8 milliseconds single-threaded on an Intel i5-6500 (3.6 GHz single threaded boost clock rate), about 1.5 seconds to 1e15, and about 99 seconds to 1e18, all using the fastest Xavier Gourdon algorithm, with all of these times reduced by approximately a factor of four when run multi-threaded (four cores). At this rate, this library can calculate the number of primes to the 64-bit number range (18446744073709551615) in about 163 seconds multi-threaded.
If one needed to make this computational ability and results available to Haskell, one can call the primecount library from GHC Haskell using FFI as will be described in the remainder of this answer. The primecount library is much better than the counting function from "arithmoi" as provided in the accepted answer because it uses a much better algorithm (Xavier Gourdon's of about 2000 with corrections and tuning by Kim Walisch) a greatly reduced use of RAM memory that is much less than proportional to the square root of the counting range, is based on a much better base Sieve of Eratosthenes implementation, and is multi-threaded to effectively be able to use all cores of a given computer. It holds the word record in being able to count the primes to a range of 1e29 (currently).
So the steps to call Kim Walisch's primecount from GHC Haskell are as follows:
Download the source code from the GitHub repo for your selected version from the "releases" link in either .zip or .tar.gz format (I used version 7.4).
Install a version of gcc/gcc++/g++, cmake, and make onto your machine as per the detailed build instructions(using MSYS2 on Windows, which also needs to be downloaded and installed).
Decompress the above downloaded file to whatever location using the utilities available on your machine (install 7zip if you don't already have a decompression program).
Open a terminal inside the resulting outer uncompressed folder level ("primecount-7.4" in my case).
Type the following command into the terminal (or copy and paste from here) followed by a Enter/Carriage Return: cmake . -DBUILD_SHARED_LIBS=ON -DBUILD_STATIC_LIBS=OFF
Type the following command followed by an Enter/Carriage Return to compile and link the target files: make -j
Make a GHC Haskell source code folder in a location of your choice (I named mine PrimeCount).
Copy or move the following files from the above primecount folder to this GHC Haskell source code folder: libprimecount.so.7.4 (for example, with the last digits the version number) and libprimecount.so.7; on Windows this will have a .dll extension rather than .so and there may be no other shortcut files. Then create the soft sym link file in the terminal if necessary in the source folder by the following terminal command: ln -s ./libprimecount.so.7 libprimecount.so (Creating a Windows shortcut can be more graphical in just opening the source folder in File Explorer, right clicking the destination .dll file, selecting create shortcut, then renaming the shortcut to the name of the target dll without the version numbers, if the windows shared libraries have associated version numbers).
copy the file "primecount.h" from the subfolder "include" to your GHC Haskell source code folder.
Create a file called, say, PrimeCountFFI.hs inside the GHC Haskell source code folder with the following contents:
-- this shows how to call Kim Walisch's C++ primecount...
-- compile with "ghc PrimeCountFFI -lprimecount -L."
{-# OPTIONS_GHC -O2 #-}
import System.Environment ( getArgs )
import Data.Int ( Int64, Int32 )
import Data.Word ( Word32 )
import Foreign.C.String ( CString, withCString, peekCString )
-- Pseudo FFI CTypes
type CInt = Int32
type CWord = Word32
{- 128-bit prime counting function.
- Count the number of primes <= x using Xavier Gourdon's
- algorithm. Uses all CPU cores by default.
-
- #param x Null-terminated string integer e.g. "12345".
- Note that x must be <= primecount_get_max_x() which is
- 10^31 on 64-bit systems and 2^63-1 on 32-bit systems.
- #param res Result output buffer.
- #param len Length of the res buffer. The length must be sufficiently
- large to fit the result, 32 is always enough.
- #return Returns -1 if an error occurs, else returns the number
- of characters (>= 1) that have been written to the
- res buffer, not counting the terminating null character.
-
- Run time: O(x^(2/3) / (log x)^2)
- Memory usage: O(x^(1/3) * (log x)^3) -}
foreign import ccall unsafe "primecount.h primecount_pi_str"
primeCountStr :: CString -> CString -> CWord -> IO CInt
-- Get the currently set number of threads
foreign import ccall unsafe "primecount.h primecount_get_num_threads"
getNumThreads :: IO CInt
main :: IO ()
main = do
input <- head <$> getArgs :: IO String
(rslt, answr) <- withCString input $ \ ip ->
withCString "01234567890123456789012345678901" $ \ rslt -> do
rtn <- primeCountStr ip rslt 32
answrstr <- peekCString rslt
return (rtn, answrstr)
numthrds <- getNumThreads
if rslt < 0 then error "error in computation!!!" else do
putStrLn $ "There are " ++ answr ++ " primes up to "
++ input ++ " using "
++ show numthrds ++ " threads."
Open a terminal inside the GHC Haskell source code folder and compile the program with the following command: ghc -lprimecount -L. PrimeCountFFI.
For some UNIX-like Operating Systems, one may have to tell the system to look for the linked shared library in the current directory rather than only in the standard install locations; this can be done for the current terminal session by the following shell command followed by Enter/the return key: export LD_LIBRARY_PATH=$PWD, which search path will be valid as long as the terminal session is open. You can run the program from here with a command such as the following equivalent example calls: ./PrimeCountFFI 1000000000000 or ./PrimeCountFFI 10^12 or ./PrimeCountFFI 10**12 or ./PrimeCountFFi 1e12 to find the count of primes up to the given count, in this case returning the following:
There are 37607912018 primes up to 1e12 using 4 threads.
which will take just a few milliseconds (1e14 takes about 150 milliseconds).
Now let's think about how pointless this exercise has been in that we may as well be using the primecount console application directly in which case we have command line help and all kinds of options available as to tracking the time, status, and the prime counting algorithm used as well as being able to tweak the algorithms and turn multi-threading off. Being able to call the primecount library from Haskell as shown here would only be useful if one wanted to do other processing on the results in Haskell as in using arguments generated from some other algorithm and/or passing the results for further processing in yet another algorithm.
Also, in just calling a library function whether it be primecount or the arithmoi function, we don't learn anything about how these functions work or what it takes to code the various prime counting algorithms.
Given the relative speeds of this test machine (an Intel i5-6500 at 3.6 GHz with single threaded boost) as compared to Daniel Fischer's test machine and that of the accepted answer using arithmoi, the arithmoi prime counting function should take about 100 milliseconds to count the primes to 1e11 and that is about what it takes on this machine at about 116 milliseconds, increasing to 446 milliseconds for 1e12, 2111 milliseconds for 1e13, 10535 milliseconds for 1e14, and 52263 milliseconds for 1e15, which is about the maximum range that can be used due to the high memory use. This is as compared to 468 milliseconds on this machine using Kim Walisch's primecount and 1629 milliseconds even single threaded, making the arithmoi over 30 times slower plus of limited use due to the high memory use.
I think that the arithmoi function is so slow because it uses a not very efficient Sieve of Eratosthenes as its base and also uses multi-precision Integers in far more places than it needs to, especially given that this program would never be used to count the number of primes higher than the 64-bit number range due to the high memory use.
It seems to me that there are much easier ways to be able to count the primes to these ranges even within the memory limitations of the arithmoi counting function, which I will show in my next answer.

I think another answer is required as the currently accepted answer only uses a "black box" library function with little explanation or understanding how it works and my own answer calling Kim Walisch's primecount through FFI not much better as to explanation although about 30 times faster single-threaded and scaled by the number of effective CPU cores faster due to multi-threading; this answer seeks to teach how prime counting functions work using the simplest of fast algorithms which is faster than the accepted answer's use of the arithmoi library although obviously won't be as fast as the specialized algorithms of primecount.
First, the arithmoi prime counting function is not as fast nor as elegant as it could be for the following reasons:
It uses the Meissel prime counting function technique (not the Meissel-Lehmer as stated), which is highly dependent on the speed of sieving, yet the implementation of sieving isn't all that fast in using too much math in calculating internal addresses for the wheel factorization such that it is no faster than using an odds-only page-segmented Sieve of Eratosthens (SoE) and likely slower.
It uses a non-recursive implementation of the "Phi" calculation, which is fine, but unlike usual implementations that use the Meissel algorithm to reduce the memory use, it uses memory proportional to the square root of the counting range rather than the cube root of the counting range.
An elegant implementation does not store all of the "P2" quotients (requiring an additional array of about O(n^(1/2)/log (n^(1/2))) times eight bytes of storage but runs two sieving operations, one to produce the quotients and one to process them.
The essential parts of the arithmoi files that get used by the prime counting function span multiple files and total about 1000 Lines of Code (LoC), so this very obscure and complex code will be hard to comprehend for a programmer who isn't also a mathemetician in the prime counting function field. Now, I could translate the JavaScript implementation of the Legarias, Miller, and Odlyzko (LMO) algorithm in another answer and also add further sieving improvements as mentioned in that answer, it should run at about the same speed as the LMO mode of primecount at only a few times slower than the fastest algorithm (depending on the counting range), but again that will approach 1000 LoC and will be difficult for programmers not accustomed to the field to comprehend as a first prime counting project.
This answer seeks to teach how prime counting functions work using the simpler Legendre algorithm that predates the Meissel work which was a follow on to Legendre, and will be shown to be faster than the arithmoi prime counting function although it will still use RAM memory proportional to the square root of the counting range (which is the general characteristic for algorithms of the Legendre type, not the Meissel type).
This Legendre algorithm is well known to the competitive programming communities who use it to often be the fastest of prime counting functions up to this limit of 1e11 at which it is tested, although many competitive programmers miscall it the Meissel-Lehmer algorithm (just as the arithmoi prime counting function is miss-called). I have posted a Nim language contribution to [the RosettaCode task for Legendre prime counting](https://rosettacode.org/wiki/Legendre_prime_counting_function#Non-Memoized_Versions - the last of these versions) that explains how it works by using partial sieving and which version could be translated to other languages including Haskell (it was originally translated from C++); however, it still has very high memory use of about eight times the square root of the counting range in bytes so about 0.8 Gigabytes to count the number of primes to 1e16, which is quite high.
I highly recommend reading the text from the above linked RosettaCode "partial sieving" Legendre prime counting function implementation article if one wants to understand the technique. The following Haskell code is a translation of the Nim code from that article, with a modification to the algorithm as discussed below:
{-# OPTIONS_GHC -O2 -fllvm #-}
{-# LANGUAGE FlexibleContexts, BangPatterns #-}
import Data.Time.Clock.POSIX ( getPOSIXTime ) -- for timing
import Data.Int ( Int64, Int32 )
import Data.Bits ( shiftL, shiftR, (.&.), (.|.) )
import Control.Monad ( forM_, when )
import Control.Monad.ST (ST, runST)
import Data.Array.Base ( STUArray(..), unsafeAt,
castSTUArray, unsafeFreezeSTUArray,
MArray(unsafeNewArray_, unsafeRead, unsafeWrite) )
range :: Int64
range = 10^(11 :: Int)
primeCount :: Int64 -> Int64
primeCount n =
if n < 3 then (if n < 2 then 0 else 1) else
let
{-# INLINE divide #-}
divide :: Int64 -> Int64 -> Int
divide nm d = truncate $ (fromIntegral nm :: Double) / fromIntegral d
{-# INLINE half #-}
half :: Int -> Int
half x = (x - 1) `shiftR` 1
rtlmt = floor $ sqrt (fromIntegral n :: Double)
mxndx = (rtlmt - 1) `div` 2
(!nbps, !nrs, !smalls, !roughs, !larges) = runST $ do
-- becomes `smalls` LUT -> the current counts of odd primes to index...
mss <- unsafeNewArray_ (0, mxndx) :: ST s (STUArray s Int Int32)
let msscst =
castSTUArray :: STUArray s Int Int32 -> ST s (STUArray s Int Int64)
mdss <- msscst mss -- for use in adjing counts LUT
forM_ [ 0 .. mxndx ] $ \ i -> unsafeWrite mss i (fromIntegral i)
-- becomes `roughs` LUT -> the current "k-roughs" for base prime sieved...
mrs <- unsafeNewArray_ (0, mxndx) :: ST s (STUArray s Int Int32)
forM_ [ 0 .. mxndx ] $ \ i -> unsafeWrite mrs i (fromIntegral i * 2 + 1)
-- becomes `larges` LUT -> the current count of odd primes indexed for
-- the inverse of the current "k-roughs" in the table above...
mls <- unsafeNewArray_ (0, mxndx) :: ST s (STUArray s Int Int64)
forM_ [ 0 .. mxndx ] $ \ i ->
let d = fromIntegral (i + i + 1)
in unsafeWrite mls i (fromIntegral (divide n d - 1) `div` 2)
cmpsts <- unsafeNewArray_ (0, mxndx) :: ST s (STUArray s Int Bool)
-- partial sieves to quad root of counting range, adjusting and
-- accumulating LUT's so that the overall current results are
-- accumulated to the `mls`/`larges` array...
-- also outputs `cbpi`/`nbps` is the number of base prime sieved and
-- `rlmti`/`nrs` is the effective size of the "k-roughs" sized LUT's...
let loop i !cbpi !rlmti =
let sqri = (i + i) * (i + 1) in
if sqri > mxndx then do
fss <- unsafeFreezeSTUArray mss
frs <- unsafeFreezeSTUArray mrs
fls <- unsafeFreezeSTUArray mls
return (cbpi, rlmti + 1, fss, frs, fls)
else do
v <- unsafeRead cmpsts i
if v then loop (i + 1) cbpi rlmti else do
unsafeWrite cmpsts i True -- cull current bp so not a "k-rough"!
let bp = i + i + 1
-- partial cull by current base prime...
cull c = if c > mxndx then return () else do
unsafeWrite cmpsts c True; cull (c + bp)
-- adjust `mls` array for current partial sieve;
-- also adjusts effective sizes of `mrs` and `mls`...
part ri nri = -- old "rough" index to new one...
if ri > rlmti then return (nri - 1) else do
r <- unsafeRead mrs ri -- "rough" always odd!
t <- unsafeRead cmpsts (fromIntegral r `shiftR` 1)
if t then part (ri + 1) nri else do -- skip newly culled
olv <- unsafeRead mls ri
let m = fromIntegral r * fromIntegral bp
-- split -> when multiple <= square root:
-- quotient `n / m` will be less than square root so
-- `mls` index will be found from indexing `mss`
-- (adjusted by current number bp's not in `mls`)...
adjv <- if m <= fromIntegral rtlmt then do
let ndx = fromIntegral m `shiftR` 1
sv <- unsafeRead mss ndx
unsafeRead mls (fromIntegral sv - cbpi)
-- else quotient will be less than square root so
-- quotient can be directly indexed from `mss`...
else do
sv <- unsafeRead mss (half (divide n m))
return (fromIntegral sv)
-- move "rough" and new "large" values to new places:
-- adjv includes number base primes already in `olv`
unsafeWrite mls nri (olv - (adjv - fromIntegral cbpi))
unsafeWrite mrs nri r; part (ri + 1) (nri + 1)
!pm0 = ((rtlmt `div` bp) - 1) .|. 1 -- max base prime mult
-- adjust `mss` counting table for current partial sieve;
-- for array range to `lmti`; prime multiple to `pm`...
-- adjust 64-bits at a time where possible for speed...
adjc lmti pm =
if pm < bp then return () else do
c <- unsafeRead mss (pm `shiftR` 1)
let ac = c - fromIntegral cbpi -- correction
bi = (pm * bp) `shiftR` 1 -- start array index
adj si = if si > lmti then adjc (bi - 1) (pm - 2)
else do ov <- unsafeRead mss si
unsafeWrite mss si (ov - ac)
adj (si + 1)
ac64 = fromIntegral ac :: Int64
dac = (ac64 `shiftL` 32) .|. ac64
dbi = (bi + 1) `shiftR` 1
dlmti = (lmti - 1) `shiftR` 1
dadj dsi = if dsi > dlmti then return ()
else do dov <- unsafeRead mdss dsi
unsafeWrite mdss dsi (dov - dac)
dadj (dsi + 1)
when (bi .&. 1 /= 0) $ do
ov <- unsafeRead mss bi
unsafeWrite mss bi (ov - ac)
dadj dbi
when (lmti .&. 1 == 0) $ do
ov <- unsafeRead mss lmti
unsafeWrite mss lmti (ov - ac)
adjc (bi - 1) (pm - 2)
cull sqri; nrlmti <- part 0 0; adjc mxndx pm0
loop (i + 1) (cbpi + 1) nrlmti
loop 1 0 mxndx
!ans0 = unsafeAt larges 0 - -- combine all counts; each includes nbps...
sum [ unsafeAt larges i | i <- [ 1 .. nrs - 1 ] ]
-- adjust for all the base prime counts subracted above...
!adj = (nrs + 2 * (nbps - 1)) * (nrs - 1) `div` 2
!adjans0 = ans0 + fromIntegral adj
-- add counts for base primes above quad root counting range
-- to cube root counting range multiplied by rough primes above
-- the base prime as long as the quotient of `n` divided by the
-- multiple is greater than the base prime; counts of indexed by
-- the quotient as above...
-- since all `roughs` are now prime, the multiple will always be
-- just two primes so the compensation will always be added;
-- also, the product will always be > the square root of the range so
-- the quotient will always be less than the square root of the range and
-- only the `smalls` count LUT needs be used (second case from above loop).
loopr ri !acc =
if ri >= nrs then acc else
let r = fromIntegral (unsafeAt roughs ri)
q = n `div` r
lmtsi = half (fromIntegral (q `div` r))
lmti = fromIntegral (unsafeAt smalls lmtsi) - nbps
addcnt pi !ac =
if pi > lmti then ac else
let p = fromIntegral (unsafeAt roughs pi)
ci = half (fromIntegral (divide q p))
in addcnt (pi + 1) (ac + fromIntegral (unsafeAt smalls ci))
in if lmti <= ri then acc else
-- adjust for the `nbps`'s over added in the `smalls` counts...
let !adj = fromIntegral ((lmti - ri) * (nbps + ri - 1))
in loopr (ri + 1) (addcnt (ri + 1) acc - adj)
in loopr 1 adjans0 + 1 -- add one for only even prime of two!
main :: IO ()
main = do
strt <- getPOSIXTime
let rslt = primeCount range
stop <- rslt `seq` getPOSIXTime -- force evaluation of anrswr b4 stop time!
let elpsd = round $ 1e3 * (stop - strt) :: Int64
putStrLn $ "Found " ++ show rslt ++ " primes to " ++
show range ++ " in " ++ show elpsd ++ " milliseconds."
Unfortunately, even when using the LLVM GHC back-end, before the algorithm modification the above code ran about twenty to twenty-five percent slower than the Nim or C++ code from which it was translated even when the C/C++ is compiled with clang which also has a LLVM back-end, primarily due to the current (2022) GHC Haskell compiler not emitting LLVM code that optimizes to the forms of SIMD vector instructions as the GCC C/C++ compiler can do, but also likely due to some optimization problem with the LLVM back-end, as a equivalent translation to Rust which also uses a LLVM back-end is even slower. Also the memory use is very high just as mentioned above for the arithmoi function at about eight Gigabytes of RAM to count the primes to 1e18 (which would take about an hour to complete instead of under two minutes using the fastest algorithm of primecount single-threaded).
The code was modified to speed up the "smalls" count LUT adjustment by 64-bits per loop rather than 32-bits for a gain of about ten percent so that it is only about ten percent slower than the C++ code compiled with clang/LLVM. Manually using GHC SIMD vector operations was also tried but without any further gain, likely because the memory addressing of GHC Haskell SIMD vectors isn't as efficient as the optimizations made by the C++ compilers; it may be that using GHC Haskell SIMD address offset (GHC equivalent to pointers) operations would make it faster, but for an extra small benefit it was a lot of work to try as it would require allocating the "smalls" array with pinning.
The above code is better than the arithmoi prime counting function in several respects, as follows:
It is a small at only about 100 LoC for the counting function without comments and fairly easy to understand (with the explanation from the RosettaCode contribution).
It is almost five times faster than the arithmoi code while using about the same RAM memory and likely would be five times faster if manually using the GHC SIMD address offset primitives.
It is so much simpler than the arithmoi code as to the sieving algorithm because sieving is a negligible part of the overall execution time for a prime counting function of the Legendre type so can use a simple sieve as in the odds-only SoE used here.
There are some improvements that could be made to the algorithm as follows:
Page-segmentation of the culling buffer would mean that "counts" LUT wouldn't be one huge array but would proportional to the size of the page segments.
The memory use of the above code can be reduced by different use of LUT's: using the "splitting" technique from LMO would make the "roughs" and "larges" LUT's unnecessary although there would be a "primes" table of the primes up to the square root of the counting range, which would be about O(n^(1/2) / log g) in size - reduced by the extra log term; this array could be encoded as "delta" values from the previous prime to reduce the size by a further factor of four. The large size "smalls" LUT would be essentially reduced to just the size of each page segment when page segmented as mentioned above. However, an additional table of the "special leaves" roots would be necessary which would be the square root of the counting range in size unless compressed by the "log n" factor, but compressing would require a sort and cost some time and wouldn't save all that much space since it would need to include the base prime factors in each element. In short, it isn't easy to reduce the need for space for any algorithms of the Legendre type to below being proportional to the square root of the counting range, and not by all that much (a single "log n" factor plus some small constant factors).
Using page-segmentation would make it relatively easy to convert to a multi-threaded algorithm.
The above improvements will come at the cost of some extra code complexity, but not as much as the arithmoi code or LMO implementations because of the much simpler sieving being adequate. I have not included a version with these improvements, as if one were to proceed with this amount of work, one may as well implement LMO and enjoy the benefits of memory use proportional to the cube root of the counting range, which would make it practically useful to the 64-bit counting range and higher. The main potential advantage of LMO over this function isn't so much speed but is the reduced use of RAM as counting ranges increase, which is perhaps one of the main reasons LMO was invented due to the limited RAM available on computers of the time; the main disadvantage of LMO as compared to this function is that it won't be that much much faster if any (even for larger counting ranges) without the ultimate in sieving algorithms; with an "ordinary" odds-only page-segmented sieve implementation, it won't be any faster and perhaps slower than this Legendre implmementation.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string