Haskell parMap performance?

Haskell parMap performance? - haskell

I was trying to bench parMap vs map with a very simple example:
import Control.Parallel.Strategies
import Criterion.Main
sq x = x^2
a = whnf sum $ map sq [1..1000000]
b = whnf sum $ parMap rseq sq [1..1000000]
main = defaultMain [
bench "1" a,
bench "2" b
]
My results seem to indicate zero speedup from parMap and I was wondering why this might be?
benchmarking 1
Warning: Couldn't open /dev/urandom
Warning: using system clock for seed instead (quality will be lower)
time 177.7 ms (165.5 ms .. 186.1 ms)
0.997 R² (0.992 R² .. 1.000 R²)
mean 185.1 ms (179.9 ms .. 194.1 ms)
std dev 8.265 ms (602.3 us .. 10.57 ms)
variance introduced by outliers: 14% (moderately inflated)
benchmarking 2
time 182.7 ms (165.4 ms .. 199.5 ms)
0.993 R² (0.976 R² .. 1.000 R²)
mean 189.4 ms (181.1 ms .. 195.3 ms)
std dev 8.242 ms (5.896 ms .. 10.16 ms)
variance introduced by outliers: 14% (moderately inflated)

The problem is that parMap sparks a parallel computation for each individual list element. It doesn't chunk the list at all as you seem to think from your comments—that would require the use of the parListChunk strategy.
So parMap has high overheads, so the fact that each spark simply squares one number means that its cost is overwhelmed by that overhead.

Related

How to (efficiently) find the biggest prime factor of a number using Haskell?

I am trying to practice Haskell by solving some of the tasks on Project Euler. In Problem 3, we have to find the biggest prime factor of the number 600851475143, which I had done before in Java a few years back.
I came up with the following:
primes :: [Int]
primes = sieve [2..]
where sieve (p:xs) = p : sieve (filter (\x -> x `rem` p /= 0) xs)
biggestPrimeFactor :: Int -> Int
biggestPrimeFactor 1 = 0
biggestPrimeFactor x =
if x `elem` takeWhile (< x + 1) primes
then x
else last (filter (\y -> x `rem` y == 0) (takeWhile (< x `div` 2) primes))
which works great for smaller numbers, but is terribly inefficient and as a result doesn't work well on the number I have been given.
This seems obvious, because the program iterates over all primes smaller than the number divided by 2 (if it isn't prime itself), but I am unsure what to do about it. Ideally I would be able to further restrict the possible checks, but I don't know how to accomplish this.
Note that I am not looking for an "optimal solution", but rather one that is at least moderately efficient for bigger numbers, and simple to understand and implement, as I am still a beginner in Haskell.

You have two main sources of slowness here. The easier one to address is the boundary condition in biggestPrimeFactor. Checking up to p > x `div` 2 is asymptotically worse than checking up to p^2 > x. But even that is very suboptimal when a number has a lot of factors. The largest factor may be far smaller than sqrt x. If you continually reduce the target number as you find factors, you can account for this and speed up the processing of random inputs by quite a lot.
Here's an example of that, including Daniel Wagner's note from the comments:
-- Naive trial division against a list of primes. Doesn't do anything
-- intelligent when asked to factor a number less than 2.
factorsNaive :: [Integer] -> Integer -> [Integer]
factorsNaive primes#(p : ps) x
| p * p > x = [x]
| otherwise = case x `quotRem` p of
(q, 0) -> p : factorsNaive primes q
_ -> factorsNaive ps x
A few notes:
I decided to have the primes list passed in. This is relevant in the next section, but it also allowed me to write this without a helper.
I specialized to Integer instead of Int because I wanted to throw big numbers at it without caring what maxBound :: Int is. This is slower, but I decided to default to correctness first.
I removed a traversal of the input list. Doing it in one pass is a bit more efficient, but mostly it's cleaner.
Strictly speaking, this is correct even if the input list contains non-primes, so long as the list starts at 2, is monotonically non-decreasing, and eventually contains every prime.
Note that when it recurses, it either discards a prime or produces one. It never will do both at the same time. This is an easy way to ensure it doesn't miss repeated factors.
I named this factorsNaive just to make it clear that it's not doing anything clever with number theory. There are very many things that could be done which are far more complex than this, but this is a good stopping point for understandable factoring of relatively small numbers...
Or at least it is okay at factoring as long as you have a convenient list of prime numbers. It turns out this is the second major cause of slowdown in your code. Your list of prime numbers is slow to generate as it gets longer.
Your definition of primes essentially stacks a bunch of filters on an input list. Every prime produced must go through a filter test for each previous prime. This might sound familiar - it's at least O(n^2) work to generate the first n primes. (It's actually more because division gets more costly as numbers get bigger, but let's ignore that for now.) It's a known (to mathematicians, I had to look it up to be sure) result that the number of primes less than or equal to n approaches n/ln n as n gets large. That approaches linear as n gets large, so generating the list of primes up to n approaches O(n^2) as n gets big.
(Yes, that argument is a mess. A formal version of it is presented in Melissa O'Neill's paper "The Genuine Sieve of Eratosthenes". Refer to it for much more rigorous argumentation of the result.)
It's possible to write much more efficient definitions of primes that have both better constant factors and better asymptotics. As that's the entire point of the paper mentioned in the parenthetical above, I won't go into the details too far. I'll just point out the very first possible optimization:
-- trial division. let's work in Integer for predictable correctness
-- on positive numbers
trialPrimes :: [Integer]
trialPrimes = 2 : sieve [3, 5 ..]
where
sieve (p : ps) = p : sieve (filter (\x -> x `rem` p /= 0) ps)
This does less than you might think. It doesn't double the speed, as the performance improvement is eventually outweighed by the filter stack mentioned before. This version only removes one filter from that stack, but at least it's the filter that rejects the most inputs in the initial version.
In ghci (no compilation or optimizations, and those can really make a difference), this was fast enough to factor the product of two five-digit primes in a few seconds.
ghci> :set +s
ghci> factorsNaive trialPrimes $ 84761 * 60821
[60821,84761]
(5.98 secs, 4,103,321,840 bytes)
Numbers with several small factors are handled much faster. Also notice that because the list of primes is a top-level binding, calculations are cached. Running the computation again has the list of primes pre-computed now.
ghci> factorsNaive trialPrimes $ 84761 * 60821
[60821,84761]
(0.01 secs, 6,934,688 bytes)
That also shows that the run time is absolutely dominated by generating the list of primes. The naive factorization is almost instant at that scale when the list of primes is already in memory.
But you shouldn't really trust performance of interpreted code.
main :: IO ()
main = print (factorsNaive trialPrimes $ 84761 * 60821)
gives
carl#DESKTOP:~/hask/2023$ ghc -O2 -rtsopts factor.hs
[1 of 2] Compiling Main ( factor.hs, factor.o )
[2 of 2] Linking factor
carl#DESKTOP:~/hask/2023$ ./factor +RTS -s
[60821,84761]
1,884,787,896 bytes allocated in the heap
32,303,080 bytes copied during GC
89,072 bytes maximum residency (2 sample(s))
29,400 bytes maximum slop
7 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 326 colls, 0 par 0.021s 0.021s 0.0001s 0.0002s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0002s 0.0004s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.523s ( 0.522s elapsed)
GC time 0.021s ( 0.022s elapsed)
EXIT time 0.000s ( 0.007s elapsed)
Total time 0.545s ( 0.550s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,603,678,988 bytes per MUT second
Productivity 96.0% of total user, 94.8% of total elapsed
That dropped the run time from six seconds to a half-second. (Yeah, +RTS -s is pretty verbose for this, but it's quick and easy.) I think this is a reasonable place to stop with beginner-level code.
If you want to look into more efficient prime generation, the primes package on hackage contains an implementation of the algorithm in O'Neill's paper and an implementation of naive factoring that's equivalent to the one here.

Benchmarking IO action with Criterion

I want to know how long it takes my program to read a 12.9MB .wav file into memory. The function that reads a file into memory looks as follows:
import qualified Data.ByteString as BS
getSamplesFromFileAsBS :: FilePath -> IO (BS.ByteString)
It takes the name of the file and returns the samples as a ByteString. It also performs some other validity checks on the data and ignores the header information. I read the ByteString of samples into memory using ByteString.hGet.
If I now benchmark this function with a 12.9MB file, using Criterion:
bencher :: FilePath -> IO ()
bencher fp = defaultMain [
bench "Reading all the samples from a file." $ nfIO (getSamplesFromFileAsBS fp)
]
I get the following result:
benchmarking Reading all the samples from a file.
time 3.617 ms (3.520 ms .. 3.730 ms)
0.989 R² (0.981 R² .. 0.994 R²)
mean 3.760 ms (3.662 ms .. 3.875 ms)
std dev 354.0 μs (259.9 μs .. 552.5 μs)
variance introduced by outliers: 62% (severely inflated)
It seems to load 12.9MB into memory in 3.617ms. This doesn't seem realistic since it indicates that my SSD can read 3+GB/s, which is not the case at all. What am I doing wrong?
I decided to try this another (more naive) way, by manually measuring the time difference:
runBenchmarks :: FilePath -> IO ()
runBenchmarks fp = do
start <- getCurrentTime
samplesBS <- getSamplesFromFileAsBS fp
end <- samplesBS `deepseq` getCurrentTime
print (diffUTCTime end start)
This gives me the following result: 0.023105s. This is realistic because it would mean my SSD can read at a speed of around 600MB/s. What is wrong with the Criterion result?

I looked at the visual results of my Criterion benchmark by writing the output to a html file. I could clearly see that the first run took around 0.020s, whereas the rest (after caching) takes around 0.003s.
So I'm getting these results because of caching.

Lazy IO + Parallelism: converting an image to grayscale

I am trying to add parallelism to a program that converts a .bmp to a grayscale .bmp. I am seeing usually 2-4x worse performance for the parallel code. I am tweaking parBuffer / chunking sizes and still cannot seem to reason about it. Looking for guidance.
The entire source file used here: http://lpaste.net/106832
We use Codec.BMP to read in a stream of pixels represented by type RGBA = (Word8, Word8, Word8, Word8). To convert to grayscale, simply map a 'luma' transform across all the pixels.
The serial implementation is literally:
toGray :: [RGBA] -> [RGBA]
toGray x = map luma x
The test input .bmp is 5184 x 3456 (71.7 MB).
The serial implementation runs in ~10s, ~550ns/pixel. Threadscope looks clean:
Why is this so fast? I suppose it has something with lazy ByteString (even though Codec.BMP uses strict ByteString--is there implicit conversion occurring here?) and fusion.
Adding Parallelism
First attempt at adding parallelism was via parList. Oh boy. The program used ~4-5GB memory and system started swapping.
I then read "Parallelizing Lazy Streams with parBuffer" section of Simon Marlow's O'Reilly book and tried parBuffer with a large size. This still did not produce desirable performance. The spark sizes were incredibly small.
I then tried to increase the spark size by chunking the lazy list and then sticking with parBuffer for the parallelism:
toGrayPar :: [RGBA] -> [RGBA]
toGrayPar x = concat $ (withStrategy (parBuffer 500 rpar) . map (map luma))
(chunk 8000 x)
chunk :: Int -> [a] -> [[a]]
chunk n [] = []
chunk n xs = as : chunk n bs where
(as,bs) = splitAt (fromIntegral n) xs
But this still does not yield desirable performance:
18,934,235,760 bytes allocated in the heap
15,274,565,976 bytes copied during GC
639,588,840 bytes maximum residency (27 sample(s))
238,163,792 bytes maximum slop
1910 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 35277 colls, 35277 par 19.62s 14.75s 0.0004s 0.0234s
Gen 1 27 colls, 26 par 13.47s 7.40s 0.2741s 0.5764s
Parallel GC work balance: 30.76% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)
SPARKS: 4480 (2240 converted, 0 overflowed, 0 dud, 2 GC'd, 2238 fizzled)
INIT time 0.00s ( 0.01s elapsed)
MUT time 14.31s ( 14.75s elapsed)
GC time 33.09s ( 22.15s elapsed)
EXIT time 0.01s ( 0.12s elapsed)
Total time 47.41s ( 37.02s elapsed)
Alloc rate 1,323,504,434 bytes per MUT second
Productivity 30.2% of total user, 38.7% of total elapsed
gc_alloc_block_sync: 7433188
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 1017408
How can I better reason about what is going on here?

You have a big list of RGBA pixels. Why don't you use parListChunk with a reasonable chunk size?

Optimizing Conduit pipelines

I'm currently benchmarking my program to see whether I can improve its performance. Currently my program will take an input file and run some algorithm to split it into multiple files.
It takes roughly 14s to split a file into 3 parts, with -O2 compilation flag for both library and executable.
ghc-options: -Wall -fno-warn-orphans -O2 -auto-all
It looks like it is spending approximately 60% of its time in sinkFile, and I'm wondering whether there is anything I can do to improve the following code.
-- | Get the sink file, a list of FilePaths and the share number of the file to output to.
idxSinkFile :: MonadResource m
=> [FilePath]
-> Int
-> Consumer [Word8] m ()
idxSinkFile outFileNames shareNumber =
let ccm = CC.concatMap $ flip atMay shareNumber
cbs = CC.map BS.singleton
sf = sinkFile (outFileNames !! shareNumber)
in ccm =$= cbs =$= sf
-- | Generate a sink which will take a list of bytes and write each byte to its corresponding file share
sinkMultiFiles :: MonadResource m
=> [FilePath]
-> [Int]
-> Sink [Word8] m ()
sinkMultiFiles outFileNames xs =
let len = [0..length xs - 1]
in getZipSink $ otraverse_ (ZipSink . idxSinkFile outFileNames) len
Here are the output of GHC's profiling:
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
splitFile.sink HaskSplit.Conduit.Split 289 1 0.0 0.0 66.8 74.2
sinkMultiFiles HaskSplit.Conduit.Split 290 1 27.4 33.2 66.8 74.2
idxSinkFile HaskSplit.Conduit.Split 303 3 7.9 11.3 39.4 41.0
idxSinkFile.ccm HaskSplit.Conduit.Split 319 3 3.1 3.6 3.1 3.6
idxSinkFile.cbs HaskSplit.Conduit.Split 317 3 3.5 4.2 3.5 4.2
idxSinkFile.sf HaskSplit.Conduit.Split 307 3 24.9 21.9 24.9 21.9
sinkMultiFiles.len HaskSplit.Conduit.Split 291 1 0.0 0.0 0.0 0.0
Which shows sinkFile taking a lot of time. (I've benchmarked the list access etc in case you're wondering and they have 0% of processing)
While I understand for a small program like this IO is often the bottleneck, I'd like to see if I can improve the runtime performance of my program.
Cheers!

Following nh2's advice, I decided to pack the ByteStrings in 256 byte chunks instead of doing a BS.singleton on each Word8 instance.
cbs = CL.sequence (CL.take 256) =$= CC.map BS.pack
instead of
cbs = CC.map BS.singleton
and I'm able to reduce the running time as well as the memory usage quite significantly, as demonstrated below:
Original Run
total time = 194.37 secs (194367 ticks # 1000 us, 1 processor)
total alloc = 102,021,859,892 bytes (excludes profiling overheads)
New Run, with CL.take
total time = 35.88 secs (35879 ticks # 1000 us, 1 processor)
total alloc = 21,970,152,800 bytes (excludes profiling overheads)
That's some serious improvement! I'd like to optimize it more but that's for another question :)

New state of the art in unlimited generation of Hamming sequence

(this is exciting!) I know, the subject matter is well known. The state of the art (in Haskell as well as other languages) for efficient generation of unbounded increasing sequence of Hamming numbers, without duplicates and without omissions, has long been the following (AFAIK - and by the way it is equivalent to the original Edsger Dijkstra's solution, too):
hamm :: [Integer]
hamm = 1 : map (2*) hamm `union` map (3*) hamm `union` map (5*) hamm
where
union a#(x:xs) b#(y:ys) = case compare x y of
LT -> x : union xs b
EQ -> x : union xs ys
GT -> y : union a ys
The question I'm asking is, can you find the way to make it more efficient in any significant measure? Is it still the state of the art or is it in fact possible to improve this to run twice faster?
If your answer is yes, please show the code and discuss its speed and empirical orders of growth in comparison to the above (it runs at about ~ n1.05…1.10 for the first few hundreds of thousands of numbers produced). Also, if it exists, can this efficient algorithm be extended to producing a sequence of smooth numbers with any given set of primes?
(clarification: I'm not asking about the much faster direct generation of an nth Hamming number, but rather generating all first n numbers in the sequence.)

If a constant factor(1) speedup counts as significant, then I can offer a significantly more efficient version:
hamm :: [Integer]
hamm = mrg1 hamm3 (map (2*) hamm)
where
hamm5 = iterate (5*) 1
hamm3 = mrg1 hamm5 (map (3*) hamm3)
merge a#(x:xs) b#(y:ys)
| x < y = x : merge xs b
| otherwise = y : merge a ys
mrg1 (x:xs) ys = x : merge xs ys
You can easily generalise it to smooth numbers for a given set of primes:
hamm :: [Integer] -> [Integer]
hamm [] = [1]
hamm [p] = iterate (p*) 1
hamm ps = foldl' next (iterate (q*) 1) qs
where
(q:qs) = sortBy (flip compare) ps
next prev m = let res = mrg1 prev (map (m*) res) in res
merge a#(x:xs) b#(y:ys)
| x < y = x : merge xs b
| otherwise = y : merge a ys
mrg1 (x:xs) ys = x : merge xs ys
It's more efficient because that algorithm doesn't produce any duplicates and it uses less memory. In your version, when a Hamming number near h is produced, the part of the list between h/5 and h has to be in memory. In my version, only the part between h/2 and h of the full list, and the part between h/3 and h of the 3-5-list needs to be in memory. Since the 3-5-list is much sparser, and the density of k-smooth numbers decreases, those two list parts need much less memory that the larger part of the full list.
Some timings for the two algorithms to produce the kth Hamming number, with empirical complexity of each target relative to the previous, excluding and including GC time:
k Yours (MUT/GC) Mine (MUT/GC)
10^5 0.03/0.01 0.01/0.01 -- too short to say much, really
2*10^5 0.07/0.02 0.02/0.01
5*10^5 0.17/0.06 0.968 1.024 0.06/0.04 1.199 1.314
10^6 0.36/0.13 1.082 1.091 0.11/0.10 0.874 1.070
2*10^6 0.77/0.27 1.097 1.086 0.21/0.21 0.933 1.000
5*10^6 1.96/0.71 1.020 1.029 0.55/0.59 1.051 1.090
10^7 4.05/1.45 1.047 1.043 1.14/1.25 1.052 1.068
2*10^7 8.73/2.99 1.108 1.091 2.31/2.65 1.019 1.053
5*10^7 21.53/7.83 0.985 1.002 6.01/7.05 1.044 1.057
10^8 45.83/16.79 1.090 1.093 12.42/15.26 1.047 1.084
As you can see, the factor between the MUT times is about 3.5, but the GC time is not much different.
(1) Well, it looks constant, and I think both variants have the same computational complexity, but I haven't pulled out pencil and paper to prove it, nor do I intend to.

So basically, now that Daniel Fischer gave his answer, I can say that I came across this recently, and I think this is an exciting development, since the classical code was known for ages, since Dijkstra.
Daniel correctly identified the redundancy of the duplicates generation which must then be removed, in the classical version.
The credit for the original discovery (AFAIK) goes to Rosettacode.org's contributor Ledrug, as of 2012-08-26. And of course the independent discovery by Daniel Fischer, here (2012-09-18).
Re-written slightly, that code is:
import Data.Function (fix)
hamm = 1 : foldr (\n s -> fix (merge s . (n:) . map (n*))) [] [2,3,5]
with the usual implementation of merge,
merge a#(x:xs) b#(y:ys) | x < y = x : merge xs b
| otherwise = y : merge a ys
merge [] b = b
merge a [] = a
It gives about 2.0x - 2.5x a speedup vs. the classical version.

Well this was easier than I thought. This will do 1000 Hammings in 0.05 seconds on my slow PC at home. This afternoon at work and a faster PC times of less than 600 were coming out as zero seconds.
This take Hammings from Hammings. It's based on doing it fastest in Excel.
I was getting wrong numbers after 250000, with Int. The numbers grow very big very fast, so Integer must be used to be sure, because Int is bounded.
mkHamm :: [Integer] -> [Integer] -> [Integer] -> [Integer]
-> Int -> (Integer, [Int])
mkHamm ml (x:xs) (y:ys) (z:zs) n =
if n <= 1
then (last ml, map length [(x:xs), (y:ys), (z:zs)])
else mkHamm (ml++[m]) as bs cs (n-1)
where
m = minimum [x,y,z]
as = if x == m then xs ++ [m*2] else (x:xs) ++ [m*2]
bs = if y == m then ys ++ [m*3] else (y:ys) ++ [m*3]
cs = if z == m then zs ++ [m*5] else (z:zs) ++ [m*5]
Testing,
> mkHamm [1] [2] [3] [5] 5000
(50837316566580,[306,479,692]) -- (0.41 secs)
> mkHamm [1] [2] [3] [5] 10000
(288325195312500000,[488,767,1109]) -- (1.79 secs)
> logBase 2 (1.79/0.41) -- log of times ratio =
2.1262637726461726 -- empirical order of growth
> map (logBase 2) [488/306, 767/479, 1109/692] :: [Float]
[0.6733495, 0.6792009, 0.68041545] -- leftovers sizes ratios
This means that this code's run time's empirical order of growth is above quadratic (~n^2.13 as measured, interpreted, at GHCi prompt).
Also, the sizes of the three dangling overproduced segments of the sequence are each ~n^0.67 i.e. ~n^(2/3).
Additionally, this code is non-lazy: the resulting sequence's first element can only be accessed only after the very last one is calculated.
The state of the art code in the question is linear, overproduces exactly 0 elements past the point of interest, and is properly lazy: it starts producing its numbers immediately.
So, though an immense improvement over the previous answers by this poster, it is still significantly worse than the original, let alone its improvement as appearing in the top two answers.
12.31.2018
Only the very best people educate. #Will Ness also has authored or co-authored 19 chapters in GoalKicker.com “Haskell for Professionals”. The free book is a treasure.
I had carried around the idea of a function that would do this, like this. I was apprehensive because I thought it would be convoluted and involved logic like in some modern languages. I decided to start writing and was amazed how easy Haskell makes the realization of even bad ideas.
I've not had difficulty generating unique lists. My problem is the lists I generate do not end well. Even when I use diagonalization they leave residual values making their use unreliable at best.
Here is a reworked 3's and 5's list with nothing residual at the end. The denationalization is to reduce residual values not to eliminate duplicates which are never included anyway.
g3s5s n=[t*b|(a,b)<-[ (((d+n)-(d*2)), 5^d) | d <- [0..n]],
t <-[ 3^e | e <- [0..a+8]],
(t*b)<-(3^(n+6))+a]
ham2 n = take n $ ham2' (drop 1.sort.g3s5s $ 48) [1]
ham2' o#(f:fo) e#(h:hx) = if h == min h f
then h:ham2' o (hx ++ [h*2])
else f:ham2' fo ( e ++ [f*2])
The twos list can be generated with all 2^es multiplied by each of the 3s5s but when identity 2^0 is included, then, in total, it is the Hammings.
3/25/2019
Well, finally. I knew this some time ago but could not implement it without excess values at the end. The problem was how to not generate the excess that is the result of a Cartesian Product. I use Excel a lot and could not see the pattern of values to exclude from the Cartesian Product worksheet. Then, eureka! The functions generate lists of each lead factor. The value to limit the values in each list is the end point of the first list. When this is done, all Hammings are produced with no excess.
Two functions for Hammings. The first is a new 3's & 5's list which is then used to create multiples with the 2's. The multiples are Hammings.
h35r x = h3s5s x (5^x)
h3s5s x c = [t| n<-[3^e|e<-[0..x]],
m<-[5^e|e<-[0..x]],
t<-[n*m],
t <= c ]
a2r n = sort $ a2s n (2^n)
a2s n c = [h| b<-h35r n,
a<-[2^e| e<-[0..n]],
h<-[a*b],
h <= c ]
last $ a2r 50
1125899906842624
(0.16 secs, 321,326,648 bytes)
2^50
1125899906842624
(0.00 secs, 95,424 bytes
This is an alternate, cleaner & faster with less memory usage implementation.
gnf n f = scanl (*) 1 $ replicate f n
mk35 n = (\c-> [m| t<- gnf 3 n, f<- gnf 5 n, m<- [t*f], m<= c]) (2^(n+1))
mkHams n = (\c-> sort [m| t<- mk35 n, f<- gnf 2 (n+1), m<- [t*f], m<= c]) (2^(n+1))
last $ mkHams 50
2251799813685248
(0.03 secs, 12,869,000 bytes)
2^51
2251799813685248
5/6/2019
Well, I tried limiting differently but always come back to what is simplest. I am opting for the least memory usage as also seeming to be the fastest.
I also opted to use map with an implicit parameter.
I also found that mergeAll from Data.List.Ordered is faster that sort or sort and concat.
I also like when sublists are created so I can analyze the data much easier.
Then, because of #Will Ness switched to iterate instead of scanl making much cleaner code. Also because of #Will Ness I stopped using the last of of 2s list and switched to one value determining all lengths.
I do think recursively defined lists are more efficient, the previous number multiplied by a factor.
Just separating the function into two doesn't make a difference so the 3 and 5 multiples would be
m35 lim = mergeAll $
map (takeWhile (<=lim).iterate (*3)) $
takeWhile (<=lim).iterate (*5) $ 1
And the 2s each multiplied by the product of 3s and 5s
ham n = mergeAll $
map (takeWhile (<=lim).iterate (*2)) $ m35 lim
where lim= 2^n
After editing the function I ran it
last $ ham 50
1125899906842624
(0.00 secs, 7,029,728 bytes)
then
last $ ham 100
1267650600228229401496703205376
(0.03 secs, 64,395,928 bytes)
It is probably better to use 10^n but for comparison I again used 2^n
5/11/2019
Because I so prefer infinite and recursive lists I became a bit obsessed with making these infinite.
I was so impressed and inspired with #Daniel Wagner and his Data.Universe.Helpers I started using +*+ and +++ but then added my own infinite list. I had to mergeAll my list to work but then realized the infinite 3 and 5 multiples were exactly what they should be. So, I added the 2s and mergeAlld everything and they came out. Before, I stupidly thought mergeAll would not handle infinite list but it does most marvelously.
When a list is infinite in Haskell, Haskell calculates just what is needed, that is, is lazy. The adjunct is that it does calculate from, the start.
Now, since Haskell multiples until the limit of what is wanted, no limit is needed in the function, that is, no more takeWhile. The speed up is incredible and the memory lowered too,
The following is on my slow home PC with 3GB of RAM.
tia = mergeAll.map (iterate (*2)) $
mergeAll.map (iterate (*3)) $ iterate (*5) 1
last $ take 10000 tia
288325195312500000
(0.02 secs, 5,861,656 bytes)
6.5.2019
I learned how to ghc -02 So the following is for 50000 Hammings to 2.38E+30. And this is further proof my code is garbage.
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.000s ( 0.916s elapsed)
GC time 0.047s ( 0.041s elapsed)
EXIT time 0.000s ( 0.005s elapsed)
Total time 0.047s ( 0.962s elapsed)
Alloc rate 0 bytes per MUT second
Productivity 0.0% of total user, 95.8% of total elapsed
6.13.2019
#Will Ness rawks. He provided a clean and elegant revision of tia above and it proved to be five times as fast in GHCi. When I ghc -O2 +RTS -s his against mine, mine was several times as fast. There had to be a compromise.
So, I started reading about fusion that I had encountered in R. Bird's Thinking Functionally with Haskell and almost immediately tried this.
mai n = mergeAll.map (iterate (*n))
mai 2 $ mai 3 $ iterate (*5) 1
It matched Will's at 0.08 for 100K Hammings in GHCi but what really surprised me is (also for 100K Hammings.) this and especially the elapsed times. 100K is up to 2.9e+38.
TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.000s ( 0.002s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.000s ( 0.002s elapsed)
Alloc rate 0 bytes per MUT second
Productivity 100.0% of total user, 90.2% of total elapsed

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Haskell parMap performance? - haskell

Related

How to (efficiently) find the biggest prime factor of a number using Haskell?

Benchmarking IO action with Criterion

Lazy IO + Parallelism: converting an image to grayscale

Optimizing Conduit pipelines

New state of the art in unlimited generation of Hamming sequence

Categories

Resources