Significant overhead implementing Catalan numbers with histomorphism - haskell

I'm exploring recursion-schemes recently and want to find some use cases for histomorphism - for which I think Catalan numbers could be a fun one (I'm aware there are better ways to implement Catalan numbers, which are not the focus of this question). What I come up with is the following:
import Control.Comonad.Cofree
import Control.Monad
import Data.Foldable
import Data.Function.Memoize (memoFix)
import Data.Functor.Foldable
import GHC.Natural
type Nat = Natural
-- unrelated lines omitted
catalanHisto :: Nat -> Nat
catalanHisto = histo \case
Nothing ->
1
Just fs ->
let xs = toList fs -- this is line 101 in my original code.
ys = reverse xs
in sum $ zipWith (*) xs ys
catalanMemo :: Integer -> Integer
catalanMemo = memoFix \q n ->
if n == 0
then 1
else
let xs = fmap q [0 .. n -1]
ys = reverse xs
in sum $ zipWith (*) xs ys
main :: IO ()
main = do
-- print $ catalanMemo 1000
print $ catalanHisto 1000
Performance suffers however, with catalanHisto:
real 49.36s
user 416.48s
sys 99.38s
which is pretty bad comparing with catalanMemo:
real 0.84s
user 5.09s
sys 2.08s
Given that at least it terminates, the histo version definitely memoized something, but there is a huge overhead that I'm not sure whether I'm misusing histo, or it's just the price to pay for writing programs this way. As I went ahead with some basic profiling:
Sat Feb 19 22:58 2022 Time and Allocation Profiling Report (Final)
demo +RTS -N -s -p -RTS
total time = 20.78 secs (52462 ticks # 1000 us, 24 processors)
total alloc = 122,870,767,920 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
catalanHisto.\ Catalan src/Catalan.hs:(101,5)-(103,31) 68.0 71.5
foldMap.go Control.Comonad.Cofree src/Control/Comonad/Cofree.hs:301:5-46 28.4 25.0
catalanHisto Catalan src/Catalan.hs:(97,1)-(103,31) 1.7 0.0
catalanHisto.\.ys Catalan src/Catalan.hs:102:9-23 1.3 3.3
Not an expert interpreting those results, I guess in addition to some allocations in Control.Comonad.Cofree, it spents a majority of time doing allocation in the non-trivial branch of catalanHisto, probably due to toList and reverse, which I'm not sure how much room there is for optimization.

Related

GHC profiling file and chart are contradictory

I have a sieve of Eratosthenes program written in ST.Strict, and I was profiling it when I saw that it was taking a ridiculous amount of memory:
Sun Jul 10 18:27 2016 Time and Allocation Profiling Report (Final)
Primes +RTS -hc -p -K1000M -RTS 10000000
total time = 2.32 secs (2317 ticks # 1000 us, 1 processor)
total alloc = 5,128,702,952 bytes (excludes profiling overheads)
(where 10^7) is the amount of primes I asked it to generate.
Weirdly, the profiling graph shows something completely different:
Am I misreading something in one of these graphs? Or is there something wrong with one of these tools?
For reference, my code is
{-# LANGUAGE BangPatterns #-}
import Prelude hiding (replicate, read)
import qualified Text.Read as T
import Data.Vector.Unboxed.Mutable(replicate, write, read)
import Control.Monad.ST.Strict
import Data.STRef
import Control.Monad.Primitive
import Control.Monad
import System.Environment
main = print . length . primesUpTo . T.read . head =<< getArgs
primesUpTo :: Int -> [Int]
primesUpTo n = runST $ do
primes <- replicate n True
write primes 0 False
write primes 1 False
sieve 2 primes
return []
-- Removed to avoid the memory allocation of creating the list for profiling purposes
-- filterM (read primes) [0..n-1]
where
sieve !i primes | i * i >= n = return primes
sieve !i primes = do
v <- read primes i
counter <- newSTRef $ i * i
when v $ whileM_ ((< n) <$!> readSTRef counter) $ do
curr_count <- readSTRef counter
write primes curr_count False
writeSTRef counter (curr_count + i)
sieve (i + 1) primes
whileM_ :: (Monad m) => m Bool -> m a -> m ()
whileM_ condition body = do
cond <- condition
when cond $ do
body
whileM_ condition body
This seems to confuse many people.
total alloc = 5,128,702,952 bytes (excludes profiling overheads)
This is literally the total size of all the allocations ever performed by your program, including "temporary" objects that become dead almost immediately after being allocated. Allocation itself is nearly free, and generally Haskell programs allocate at a rate of around 1-2 GB/s.
Weirdly, the profiling graph shows something completely different:
Indeed, the profiling graph shows the total size of all the objects that are live on the heap at any particular time. This reflects the space usage of your program. If your program runs in constant space, then the number shown in this graph will stay constant.

Multithreaded Subset Sum with Haskell Parallel Strategies

I am attempting to use Parallel.Strategies to parallelize my subset sum solver, and need a little help understanding what is happening.
The Problem
Find the subset of numbers in numbers :: [Int] which sum to 100000000.
Single-threaded solution:
import Data.List (find)
import Data.Maybe (fromJust, isJust)
numbers = [14920416,14602041,14088921,13371291,13216099,12153625,10896437
,10884343,10228468,10177453,9998564,9920883,9511265,8924305
,8452302,8103727,7519471,7043381,7028847,6418450,6222190,6215767
,6190960,5514135,4798322,3823984,3247980,837289] :: [Int]
subsequencesOfSize :: Int -> [Int] -> [[Int]]
subsequencesOfSize n xs = let l = length xs
in if n>l then [] else subsequencesBySize xs !! (l-n)
where
subsequencesBySize [] = [[[]]]
subsequencesBySize (x:xs) = let next = subsequencesBySize xs
in zipWith (++) ([]:next) (map (map (x:)) next ++ [[]])
subsetSum :: [Int] -> Int -> Maybe [Int]
subsetSum seq n = find ((==target) . sum) (subsequencesOfSize n seq)
where target = 100000000
solve = map (subsetSum numbers) [n,n-1 .. 1]
where n = (length numbers)
main = do
print $ fromJust $ find isJust solve
Parallel Strategy
Since I am calculating the subsets of size n separately already, I figured I could just use parMap to spawn calculation of each list of subsets of size n simultaneously. I replaced the map in the solve function like this:
import Control.Parallel.Strategies
solve = parMap rpar (subsetSum numbers) [n,n-1 .. 1]
where n = (length numbers)
Single Core
newproblem +RTS -p -N1 -RTS
total time = 35.05 secs (35047 ticks # 1000 us, 1 processor)
total alloc = 22,628,052,232 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
subsetSum Main 86.6 24.5
subsequencesOfSize.subsequencesBySize Main 11.0 75.5
solve Main 2.4 0.0
Two Cores
newproblem +RTS -p -N2 -RTS
total time = 28.80 secs (57590 ticks # 1000 us, 2 processors)
total alloc = 26,537,237,440 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
subsetSum Main 70.2 21.4
subsequencesOfSize.subsequencesBySize Main 28.8 78.6
Four Cores
newproblem +RTS -p -N4 -RTS
total time = 26.68 secs (106727 ticks # 1000 us, 4 processors)
total alloc = 35,925,142,744 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
subsetSum Main 68.2 22.4
subsequencesOfSize.subsequencesBySize Main 30.8 77.6
threadscope comparisons
As you can see, the program does run quite a bit faster with 2 or 4 cores than on a single core. However, I do not believe the lists of subsets of size n are being handled by separate processors as I intended.
Looking at the threadscopes, it seems to me as if each "bump" in processor activity is the calculation on each subset of size n. I expected not a reduction in the runtime of each "bump", but "bumps" spawned in parallel across each processor. However, the former description is more accurate than the latter.
What is happening here? Where are the speedups coming from? Why is there so much garbage collection happening in between subset calculations?
Thanks in advance for any enlightenment whatsoever :D

Improving Haskell code performance (BangPatterns, LazyByteString)

I've used BangPatterns, Lazy ByteString. Don't know what else to do to improve performance of this code. Any ideas and suggestions? It's clearly not the fastest version as it exceeds time limit.
-- Find the sum of all the multiples of 3 or 5 below N
-- Input Format
-- First line contains T that denotes the number of test cases. This is followed by T lines, each containing an integer, N.
{-# LANGUAGE BangPatterns #-}
{-# OPTIONS_GHC -O2 -optc-O2 #-}
import qualified Data.ByteString.Lazy as L
import Control.Monad (mapM_)
readInt :: L.ByteString -> Int
readInt !s = L.foldl' (\x c -> 10 * x + fromIntegral c - 48) 0 s
main :: IO ()
main = do
-- don't need the number of inputs, since it is read lazily.
-- split input by lines
(_:ls) <- L.split 10 `fmap` L.getContents
-- length ls <= 10^5
mapM_ (print . f . readInt) ls
-- n <= 10^9
f :: Int -> Int
f n = go 0 0
where
go !i !a | i == n = a
go !i !a | i `mod` 3 == 0
|| i `mod` 5 == 0 = go (i+1) (a+i)
go !i !a = go (i+1) a
danidiaz has already discussed the input and output issue somewhat.
One fast way to produce multiples of 3 or 5 is to use a "wheel" of the sort commonly used for prime sieves.
multiples3or5 = go 0 $ cycle [3,2,1,3,1,2,3]
where
go n (x : xs) = n : go (n+x) xs
go n [] = error "impossible"
In fact, since the circular list never ends, it's cleaner to use a different type. And since you're using Int, it might as well be specialized and unpacked for performance. Note that the UNPACK pragma in this context is not needed for GHC version 7.8 or above.
data IntStream = {-# UNPACK #-} !Int :> IntStream
infixr 5 :>
wheel :: IntStream
wheel = 3 :> 2 :> 1 :> 3 :> 1 :> 2 :> 3 :> wheel
multiples3or5 = go 0 wheel
where
go !n (x :> xs) = n : go (n+x) xs
As fgv commented, this is in the nature of an anamorphism. You can see this by writing
multiples3or5 = unfoldr go (0, wheel) where
go (!n, (x :> xs)) = Just (n, (n+x, xs))
but note that unfoldr did not become efficient enough to be much use for anything until base 4.8, which has not officially been released.
When printing out the results, the system has to divide a lot of things by 10. I don't know if those routines are specially optimized, but I do know that GHC's native code generator does not currently optimize division by a known divisor unless that divisor is a power of 2. So you might find that you can improve performance by using -fllvm, and being careful to use a compatible version of LLVM.
Edit
See Chad Groft's answer for a better way.
Your use of print in the line
mapM_ (print . f . readInt) ls
may be introducing some overhead, because print depends on the Show instance for Int, meaning a conversion to inefficient Strings will take place.
Add the following imports
import qualified Data.ByteString.Builder as BB
import qualified Data.Foldable as F
import Data.List.Split (chunksOf) -- from the "split" package
import System.IO -- for stdout
and try to change that line with something like
let resultList = map (f . readInt) ls
F.mapM_ (BB.hPutBuilder stdout . F.foldMap BB.intDec) (chunksOf 1000 resultList)
that takes chunks of size 1000 from the list of Ints and uses the efficient Builder type and the specialized hPutBuilder function to write them to stdout.
(I added the chunking because otherwise I feared constructing the Builder would force the whole input list into memory. And we don't want that, because the list is being read lazily.)
I'm not sure if that's the main bottleneck, though.
If you're really concerned with efficiency, rethink the algorithm. Your main bottleneck is that you're manually summing a bunch of numbers between 1 and N, which will perform poorly on large N no matter what you do.
Instead, think mathematically. The sum of all multiples of 3 or 5 up to N is almost the sum of all multiples of 3 up to N (call this S_3), plus the sum of all multiples of 5 up to N (call this S_5). I say "almost" because some numbers get double-counted; call their sum T. Now the sum you want is exactly S_3 + S_5 – T, and each term has a nice closed formula (what is it?). Calculating these three numbers is much faster than what you're doing.
Here you the formula without those "think about" mentor answers
sumMultiplesOf::Integral n=>n->n->n
sumMultiplesOf k n = d * (1 + d) `div` 2 * k where d = (n - 1) `div` k
sumMultiplesOf3or5::Integral n=>n->n
sumMultiplesOf3or5 n = sumMultiplesOf 3 n + sumMultiplesOf 5 n - sumMultiplesOf 15 n

Project euler 10 - [haskell] Why so inefficient?

Alright, so i've picked up project euler where i left off when using java, and i'm at problem 10. I use Haskell now and i figured it'd be good to learn some haskell since i'm still very much a beginner.
http://projecteuler.net/problem=10
My friend who still codes in java came up with a very straight forward way to implement the sieve of eratosthenes:
http://puu.sh/5zQoU.png
I tried implementing a better looking (and what i thought was gonna be a slightly more efficient) Haskell function to find all primes up to 2,000,000.
I came to this very elegant, yet apparently enormously inefficient function:
primeSieveV2 :: [Integer] -> [Integer]
primeSieveV2 [] = []
primeSieveV2 (x:xs) = x:primeSieveV2( (filter (\n -> ( mod n x ) /= 0) xs) )
Now i'm not sure why my function is so much slower than his (he claim his works in 5ms), if anything mine should be faster, since i only check composites once (they are removed from the list when they are found) whereas his checks them as many times as they can be formed.
Any help?
You don't actually have a sieve here. In Haskell you could write a sieve as
import Data.Vector.Unboxed hiding (forM_)
import Data.Vector.Unboxed.Mutable
import Control.Monad.ST (runST)
import Control.Monad (forM_, when)
import Prelude hiding (read)
sieve :: Int -> Vector Bool
sieve n = runST $ do
vec <- new (n + 1) -- Create the mutable vector
set vec True -- Set all the elements to True
forM_ [2..n] $ \ i -> do -- Loop for i from 2 to n
val <- read vec i -- read the value at i
when val $ -- if the value is true, set all it's multiples to false
forM_ [2*i, 3*i .. n] $ \j -> write vec j False
freeze vec -- return the immutable vector
main = print . ifoldl' summer 0 $ sieve 2000000
where summer s i b = if b then i + s else s
This "cheats" by using a mutable unboxed vector, but it's pretty darn fast
$ ghc -O2 primes.hs
$ time ./primes
142913828923
real: 0.238 s
This is about 5x faster than my benchmarking of augustss's solution.
To actually implement the sieve efficiently in Haskell you probably need to do it the Java way (i.e., allocate a mutable array an modify it).
For just generating primes I like this:
primes = 2 : filter (isPrime primes) [3,5 ..]
where isPrime (p:ps) x = p*p > x || x `rem` p /= 0 && isPrime ps x
And then you can print the sum of all primes primes < 2,000,000
main = print $ sum $ takeWhile (< 2000000) primes
You can speed it up by adding a type signature primes :: [Int].
But it works well with Integer as well and that also gives you the correct sum (which 32 bit Int will not).
See The Genuine Sieve of Eratosthenes for more information.
The time complexity of your code is n2 (in n primes produced). It is impractical to run for producing more than first 10...20 thousand primes.
The main problem with that code is not that it uses rem but that it starts its filters prematurely, so creates too many of them. Here's how you fix it, with a small tweak:
{-# LANGUAGE PatternGuards #-}
primes = 2 : sieve primes [3..]
sieve (p:ps) xs | (h,t) <- span (< p*p) xs = h ++ sieve ps [x | x <- t, rem x p /= 0]
-- sieve ps (filter (\x->rem x p/=0) t)
main = print $ sum $ takeWhile (< 100000) primes
This improves the time complexity by about n1/2 (in n primes produced) and gives it a drastic speedup: it gets to 100,000 75x faster. Your 28 seconds should become ~ 0.4 sec. But, you probably tested it in GHCi as interpreted code, not compiled. Marking it1) as :: [Int] and compiling with -O2 flag gives it another ~ 40x speedup, so it'll be ~ 0.01 sec. To reach 2,000,000 with this code takes ~ 90x longer, for a whopping ~ 1 sec of projected run time.
1) be sure to use sum $ map (fromIntegral :: Int -> Integer) $ takeWhile ... in main.
see also: http://en.wikipedia.org/wiki/Analysis_of_algorithms#Empirical_orders_of_growth

Slowdown when using parallel strategies in Haskell

I was working through the exercises of Andre Loh's deterministic parallel programming in haskell exercises. I was trying to convert the N-Queens sequential code into parallel by using strategies, but I noticed that the parallel code runs much slower than the sequential code and also errors out with insufficient stack space.
This is the code for the parallel N-Queens,
import Control.Monad
import System.Environment
import GHC.Conc
import Control.Parallel.Strategies
import Data.List
import Data.Function
type PartialSolution = [Int] -- per column, list the row the queen is in
type Solution = PartialSolution
type BoardSize = Int
chunk :: Int -> [a] -> [[a]]
chunk n [] = []
chunk n xs = case splitAt n xs of
(ys, zs) -> ys : chunk n zs
-- Generate all solutions for a given board size.
queens :: BoardSize -> [Solution]
--queens n = iterate (concatMap (addQueen n)) [[]] !! n
queens n = iterate (\l -> concat (map (addQueen n) l `using` parListChunk (n `div` numCapabilities) rdeepseq)) [[]] !! n
-- Given the size of the problem and a partial solution for the
-- first few columns, find all possible assignments for the next
-- column and extend the partial solution.
addQueen :: BoardSize -> PartialSolution -> [PartialSolution]
addQueen n s = [ x : s | x <- [1..n], safe x s 1 ]
-- Given a row number, a partial solution and an offset, check
-- that a queen placed at that row threatens no queen in the
-- partial solution.
safe :: Int -> PartialSolution -> Int -> Bool
safe x [] n = True
safe x (c:y) n = x /= c && x /= c + n && x /= c - n && safe x y (n + 1)
main = do
[n] <- getArgs
print $ length $ queens (read n)
The line (\l -> concat (map (addQueen n) l using parListChunk (n div numCapabilities) rdeepseq)) is what I changed from the original code. I have seen Simon Marlow's solution but I wanted to know the reason for the slowdown and error in my code.
Thanks in advance.
You are sparking way too much work. The parListChunk parameter of div n numCapabilities is probably, what, 7 on your system (2 cores and you're running with n ~ 14). The list is going to grow large very quickly so there is no point in sparking such small units of work (and I don't see why it makes sense tying it to the value of n).
If I add a factor of ten (making the sparking unit 70 in this case) then I get a clear performance win over single threading. Also, I don't have the stack issue you refer to - if it goes away with a change to your parListChunk value then I'd report that as a bug.
If I make the chunking every 800 then the times top off at 5.375s vs 7.9s. Over 800 and the performance starts to get worse again, ymmv.
EDIT:
[tommd#mavlo Test]$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.0.4
[tommd#mavlo Test]$ ghc -O2 so.hs -rtsopts -threaded -fforce-recomp ; time ./so 13 +RTS -N2
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
73712
real 0m5.404s
[tommd#mavlo Test]$ ghc -O2 so.hs -rtsopts -fforce-recomp ; time ./so 13
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
73712
real 0m8.134s

Resources