Multithreaded Subset Sum with Haskell Parallel Strategies - multithreading

I am attempting to use Parallel.Strategies to parallelize my subset sum solver, and need a little help understanding what is happening.
The Problem
Find the subset of numbers in numbers :: [Int] which sum to 100000000.
Single-threaded solution:
import Data.List (find)
import Data.Maybe (fromJust, isJust)
numbers = [14920416,14602041,14088921,13371291,13216099,12153625,10896437
,10884343,10228468,10177453,9998564,9920883,9511265,8924305
,8452302,8103727,7519471,7043381,7028847,6418450,6222190,6215767
,6190960,5514135,4798322,3823984,3247980,837289] :: [Int]
subsequencesOfSize :: Int -> [Int] -> [[Int]]
subsequencesOfSize n xs = let l = length xs
in if n>l then [] else subsequencesBySize xs !! (l-n)
where
subsequencesBySize [] = [[[]]]
subsequencesBySize (x:xs) = let next = subsequencesBySize xs
in zipWith (++) ([]:next) (map (map (x:)) next ++ [[]])
subsetSum :: [Int] -> Int -> Maybe [Int]
subsetSum seq n = find ((==target) . sum) (subsequencesOfSize n seq)
where target = 100000000
solve = map (subsetSum numbers) [n,n-1 .. 1]
where n = (length numbers)
main = do
print $ fromJust $ find isJust solve
Parallel Strategy
Since I am calculating the subsets of size n separately already, I figured I could just use parMap to spawn calculation of each list of subsets of size n simultaneously. I replaced the map in the solve function like this:
import Control.Parallel.Strategies
solve = parMap rpar (subsetSum numbers) [n,n-1 .. 1]
where n = (length numbers)
Single Core
newproblem +RTS -p -N1 -RTS
total time = 35.05 secs (35047 ticks # 1000 us, 1 processor)
total alloc = 22,628,052,232 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
subsetSum Main 86.6 24.5
subsequencesOfSize.subsequencesBySize Main 11.0 75.5
solve Main 2.4 0.0
Two Cores
newproblem +RTS -p -N2 -RTS
total time = 28.80 secs (57590 ticks # 1000 us, 2 processors)
total alloc = 26,537,237,440 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
subsetSum Main 70.2 21.4
subsequencesOfSize.subsequencesBySize Main 28.8 78.6
Four Cores
newproblem +RTS -p -N4 -RTS
total time = 26.68 secs (106727 ticks # 1000 us, 4 processors)
total alloc = 35,925,142,744 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
subsetSum Main 68.2 22.4
subsequencesOfSize.subsequencesBySize Main 30.8 77.6
threadscope comparisons
As you can see, the program does run quite a bit faster with 2 or 4 cores than on a single core. However, I do not believe the lists of subsets of size n are being handled by separate processors as I intended.
Looking at the threadscopes, it seems to me as if each "bump" in processor activity is the calculation on each subset of size n. I expected not a reduction in the runtime of each "bump", but "bumps" spawned in parallel across each processor. However, the former description is more accurate than the latter.
What is happening here? Where are the speedups coming from? Why is there so much garbage collection happening in between subset calculations?
Thanks in advance for any enlightenment whatsoever :D

Related

Significant overhead implementing Catalan numbers with histomorphism

I'm exploring recursion-schemes recently and want to find some use cases for histomorphism - for which I think Catalan numbers could be a fun one (I'm aware there are better ways to implement Catalan numbers, which are not the focus of this question). What I come up with is the following:
import Control.Comonad.Cofree
import Control.Monad
import Data.Foldable
import Data.Function.Memoize (memoFix)
import Data.Functor.Foldable
import GHC.Natural
type Nat = Natural
-- unrelated lines omitted
catalanHisto :: Nat -> Nat
catalanHisto = histo \case
Nothing ->
1
Just fs ->
let xs = toList fs -- this is line 101 in my original code.
ys = reverse xs
in sum $ zipWith (*) xs ys
catalanMemo :: Integer -> Integer
catalanMemo = memoFix \q n ->
if n == 0
then 1
else
let xs = fmap q [0 .. n -1]
ys = reverse xs
in sum $ zipWith (*) xs ys
main :: IO ()
main = do
-- print $ catalanMemo 1000
print $ catalanHisto 1000
Performance suffers however, with catalanHisto:
real 49.36s
user 416.48s
sys 99.38s
which is pretty bad comparing with catalanMemo:
real 0.84s
user 5.09s
sys 2.08s
Given that at least it terminates, the histo version definitely memoized something, but there is a huge overhead that I'm not sure whether I'm misusing histo, or it's just the price to pay for writing programs this way. As I went ahead with some basic profiling:
Sat Feb 19 22:58 2022 Time and Allocation Profiling Report (Final)
demo +RTS -N -s -p -RTS
total time = 20.78 secs (52462 ticks # 1000 us, 24 processors)
total alloc = 122,870,767,920 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
catalanHisto.\ Catalan src/Catalan.hs:(101,5)-(103,31) 68.0 71.5
foldMap.go Control.Comonad.Cofree src/Control/Comonad/Cofree.hs:301:5-46 28.4 25.0
catalanHisto Catalan src/Catalan.hs:(97,1)-(103,31) 1.7 0.0
catalanHisto.\.ys Catalan src/Catalan.hs:102:9-23 1.3 3.3
Not an expert interpreting those results, I guess in addition to some allocations in Control.Comonad.Cofree, it spents a majority of time doing allocation in the non-trivial branch of catalanHisto, probably due to toList and reverse, which I'm not sure how much room there is for optimization.

GHC profiling file and chart are contradictory

I have a sieve of Eratosthenes program written in ST.Strict, and I was profiling it when I saw that it was taking a ridiculous amount of memory:
Sun Jul 10 18:27 2016 Time and Allocation Profiling Report (Final)
Primes +RTS -hc -p -K1000M -RTS 10000000
total time = 2.32 secs (2317 ticks # 1000 us, 1 processor)
total alloc = 5,128,702,952 bytes (excludes profiling overheads)
(where 10^7) is the amount of primes I asked it to generate.
Weirdly, the profiling graph shows something completely different:
Am I misreading something in one of these graphs? Or is there something wrong with one of these tools?
For reference, my code is
{-# LANGUAGE BangPatterns #-}
import Prelude hiding (replicate, read)
import qualified Text.Read as T
import Data.Vector.Unboxed.Mutable(replicate, write, read)
import Control.Monad.ST.Strict
import Data.STRef
import Control.Monad.Primitive
import Control.Monad
import System.Environment
main = print . length . primesUpTo . T.read . head =<< getArgs
primesUpTo :: Int -> [Int]
primesUpTo n = runST $ do
primes <- replicate n True
write primes 0 False
write primes 1 False
sieve 2 primes
return []
-- Removed to avoid the memory allocation of creating the list for profiling purposes
-- filterM (read primes) [0..n-1]
where
sieve !i primes | i * i >= n = return primes
sieve !i primes = do
v <- read primes i
counter <- newSTRef $ i * i
when v $ whileM_ ((< n) <$!> readSTRef counter) $ do
curr_count <- readSTRef counter
write primes curr_count False
writeSTRef counter (curr_count + i)
sieve (i + 1) primes
whileM_ :: (Monad m) => m Bool -> m a -> m ()
whileM_ condition body = do
cond <- condition
when cond $ do
body
whileM_ condition body
This seems to confuse many people.
total alloc = 5,128,702,952 bytes (excludes profiling overheads)
This is literally the total size of all the allocations ever performed by your program, including "temporary" objects that become dead almost immediately after being allocated. Allocation itself is nearly free, and generally Haskell programs allocate at a rate of around 1-2 GB/s.
Weirdly, the profiling graph shows something completely different:
Indeed, the profiling graph shows the total size of all the objects that are live on the heap at any particular time. This reflects the space usage of your program. If your program runs in constant space, then the number shown in this graph will stay constant.

How to read big csv file?

I trying to read a big csv file by haskell, and generate the word count by each column.
This more than 4M rows in the file.
So I choice read a block and get the word count each time(5k rows one block).
And than sum it together.
When I test the function with 12000 rows and 120000 rows the time increase almost linear.
But When read 180000 rows, run time exceeds more than four times.
I think it because the memory is not enough,swap with disk make the function much slower.
I had write my code as map/reduce style,But how to make the haskell don't hold all data in memory?
The blow is my code and profiling result.
import Data.Ord
import Text.CSV.Lazy.String
import Data.List
import System.IO
import Data.Function (on)
import System.Environment
splitLength = 5000
mySplit' [] = []
mySplit' xs = [x] ++ mySplit' t
where
x = take splitLength xs
t = drop splitLength xs
getBlockCount::Ord a => [[a]] -> [[(a,Int)]]
getBlockCount t = map
(map (\x -> ((head x),length x))) $
map group $ map sort $ transpose t
foldData::Ord a=> [(a,Int)]->[(a,Int)]->[(a,Int)]
foldData lxs rxs = map combind wlist
where
wlist = groupBy ((==) `on` fst) $ sortBy (comparing fst) $ lxs ++ rxs
combind xs
| 1==(length xs) = head xs
| 2 ==(length xs) = (((fst . head) xs ), ((snd . head) xs)+((snd . last) xs))
loadTestData datalen = do
testFile <- readFile "data/test_csv"
let cfile = fromCSVTable $ csvTable $ parseCSV testFile
let column = head cfile
let body = take datalen $ tail cfile
let countData = foldl1' (zipWith foldData) $ map getBlockCount $ mySplit' body
let output = zip column $ map ( reverse . sortBy (comparing snd) ) countData
appendFile "testdata" $ foldl1 (\x y -> x ++"\n"++y)$ map show $tail output
main = do
s<-getArgs
loadTestData $ read $ last s
profiling result
loadData +RTS -p -RTS 12000
total time = 1.02 secs (1025 ticks # 1000 us, 1 processor)
total alloc = 991,266,560 bytes (excludes profiling overheads)
loadData +RTS -p -RTS 120000
total time = 17.28 secs (17284 ticks # 1000 us, 1 processor)
total alloc = 9,202,259,064 bytes (excludes profiling overheads)
loadData +RTS -p -RTS 180000
total time = 85.06 secs (85059 ticks # 1000 us, 1 processor)
total alloc = 13,760,818,848 bytes (excludes profiling overheads)
So first, a few suggestions.
Lists aren't fast. Okay, okay, cons is constant time, but in general, lists aren't fast. You're using lists. (Data.Sequence would've been faster for two-ended cons'ing and consumption)
Strings are slow. Strings are slow because they're [Char] (List of Char). The library you're currently using is written in terms of lists of Strings. Usually linked-lists of linked-lists of characters aren't what you want for text processing. This is no bueno. Use Text (for, uh, text) or ByteString (for bytes) instead of String in future unless it's something small and not performance sensitive.
The library you are using is just lazy, not streaming. You'd have to handle getting streaming behavior overlaid onto the lazy semantics to get constant memory use. Streaming libraries solve the problem of incrementally processing data and limiting memory use. I'd suggest learning Pipes or Conduit for that general class of problems. Some problem-specific libraries will also offer an iteratee API which can be used for streaming. Iteratee APIs can be used directly or hooked up to Pipes/Conduit/etc.
I don't think the library you're using is a good idea.
I suggest you use one of the following libraries:
http://hackage.haskell.org/package/pipes-csv (Pipes based)
https://hackage.haskell.org/package/cassava-0.4.2.0/docs/Data-Csv-Streaming.html (Common CSV library, not based on a particular streaming library)
https://hackage.haskell.org/package/csv-conduit (Conduit based)
These should give you good performance and constant memory use modulo whatever you might be accumulating.
There are a couple of things to be aware of:
You want to stream the data so that you are only holding in memory a small portion of the input file at any time. You might be able to accomplish this with lazy IO and the lazy-csv package. However, it still is easy to inadvertently hold on to references which keep all of your input in memory. A better option is to use a streaming library like csv-conduit or pipes-csv.
Use ByteString or Text when processing large amounts of string data.
You want to make sure to use strict operations when reducing your data. Otherwise you will just be building up thunks of unevaluated expressions in memory until the very end when you print out the result. One place where thunks could be building up is your foldData function - the word count expressions do not appear to be getting reduced.
Here is an example of a program which will compute the total length of all of the words in each column of a CSV file and does it in constant memory. The main features are:
uses lazy IO
uses the lazy-csv package with (lazy) ByteString instead of String
uses BangPatterns to strictify the computation of the number of lines
uses an unboxed array to hold the column counters
The code:
{-# LANGUAGE BangPatterns #-}
import qualified Data.ByteString.Lazy.Char8 as BS
import Data.ByteString.Lazy (ByteString)
import Text.CSV.Lazy.ByteString
import System.Environment (getArgs)
import Data.List (foldl')
import Data.Int
import Data.Array.IO
import Data.Array.Unboxed
import Control.Monad
type Length = Int64 -- use Int on 32-bit systems
main = do
(arg:_) <- getArgs
(line1:lns) <- fmap BS.lines $ BS.readFile arg
-- line1 contains the header
let (headers:_) = [ map csvFieldContent r | r <- csvTable (parseCSV line1) ]
ncols = length headers :: Int
arr <- newArray (1,ncols) 0 :: IO (IOUArray Int Length)
let inc i a = do v <- readArray arr i; writeArray arr i (v+a)
let loop !n [] = return n
loop !n (b:bs) = do
let lengths = map BS.length $ head [ map csvFieldContent r | r <- csvTable (parseCSV b) ]
forM_ (zip [1..] lengths) $ \(i,a) -> inc i a
loop (n+1) bs
print headers
n <- loop 0 lns
putStrLn $ "n = " ++ show (n :: Int)
arr' <- freeze arr :: IO (UArray Int Length)
putStrLn $ "totals = " ++ show arr'
I have had this issue before in another language. The trick is not to read the data into memory, but rather just read it in one line at a time. When you read the next line just overwrite your variables as you are only looking for a word count.
Just test for an EOF end of file condition in your io stream and exit then. That way you don;t have to split the file.
Hope that helps

memoFix of Data.Function.Memoize does not free memory after itself

I am wondering why the GC never frees memory for main = memo_main. If main = slow_main, the computation takes the same amount of time each time I enter 37. Due to this, the behavior I expected was to see decrease in memory allocation each time the top level memoFix returns. Yet, it is stuck at 370 MBs.
module Main where
import Data.Function.Memoize
memo_fib :: (Int -> Int) -> Int -> Int
memo_fib _ 0 = 1
memo_fib _ 1 = 1
memo_fib f n = f (n-1) + f (n-2)
memo_main = interact $ unlines . map (\x -> show $ memoFix memo_fib $ read x) . lines
slow_fib 0 = 1
slow_fib 1 = 1
slow_fib n = slow_fib (n-1) + slow_fib (n-2)
slow_main = interact $ unlines . map (\x -> show $ slow_fib $ read x) . lines
main :: IO ()
main = memo_main
UPD: Resolved. Memory was actually freed, but not returned to the OS. This is per design of GHC RTS.
Just a couple notes:
Are you compiling with optimization? Optimization might cause memoFix memo_fib to be lifted out of the lambda, since it doesn't depend on x, thus sharing the memo table across all calls.
Another thing to keep in mind is that the GHC runtime never shrinks the heap. As soon as it has allocated a certain amount of memory, the GC collects it but keeps it around for other parts of the program to use, rather than freeing it to the OS. So the GC may be collecting this after all, but you won't see it with whatever tool you're using to watch it.

Optimizing longest Collatz chain in Haskell

I've been doing project Euler problems to learn Haskell.
I've have some bumps on the way but managed to get to problem 14.
The question is, which starting number under 1 000 000 produces the longest Collatz chain (numbers are allowed to go above one million after the chain starts).
I've tried a couple of solutions but none of the worked.
I wanted to do a reverse. Starting from 1 and terminating when the number gets above one million but that obviously doesn't work since the terms can go higher than one million.
I've tried memoizing the normal algorithm but again, too large numbers, to much memoization.
I've read that the most obvious solution should work for this but for some reason, my solution takes over 10 seconds to get the maximum up to 20 000. Let alone 1 million.
This is the code I'm using at the moment:
reg_collatz 1 = 1
reg_collatz n
| even n = 1 + reg_collatz (n `div` 2)
| otherwise = 1 + reg_collatz (n * 3 + 1)
solution = foldl1 (\a n -> max a (reg_collatz n)) [1..20000]
Any help is very welcome.
The answer is simple: don’t memoise numbers above one million, but do that with numbers below.
module Main where
import qualified Data.Map as M
import Data.List
import Data.Ord
main = print $ fst $ maximumBy (comparing snd) $ M.toList ansMap
ansMap :: M.Map Integer Int
ansMap = M.fromAscList [(i, collatz i) | i <- [1..1000000]]
where collatz 1 = 0
collatz x = if x' <= 1000000 then 1 + ansMap M.! x'
else 1 + collatz x'
where x' = if even x then x `div` 2 else x*3 + 1
This is obv waaay late but I thought I'd post anyways for future readers' benefit (I imagine OP is long done with this problem).
TL;DR:
I think we probably want to use the Data.Vector package for this problem (and similar types of problems).
Longer version:
According to the Haskell docs, a Map (from Data.Map) has O(log N) access time whereas a Vector (from Data.Vector) has O(1) access; we can see the difference in the results below: the vector implementation runs ~3x faster. (Both are way better than lists which have O(N) access time.)
A couple of benchmarks are included below. The tests were intentionally not run one after another so as to prevent any cache-based optimization.
A couple of observations:
The largest absolute improvement (from the code in the original post) was due to the addition of type signatures; without being explicitly told that the data was of type Int, Haskell's type system was inferring that the data was of type Integer (which is obv bigger and slower)
A bit counterintuitive but, results are virtually indistinguishable between foldl1' and foldl1. (I double checked the code and ran these a few times just to make sure.)
Vector and Array (and, to a certain extent, Map) allow for decent improvement primarily as a result of memoization. (Note that OP's solution is likely a lot faster than a list-based solution that tried to use memoization given lists' O(N) access time.)
Here are a couple of benchmarks (all compiled using O2):
Probably want to look
at these numbers
|
V
Data.Vector 0.35s user 0.10s system 97% cpu 0.468 total
Data.Array (Haskell.org) 0.31s user 0.21s system 98% cpu 0.530 total
Data.Map (above answer) 1.31s user 0.46s system 99% cpu 1.772 total
Control.Parallel (Haskell.org) 1.75s user 0.05s system 99% cpu 1.799 total
OP (`Int` type sigs + foldl') 3.60s user 0.06s system 99% cpu 3.674 total
OP (`Int` type sigs) 3.53s user 0.16s system 99% cpu 3.709 total
OP (verbatim) 3.36s user 4.77s system 99% cpu 8.146 total
Source of figures from Haskell.org: https://www.haskell.org/haskellwiki/Euler_problems/11_to_20#Problem_14
The Data.Vector implementation used to generate the above results:
import Data.Vector ( Vector, fromList, maxIndex, (!) )
main :: IO ()
main = putStrLn $ show $ largestCollatz 1000000
largestCollatz :: Int -> Int
largestCollatz n = maxIndex vector
where
vector :: Vector Int
vector = fromList $ 0 : 1 : [collatz x x 0 | x <- [2..n]]
collatz m i c =
case i < m of
True -> c + vector ! i
False -> let j = if even i then i `div` 2 else 3*i + 1
in collatz m j (c+1)

Resources