GHC profiling file and chart are contradictory - haskell

I have a sieve of Eratosthenes program written in ST.Strict, and I was profiling it when I saw that it was taking a ridiculous amount of memory:
Sun Jul 10 18:27 2016 Time and Allocation Profiling Report (Final)
Primes +RTS -hc -p -K1000M -RTS 10000000
total time = 2.32 secs (2317 ticks # 1000 us, 1 processor)
total alloc = 5,128,702,952 bytes (excludes profiling overheads)
(where 10^7) is the amount of primes I asked it to generate.
Weirdly, the profiling graph shows something completely different:
Am I misreading something in one of these graphs? Or is there something wrong with one of these tools?
For reference, my code is
{-# LANGUAGE BangPatterns #-}
import Prelude hiding (replicate, read)
import qualified Text.Read as T
import Data.Vector.Unboxed.Mutable(replicate, write, read)
import Control.Monad.ST.Strict
import Data.STRef
import Control.Monad.Primitive
import Control.Monad
import System.Environment
main = print . length . primesUpTo . T.read . head =<< getArgs
primesUpTo :: Int -> [Int]
primesUpTo n = runST $ do
primes <- replicate n True
write primes 0 False
write primes 1 False
sieve 2 primes
return []
-- Removed to avoid the memory allocation of creating the list for profiling purposes
-- filterM (read primes) [0..n-1]
where
sieve !i primes | i * i >= n = return primes
sieve !i primes = do
v <- read primes i
counter <- newSTRef $ i * i
when v $ whileM_ ((< n) <$!> readSTRef counter) $ do
curr_count <- readSTRef counter
write primes curr_count False
writeSTRef counter (curr_count + i)
sieve (i + 1) primes
whileM_ :: (Monad m) => m Bool -> m a -> m ()
whileM_ condition body = do
cond <- condition
when cond $ do
body
whileM_ condition body

This seems to confuse many people.
total alloc = 5,128,702,952 bytes (excludes profiling overheads)
This is literally the total size of all the allocations ever performed by your program, including "temporary" objects that become dead almost immediately after being allocated. Allocation itself is nearly free, and generally Haskell programs allocate at a rate of around 1-2 GB/s.
Weirdly, the profiling graph shows something completely different:
Indeed, the profiling graph shows the total size of all the objects that are live on the heap at any particular time. This reflects the space usage of your program. If your program runs in constant space, then the number shown in this graph will stay constant.

Related

Significant overhead implementing Catalan numbers with histomorphism

I'm exploring recursion-schemes recently and want to find some use cases for histomorphism - for which I think Catalan numbers could be a fun one (I'm aware there are better ways to implement Catalan numbers, which are not the focus of this question). What I come up with is the following:
import Control.Comonad.Cofree
import Control.Monad
import Data.Foldable
import Data.Function.Memoize (memoFix)
import Data.Functor.Foldable
import GHC.Natural
type Nat = Natural
-- unrelated lines omitted
catalanHisto :: Nat -> Nat
catalanHisto = histo \case
Nothing ->
1
Just fs ->
let xs = toList fs -- this is line 101 in my original code.
ys = reverse xs
in sum $ zipWith (*) xs ys
catalanMemo :: Integer -> Integer
catalanMemo = memoFix \q n ->
if n == 0
then 1
else
let xs = fmap q [0 .. n -1]
ys = reverse xs
in sum $ zipWith (*) xs ys
main :: IO ()
main = do
-- print $ catalanMemo 1000
print $ catalanHisto 1000
Performance suffers however, with catalanHisto:
real 49.36s
user 416.48s
sys 99.38s
which is pretty bad comparing with catalanMemo:
real 0.84s
user 5.09s
sys 2.08s
Given that at least it terminates, the histo version definitely memoized something, but there is a huge overhead that I'm not sure whether I'm misusing histo, or it's just the price to pay for writing programs this way. As I went ahead with some basic profiling:
Sat Feb 19 22:58 2022 Time and Allocation Profiling Report (Final)
demo +RTS -N -s -p -RTS
total time = 20.78 secs (52462 ticks # 1000 us, 24 processors)
total alloc = 122,870,767,920 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
catalanHisto.\ Catalan src/Catalan.hs:(101,5)-(103,31) 68.0 71.5
foldMap.go Control.Comonad.Cofree src/Control/Comonad/Cofree.hs:301:5-46 28.4 25.0
catalanHisto Catalan src/Catalan.hs:(97,1)-(103,31) 1.7 0.0
catalanHisto.\.ys Catalan src/Catalan.hs:102:9-23 1.3 3.3
Not an expert interpreting those results, I guess in addition to some allocations in Control.Comonad.Cofree, it spents a majority of time doing allocation in the non-trivial branch of catalanHisto, probably due to toList and reverse, which I'm not sure how much room there is for optimization.

Multithreaded Subset Sum with Haskell Parallel Strategies

I am attempting to use Parallel.Strategies to parallelize my subset sum solver, and need a little help understanding what is happening.
The Problem
Find the subset of numbers in numbers :: [Int] which sum to 100000000.
Single-threaded solution:
import Data.List (find)
import Data.Maybe (fromJust, isJust)
numbers = [14920416,14602041,14088921,13371291,13216099,12153625,10896437
,10884343,10228468,10177453,9998564,9920883,9511265,8924305
,8452302,8103727,7519471,7043381,7028847,6418450,6222190,6215767
,6190960,5514135,4798322,3823984,3247980,837289] :: [Int]
subsequencesOfSize :: Int -> [Int] -> [[Int]]
subsequencesOfSize n xs = let l = length xs
in if n>l then [] else subsequencesBySize xs !! (l-n)
where
subsequencesBySize [] = [[[]]]
subsequencesBySize (x:xs) = let next = subsequencesBySize xs
in zipWith (++) ([]:next) (map (map (x:)) next ++ [[]])
subsetSum :: [Int] -> Int -> Maybe [Int]
subsetSum seq n = find ((==target) . sum) (subsequencesOfSize n seq)
where target = 100000000
solve = map (subsetSum numbers) [n,n-1 .. 1]
where n = (length numbers)
main = do
print $ fromJust $ find isJust solve
Parallel Strategy
Since I am calculating the subsets of size n separately already, I figured I could just use parMap to spawn calculation of each list of subsets of size n simultaneously. I replaced the map in the solve function like this:
import Control.Parallel.Strategies
solve = parMap rpar (subsetSum numbers) [n,n-1 .. 1]
where n = (length numbers)
Single Core
newproblem +RTS -p -N1 -RTS
total time = 35.05 secs (35047 ticks # 1000 us, 1 processor)
total alloc = 22,628,052,232 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
subsetSum Main 86.6 24.5
subsequencesOfSize.subsequencesBySize Main 11.0 75.5
solve Main 2.4 0.0
Two Cores
newproblem +RTS -p -N2 -RTS
total time = 28.80 secs (57590 ticks # 1000 us, 2 processors)
total alloc = 26,537,237,440 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
subsetSum Main 70.2 21.4
subsequencesOfSize.subsequencesBySize Main 28.8 78.6
Four Cores
newproblem +RTS -p -N4 -RTS
total time = 26.68 secs (106727 ticks # 1000 us, 4 processors)
total alloc = 35,925,142,744 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
subsetSum Main 68.2 22.4
subsequencesOfSize.subsequencesBySize Main 30.8 77.6
threadscope comparisons
As you can see, the program does run quite a bit faster with 2 or 4 cores than on a single core. However, I do not believe the lists of subsets of size n are being handled by separate processors as I intended.
Looking at the threadscopes, it seems to me as if each "bump" in processor activity is the calculation on each subset of size n. I expected not a reduction in the runtime of each "bump", but "bumps" spawned in parallel across each processor. However, the former description is more accurate than the latter.
What is happening here? Where are the speedups coming from? Why is there so much garbage collection happening in between subset calculations?
Thanks in advance for any enlightenment whatsoever :D

How to read big csv file?

I trying to read a big csv file by haskell, and generate the word count by each column.
This more than 4M rows in the file.
So I choice read a block and get the word count each time(5k rows one block).
And than sum it together.
When I test the function with 12000 rows and 120000 rows the time increase almost linear.
But When read 180000 rows, run time exceeds more than four times.
I think it because the memory is not enough,swap with disk make the function much slower.
I had write my code as map/reduce style,But how to make the haskell don't hold all data in memory?
The blow is my code and profiling result.
import Data.Ord
import Text.CSV.Lazy.String
import Data.List
import System.IO
import Data.Function (on)
import System.Environment
splitLength = 5000
mySplit' [] = []
mySplit' xs = [x] ++ mySplit' t
where
x = take splitLength xs
t = drop splitLength xs
getBlockCount::Ord a => [[a]] -> [[(a,Int)]]
getBlockCount t = map
(map (\x -> ((head x),length x))) $
map group $ map sort $ transpose t
foldData::Ord a=> [(a,Int)]->[(a,Int)]->[(a,Int)]
foldData lxs rxs = map combind wlist
where
wlist = groupBy ((==) `on` fst) $ sortBy (comparing fst) $ lxs ++ rxs
combind xs
| 1==(length xs) = head xs
| 2 ==(length xs) = (((fst . head) xs ), ((snd . head) xs)+((snd . last) xs))
loadTestData datalen = do
testFile <- readFile "data/test_csv"
let cfile = fromCSVTable $ csvTable $ parseCSV testFile
let column = head cfile
let body = take datalen $ tail cfile
let countData = foldl1' (zipWith foldData) $ map getBlockCount $ mySplit' body
let output = zip column $ map ( reverse . sortBy (comparing snd) ) countData
appendFile "testdata" $ foldl1 (\x y -> x ++"\n"++y)$ map show $tail output
main = do
s<-getArgs
loadTestData $ read $ last s
profiling result
loadData +RTS -p -RTS 12000
total time = 1.02 secs (1025 ticks # 1000 us, 1 processor)
total alloc = 991,266,560 bytes (excludes profiling overheads)
loadData +RTS -p -RTS 120000
total time = 17.28 secs (17284 ticks # 1000 us, 1 processor)
total alloc = 9,202,259,064 bytes (excludes profiling overheads)
loadData +RTS -p -RTS 180000
total time = 85.06 secs (85059 ticks # 1000 us, 1 processor)
total alloc = 13,760,818,848 bytes (excludes profiling overheads)
So first, a few suggestions.
Lists aren't fast. Okay, okay, cons is constant time, but in general, lists aren't fast. You're using lists. (Data.Sequence would've been faster for two-ended cons'ing and consumption)
Strings are slow. Strings are slow because they're [Char] (List of Char). The library you're currently using is written in terms of lists of Strings. Usually linked-lists of linked-lists of characters aren't what you want for text processing. This is no bueno. Use Text (for, uh, text) or ByteString (for bytes) instead of String in future unless it's something small and not performance sensitive.
The library you are using is just lazy, not streaming. You'd have to handle getting streaming behavior overlaid onto the lazy semantics to get constant memory use. Streaming libraries solve the problem of incrementally processing data and limiting memory use. I'd suggest learning Pipes or Conduit for that general class of problems. Some problem-specific libraries will also offer an iteratee API which can be used for streaming. Iteratee APIs can be used directly or hooked up to Pipes/Conduit/etc.
I don't think the library you're using is a good idea.
I suggest you use one of the following libraries:
http://hackage.haskell.org/package/pipes-csv (Pipes based)
https://hackage.haskell.org/package/cassava-0.4.2.0/docs/Data-Csv-Streaming.html (Common CSV library, not based on a particular streaming library)
https://hackage.haskell.org/package/csv-conduit (Conduit based)
These should give you good performance and constant memory use modulo whatever you might be accumulating.
There are a couple of things to be aware of:
You want to stream the data so that you are only holding in memory a small portion of the input file at any time. You might be able to accomplish this with lazy IO and the lazy-csv package. However, it still is easy to inadvertently hold on to references which keep all of your input in memory. A better option is to use a streaming library like csv-conduit or pipes-csv.
Use ByteString or Text when processing large amounts of string data.
You want to make sure to use strict operations when reducing your data. Otherwise you will just be building up thunks of unevaluated expressions in memory until the very end when you print out the result. One place where thunks could be building up is your foldData function - the word count expressions do not appear to be getting reduced.
Here is an example of a program which will compute the total length of all of the words in each column of a CSV file and does it in constant memory. The main features are:
uses lazy IO
uses the lazy-csv package with (lazy) ByteString instead of String
uses BangPatterns to strictify the computation of the number of lines
uses an unboxed array to hold the column counters
The code:
{-# LANGUAGE BangPatterns #-}
import qualified Data.ByteString.Lazy.Char8 as BS
import Data.ByteString.Lazy (ByteString)
import Text.CSV.Lazy.ByteString
import System.Environment (getArgs)
import Data.List (foldl')
import Data.Int
import Data.Array.IO
import Data.Array.Unboxed
import Control.Monad
type Length = Int64 -- use Int on 32-bit systems
main = do
(arg:_) <- getArgs
(line1:lns) <- fmap BS.lines $ BS.readFile arg
-- line1 contains the header
let (headers:_) = [ map csvFieldContent r | r <- csvTable (parseCSV line1) ]
ncols = length headers :: Int
arr <- newArray (1,ncols) 0 :: IO (IOUArray Int Length)
let inc i a = do v <- readArray arr i; writeArray arr i (v+a)
let loop !n [] = return n
loop !n (b:bs) = do
let lengths = map BS.length $ head [ map csvFieldContent r | r <- csvTable (parseCSV b) ]
forM_ (zip [1..] lengths) $ \(i,a) -> inc i a
loop (n+1) bs
print headers
n <- loop 0 lns
putStrLn $ "n = " ++ show (n :: Int)
arr' <- freeze arr :: IO (UArray Int Length)
putStrLn $ "totals = " ++ show arr'
I have had this issue before in another language. The trick is not to read the data into memory, but rather just read it in one line at a time. When you read the next line just overwrite your variables as you are only looking for a word count.
Just test for an EOF end of file condition in your io stream and exit then. That way you don;t have to split the file.
Hope that helps

memoFix of Data.Function.Memoize does not free memory after itself

I am wondering why the GC never frees memory for main = memo_main. If main = slow_main, the computation takes the same amount of time each time I enter 37. Due to this, the behavior I expected was to see decrease in memory allocation each time the top level memoFix returns. Yet, it is stuck at 370 MBs.
module Main where
import Data.Function.Memoize
memo_fib :: (Int -> Int) -> Int -> Int
memo_fib _ 0 = 1
memo_fib _ 1 = 1
memo_fib f n = f (n-1) + f (n-2)
memo_main = interact $ unlines . map (\x -> show $ memoFix memo_fib $ read x) . lines
slow_fib 0 = 1
slow_fib 1 = 1
slow_fib n = slow_fib (n-1) + slow_fib (n-2)
slow_main = interact $ unlines . map (\x -> show $ slow_fib $ read x) . lines
main :: IO ()
main = memo_main
UPD: Resolved. Memory was actually freed, but not returned to the OS. This is per design of GHC RTS.
Just a couple notes:
Are you compiling with optimization? Optimization might cause memoFix memo_fib to be lifted out of the lambda, since it doesn't depend on x, thus sharing the memo table across all calls.
Another thing to keep in mind is that the GHC runtime never shrinks the heap. As soon as it has allocated a certain amount of memory, the GC collects it but keeps it around for other parts of the program to use, rather than freeing it to the OS. So the GC may be collecting this after all, but you won't see it with whatever tool you're using to watch it.

How to use Criterion to measure performance of Haskell programs?

I'm trying to measure the performance of a simple Haar DWT program using the Criterion framework. (It is erroneously slow, but I'll leave that for another question). I can't find any good documentation on the web, unfortunately. My two primary problems are
How can one pass data from one benchmark to another? I want to time each stage of the program.
How does the sampling work, and avoid lazy evaluation reusing its previous computations?
This source is relatively pared down; the first function getRandList generates a list of random numbers; haarStep transforms an input signal into differences and sums, and haarDWT calls the former and recurses on the sums. I'm trying to pass the getRandList to the haarDWT via lazy evaluation, but perhaps my usage is incorrect / unsupported. The timings don't seem to make sense.
{-# LANGUAGE ViewPatterns #-}
import Control.Arrow
import qualified Data.Vector.Unboxed as V
import System.Random
import Criterion.Main
invSqrt2 = 0.70710678118654752440
getRandList :: RandomGen g => g -> Int -> [Float]
getRandList gen 0 = []
getRandList gen n = v:rest where
(v, gen') = random gen
rest = getRandList gen' (n - 1)
haarStep :: V.Vector Float -> (V.Vector Float, V.Vector Float)
haarStep = (alternatingOp (-) &&& alternatingOp (+)) where
alternatingOp op x = V.generate (V.length x `div` 2) (\i ->
((x V.! (2 * i)) `op` (x V.! (2 * i + 1))) * invSqrt2)
haarDWT :: V.Vector Float -> V.Vector Float
haarDWT xl#(V.length -> 1) = xl
haarDWT (haarStep -> (d, s)) = haarDWT s V.++ d
main = do
gen <- getStdGen
inData <- return $ getRandList gen 2097152
outData <- return $ haarDWT (V.fromList inData)
defaultMain [
bench "get input" $ nf id inData,
bench "transform" $ nf V.toList outData
]
writeFile "input.dat" (unlines $ map show inData)
writeFile "output.dat" (unlines $ map show $ V.toList outData)
Finally, I'm getting an error when I try to call it with -s 1; maybe this is just a Criterion bug.
Main: ./Data/Vector/Generic.hs:237 ((!)): index out of bounds (1,1)
Thanks in advance!
The posted benchmark is erroniously slow... or is it
Are you sure it's erroneous? You're touching (well, the "nf" call is touching) 2 million boxed elements - thats 4 million pointers. You can call this erroneous if you want, but the issue is just what you think you're measure compared to what you really are measuring.
Sharing Data Between Benchmarks
Data sharing can be accomplished through partial application. In my benchmarks I commonly have
let var = somethingCommon in
defaultMain [ bench "one" (nf (func1 somethingCommon) input1)
, bench "two" (nf (func2 somethingCommon) input2)]
Avoiding Reuse in the presences of lazy evaluation
Criterion avoids sharing by separating out your function and your input. You have signatures such as:
funcToBenchmark :: (NFData b) => a -> b
inputForFunc :: a
In Haskell every time you apply funcToBenchmark inputForFunc it will create a thunk that needs evaluated. There is no sharing unless you use the same variable name as a previous computation. There is no automatic memoization - this seems to be a common misunderstanding.
Notice the nuance in what isn't shared. We aren't sharing the final result, but the input is shared. If the generation of the input is what you want to benchmark (i.e. getRandList, in this case) then benchmark that and not just the identity + nf function:
main = do
gen <- getStdGen
let inData = getRandList gen size
inVec = V.fromList inData
size = 2097152
defaultMain
[ bench "get input for real" $ nf (getRandList gen) size
, bench "get input for real and run harrDWT and listify a vector" $ nf (V.toList . haarDWT . V.fromList . getRandList gen) size
, bench "screw generation, how fast is haarDWT" $ whnf haarDWT inVec] -- for unboxed vectors whnf is sufficient
Interpreting Data
The third benchmark is rather instructive. Lets look at what criterion prints out:
benchmarking screw generation, how fast is haarDWT
collecting 100 samples, 1 iterations each, in estimated 137.3525 s
bootstrapping with 100000 resamples
mean: 134.7204 ms, lb 134.5117 ms, ub 135.0135 ms, ci 0.950
Based on a single run, Criterion thinks it will take 137 seconds to perform it's 100 samples. About ten seconds later it was done - what happened? Well, the first run forced all the inputs (inVec), which was expensive. The subsequent runs found a value instead of a thunk, and thus we truely benchmarked haarDWT and not the StdGen RNG (which is known to be painfully slow).

Resources