No heap profiling data for module Data.ByteString - haskell

I was trying to generate heap memory profile for following naive Haskell code that copies a file:
import System.Environment
import System.IO
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as LB
naiveCopy :: String -> String -> IO ()
naiveCopy from to = do
putStrLn $ "From: " ++ from
putStrLn $ "To: " ++ to
s <- B.readFile from
B.writeFile to s
main = do
args <- getArgs
mapM (\ x-> putStrLn x) args
naiveCopy (head args) ((head.tail) args)
Command that build the code with ghc 8.0.1:
ghc -o t -rtsopts -prof -fprof-auto t.hs
Command that collect the profiling data:
./t +RTS -p -h -RTS in/data out/data && hp2ps -e8in -c t.hp
where in/data is a quite big file (approx 500MB) which will take the program about 2 seconds to copy.
The problem is that I couldn't get heap profiling data if I use the strict Data.ByteString, there's only an small t.hp file without any sample data, it looks like this:
JOB "t in/data out/data +RTS -p -h"
DATE "Thu Aug 4 20:19 2016"
SAMPLE_UNIT "seconds"
VALUE_UNIT "bytes"
BEGIN_SAMPLE 0.000000
END_SAMPLE 0.000000
BEGIN_SAMPLE 0.943188
END_SAMPLE 0.943188
and corresponding profile chart like this:
However I could get heap profiling data if I switch to the lazy version Data.ByteString.Lazy, profile chart like this:
Update: Thanks #ryachza, I added a -i0 parameter to set sampling interval and tried again, this time I got sample data for strict ByteString and it looked reasonable (I was copying a 500M file and the memory allocation peak in following profiling chart is about 500M)
./t +RTS -p -h -RTS in/data out/data && hp2ps -e8in -c t.hp

It appears as though the runtime isn't "getting the chance to measure" the heap. If you add -s to your RTS options, it should print some time and allocation information. When I run this, I see the bytes allocated and total memory use is very high (size of the file), but the maximum residency (and the number of samples) is very low, and while the elapsed time is high the actual "work" time is practically 0.
Adding the RTS option -i0 allowed me to reproducibly visualize the bytestring allocation as PINNED (this is the classification because the byte arrays that bytestring uses internally are allocated in an area in which the GC can't move things). You could experiment with different -h options which associate allocations to different cost centers (for example, -hy should show ARR_WORDS) but it probably wouldn't have much value in this case as the bytestrings are really just "big chunks of raw memory".
The references I used to find the RTS options were (clearly I wasn't particular about the GHC version - I can't imagine these flags change frequently):
https://downloads.haskell.org/~ghc/7.0.1/docs/html/users_guide/runtime-control.html
https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/profiling.html

Related

Worse forConcurrently performance than sequential

I've written a function that up-samples a file from 48kHz to 192kHz by means of a filter:
upsample :: Coefficients -> FilePath -> IO ()
It takes the filter coefficients, the path of the file (that has to be upsamples) and writes the result to a new file.
I have to up-sample many files so I've written a function to up-sample a full directory in parallel, using forConcurrently_ from Control.Concurrent.Async:
upsampleDirectory :: Directory -> FilePath -> IO ()
upsampleDirectory dir coefPath = do
files <- getAllFilesFromDirectory dir
coefs <- loadCoefficients coefPath
forConcurrently_ files $ upsample coefs
I'm compiling with the -threaded option and running using +RTS -N2. What I see is that up-sampling 2 files sequentially is faster than up-sampling both files in parallel.
Upsampling file1.wav takes 18.863s.
Upsampling file2.wav takes 18.707s.
Upsampling a directory with file1.wav and file2.wav takes 66.250s.
What am I doing wrong?
I've tried to keep this post concise, so ask me if you need more details on some of the functions.
Here are a couple of possibilities. First, make yourself 100% sure you're actually running your program with +RTS -N2 -RTS. I can't tell you how many times I've been benchmarking a parallel program and written:
stack exec myprogram +RTS -N2 -RTS
in place of:
stack exec myprogram -- +RTS -N2 -RTS
and gotten myself hopelessly confused. (The first version runs the stack executable on two processors but the target executable on one!) Maybe add a print $ getNumCapabilities at the beginning of your main program to be sure.
After confirming you're running on two processors, then the next most likely issue is that your implementation is not running in constant space and is blowing up the heap. Here's a simple test program I used to try to duplicate your problem. (Feel free to use my awesome upsampling filter yourself!)
module Main where
import Control.Concurrent.Async
import System.Environment
import qualified Data.ByteString as B
upsample :: FilePath -> IO ()
upsample fp = do c <- B.readFile fp
let c' = B.pack $ concatMap (replicate 4) $ B.unpack c
B.writeFile (fp ++ ".out") c'
upsampleFiles :: [FilePath] -> IO ()
upsampleFiles files = do
forConcurrently_ files $ upsample
main :: IO ()
main = upsampleFiles =<< getArgs -- sample all file on command line
When I ran this on a single 70meg test file, it ran in 14 secs. When I ran it on two copies in parallel, it ran for more than a minute before it started swapping like mad, and I had to kill it. After switching to:
import qualified Data.ByteString.Lazy as B
it ran in 3.7 secs on a single file, 7.8 secs on two copies on a single processor, and 4.0 secs on two copies on two processors with +RTS -N2.
Make sure you're compiling with optimzations on, profile your program, and make sure it's running in a constant (or at least reasonable) heap space. The above program runs in a constant 100k bytes of heap. A similar version that uses a strict ByteString for reading and a lazy ByteString for writing reads the whole file into memory, but the heap almost immediately grows to 70megs (the size of the file) within a fraction of a second and then stays constant while the file is processed.
No matter how complicated your filter is, if your program is growing gigabytes of heap, the implementation is broken, and you'll need to fix it before you worry about performance, parallel or otherwise.

Runtime memory Haskell

I was trying to write a program in Haskell and I wanted to write a function which when called gives me the free memory and the total memory at that instance. I know that in Java I can write something like runtime.getruntime().totalmemory() inside a method. I am fairly new to Haskell and can't figure out how to hard code something like this inside my program.
You can use the getGCStats function in GHC.Stats to get the amount of memory used as of the last GC (currentBytesUsed).
You might have to compile your program with -rtsopts and run it with +RTS -T to enable the stats.
Apparantly the API changed a bit in newer GHC's. The following checks first if you've used +RTS -T and then (based on https://ofb.net/~elaforge/karya/hscolour/Util/Memory.html ) uses the correct API based on GHC_VERSION:
{-# LANGUAGE CPP #-}
import GHC.Int (Int64)
import qualified GHC.Stats
rtsMemUsage :: IO (Maybe Int64)
rtsMemUsage = do
enabled <- GHC.Stats.getGCStatsEnabled
if enabled
then
#if GHC_VERSION >= 80401
Just . GHC.Stats.gcdetails_mem_in_use_bytes . GHC.Stats.gc <$> GHC.Stats.getRTSStats
#else
Just . GHC.Stats.currentBytesUsed <$> GHC.Stats.getGCStats
#endif
else return Nothing

Basic I/O performance in Haskell

Another microbenchmark: Why is this "loop" (compiled with ghc -O2 -fllvm, 7.4.1, Linux 64bit 3.2 kernel, redirected to /dev/null)
mapM_ print [1..100000000]
about 5x slower than a simple for-cycle in plain C with write(2) non-buffered syscall? I am trying to gather Haskell gotchas.
Even this slow C solution is much faster than Haskell
int i;
char buf[16];
for (i=0; i<=100000000; i++) {
sprintf(buf, "%d\n", i);
write(1, buf, strlen(buf));
}
Okay, on my box the C code, compiled per gcc -O3 takes about 21.5 seconds to run, the original Haskell code about 56 seconds. So not a factor of 5, a bit above 2.5.
The first nontrivial difference is that
mapM_ print [1..100000000]
uses Integers, that's a bit slower because it involves a check upfront, and then works with boxed Ints, while the Show instance of Int does the conversion work on unboxed Int#s.
Adding a type signature, so that the Haskell code works on Ints,
mapM_ print [1 :: Int .. 100000000]
brings the time down to 47 seconds, a bit above twice the time the C code takes.
Now, another big difference is that show produces a linked list of Char and doesn't just fill a contiguous buffer of bytes. That is slower too.
Then that linked list of Chars is used to fill a byte buffer that then is written to the stdout handle.
So, the Haskell code does more, and more complicated things than the C code, thus it's not surprising that it takes longer.
Admittedly, it would be desirable to have an easy way to output such things more directly (and hence faster). However, the proper way to handle it is to use a more suitable algorithm (that applies to C too). A simple change to
putStr . unlines $ map show [0 :: Int .. 100000000]
almost halves the time taken, and if one wants it really fast, one uses the faster ByteString I/O and builds the output efficiently as exemplified in applicative's answer.
On my (rather slow and outdated) machine the results are:
$ time haskell-test > haskell-out.txt
real 1m57.497s
user 1m47.759s
sys 0m9.369s
$ time c-test > c-out.txt
real 7m28.792s
user 1m9.072s
sys 6m13.923s
$ diff haskell-out.txt c-out.txt
$
(I have fixed the list so that both C and Haskell start with 0).
Yes you read this right. Haskell is several times faster than C. Or rather, normally buffered Haskell is faster than C with write(2) non-buffered syscall.
(When measuring output to /dev/null instead of a real disk file, C is about 1.5 times faster, but who cares about /dev/null performance?)
Technical data: Intel E2140 CPU, 2 cores, 1.6 GHz, 1M cache, Gentoo Linux, gcc4.6.1, ghc7.6.1.
The standard Haskell way to hand giant bytestrings over to the operating system is to use a builder monoid.
import Data.ByteString.Lazy.Builder -- requires bytestring-0.10.x
import Data.ByteString.Lazy.Builder.ASCII -- omit for bytestring-0.10.2.x
import Data.Monoid
import System.IO
main = hPutBuilder stdout $ build [0..100000000::Int]
build = foldr add_line mempty
where add_line n b = intDec n <> charUtf8 '\n' <> b
which gives me:
$ time ./printbuilder >> /dev/null
real 0m7.032s
user 0m6.603s
sys 0m0.398s
in contrast to Haskell approach you used
$ time ./print >> /dev/null
real 1m0.143s
user 0m58.349s
sys 0m1.032s
That is, it's child's play to do nine times better than mapM_ print, contra Daniel Fischer's suprising defeatism. Everything you need to know is here: http://hackage.haskell.org/packages/archive/bytestring/0.10.2.0/doc/html/Data-ByteString-Builder.html I won't compare it with your C since my results were much slower than Daniel's and n.m. so I figure something was going wrong.
Edit: Made the imports consistent with all versions of bytestring-0.10.x It occurred to me the following might be clearer -- the Builder equivalent of unlines . map show:
main = hPutBuilder stdout $ unlines_ $ map intDec [0..100000000::Int]
where unlines_ = mconcat . map (<> charUtf8 '\n')

Can't get C2HS with "foreign" Pointers to work

General Information
I'm currently experimenting with the C->Haskell (C2HS) Interface Generator for Haskell. At the first glace, it was just awesome, I interfaced a rather complicated C++ library (using a small extern C-wrapper) in just a couple of hours. (And I never did any FFI before.)
There was just one problem: How to get the memory allocated in the C/C++ library freed? I found {#pointer ... foreign #} in the C2HS documentation and that looks exactly like what I'm after. Since my C-wrapper turns the C++ library into a library with referential transparency with a functional interface, the Haskell Storage Manager should be able to do the hard work for me :-). Unfortunately, I was not able to get this working. To better explain my problem, I set up a small demo project on GitHub that has the same properties as the C/C++ library+wrapper but hasn't the overhead. As you can see, the library is completely safe to use with pure unsafe FFI.
Demo Project
On GitHub, I created a small demo project organized as follows:
C library
The C library is very simple and useless: You can pass it an integer and you can get as many integers (currently [0..n]) back from the library. Remember: The library is useless, just a demo. The interface is quite simple, too: The function LTIData lti_new_data(int n) will (after passing a integer) return you some kind of opaque object containing allocated data of the C library. The library has also two accessor functions int lti_element_count(LTIData data) and int lti_get_element(LTIData data, int n), the former will return the number of elements and the latter will return you element n. Ah, and last but not least, the user of the library should after using it free the opaque LTIData using void lti_free_data(LTIData data).
Experiment/C2HSBinding/Foreign/lib_to_interface.h
Experiment/C2HSBinding/Foreign/lib_to_interface.c
Low-level Haskell Binding
The low-level Haskell Binding is set up using C2HS, you can find it in
Experiment/C2HSBinding/Foreign/HsLTI.chs
High-level Haskell API
For fun I also set up kind of a high-level Haskell API using the low-level API binding and a simple driver program that uses the high-level API. Using the driver program and e.g. valgrind one is easily able to see the leaked memory (for every parameter p_1, p_2, ..., p_n the library does \sum_{i = 1..n} 1 + p_i allocations; easily observable as below):
$ valgrind dist/build/TestHsLTI/TestHsLTI 100 2>&1 | grep -e allocs -e frees
==22647== total heap usage: 184 allocs, 74 frees, 148,119 bytes allocated
$ valgrind dist/build/TestHsLTI/TestHsLTI 100 100 2>&1 | grep -e allocs -e frees
==22651== total heap usage: 292 allocs, 80 frees, 181,799 bytes allocated
$ valgrind dist/build/TestHsLTI/TestHsLTI 100 100 100 2>&1 | grep -e allocs -e frees
==22655== total heap usage: 400 allocs, 86 frees, 215,479 bytes allocated
$ valgrind dist/build/TestHsLTI/TestHsLTI 100 100 100 100 2>&1 | grep -e allocs -e frees
==22659== total heap usage: 508 allocs, 92 frees, 249,159 bytes allocated
Current State of the Demo
You should be able to clone, compile and run the project by simply typing git clone https://github.com/weissi/c2hs-experiments.git && cd c2hs-experiments && cabal configure && cabal build && dist/build/TestHsLTI/TestHsLTI
So what's the Problem again?
The problem is that the project only uses Foreign.Ptr and not the "managed" version Foreign.ForeignPtr using C2HS's {#pointer ... foreign #} and I can't get it to work. In the demo project I also added a .chs file trying to use these foreign pointers but it does not work :-(. I tried it very hard but I didn't have any success.
And there is one thing I don't understand, too: How to tell GHC using C2HS how to free the library's data. The demo project's library provides a function void lti_free_data(LTIData data) that should get called to free the memory. But GHC can't guess that!?! If GHC uses regular a free(), not all of the memory will get freed :-(.
Problem solved: I found this file doing something similar on the internet and was able to solve it :-).
Everything it needed was some boilerplate marshalling code:
foreign import ccall "lib_to_interface.h &lti_free_data"
ltiFreeDataPtr :: FunPtr (Ptr (LTIDataHs) -> IO ())
newObjectHandle :: Ptr LTIDataHs -> IO LTIDataHs
newObjectHandle p = do
fp <- newForeignPtr ltiFreeDataPtr p
return $ LTIDataHs fp
Here's the final managed (ForeignPtr) verion of the .chs file.

Tools for analyzing performance of a Haskell program

While solving some Project Euler Problems to learn Haskell (so currently I'm a completly beginner) I came over Problem 12. I wrote this (naive) solution:
--Get Number of Divisors of n
numDivs :: Integer -> Integer
numDivs n = toInteger $ length [ x | x<-[2.. ((n `quot` 2)+1)], n `rem` x == 0] + 2
--Generate a List of Triangular Values
triaList :: [Integer]
triaList = [foldr (+) 0 [1..n] | n <- [1..]]
--The same recursive
triaList2 = go 0 1
where go cs n = (cs+n):go (cs+n) (n+1)
--Finds the first triangular Value with more than n Divisors
sol :: Integer -> Integer
sol n = head $ filter (\x -> numDivs(x)>n) triaList2
This Solution for n=500 (sol 500) is extremely slow (running for more than 2 hours now), so I wondered how to find out why this solution is so slow. Are there any commands that tell me where most of the computation-time is spent so I know which part of my haskell-program is slow? Something like a simple profiler.
To make it clear, I'm not asking for a faster solution but for a way to find this solution. How would you start if you would have no haskell knowledge?
I tried to write two triaList functions but found no way to test which one is faster, so this is where my problems start.
Thanks
how to find out why this solution is so slow. Are there any commands that tell me where most of the computation-time is spend so I know which part of my haskell-program is slow?
Precisely! GHC provides many excellent tools, including:
runtime statistics
time profiling
heap profiling
thread analysis
core analysis.
comparative benchmarking
GC tuning
A tutorial on using time and space profiling is part of Real World Haskell.
GC Statistics
Firstly, ensure you're compiling with ghc -O2. And you might make sure it is a modern GHC (e.g. GHC 6.12.x)
The first thing we can do is check that garbage collection isn't the problem.
Run your program with +RTS -s
$ time ./A +RTS -s
./A +RTS -s
749700
9,961,432,992 bytes allocated in the heap
2,463,072 bytes copied during GC
29,200 bytes maximum residency (1 sample(s))
187,336 bytes maximum slop
**2 MB** total memory in use (0 MB lost due to fragmentation)
Generation 0: 19002 collections, 0 parallel, 0.11s, 0.15s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 13.15s ( 13.32s elapsed)
GC time 0.11s ( 0.15s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 13.26s ( 13.47s elapsed)
%GC time **0.8%** (1.1% elapsed)
Alloc rate 757,764,753 bytes per MUT second
Productivity 99.2% of total user, 97.6% of total elapsed
./A +RTS -s 13.26s user 0.05s system 98% cpu 13.479 total
Which already gives us a lot of information: you only have a 2M heap, and GC takes up 0.8% of time. So no need to worry that allocation is the problem.
Time Profiles
Getting a time profile for your program is straight forward: compile with -prof -auto-all
$ ghc -O2 --make A.hs -prof -auto-all
[1 of 1] Compiling Main ( A.hs, A.o )
Linking A ...
And, for N=200:
$ time ./A +RTS -p
749700
./A +RTS -p 13.23s user 0.06s system 98% cpu 13.547 total
which creates a file, A.prof, containing:
Sun Jul 18 10:08 2010 Time and Allocation Profiling Report (Final)
A +RTS -p -RTS
total time = 13.18 secs (659 ticks # 20 ms)
total alloc = 4,904,116,696 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
numDivs Main 100.0 100.0
Indicating that all your time is spent in numDivs, and it is also the source of all your allocations.
Heap Profiles
You can also get a break down of those allocations, by running with +RTS -p -hy, which creates A.hp, which you can view by converting it to a postscript file (hp2ps -c A.hp), generating:
which tells us there's nothing wrong with your memory use: it is allocating in constant space.
So your problem is algorithmic complexity of numDivs:
toInteger $ length [ x | x<-[2.. ((n `quot` 2)+1)], n `rem` x == 0] + 2
Fix that, which is 100% of your running time, and everything else is easy.
Optimizations
This expression is a good candidate for the stream fusion optimization, so I'll rewrite it
to use Data.Vector, like so:
numDivs n = fromIntegral $
2 + (U.length $
U.filter (\x -> fromIntegral n `rem` x == 0) $
(U.enumFromN 2 ((fromIntegral n `div` 2) + 1) :: U.Vector Int))
Which should fuse into a single loop with no unnecessary heap allocations. That is, it will have better complexity (by constant factors) than the list version. You can use the ghc-core tool (for advanced users) to inspect the intermediate code after optimization.
Testing this, ghc -O2 --make Z.hs
$ time ./Z
749700
./Z 3.73s user 0.01s system 99% cpu 3.753 total
So it reduced running time for N=150 by 3.5x, without changing the algorithm itself.
Conclusion
Your problem is numDivs. It is 100% of your running time, and has terrible complexity. Think about numDivs, and how, for example, for each N you are generating [2 .. n div 2 + 1] N times.
Try memoizing that, since the values don't change.
To measure which of your functions is faster, consider using criterion, which will provide statistically robust information about sub-microsecond improvements in running time.
Addenda
Since numDivs is 100% of your running time, touching other parts of the program won't make much difference,
however, for pedagogical purposes, we can also rewrite those using stream fusion.
We can also rewrite trialList, and rely on fusion to turn it into the loop you write by hand in trialList2,
which is a "prefix scan" function (aka scanl):
triaList = U.scanl (+) 0 (U.enumFrom 1 top)
where
top = 10^6
Similarly for sol:
sol :: Int -> Int
sol n = U.head $ U.filter (\x -> numDivs x > n) triaList
With the same overall running time, but a bit cleaner code.
Dons' answer is great without being a spoiler by giving a direct solution to the problem.
Here I want to suggest a little tool that I wrote recently. It saves you the time to write SCC annotations by hand when you want a more detailed profile than the default ghc -prof -auto-all. Besides that it's colorful!
Here's an example with the code you gave(*), green is OK, red is slow:
All the time goes in creating the list of divisors. This suggests a few things you can do:
1. Make the filtering n rem x == 0 faster, but since it's a built-in function probably it's already fast.
2. Create a shorter list. You've already done something in that direction by checking only up to n quot 2.
3. Throw away the list generation completely and use some math to get a faster solution. This is the usual way for project Euler problems.
(*) I got this by putting your code in a file called eu13.hs, adding a main function main = print $ sol 90. Then running visual-prof -px eu13.hs eu13 and the result is in eu13.hs.html.
Haskell related note: triaList2 is of course faster than triaList because the latter performs a lot of unnecessary computations. It will take quadratic time to compute n first elements of triaList, but linear for triaList2. There is another elegant (and efficient) way to define an infinite lazy list of triangle numbers:
triaList = 1 : zipWith (+) triaList [2..]
Math related note: there is no need to check all divisors up to n / 2, it's enough to check up to sqrt(n).
You can run your program with flags to enable time profiling. Something like this:
./program +RTS -P -sprogram.stats -RTS
That should run the program and produce a file called program.stats which will have how much time was spent in each function. You can find more information about profiling with GHC in the GHC user guide. For benchmarking, there is the Criterion library. I've found this blog post has a useful introduction.

Resources