I understand from here that to use ThreadScope I need to compile with an eventlog and rtsoptions, eg "-rtsopts -eventlog -threaded"
I am using Stack, so my compilation call looks like:
$ stack ghc -- -O2 -funfolding-use-threshold=16 -optc-O3 -rtsopts -eventlog -threaded mycoolprogram.hs
Whereas normally, I do:
$ stack ghc -- -O2 -funfolding-use-threshold=16 -optc-O3 -threaded mycoolprogram.hs
This compiles fine. However, my program takes 2 and only 2 positional arguments:
./mycoolprogram arg1 arg2
I'm trying to add the RTS options +RTS -N2 -l, like so:
./mycoolprogram arg1 arg2 -- +RTS -N2 -l
Or
./mycoolprogram +RTS -N2 -l -- arg1 arg2
How can I simultaneously run my program with arguments going into System.Environment.getArgs (like eg here) and also include these profiling flags?
As #sjakobi said you can use the +RTS ... -RTS other arguments or other arguments +RTS ... forms but there is also the option to pass them in an environment variable GHCRTS:
GHCRTS='-N2 -l' ./mycoolprogram arg1 arg2
Lost of more info is available in the GHC users guide.
Related
I'm running "perf" in the following way:
perf record -a --call-graph -p some_pid
perf report --call-graph --stdio
Then, I see this:
1.60% my_binary my_binary [.] my_func
|
--- my_func
|
|--71.10%-- (nil)
| (nil)
|
--28.90%-- 0x17f310000000a
I can't see which functions call my_func(). I see "nil" and "0x17f310000000a" instead. Am I doing something wrong? It is probably not a debug info problem because some symbols are shown while others are not shown.
More info:
I'm runnning CentOS 6.2 (kernel 2.6.32-220.4.1).
perf rpm - perf-2.6.32-279.5.2.el6.x86_64.
Make sure you compiled the code with -fno-omit-frame-pointer gcc option.
You're almost there, you're missing the -G option (you might need a more recent perf than the one installed on your system):
$ perf report --call-graph --stdio -G
From perf help report:
-G, --inverted
alias for inverted caller based call graph.
I'm comparing the performance of two haskell programs running the same computation.
The first one is sequential:
main :: IO()
main = putStr $ unlines . map (show . solve) $ [100..107]
where solve x = pow x (10^7) (982451653)
The second one uses Control.Parallel.Strategies:
import Control.Parallel.Strategies
main :: IO()
main = putStr $ unlines . parMap rdeepseq (show . solve) $ [100..107]
where solve x = pow x (10^7) (982451653)
In both cases, pow is the modular exponentiation naively implemented as:
pow :: Int -> Int -> Int -> Int
pow a 0 m = 1
pow a b m = a * (pow a (b-1) m) `mod` m
The sequential program runs in about 3 seconds using, as expected, 100% CPU.
$ stack ghc seq.hs -- -O2
$ \time -f "%e s - %P" ./seq > /dev/null
2.96 s - 100%
The parallel program also runs in about 3 seconds using 100% CPU when limited to a single core.
$ stack ghc par.hs -- -O2 -threaded
$ \time -f "%e s - %P" ./par +RTS -N1 > /dev/null
3.14 s - 99%
But when I ran it on 4 cores, I did not observe the performance gain I was expected:
$ \time -f "%e s - %P" ./par +RTS -N4 > /dev/null
3.31 s - 235%
Even more surprising, the sequential program uses more than 100% CPU when run on several cores:
$ stack ghc seq.hs -- -O2 -threaded
$ \time -f "%e s - %P" ./seq +RTS -N4 > /dev/null
3.26 s - 232%
How can those results be explained?
EDIT - As advised by #RobertK and #Yuras, I replaced the rdeeseq by rpar and it did fix the initial issue. However, the performance is still much less than what I expected:
$ stack ghc par.hs -- -O2 -threaded
$ \time -f "%e s - %P" ./par +RTS -N1 > /dev/null
3.12 s - 99%
$ \time -f "%e s - %P" ./par +RTS -N4 > /dev/null
1.91 s - 368%
The execution time is barely divided by two even though the 4 cores are running more than 90% of the time on average.
Also, some parts of the threadscope graph look very sequential:
First of all, rdeepseq seems to be buggy. Try to run ./seq +RTS -N4 -s, and you'll see no sparks created. That is why you don't see any speedup on 4 cores. Use rnf x ‘pseq‘ return x instead.
Also note GC statictics in +RTS -s output. Actually GC takes most of the CPU. With -N4 you have 4 parallel GC running, they take more time. That is why sequencial progral takes much more CPU on 4 cores. Basically you have 3 GC threads idle in spin lock waiting for synchronization. The do nothing useful, by eat CPU in a busy loop. Try to limit number of parallel GC threads using -qn1 option.
Regarding performance gain. You should not expect perfect scaling. Also I think you have 1 fizzled spark -- it is evaluated in parallel, but its result is not used.
Added: Comparing with the python implementation you linked in the comments, I see that you are using completely different algorithm in haskell. More or less similar approach is the next (requires BangPatterns):
pow :: Int -> Int -> Int -> Int
pow a b m = go 1 b
where
go !r 0 = r
go r b' = go ((r * a) `mod` m) (pred b')
Your ariginal algorithm uses stack to build the result, so it is bound by GC, not by actuall computation. So you don't see big speedup. With new one I see 3x speedup (I had to increase amount of work to see the speedup because the algorithm becomes too slow).
I do not believe your parallel example is parallel. parMap accepts a strategy, and your strategy simply tells it to perform a deepseq. You need to combine this strategy with one that defines the parallel behaviour, e.g rpar. You are telling haskell 'perform this map, using this strategy', and right now your strategy does not define any parallel behaviour.
Also make sure that you compile your program specifying the -rtsopts flag (I do not know if stack does this for you, but ghc requires it to enable runtime options).
How do I pass +RTS options to a program run with stack exec?
I've added -rtsopts to ghc-options in my cabal file, and built a program with stack build. If I run the program manually both normal and +RTS command line arguments work:
>.stack-work\dist\ca59d0ab\build\iterate-strict-exe\iterate-strict-exe.exe 25 +RTS -s
OK
3,758,156,184 bytes allocated in the heap
297,976 bytes copied during GC
...
But if I run it with stack exec only the normal options reach the program
>stack exec iterate-strict-exe -- 25 +RTS -s
OK
Other things that don't work
If I juggle the order of the arguments around as suggested by #epsilonhalbe I get the same result.
>stack exec -- iterate-strict-exe 25 +RTS -s
OK
There doesn't seem to be the suggested --rts-options option to pass to stack exec.
>stack exec --rts-options "-s" -- iterate-strict-exe 25
Invalid option `--rts-options'
Usage: stack exec CMD [-- ARGS (e.g. stack ghc -- X.hs -o x)] ([--plain] |
[--[no-]ghc-package-path] [--[no-]stack-exe] [--package ARG])
[--help]
Execute a command
I'm using stack version 1.1.2
>stack --version
Version 1.1.2, Git revision c6dac65e3174dea79df54ce6d56f3e98bc060ecc (3647 commits) x86_64 hpack-0.14.0
The same after a stack upgrade to 1.4.0.
Passing the entire command as a string (another suggestion) results in a command with that name not being found
>stack exec -- "iterate-strict-exe 25 +RTS -s"
Executable named iterate-strict-exe 25 +RTS -s not found on path: ...
It looks like you are on Windows and encountering GHC bug #13287 (to be fixed in 8.2). See also stack issues 2022 and 2640. Apparently a workaround is to add --RTS before --, like
stack exec iterate-strict-exe --RTS -- 25 +RTS -s
I would like to profile a program that is being managed by Stack. The file was built using with the following command:
stack build --executable-profiling --library-profiling --ghc-options="-fprof-auto -rtsopts"
And run with this command
stack exec myProgram.exe -- inputArg +RTS -p
I know that the program has run (from the output file) but I am expecting a myProgram.prof file to be produced as well, I cannot find this file.
If I execute the program without using stack the profiling file is produced, but is there a way to get this to work using Stack?
-- stops the RTS from processing further command-line arguments but is passed through to the program. So, your -- is visible to both stack and myProgram.exe and therefore the +RTS -p flags are not visible to myProgram.exe's RTS. Instead try
stack exec -- myProgram.exe inputArg +RTS -p
To prevent the escape of privileged data, setcap executables on Linux don't dump core:
ijw#build$ cat > test.c
main() { abort(); }
ijw#build$ gcc test.c
test.c: In function ‘main’:
test.c:1: warning: incompatible implicit declaration of built-in function ‘abort’
ijw#build$ ./a.out
Aborted (core dumped)
ijw#build$ sudo setcap "cap_net_admin=+ep" a.out
ijw#build$ ./a.out
Aborted
Is there any way to enable it when you're debugging and actually want to see the core file?
I have two answers after more research.
You can change the system behaviour in its entirety. This isn't really suitable beyond a one user development machine but it does the trick:
echo 1 > /proc/sys/fs/suid_dumpable
Tested, works.
You can change the behaviour of the specific program by calling prctl() in it:
prctl(PR_SET_DUMPABLE, 1);
In this way, the privileged program determines for itself that it should be dumpable, and the system as a whole is not affected.
I've not tried this one.