Performance issue with parallel computation in Haskell

Performance issue with parallel computation in Haskell - haskell

I'm comparing the performance of two haskell programs running the same computation.
The first one is sequential:
main :: IO()
main = putStr $ unlines . map (show . solve) $ [100..107]
where solve x = pow x (10^7) (982451653)
The second one uses Control.Parallel.Strategies:
import Control.Parallel.Strategies
main :: IO()
main = putStr $ unlines . parMap rdeepseq (show . solve) $ [100..107]
where solve x = pow x (10^7) (982451653)
In both cases, pow is the modular exponentiation naively implemented as:
pow :: Int -> Int -> Int -> Int
pow a 0 m = 1
pow a b m = a * (pow a (b-1) m) `mod` m
The sequential program runs in about 3 seconds using, as expected, 100% CPU.
$ stack ghc seq.hs -- -O2
$ \time -f "%e s - %P" ./seq > /dev/null
2.96 s - 100%
The parallel program also runs in about 3 seconds using 100% CPU when limited to a single core.
$ stack ghc par.hs -- -O2 -threaded
$ \time -f "%e s - %P" ./par +RTS -N1 > /dev/null
3.14 s - 99%
But when I ran it on 4 cores, I did not observe the performance gain I was expected:
$ \time -f "%e s - %P" ./par +RTS -N4 > /dev/null
3.31 s - 235%
Even more surprising, the sequential program uses more than 100% CPU when run on several cores:
$ stack ghc seq.hs -- -O2 -threaded
$ \time -f "%e s - %P" ./seq +RTS -N4 > /dev/null
3.26 s - 232%
How can those results be explained?
EDIT - As advised by #RobertK and #Yuras, I replaced the rdeeseq by rpar and it did fix the initial issue. However, the performance is still much less than what I expected:
$ stack ghc par.hs -- -O2 -threaded
$ \time -f "%e s - %P" ./par +RTS -N1 > /dev/null
3.12 s - 99%
$ \time -f "%e s - %P" ./par +RTS -N4 > /dev/null
1.91 s - 368%
The execution time is barely divided by two even though the 4 cores are running more than 90% of the time on average.
Also, some parts of the threadscope graph look very sequential:

First of all, rdeepseq seems to be buggy. Try to run ./seq +RTS -N4 -s, and you'll see no sparks created. That is why you don't see any speedup on 4 cores. Use rnf x ‘pseq‘ return x instead.
Also note GC statictics in +RTS -s output. Actually GC takes most of the CPU. With -N4 you have 4 parallel GC running, they take more time. That is why sequencial progral takes much more CPU on 4 cores. Basically you have 3 GC threads idle in spin lock waiting for synchronization. The do nothing useful, by eat CPU in a busy loop. Try to limit number of parallel GC threads using -qn1 option.
Regarding performance gain. You should not expect perfect scaling. Also I think you have 1 fizzled spark -- it is evaluated in parallel, but its result is not used.
Added: Comparing with the python implementation you linked in the comments, I see that you are using completely different algorithm in haskell. More or less similar approach is the next (requires BangPatterns):
pow :: Int -> Int -> Int -> Int
pow a b m = go 1 b
where
go !r 0 = r
go r b' = go ((r * a) `mod` m) (pred b')
Your ariginal algorithm uses stack to build the result, so it is bound by GC, not by actuall computation. So you don't see big speedup. With new one I see 3x speedup (I had to increase amount of work to see the speedup because the algorithm becomes too slow).

I do not believe your parallel example is parallel. parMap accepts a strategy, and your strategy simply tells it to perform a deepseq. You need to combine this strategy with one that defines the parallel behaviour, e.g rpar. You are telling haskell 'perform this map, using this strategy', and right now your strategy does not define any parallel behaviour.
Also make sure that you compile your program specifying the -rtsopts flag (I do not know if stack does this for you, but ghc requires it to enable runtime options).

Related

Within RT Linux, is it possible to enforce linux background tasks to run on a specific core?

Assumptions:
RT Linux patch is applied
POSIX RT threads are available which can have core affinity applied
Using a multi-core processor, for simplicity a dual core with cores labelled A and B
No interrupts
In a system which is designed to be Real Time, at least Soft Real Time rather than Hard Real Time. I was curious if you could apply core affinity for the background linux tasks that can "disrupt" the real time tasks to a single core, core A, and leave the real time tasks to run on the other core, B.
This way the linux kernel can execute its low priority tasks whilst being able to perform real time task execution.
This is quite a simplistic explanation but I'm curious simply if it is possible to do so. Granted there is other factors at play which can reduce determinism within a system such as interrupts but assumption removes those from this current scenario.
Any guidance to material or corrections on this question are welcome

CPUSETS
A very easy approach to isolate cores inside Linux is to use CPUSETS.
See the excellent documentation there for further information.
The following example script assumes a cpu consisting of four cpu-cores. It will use cores 0-2 for all available tasks and reserve core 3 for tasks that are specifically attached to that core.
This is done by creating two cpusets "rt0" and "system".
Global load-balancing gets disabled and only reenabled on the "system" set, which makes the remaining core inaccessible to the load-balancer.
To attach tasks to the reserved core you have to write its pid to the corresponding tasks file.
$ mkdir /dev/cpuset
$ mount -t cgroup -o cpuset cpuset /dev/cpuset
$ cd /dev/cpuset
$ /bin/echo 1 > cpuset.cpu_exclusive
$ /bin/echo 0 > cpuset.sched_load_balance
$ mkdir rt0
$ /bin/echo 3 > rt0/cpuset.cpus
$ /bin/echo 0 > rt0/cpuset.mems
$ /bin/echo 1 > rt0/cpuset.cpu_exclusive
$ /bin/echo 0 > rt0/cpuset.sched_load_balance
$ /bin/echo $RT_PROC_PID > rt0/tasks
$ mkdir system
$ /bin/echo 0-2 > system/cpuset.cpus
$ /bin/echo 0 > system/cpuset.mems
$ /bin/echo 1 > system/cpuset.cpu_exclusive
$ /bin/echo 1 > system/cpuset.sched_load_balance
$ for pid in $(cat tasks); do /bin/echo $pid > system/tasks; done
Systemd slices
A newer approach to deal with resource partitioning is systemd-slices. As far as I know it only works with Control Group v2 and obviously systemd. Since I did never really work with it I cannot provide further information here, but I thought it is worth mentioning.

Profiling with ThreadScope with command line arguments in Haskell

I understand from here that to use ThreadScope I need to compile with an eventlog and rtsoptions, eg "-rtsopts -eventlog -threaded"
I am using Stack, so my compilation call looks like:
$ stack ghc -- -O2 -funfolding-use-threshold=16 -optc-O3 -rtsopts -eventlog -threaded mycoolprogram.hs
Whereas normally, I do:
$ stack ghc -- -O2 -funfolding-use-threshold=16 -optc-O3 -threaded mycoolprogram.hs
This compiles fine. However, my program takes 2 and only 2 positional arguments:
./mycoolprogram arg1 arg2
I'm trying to add the RTS options +RTS -N2 -l, like so:
./mycoolprogram arg1 arg2 -- +RTS -N2 -l
Or
./mycoolprogram +RTS -N2 -l -- arg1 arg2
How can I simultaneously run my program with arguments going into System.Environment.getArgs (like eg here) and also include these profiling flags?

As #sjakobi said you can use the +RTS ... -RTS other arguments or other arguments +RTS ... forms but there is also the option to pass them in an environment variable GHCRTS:
GHCRTS='-N2 -l' ./mycoolprogram arg1 arg2
Lost of more info is available in the GHC users guide.

/usr/bin/time and time in multicore platform

guys. I am confused about the result below:
1). time xxxxx
real 0m28.942s
user 0m28.702s
sys 0m0.328s
2). /usr/bin/time -p xxxxx
real 28.48
user 0.00
sys 0.13
so. I have some question(user: 0m28.702s != 0, sys: 0m0.328s != 0.13):
what's different between time and /usr/bin/time ?
what's different in differnt cpu platform, one core or multicore ?
any suggestion？

It's quite easy to find out the answer to your first question using type:
$ type time
time is a shell keyword
$ type /usr/bin/time
/usr/bin/time is /usr/bin/time
So the first command uses a bash built-in, while the latter defers to an external program. However, not knowing what system you are using, I have no idea where that program comes from. On Gentoo Linux, there's no /usr/bin/time by default, and the only implementation available is GNU time that has different output.
That said, I have tried a command similar to yours (assuming it's working on a 1G file), and got the following results:
$ time sed -e 's/0//g' big-file > big-file2
real 0m40.600s
user 0m31.295s
sys 0m4.174s
$ /usr/bin/time sed -e 's/0//g' big-file > big-file2
35.06user 3.31system 0:40.58elapsed 94%CPU (0avgtext+0avgdata 3488maxresident)k
8inputs+2179176outputs (0major+276minor)pagefaults 0swaps
As you can see, the numbers are similar.
Then, given your results (0 userspace time is quite impossible) I'd say that your /usr/bin/time is simply broken. This might be worth reporting a bug to its author.

how to shield a cpu from the linux scheduler (prevent it scheduling threads onto that cpu)?

It is possible to use sched_setaffinity to pin a thread to a cpu, increasing performance (in some situations)
From the linux man page:
Restricting a process to run on a single CPU also avoids the
performance cost caused by the cache invalidation that occurs when a
process ceases to execute on one CPU and then recommences execution on
a different CPU
Further, if I desire a more real-time response, I can change the scheduler policy for that thread to SCHED_FIFO, and up the priority to some high value (up to sched_get_priority_max), meaning the thread in question should always pre-empt any other thread running on its cpu when it becomes ready.
However, at this point, the thread running on the cpu which the real-time thread just pre-empted will possibly have evicted much of the real-time thread's level-1 cache entries.
My questions are as follows:
Is it possible to prevent the scheduler from scheduling any threads onto a given cpu? (eg: either hide the cpu completely from the scheduler, or some other way)
Are there some threads which absolutely have to be able to run on that cpu? (eg: kernel threads / interrupt threads)
If I need to have kernel threads running on that cpu, what is a reasonable maximum priority value to use such that I don't starve out the kernel threads?

The answer is to use cpusets. The python cpuset utility makes it easy to configure them.
Basic concepts
3 cpusets
root: present in all configurations and contains all cpus (unshielded)
system: contains cpus used for system tasks - the ones which need to run but aren't "important" (unshielded)
user: contains cpus used for "important" tasks - the ones we want to run in "realtime" mode (shielded)
The shield command manages these 3 cpusets.
During setup it moves all movable tasks into the unshielded cpuset (system) and during teardown it moves all movable tasks into the root cpuset.
After setup, the subcommand lets you move tasks into the shield (user) cpuset, and additionally, to move special tasks (kernel threads) from root to system (and therefore out of the user cpuset).
Commands:
First we create a shield. Naturally the layout of the shield will be machine/task dependent. For example, say we have a 4-core non-NUMA machine: we want to dedicate 3 cores to the shield, and leave 1 core for unimportant tasks; since it is non-NUMA we don't need to specify any memory node parameters, and we leave the kernel threads running in the root cpuset (ie: across all cpus)
$ cset shield --cpu 1-3
Some kernel threads (those which aren't bound to specific cpus) can be moved into the system cpuset. (In general it is not a good idea to move kernel threads which have been bound to a specific cpu)
$ cset shield --kthread on
Now let's list what's running in the shield (user) or unshielded (system) cpusets: (-v for verbose, which will list the process names) (add a 2nd -v to display more than 80 characters)
$ cset shield --shield -v
$ cset shield --unshield -v -v
If we want to stop the shield (teardown)
$ cset shield --reset
Now let's execute a process in the shield (commands following '--' are passed to the command to be executed, not to cset)
$ cset shield --exec mycommand -- -arg1 -arg2
If we already have a running process which we want to move into the shield (note we can move multiple processes by passing a comma separated list, or ranges (any process in the range will be moved, even if there are gaps))
$ cset shield --shield --pid 1234
$ cset shield --shield --pid 1234,1236
$ cset shield --shield --pid 1234,1237,1238-1240
Advanced concepts
cset set/proc - these give you finer control of cpusets
Set
Create, adjust, rename, move and destroy cpusets
Commands
Create a cpuset, using cpus 1-3, use NUMA node 1 and call it "my_cpuset1"
$ cset set --cpu=1-3 --mem=1 --set=my_cpuset1
Change "my_cpuset1" to only use cpus 1 and 3
$ cset set --cpu=1,3 --mem=1 --set=my_cpuset1
Destroy a cpuset
$ cset set --destroy --set=my_cpuset1
Rename an existing cpuset
$ cset set --set=my_cpuset1 --newname=your_cpuset1
Create a hierarchical cpuset
$ cset set --cpu=3 --mem=1 --set=my_cpuset1/my_subset1
List existing cpusets (depth of level 1)
$ cset set --list
List existing cpuset and its children
$ cset set --list --set=my_cpuset1
List all existing cpusets
$ cset set --list --recurse
Proc
Manage threads and processes
Commands
List tasks running in a cpuset
$ cset proc --list --set=my_cpuset1 --verbose
Execute a task in a cpuset
$ cset proc --set=my_cpuset1 --exec myApp -- --arg1 --arg2
Moving a task
$ cset proc --toset=my_cpuset1 --move --pid 1234
$ cset proc --toset=my_cpuset1 --move --pid 1234,1236
$ cset proc --toset=my_cpuset1 --move --pid 1238-1340
Moving a task and all its siblings
$ cset proc --move --toset=my_cpuset1 --pid 1234 --threads
Move all tasks from one cpuset to another
$ cset proc --move --fromset=my_cpuset1 --toset=system
Move unpinned kernel threads into a cpuset
$ cset proc --kthread --fromset=root --toset=system
Forcibly move kernel threads (including those that are pinned to a specific cpu) into a cpuset (note: this may have dire consequences for the system - make sure you know what you're doing)
$ cset proc --kthread --fromset=root --toset=system --force
Hierarchy example
We can use hierarchical cpusets to create prioritised groupings
Create a system cpuset with 1 cpu (0)
Create a prio_low cpuset with 1 cpu (1)
Create a prio_met cpuset with 2 cpus (1-2)
Create a prio_high cpuset with 3 cpus (1-3)
Create a prio_all cpuset with all 4 cpus (0-3) (note this the same as root; it is considered good practice to keep a separation from root)
To achieve the above you create prio_all, and then create subset prio_high under prio_all, etc
$ cset set --cpu=0 --set=system
$ cset set --cpu=0-3 --set=prio_all
$ cset set --cpu=1-3 --set=/prio_all/prio_high
$ cset set --cpu=1-2 --set=/prio_all/prio_high/prio_med
$ cset set --cpu=1 --set=/prio_all/prio_high/prio_med/prio_low

There are two other ways I can think of doing this (though not as elegant as cset, which doesn't seem to have a fantastic level of support from Redhat):
1) Taskset everything including PID 1 - nice and easy (but, alledgly -- I've never seen any issues myself -- may cause inefficiencies in the scheduler). The script below (which must be run as root) runs taskset on all the running processes, including init (pid 1); this will pin all the running processes to one or more 'junk cores', and by also pinning init, it will ensure that any future processes are also started in the list of 'junk cores':
#!/bin/bash
if [[ -z $1 ]]; then
printf "Usage: %s '<csv list of cores to set as junk in double quotes>'", $0
exit -1;
fi
for i in `ps -eLfad |awk '{ print $4 } '|grep -v PID | xargs echo `; do
taskset -pc $1 $i;
done
2) use the isolcpus kernel parameter (here's the documentation from https://www.kernel.org/doc/Documentation/kernel-parameters.txt):
isolcpus= [KNL,SMP] Isolate CPUs from the general scheduler.
Format:
<cpu number>,...,<cpu number>
or
<cpu number>-<cpu number>
(must be a positive range in ascending order)
or a mixture
<cpu number>,...,<cpu number>-<cpu number>
This option can be used to specify one or more CPUs
to isolate from the general SMP balancing and scheduling
algorithms. You can move a process onto or off an
"isolated" CPU via the CPU affinity syscalls or cpuset.
<cpu number> begins at 0 and the maximum value is
"number of CPUs in system - 1".
This option is the preferred way to isolate CPUs. The
alternative -- manually setting the CPU mask of all
tasks in the system -- can cause problems and
suboptimal load balancer performance.
I've used these two plus the cset mechanisms for several projects (incidentally, please pardon the blatant self promotion :-)), I've just filed a patent for a tool called Pontus Vision ThreadManager that comes up with optimal pinning strategies for any given x86 platform to any given software work loads; after testing it in a customer site, I got really good results (270% reduction in peak latencies), so it's well worth doing pinning and CPU isolation.

Here's how to do it the old-fashioned way, using cgroups. I have a Fedora 28 machine and RedHat/Fedora want you to use systemd-run, but I wasn't able to find this functionality in there. I would love to know how to do it using systemd-run, if anyone would care to enlighten me.
Let's say I want to exclude my fourth CPU (of CPUs 0-3) from scheduling, and move all existing processes to CPUs 0-2. Then I want to put a process on CPU 3 all by itself.
sudo su -
cgcreate -g cpuset:not_cpu_3
echo 0-2 > /sys/fs/cgroup/cpuset/not_cpu_3/cpuset.cpus
# This "0" is the memory node. See https://utcc.utoronto.ca/~cks/space/blog/linux/NUMAMemoryInfo
# for more information *
echo 0 > /sys/fs/cgroup/cpuset/not_cpu_3/cpuset.mems
Specifically, on your machine you'll want to review /proc/zoneinfo and the /sys/devices/system/node heirarchy. Getting the proper node information is left as an exercise for the reader.
Now that we have our cgroup, we need to create our isolated CPU 3 cgroup:
cgcreate -g cpuset:cpu_3
echo 3 > /sys/fs/cgroup/cpuset/cpu_3/cpuset.cpus
# Again, the memory node(s) you want to specify.
echo 0 > /sys/fs/cgroup/cpuset/cpu_3/cpuset.mems
Put all processes/threads on the not_cpu_3 cgroup:
for pid in $(ps -eLo pid) ; do cgclassify -g cpuset:not_cpu_3 $pid; done
Review:
ps -eL k psr o psr,pid,tid,args | sort | cut -c -80
NOTE! Processes currently in sleep will not move. They must be awakened so that the scheduler will put them on a different CPU. To see this, choose your favorite sleeping process in the above list- a process, say a web browser, that you thought should be on CPU 0-2 but it's still on 3. Using its thread ID from the above list, perform:
kill -CONT <thread_id>
example
kill -CONT 9812
Rerun the ps command, and note that it's moved to another CPU.
DOUBLE NOTE! Some kernel threads cannot and will not move! For example, you may note that every CPU has a kernel thread [kthreadd] on it. Assigning processes to cgroups works for userspace processes, not for kernel threads. This is life in the multitasking world.
Now to move a process and all its children to control group cpu_3:
pid=12566 # for example
cgclassify -g cpuset:cpu_3 $pid
taskset -c -p 3 $pid
Again, if $pid is sleeping, you'll need to wake it up for the CPU move to actually take place.
To undo all of this, simply delete the cgroups you've created. Everybody will be stuck back into the root cgroup:
cgdelete -r cpuset:cpu_3
cgdelete -r cpuset:not_cpu_3
No need to reboot.
(Sorry, I don't understand the 3rd question from the original poster. I can't comment on that.)

If you are using rhel instance you can use Tuna for this (May be available for other linux distros also, but not sure about that). It can easily installed from yum command. Tuna can be used to isolate a cpu core and it dynamically moves processes run in that particular cpu to neighboring cpu. The command to isolate a cpu core is as follow,
# tuna --cpus=CPU-LIST --isolate
You can use htop to see how tuna isolate the cpu cores in real-time.

How do I determine the slowest component of my shell pipeline?

I have an extremely long and complicated shell pipeline set up to grab 2.2Gb of data and process it. It currently takes 45 minutes to process. The pipeline is a number of cut, grep, sort, uniq, grep and awk commands tied together. I have my suspicion that it's the grep portion that is causing it to take so much time but I have no way of confirming it.
Is there anyway to "profile" the entire pipeline from end to end to determine which component is the slowest and if it is CPU or IO bound so it can be optimised?
I cannot post the entire command here unfortunately as it would require posting proprietary information but I suspect it is the following bit checking it out with htop:
grep -v ^[0-9]

One way to do this is to gradually build up the pipeline, timing each addition, and taking as much out of the equation as possible (such as outputting to a terminal or file). A very simple example is shown below:
pax:~$ time ( cat bigfile >/dev/null )
real 0m4.364s
user 0m0.004s
sys 0m0.300s
pax:~$ time ( cat bigfile | tr 'a' 'b' >/dev/null )
real 0m0.446s
user 0m0.312s
sys 0m0.428s
pax:~$ time ( cat bigfile | tr 'a' 'b' | tail -1000l >/dev/null )
real 0m0.796s
user 0m0.516s
sys 0m0.688s
pax:~$ time ( cat bigfile | tr 'a' 'b' | tail -1000l | sort -u >/dev/null )
real 0m0.892s
user 0m0.556s
sys 0m0.756s
If you add up the user and system times above, you'll see that the incremental increases are:
0.304 (0.004 + 0.300) seconds for the cat;
0.436 (0.312 + 0.428 - 0.304) seconds for the tr;
0.464 (0.516 + 0.688 - 0.436 - 0.304) seconds for the tail; and
0.108 (0.556 + 0.756 - 0.464 - 0.436 - 0.304) seconds for the sort.
This tells me that the main things to look into are the tail and the tr.
Now obviously, that's for CPU only, and I probably should have done multiple runs at each stage for averaging purposes, but that's the basic first approach I would take.
If it turns out it really is your grep, there are a few other options available to you. There are numerous other commands that can strip lines not starting with a digit but you may find that a custom-built command for doing this may be faster still, pseudo-code like (untested, but you should get the idea):
state = echo
lastchar = newline
while not end of file:
read big chunk from file
for every char in chunk:
if lastchar is newline:
if state is echo and char is non-digit:
state = skip
else if state is skip and and char is digit:
state = echo
if state is echo:
output char
lastchar = char
Custom, targeted code like this can sometimes be made more efficient than a general-purpose regex processing engine, simply because it can be optimised to the specific case. Whether that's true is this case, or any case for that matter, is something you should test. My number one optimisation mantra is measure, don't guess!

I found the problem myself after some further experimentation. It appears to be due to the encoding support in grep. Using the following hung the pipeline:
grep -v ^[0-9]
I replaced it with sed as follows and it finished in under 45 seconds!
sed '/^[0-9]/d'

This is straightforward with zsh:
zsh-4.3.12[sysadmin]% time sleep 3 | sleep 5 | sleep 2
sleep 3 0.01s user 0.03s system 1% cpu 3.182 total
sleep 5 0.01s user 0.01s system 0% cpu 5.105 total
sleep 2 0.00s user 0.05s system 2% cpu 2.121 total

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string