Run perf with an MPI application - linux

perf is a performance analysis tool which can report hardware and software events. I am trying to run it with an MPI application in order to learn how much time the application spends within each core on data transfers and compute operations.
Normally, I would run my application with
mpirun -np $NUMBER_OF_CORES app_name
And it would spawn on several cores or possibly several nodes. Is it possible to add perf on top? I've tried
perf stat mpirun -np $NUMBER_OF_CORES app_name
But the output for this looks like some sort of aggregate of mpirun. Is there a way to collect perf type data from each core?

Something like:
mpirun -np $NUMBER_OF_CORES ./myscript.sh
might work with myscript.sh containing:
#! /bin/bash
perf stat app_name %*
You should add some parameter to the perf call to produce differently named result files.

perf can follow spawned child processes. To profile the MPI processes located on the same node, you can simply do
perf stat mpiexec -n 2 ./my-mpi-app
You can use perf record as well. It will create a single perf.data file containing the profiling information for all the local MPI processes. However, this won't allow you to profile individual MPI ranks.
To find out information about individual mpi ranks, you need to run
mpiexec -n 2 perf stat ./my-mpi-app
This will profile the individual ranks and will also work across multiple nodes. However, this does not work with some perf commands such as perf record.

Related

Which tool or command gives very accurate memory usage status in Linux?

I have been asked in my project to profile memory usage of a C++ application that runs on Linux for an embedded like device. We need to know this in order to decide how much RAM we need.
I have done some research and found many tools or commands to find the max memory usage of a process when it is running.
Here are those:
top
Command: top -p $Pid
ps
Command: ps -o rss=$pid
pmap
Command: pmap -x $pid
valgrind -massif
valgrind --tool=massif --pages-as-heap=yes program
smaps
Used the following link: Script
Linux system monitor app
But I get different memory usage in each of those. I have tried to understand in depth, but left me confused which is close enough to trust. So someone with experience could share which one they use and also why we have these many ways to measure memory which gives different results.
VM, RSS and Shared parts are having different values in all of them.
Thanks
You can get the maximum resident set size of the process during its lifetime, in Kilobytes by using the following command:
/usr/bin/time -f %M
Followed by the execution of your C++ binary.

linux perf report inconsistent behavior

I have an application I'm profiling using perf and I find the results when using perf report are not consistent, and I can't discern the pattern.
I start the application and profile it by pid for 60 seconds:
perf record -p <pid> -o <file> sleep 60
And when I pull the results in with perf report -i <file>, sometimes I see a "+" in the far left column that allows me to drill down into the function call trees when I press ENTER, and sometimes that "+" is not there. It seems to be dependent on some property of the recorded file, in that I have a collection of recorded files, some which allow this drill down and some which do not.
Any suggestions on how to get consistent behavior here would be appreciated.
The default event being measured by perf record is cpu-cycles.
(Or depending on the machine, sometimes cpu-cycles:p or cpu-cycles:pp)
Are you sure your application is not sleeping a lot? Does it consume a lot of cpu cycles?
Try a perf measurement on something that stresses the CPU by doing a lot of computations:
$ apt-get install stress
$ perf record -e cpu-cycles --call-graph fp stress --cpu 1 --timeout 5
$ perf report
Subsequent runs should then show more or less similar results.
In case your program is CPU intensive, and call stacks do differ a lot between runs, then you may want to look at the --call-graph option, as perf can record call-graphs with different methods:
fp (function pointer)
lbr (last branch record)
dwarf
Maybe different methods give better results.

Can I specify one CPU core while using oProfile?

I need to do a performance counter analysis on a 8-core server using oProfile, can oProfile only record events on core 7? Thank you!
The operf man page does not describe such an option (which would seem like the --cpu option of the perf record command).
With operf you can try the --separate-cpu / -c option (This option categorizes samples by cpu) with the --system-wide option (This option is for performing a system-wide profile.), and then provide a cpu:cpulist profile specification for opreport (Only consider profiles for the given numbered CPU).
For example:
$ sudo operf --separate-cpu --system-wide
... <Ctrl-C or kill -SIGINT>
$ opreport cpu:0

Running MPI programs on a specified number of cores

I am new to MPI programming. I want to run my MPI program on a specified number of cores. I referred to the help option by typing mpirun --help. It gave the following output:
...
-c|-np|--np <arg0> Number of processes to run
...
-n|--n <arg0> Number of processes to run
...
However, when I referred to this website, it specifies the following two different things in two different places:
in the introduction:
mpirun typically works like this
mpirun -np <number of processes> <program name and arguments>
and in the options help menu:
-np <np>
- specify the number of processors to run on
In this scenario, does -np specify the number of processes to run or processors to run on? Moreover, how do I run my MPI programs on multiple PCs?
Any help would be appreciated in this regards.
The use of -np specifies processes. The number of actual processors that the job runs on depends how you have configured MPI and your computer architecture. If you have mpi setup correctly on your local machine, mpirun -np 2 ./a.out will run two processes on two processors. If your local machine has four cores and you run mpirun -np 8 ./a.out, this should run 8 processes with two per processor (which may be sensible if the cores allow multi-threading). Check on top to see how many processors are actually used for various cases.
To run on multiple PCs, you will need to specify a list of the PCs network addresses in a host file and start a ring with a process manager like hydra or mpd, e.g. for 8 PCs or nodes mpd -n 8 -f ~/mpd.hosts. You will need to setup ssh to use key authentication and install MPI on every PC. There are a number of good tutorials which can walk you through this process (check the tutorials for the version of MPI you are using, probably MPICH or openMPI).

How to set core dump naming scheme without su/sudo?

I am developing a MPI program on a Linux machine where I do not have sudo/su access. As my program currently segfaults, I would like to examine the core dumps via gdb. Unfortunately, as the program is multi-threaded, all the threads write to one core dump. So I would like to be able to append the PID to each separate core dump for every process.
I know there is a way to do it via /proc/sys/kernel/core_pattern, however I do not have access to write to this.
Thanks for any help.
It can be a pain to debug MPI apps on systems that are configured this way when you do not have root access. One option for working around this is to use Valgrind to get stack traces for your segfault(s). This will only be useful provided that your application will fail in a reasonable period of time when slowed down via Valgrind, and that it still segfaults at all in this case.
I usually run MPI apps under Valgrind like this:
% mpiexec -n 5 valgrind -q /path/to/my_app
That will send all of the Valgrind output to standard error. But if I want the output separated into different files, then you can get a bit fancier:
% mpiexec -n 5 valgrind -q --log-file='vg_out.%q{PMI_RANK}' /path/to/my_app
That's the setup for MPICH2. I think that for Open MPI you'll need to replace PMI_RANK with OMPI_MCA_ns_nds_vpid, but if that doesn't work for you then you'll need to check with the Open MPI developers on their discussion list. In either case, this will yield N files, where N is the size of MPI_COMM_WORLD, each named vg_out.0, vg_out.1, ..., to vg_out.$(($N-1)), each corresponding to a rank in MPI_COMM_WORLD.

Resources