Running MPI programs on a specified number of cores - multithreading

I am new to MPI programming. I want to run my MPI program on a specified number of cores. I referred to the help option by typing mpirun --help. It gave the following output:
...
-c|-np|--np <arg0> Number of processes to run
...
-n|--n <arg0> Number of processes to run
...
However, when I referred to this website, it specifies the following two different things in two different places:
in the introduction:
mpirun typically works like this
mpirun -np <number of processes> <program name and arguments>
and in the options help menu:
-np <np>
- specify the number of processors to run on
In this scenario, does -np specify the number of processes to run or processors to run on? Moreover, how do I run my MPI programs on multiple PCs?
Any help would be appreciated in this regards.

The use of -np specifies processes. The number of actual processors that the job runs on depends how you have configured MPI and your computer architecture. If you have mpi setup correctly on your local machine, mpirun -np 2 ./a.out will run two processes on two processors. If your local machine has four cores and you run mpirun -np 8 ./a.out, this should run 8 processes with two per processor (which may be sensible if the cores allow multi-threading). Check on top to see how many processors are actually used for various cases.
To run on multiple PCs, you will need to specify a list of the PCs network addresses in a host file and start a ring with a process manager like hydra or mpd, e.g. for 8 PCs or nodes mpd -n 8 -f ~/mpd.hosts. You will need to setup ssh to use key authentication and install MPI on every PC. There are a number of good tutorials which can walk you through this process (check the tutorials for the version of MPI you are using, probably MPICH or openMPI).

Related

Monitor the CPU usage of an OpenFOAM simulation running on a slurm job

I'm running an OpenFOAM simulation on a cluster. I have used the Scotch decomposition method and my decomposeParDict looks like this:
FoamFile
{
version 2.0;
format ascii;
class dictionary;
object decomposeParDict;
}
numberOfSubdomains 6;
method scotch;
checkMesh and decomposePar finish with no issues. I have assigned 6 nodes to the slurm by
srun -N6 -l sonicFoam
and the solver runs smoothly without any errors.
The issue is the solution speed is not improved in comparison to the non-parallel simulation I ran before. I want to monitor the CPU usage to see if all of the 6 nodes I have assigned are similarly loaded. The squeue --user=foobar command return the jobNumber and list of nodes assigned (NODELIST(REASON)) which looks like this:
foo,bar[061-065]
from sinfo command these nodes are both in debug and main* PARTITIONs (which I have absolutely no idea what it means!).
This post says that you can use the sacct or sstat commands to monitor CPU time and memory usage of a slurm job. But when I run
sacct --format="CPUTime,MaxRSS"
it gives me:
CPUTime MaxRSS
---------- ----------
00:00:00
00:00:00
00:07:36
00:00:56
00:00:26
15:26:24
which I can not understand. And when I specify the job number by
sacct --job=<jobNumber> --format="UserCPU"
The return is empty. So my questions are
Is my simulation loading all nodes or is it running on one or two and the rest are free?
am I running the right commands? if yes what those numbers mean? how they represent the CPU usage per node?
If not then what are the right --format="..."s for sacct and/or sstat (or maybe other slurm commands) to get the CPU usage/load?
P.S.1. I have followed the OpenFOAM compiling following the official instructions. I did not do anything with OpenMPI and it's mpicc compiler for that matter though.
P.S.2 For those of you who might end up here. Maybe I'm running the wrong command apparently one can first allocate some resources by:
srun -N 1 --ntasks-per-node=7 --pty bash
where 7 is the number of cores you want and bash is just a name. and then run the solver with:
mpirun -np 7 sonicFoam -parallel -fileHandler uncollated
I'm not sure yet though.
You can use
sacct --format='jobid,AveCPU,MinCPU,MinCPUTask,MinCPUNode'
to check whether all CPUs have been active. Compare AveCPU (average CPU time of all tasks in job) with MinCPU (minimum CPU time of all tasks in job). If they are equal, all 6 tasks (you requested 6 nodes, with, implicitly, 1 task per node) worked equally. If they are not equal, or even MinCPU is zero, then some tasks have been doing nothing.
But in your case, I believe you will observe that all tasks have been working hard, but they were all doing the same thing.
Besides the remark concerning the -parallel flag by #timdykes, you also must be aware that launching an MPI job with sun requires that OpenMPI was compiled with Slurm support. During your installation of OpenFOAM, it installed its own version of OpenMPI, and if file /usr/include/slurm/slurm.h or /usr/include/slurm.h exists, then Slurm support was probably compiled in. But the safest is probably to use mpirun.
But to do that, you will have to first request an allocation from Slurm with either sbatch or salloc.
Have you tried running with the '-parallel' argument? All of the OpenFOAM examples online use this argument when running a parallel job, one example is the official guide for running in parallel.
srun -N $NTASKS -l sonicFOAM -parallel
As an aside - I saw you built openfoam yourself, have you checked whether the cluster admins have provided a module for it? You can usually run module avail to see a list of the available modules, and then module load moduleName if there is an existing OpenFOAM module. This is useful as you can probably trust its been built with all the right options and would automatically set up your $PATH etc.

Limit number of cores used by OMPython

Background
I need to run a blocks simulation. I've used OMEdit to create the system and I call omc to run the simulation using OMPython with zmq for messaging. The simulation works fine but now I need to move it to a server to simulate the system for long times.
Since the server is shared among a team of people, it uses slurm to queue the jobs. The server has 32 cores but they asked me to use only 8 while I tune my script and then 24 when I want to run my final simulation.
I've configured slurm to call my script in the following manner:
#!/bin/bash
#
#SBATCH --job-name=simulation_Test_ADC2_pipe_4096s
#SBATCH --output=simulation_Test_ADC2_pipe_4096s.txt
#
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=10000
source activate /home/jabozzo/conda_envs/py27
#which python
python ./Test_ADC2_pipe_4096s.py
Then I execute the slurm file using sbatch.
Problem
The omc compilation works fine. When it starts to simulate all the 32 cores of the server become loaded, even if it was configured to only use 8.
I've tried
There are compilation and simulation flags that can be passed to omc. I've tried to use --numProcs (a compilation flag) but this only seem to apply during the compilation process and does not affect the final executable. I've scanned the page of simulation flags looking for something related but it seems there is no option to change the cpu usage.
The only thing that we add when doing our OpenModelica testing in parallel is to add the GC_MARKERS=1 environment variable and --numProcs=1; this makes our nightly library testing of 10000 tests all run in serial. But GC_MARKERS shouldn't affect simulations unless they are allocating extreme amounts of memory. Other than that, OpenModelica simulations are serial unless perhaps you use a parallel blas/lapack/sundials library which might use more cores without OpenModelica knowing anything about it; in that case you would need to read the documentation for the library that's consuming all your resources.
What's a bit surprising is also how slurm allows your process to consume more CPUs than you set; it could use the taskset command to make the kernel force the process to only use certain CPUs.
My server administrator was unsure if taskset would interfere with slurm internals. So we found another option. If omc uses openMP for compilation we can also limit the number of cores replacing the last line of the slurm file with:
OMP_NUM_THREADS=8 python ./Test_ADC2_pipe_4096s.py
I'm leaving this anwser here to complement sjoelund.se anwser

Run processes use two CPU in different terminals

I have a complex script (script it's just example, it may be a unzip command and etc. and on the other terminal different command; they are not connected) and two CPU. Can I run two different processes (or commands and etc) on two terminals with different CPU each? (simultaneously) Is that possible? It's possible to specify a particular processor in each terminal for processing?
You can run 2 or more commands even on the same terminal with "taskset"
From the man pages (http://linuxcommand.org/man_pages/taskset1.html):
taskset is used to set or retrieve the CPU affinity of a running pro-
cess given its PID or to launch a new COMMAND with a given CPU affin-
ity. CPU affinity is a scheduler property that "bonds" a process to a
given set of CPUs on the system. The Linux scheduler will honor the
given CPU affinity and the process will not run on any other CPUs.
Note that the Linux scheduler also supports natural CPU affinity: the
scheduler attempts to keep processes on the same CPU as long as practi-
cal for performance reasons. Therefore, forcing a specific CPU affin-
ity is useful only in certain applications.
#eddiem already shared the link (http://xmodulo.com/run-program-process-specific-cpu-cores-linux.html) on how to install taskset and that link also explains how to run it
In short:
$taskset 0x1 tar -xzvf test.tar.gz
That would send the tar command to run on CPU 0
If you want to run several commands/scripts in the same terminal using different CPUs then I think that you just could send them to the background appending "&" at the end e.g.
$taskset 0x1 tar -xzvf test.tar.gz &
You can use the taskset program to control the CPU affinity of specific processes. If you set the affinity for the shell process controlling terminal A to core 0 and terminal B to core 1, any child processes started from A should run on core 0 and B on core 1.
http://xmodulo.com/run-program-process-specific-cpu-cores-linux.html

Run perf with an MPI application

perf is a performance analysis tool which can report hardware and software events. I am trying to run it with an MPI application in order to learn how much time the application spends within each core on data transfers and compute operations.
Normally, I would run my application with
mpirun -np $NUMBER_OF_CORES app_name
And it would spawn on several cores or possibly several nodes. Is it possible to add perf on top? I've tried
perf stat mpirun -np $NUMBER_OF_CORES app_name
But the output for this looks like some sort of aggregate of mpirun. Is there a way to collect perf type data from each core?
Something like:
mpirun -np $NUMBER_OF_CORES ./myscript.sh
might work with myscript.sh containing:
#! /bin/bash
perf stat app_name %*
You should add some parameter to the perf call to produce differently named result files.
perf can follow spawned child processes. To profile the MPI processes located on the same node, you can simply do
perf stat mpiexec -n 2 ./my-mpi-app
You can use perf record as well. It will create a single perf.data file containing the profiling information for all the local MPI processes. However, this won't allow you to profile individual MPI ranks.
To find out information about individual mpi ranks, you need to run
mpiexec -n 2 perf stat ./my-mpi-app
This will profile the individual ranks and will also work across multiple nodes. However, this does not work with some perf commands such as perf record.

Is tesseract 3.00 multi-threaded?

I read some other posts suggesting that they would add multi-threading support in 3.00. But I'm not sure if it's added in 3.00 when it was released.
Other than multi-threading, is running multiple processes of tesseract a feasible option to achieve concurrency?
Thanks.
One thing I've done is invoked GNU Parallel to run as many instances of Tess* as able on a multi-core system for multi-page documents converted to single page images.
It's a short program, easily compiled on most Linux distros (I'm using OpenSuSE 11.4).
Here's the command line that I use:
/usr/local/bin/parallel -j 4 \
/usr/local/bin/tesseract -psm 1 -l eng {} {.} \
::: /tmp/tmp/*.jpg
The -j 4 tells parallel to use all four CPU cores that I have on a server.
If you run this, and in another terminal do a 'top,' you'll see up to four processes at one time until it rummages through all of the JPG's in the directory specified.
Your load should never exceed the number of CPU cores in your system (if you run Linux).
Here's the link to GNU Parallel:
http://www.gnu.org/software/parallel/
No. You can browse the code in http://code.google.com/p/tesseract-ocr/source/browse/ None of the current code in trunk seems to make use of multi-threading. (at least looking through the base classes, api, and neural networking classes)
I did use parallel as well, on a Centos, this way:
ls | parallel --gnu "tesseract {} {.}"
I used the --gnu option as suggested from the stdout log which was:
parallel: Warning: YOU ARE USING --tollef. IF THINGS ARE ACTING WEIRD USE --gnu.
the {} and {.} are placeholders for parallel: in this case you're telling tesseract to use the file listed as first argument, and the same file name without extension as second argument - everything is well explained in parallel man pages.
Now, if you have - say - three .tif files and you run tesseract three times, one for each file, summing up the execution time, and then you run the command above with time before parallel, you can easily check the speedup.

Resources