this post is related to a previous post binding threads to certain MPI processes. Here, it was asked how MPI ranks could be assigned a different
number of OpenMP threads. One possibility is as follows
$ mpiexec <global parameters>
-n n1 <local parameters> executable_1 <args1> :
-n n2 <local parameters> executable_2 <args2> :
...
-n nk <local parameters> executable_k <argsk>
what I don't know is how the independent instances executable_1, executable_2, ..., executable_k communicate with each other. I mean
if at some point during execution they need to exchange data, do they
use a inter-communicator (among instances) and a intra-communicator
(within the same instance, for example executable_1)?
Thanks.
All processes launched as a result of that command form a single MIMD/MPMD MPI job, i.e. they share the same world communicator. The first n1 ranks are running executable_1, the following n2 ranks are running executable_2, etc.
rank | executable
----------------------------------------+---------------
0 .. n1-1 | executable_1
n1 .. n1+n2-1 | executable_2
n1+n2 .. n1+n2+n3-1 | executable_3
.... | ....
n1+n2+n3+..+n(k-1) .. n1+n2+n3+..+nk-1 | executable_k
The communication happens simply by sending messages in MPI_COMM_WORLD. The separate executables do not form communicator groups on their own automatically. This is what distinguishes MPMD from starting child jobs using MPI_Comm_spawn - child jobs have their own world communicators and one uses intercommunicators to talk to them while the separate sub-jobs in an MIMD/MPMD job do not.
It is still possible for a rank to find out to which application context it belongs by querying the MPI_APPNUM attribute of MPI_COMM_WORLD. It makes it possible to create separate sub-communicators for each context (the different contexts are the commands separated by :) by simply performing a split using the appnum value as colour:
int *appnum, present;
MPI_Comm_get_attr(MPI_COMM_WORLD, MPI_APPNUM, &appnum, &present);
if (!present)
{
printf("MPI_APPNUM is not provided!\n");
MPI_Abort(MPI_COMM_WORLD, 0);
}
MPI_Comm appcomm;
MPI_Comm_split(MPI_COMM_WORLD, *appnum, 0, &appcomm);
Related
So far I've been using OPEN(fid, FILE='IN', ...) and it seems that all MPI processes read the same file IN without interfering with each other.
Furthermore, in order to allow the input file being chosen among several, I simply made the IN file a symbolic link pointing to the desired input. This means that when I want to change the input file I have to run ln -sf desidered-input IN before running the program (mpirun -n $np ./program).
I'd really like to be able to run the progam as mpirun -n $np ./program < input-file. To do so I removed the OPEN statement, and the corresponding CLOSE statement, and changed all READ(fid,*) statements to READ(INPUT_UNIT,*) (I'm using ISO_FORTRAN_ENV module).
But, after all edits, I've realized that only one process (always 0, I noticed) reads from it, since all others reach EOF immediately. Here is a MWE, using OpenMPI 2.0.1.
! cat main.f90
program main
use, intrinsic :: iso_fortran_env
use mpi
implicit none
integer :: myid, x, ierr, stat
x = 12
call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world, myid, ierr)
read(input_unit,*, iostat=stat) x
if (is_iostat_end(stat)) write(output_unit,*) myid, "I'm out"
if (.not. is_iostat_end(stat)) write(output_unit,*) myid, "I'm in", myid, x
call mpi_finalize(ierr)
end program main
that can be compiled with mpifort -o main main.f90, run with mpirun -np 4 ./main, and which results in this output
1 I'm out
2 I'm out
3 I'm out
17 this is my input from keyboard
0 I'm in 0 17
I know that MPI has proper routines to perform parallel I/O, but I've found nothing about reading from standard input.
You are seeing the expected behaviour with OpenMPI. By default, mpirun
directs UNIX standard input to /dev/null on all processes except the MPI_COMM_WORLD rank 0 process. The MPI_COMM_WORLD rank 0 process inherits standard input from mpirun.
The option --stdin can be used to direct standard input to another process, but not to direct to all.
One could also note that the behaviour of redirection of standard input isn't consistent across MPI implementations (the notion isn't specified by the MPI standard). For example, using Intel MPI there is the -s option to that mpirun. mpirun -np 4 -s all ./main does allow all processes access to mpirun's standard input. There's also no guarantee that processes without that redirection will fail, rather than wait, to read.
I am trying to read an input file in a cluster environment. Different nodes will read different parts of it. However the parts are not clearly separated, but interleaved in a "grid".
For example, a file with 16 elements (assume integers):
0 1 2 3
4 5 6 7
8 9 A B
C D E F
If I use four nodes, the first node will read the top left 2x2 square (0,1,4,5), the second node will read the top right 2x2 square and so on.
How should I handle this? I can use MPI or OpenMP. I have two ideas but I don't know which would work better:
Each node will open the file and have its own handle to it. Each node would read the file independently, using only the part of the file it needs and skipping over the rest of it. In this case, what would be the difference between using fopen or MPI_File_open? Which one would be better?
Use one node read the whole file and send each part of the input to the node that needs it.
Regarding your question,
I will not suggest the second option you mentioned. that is using one node to read and then distributing the parts. Reasons being this is slow .. especially if the file is large. Here you have twice the overhead, first to keep other processes waiting and second to send the data which is read. So clearly a no go for me.
Regarding your first option, there is no big difference between using fopen and MPI_Fole_open. But Here I will still suggest MPI_File_open to avail certain facilities like non blocking I/O operations and Shared file pointers (makes life easy)
I have a simple mono-threaded application that does almost pure processing
It uses two int buffers of the same size
It reads one-by-one all the values of the first buffer
each value is a random index in the second buffer
It reads the value at the index in the second buffer
It sums all the values taken from the second buffer
It does all the previous steps for bigger and bigger
At the end, I print the number of voluntary and involuntary CPU context switches
If the size of the buffers become quite big, my PC starts to slow down: why? I have 4 cores with hyper-threading so 3 cores are remaing. Only one is 100% busy. Is it because my process uses almost 100% for the "RAM-bus"?
Then, I created a CPU-set that I want to dedicate to my process (my CPU-set contains both CPU-threads of the same core)
$ cat /sys/devices/system/cpu/cpu3/topology/core_id
3
$ cat /sys/devices/system/cpu/cpu7/topology/core_id
3
$ cset set -c 3,7 -s my_cpuset
$ cset set -l
cset:
Name CPUs-X MEMs-X Tasks Subs Path
------------ ---------- - ------- - ----- ---- ----------
root 0-7 y 0 y 934 1 /
my_cpuset 3,7 n 0 n 0 0 /my_cpuset
It seems that absolutely no task at all is running on my CPU-set. I can relaunch my process and while it is running, I launch:
$ taskset -c 7 ./TestCpuset # Here, I launch my process
...
$ ps -mo pid,tid,fname,user,psr -p 25244 # 25244 being the PID of my process
PID TID COMMAND USER PSR
25244 - TestCpus phil -
- 25244 - phil 7
PSR = 7: my process is well running on the expected CPU-thread. I hope it is the only one running on it but at the end, my process displays:
Number of voluntary context switch: 2
Number of involuntary context switch: 1231
If I had involuntary context switches, it means that other processes are running on my core: How is it possible? What must I do in order to get Number of involuntary context switch = 0?
Last question: When my process is running, if I launch
$ cset set -l
cset:
Name CPUs-X MEMs-X Tasks Subs Path
------------ ---------- - ------- - ----- ---- ----------
root 0-7 y 0 y 1031 1 /
my_cpuset 3,7 n 0 n 0 0 /my_cpuset
Once again I get 0 tasks on my CPU-set. But I know that there is a process running on it: it seems that a task is not a process?
If the size of the buffers become quite big, my PC starts to slow down: why? I have 4 cores with hyper-threading so 3 cores are remaing. Only one is 100% busy. Is it because my process uses almost 100% for the "RAM-bus"?
You reached the hardware performance limit of a single-threaded application, that is 100% CPU time on the single CPU your program is allocated to. Your application thread will not run on more than one CPU at a time (reference).
What must I do in order to get Number of involuntary context switch = 0?
Aren't you missing --cpu_exclusive option in cset set command?
By the way, if you want to achieve lower execution time, i suggest you to make a multithreaded application and let operating system, and the hardware beneath parallelize execution instead. Locking a process to a CPU set and preventing it from doing context-switch might degrade the operating system performance and is not a portable solution.
Running top interactively you can do different things. Is there a way to write a bash script that would interact with top without using programs like xdotool?
Your best option, if you want to interact with top from a script, would be to have a script that runs it once and captures the output (top -n1 will make it run once and then terminate). You can have your script run top -n1 with appropriate other parameters every time it wants to capture new output, and control it accordingly.
Trying to run an interactive top session, and have a script send it keystrokes, would be very fragile and quite a mess.
top isn't the best tool to run from scripts. Foremost, it's output isn't designed to be processed by scripts. But if you tell us what you want to achieve, then we can tell you commands that do similar things:
ps lists active processes. iostat gives you disk and other I/O data. sar reports system activity (network, interrupts, ...)
top -b -n n1 -d d1
where,
n1 = duration for which top output to be collected (in seconds)
d1 = how often you need to collect top output (in milliseconds)
Example:
top -b -n $n1 -d $d1 | grep "Cpu" > top.txt (n1 and d1 accepted from user)
I have some code similar to this:
!$dir parallel do
do k = 1, NUM_JOBS
call asynchronous_task( parameter_array(k) )
end do
!$dir end parallel do
I've tried many different strategies, including
$ micnativeloadex $exe -e "KMP_PLACE_THREADS=59Cx4T OMP_NUM_THREADS=236"
But, when I check the MIC with top, I'm only getting 25% usage.
I'm having a great deal of difficultly finding any specific help in the Intel docs/forums and OpenMP forums, and now I'm thinking that my only shot at having 59 tasks with 4 threads working on each task is to combine open-MPI with open-MP.
Does anyone have any experience with this and have any recommendations for moving forward? I've been running 236 asynchronous tasks instead, but I have a suspicion that 59 tasks will run over 4 times faster than 236 due to the memory overhead of my task.
KMP_PLACE_THREADS will set OMP_NUM_THREADS implicitly, so you don't need to specify this in your mic environment variables.
If you would like to use 59 tasks with 4 threads per task you have a few options.
MPI/OpenMP
As you mentioned, you could use a hybrid MPI/OpenMP approach. In this case you will utilise a different OpenMP domain per rank. I achieved this in the past running mpirun natively on the mic something like this:
#!/bin/bash
export I_MPI_PIN=off
mpirun -n 1 -env KMP_PLACE_THREADS=10c,4t,1o ./scaling : \
-n 1 -env KMP_PLACE_THREADS=10c,4t,11o ./scaling : \
-n 1 -env KMP_PLACE_THREADS=10c,4t,21o ./scaling : \
-n 1 -env KMP_PLACE_THREADS=10c,4t,31o ./scaling : \
-n 1 -env KMP_PLACE_THREADS=10c,4t,41o ./scaling : \
-n 1 -env KMP_PLACE_THREADS=10c,4t,51o ./scaling
This will create 6 MPI ranks, with the threads explicitly placed at CPU 1,11,21,31,41,51 with 40 OpenMP threads per rank.
You will have to design your MPI code to split the NUM_JOBS over your ranks and use OpenMP internally inside your asynchronous_task().
Nested OpenMP
The other possibility to use used nested OpenMP. This will almost certainly be more advantageous for total memory consumption on the Xeon Phi.
In this case, you will also need to expose parallelism inside your asynchronous_task using OpenMP directives.
At the top level loop you can start 59 tasks and then use 4 threads internally in asynchronous_task. It is critical that you can expose this parallelism internally or your performance will not scale well.
To use nested OpenMP you can use something like this:
call omp_set_nested(.true.)
!$OMP parallel do NUM_THREADS(59)
do k = 1, NUM_JOBS
call asynchronous_task( parameter_array(k) )
end do
!$OMP end parallel do
subroutine asynchronous_task()
!$OMP parallel NUM_THREADS(4)
work()
!$OMP end parallel
end subroutine
In both use cases, you will need to utilise OpenMP inside your task subroutine, in order to use more than one thread per task.