openMp and the number of cores vs cpus - linux

I'm wondering about how openmp figures out how many threads it can run via the omp_get_max_threads() library call. I'm running on a centOS linux machine using gcc -fopenmp. My machine has 16 AMD Opteron(tm) Processor 6136 CPUs, each with 8 cores, all according to /proc/cpuinfo. If I run omp_get_num_procs() it returns 16. But omp_get_max_threads() also returns 16. Why isn't the max threads number 16*8?
When I run a program that uses 16 threads I see the program in top running at ~1600% of CPU and if I toggle 'Last used cpu (SMP)' that number moves around a bit. So the 1600% makes sense but is there any way to know which cores of which CPUs the threads are running on?
I'm pretty new to openmp so sorry if these questions seem naive.

You can use the hwloc tool set to know the binding of the threads of any application to the hardware threads/cores. You need only the name or the PID of the target running process. Here is an example:
$ hwloc-ps --pid 2038168 --threads --get-last-cpu-location
2038168 Machine:0 ./a.out
2038168 Core:5 a.out
2038169 Core:3 a.out
2038170 Core:1 a.out
2038171 Core:4 a.out
2038172 Core:0 a.out
2038173 Core:2 a.out
Here we can see that the process a.out (with the PID 2038168) uses 6 threads each map on different cores.
However, the mapping of threads on cores over time can change if you do not configure OpenMP properly (a starting point is to set the environment variables OMP_PROC_BIND and OMP_PLACES).
Additionally, you can use  hwloc-ps to understand the topology of your machine (how many cores there are, how many threads, how they are connected, etc.).
I am very surprise you can have 16 "AMD Opteron(tm) Processor 6136 CPUs". Indeed, this processor use the G34 socket which is available in up to 4-socket arrangements (and 8 dies). So, please check this with hwloc-ps!
An alternative way is to use a profiling tool (such as Intel VTune).

Related

mpirun with Intel OneAPI running processes n times

I am new to Intel OneAPI, but I installed the OneAPI package and when I run
mpirun -n ...
I receive an output like the following if I set N = 3 (for example):
Iteration #1...
Iteration #1...
Iteration #1...
Iteration #2...
Iteration #2...
Iteration #2...
Rather than dividing the cores I specify to the program, it rather runs the program N times with 1 core divided to each process. I was wondering how to set this up so that N cores are divided to 1 process.
Other useful information is that I am running a program called Quantum Espresso and I am running this program with a NUMA 2x18 core dual processor with 2 threads for each core. I initially installed Intel OneAPI because I noticed that if I specify 72 cores with mpirun, the computational demand increases 50-60 fold as opposed to running with 1 core and was hoping OneAPI may be able to resolve this.
So mpirun with -np will say how many instances of a given process to run as you saw.
Have you read this part of their documentation?
https://www.quantum-espresso.org/Doc/user_guide/
I’m not sure how you’ve built it or which functions you are using, but if you have built against their multithreaded libraries with OpenMP then you should get N threads in a process for those library calls.
Otherwise you will be limited to the MPI parallelism in their MPI parallel code.
I’m not sure what you expect when you said you use all 72 code and the computational demand increases? Isn’t that what you want, with the goal that the final result is completed sooner? In either the OpenMP or MPI cause you should see computational resource usage go up.
Good luck!

How many threads does my machine really have?

I have a Mac Pro with 12 cores and 24 threads (2.7 GHz 12-Core Intel Xeon E5), but when I go into the terminal and type the "top" command, it says there are 1621 threads. How can this even be possible? Is the word "thread" being used differently by top? I thought there were only 24. It seems there are far more because, aside from what I've said about top, I can compile with several dozen threads. When I type "make -j60", for example, the computer has no issue with launching 60 different compile processes, each working independently, compiling its own object file (or at least that's how it appears).
Thanks in advance,
-AA
The CPU can execute 24 threads at the same time, two in each of its 12 cores. There are currently 1,621 threads created by the software on your system. The vast majority of them currently have nothing to do. As those threads become ready to run, the system's scheduler will cause some of them to be executed, subject to the 24 at a time maximum the CPU is capable of.

available threads in Knights Landing

I am programming on a Knights Landing node which has 68 cores and 4 hyperthreads/core. I am working on a hybrid MPI/OpenMP application.
My question is if the 4 hyperthreads are meant to be used as OpenMP
threads or how could I use them? When I run my program using the
following scheme:
export OMP_NUM_THREADS=1
mpirun -np 68 ./app
it runs much more faster than when I use the scheme:
export OMP_NUM_THREADS=4
mpirun -np 68 ./app
Maybe the problem is that the threads for a certain MPI are not close to
each other. However, I don't know how to do it.
In summary, can I use the 4 hyperthreads/core as OpenMP threads?
Thanks.
As you're probably using Intel MPI and OpenMP runtimes, allow me to forward you some links with valuable information for pinning MPI and OpenMP threads into processor cores/threads. Process/thread binding is a must nowadays to achieve high performance. Even though the OS tries to do its best, moving one process/thread from one core/thread to another location implies that the data needs to be transferred as well. For that matter, take a look at Running an MPI/OpenMP Program and Environment Variables for Process Pinning. For instance, if you run with 68 MPI ranks, then you probably start placing each MPI rank into a different core. You can double check if mpirun is honoring your requests by setting I_MPI_DEBUG environment variable (as described here).

Curious about how to specify the number of core for MPI in order to get the fastest scientific computation

I have been running several scientific program package in conjunction with MPI by using the following command
nohup mpirun -np N -x OMP_NUM_THREADS=M program.exe < input > output &
where the value of N and M depend on the physical CPU cores of my machine. For example, my machine has the specification like this
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Intel(R) Xeon(R) CPU E5-2440 0 # 2.40GHz
Stepping: 7
In this case, I first tried setting with N = 24 and M = 1, so the calculation ran very slowly. Then I changed N and M to 12 and 2 respectively. So I found that the latter had obviously provided me the fastest computation.
I was wondering that why did I set N & M are 12 and 2 provide more performance higher than the first case ?
there is no absolute rule on how to run MPI+OpenMP application.
the only advice is not to run an OpenMP process on more than one socket
(OpenMP was designed for SMP machines with flat memory access, but today, most systems are NUMA)
then just experiment.
some apps run best in flat MPI (e.g. one thread per task), while some other work best with one MPI task per socket, and all available cores for OpenMP.
last but not least, if you run more than one OpenMP thread per MPI task, make sure your MPI library bound the MPI tasks as expected.
for example, if you run with 12 OpenMP threads but MPI bind tasks to one core, you will end up doing time sharing and performance will be horrible.
or if you run with 12 OpenMP threads, and MPI task was bound to 12 cores, make sure the 12 cores are on the same socket (and not 6 on each socket)
There is no general rule about this because, most of the time, this performance is dependent on the computation properties of the application itself.
Applications with coarse synchronization granularity may scale well using plain MPI code (no multithreading).
If the synchronization granularity is fine, then using shared memory multithreading (such as OpenMP) and placing all the threads in a process close to each other (in the same socket) becomes more important: synchronization is cheaper and memory access latency is critical.
Finally, compute-bound applications (performance is limited by the processor) are likely not to benefit from hyper-threading at all, since two threads sharing a core contend for the functional units it contains. In this case, you may find applications that perform better using N=2 and M=6 than using N=2 and M=12.
indeeed there is no absolute rule on how to run MPI+OpenMP application.
I agree with all Gilles said.
so I want to talk about the CPU in your case.
in the specification you give, it shows the system enables hyper-thread.
but this not always helps. your computer has 12 physical cores in fact.
so I advice you try some combinations that make M * N = 12 to 24,
like 12*1, 6*2, 6*3
which one is best, depends on how well your application.

How do I force MPI to not run on all cores if I have more threads than cores?

Context: I'm debugging a simulation code that requires that the number of MPI threads does not change when continuing the simulation from a restart file. This code was running on a large cluster, but I'm debugging it on a smaller local machine so that I don't have to wait to submit the job to a queue. The code requires 72 threads, which is more than the number of cores on the local machine. This is not a problem in itself - I can run with more threads than cores, and just take the performance hit, which is not a major issue when debugging.
The Problem: I want to leave some cores free for other tasks and other users. For instance, if my small local computer has 48 cores, I want to run my 72 threads on, say, 36 cores, and leave 12 cores free. I want to debug my large code locally without completely taking over the machine.
Assuming I'm willing to deal with the memory and performance issues of running on more threads than cores, how do I actually do this? Do I have to get into the back-end of the scheduler somehow? Does it depend on whether I'm using MPICH or Open-MPI etc?
I'm essentially looking for something like mpirun -np 72 --cpus-per-proc 0.5, if that were possible.
taskset -c 0-35 mpiexec -np 72 ./a.out should do the trick if the process are to be launched all on the same host and should work with basically all MPI distributions (Open MPI, MPICH, Intel MPI, etc.). Also, make sure to disable any process binding by the MPI library, i.e. pass --bind-to none for Open MPI 1.8+, -bind-to none for MPICH with Hydra or -genv I_MPI_PIN=0 for Intel MPI.

Resources