Hyrbid MPI / OpenMP - multithreading

I've been trying to use OpenMPI with OpenMP and when I run try to run 2 MPI processes and 4 threads on one machine, all threads are executed on the same core at 25% usage instead of on 4 separate cores. I was able to fix this using --enable-mpi-threads when building OpenMPI; but now I am having an issue with this being a duel CPU machine.
There are 8 cores per processor, 2 processors in each server. If I run 2 MPI processes and 8 threads then everything is fine as long as the 2 processes started on separate processors, but if I try and do 1 MPI process with 16 threads it reverts to stacking every thread on one core.
Has anyone had any experience running OpenMPI and OpenMP together?

Related

When 2 threads would be executed on a 1 physical CPU core with a multi-core CPU machine?

Lets say there's a machine with 8-cores CPU.
I'm creating 2 posix threads using standard pthread_create(...) function.
As I know there's no any garanties these threads always would be executed by a 2 different physical cores, but practically in 90% they will run simultaneously (or in parallel). At least for my cases I seen that top command shows 2 cpu's are running ... thus around 160-180% CPU usage
The question is:
What could be the scenario when 2 threads within a single process are running only on 1 physical core ?
Two cases:
1) The other physical cores are busy doing other stuff, so only one core gets used by this process. The two threads run in alternation on that core.
2) The physical core supports executing more than one thread concurrently using hyperthreading or something similar. The other physical cores are busy doing other stuff, so the best the scheduler can do is run both threads in a single physical core.

Hybrid parallelization with OpenMP and MPI

I'm trying to setup a program which runs across a cluster of 20 nodes, each with 12 cores each. The idea is to have the head process distribute some data out to each node, and have each node perform some operations on the data using OpenMP to utilize the 12 cores. I'm relatively new to this an not sure about the best way to set this up.
We use PBS as the scheduler and my original plan was to create a single MPI process on each node, and let OpenMP create 12 threads per node.
#PBS -l nodes=20:ppn=1
But when I run this, OpenMP seems to only create 1 thread per process. How can I set this up so OpenMP will always create 12 threads per MPI process?
Edit: As soon as I specify to use more than 1 process in PBS, OpenMP will start using 6 threads per process, can't seem to figure out why using only 1 process per node isn't working.

How many cores does a process occupy?

Lets say I have 4 core on my machine and I have a process that spawns 4 threads, while this is the current process scheduled, are all 4 of those cores reserved for the process' 4 threads?
That is a very complex question. However, I can help. As a general rule, 1 process only uses 1 core. Actually, 1 thread can only be executed by 1 core. If you have a dual core processor, it is literally 2 CPUs stuck together in the same pc. These are called physical processors. These physical proessors execute 1 thread. Although, some CPUs have 2 physical cores but are capable of running 4 threads simultaneously. These extra 2 threads are run on logical cores. They do not physically exist but logically exist to the cpu.
If by process you mean thread then yes 1 process 1 core. And you can run 4 threads on a cpu with 4 compute cores (the name with includes physical and logical cores because a single core cpu may only have 1 compute core).
If by process you mean program or process in the processes tab in the task manager, then it depends on how the program is written.
Judging by your question, if a process spawns 4 threads it depends at what place it is in the pool. There are thousands of threads waiting to be executed. The threads from each program or executable file do not have to be executed at the same time.
The 4 threads of your process are scheduled independently - the process itself isn't scheduled.
If all 4 threads are runnable at the same time, and there's no other higher priority runnable threads in the system, then all 4 threads may be scheduled simultaneously on your 4 cores.

How does more than one thread execute on a processor core

I wanted to know how does a multi-threaded program with more number of threads executes on a processor core. For example, my program has 12 threads and I am running it on a intel core-i5 machine. It has four CPUs. Will each core run 3 threads? I am confused because I have seen programs with 30 threads running on a 4 core machine.
Thanks
Each core would be able to execute one thread simultaneously. So if there are 30 threads and 4 cores, 26 threads will be waiting to get context switched to get executed. Something like, thread 1-4 runs for 200ms and then 5-8 runs for 200 ms and so on
The processor core is capable of executing one thread at a time. In a quad core, 4 threads are executed simultaneously. Not all the user space threads are executed simultaneously, the kernel threads also runs to schedule the next thread or do some other kernel tasks.

POSIX Threads on a Multiprocessor System

I have written software which takes advantage of POSIX threads so that I can utilize shared memory within the process. My question is if I have a machine running Ubuntu with 4 processors and each processor has 16 cores. Is it more efficient to run 4 processes each with 16 threads or 1 process with 64 threads? Each processor has a dedicated 32gb of ram.
My main worry is that there will be a lot of memcopy happening behind the seen with 1 process.
In summary:
On a 4(16core) Proc Machine
1 process 64 threads? 4 Processes 16 Threads each?
If the process requires more than 32 gb of RAM(The amount dedicated to one Proc) does the answer differ?
Thanks for your help
Depends on what your application does.
A thread in a single-threaded process runs faster then a thread in a multi-threaded process since the latter requires synchronization between threads in library functions like malloc(), fprintf(), etc.. Also, more threads in a multi-threaded process are likely to cause more lock contention slowing down each other. If threads don't need to communicate and don't share data they don't need to be in the same process.
In your case, you may get better parallelism with 4 processes with 16 threads rather then 1 process with 64 threads.

Resources