MPI code using more CPU resources than allocated (over-committing)

MPI code using more CPU resources than allocated (over-committing) - multithreading

Here is the situation:
I'm running a genetic algorithm (GA) optimization code that calls a standalone package (ccx) to do the function evaluation. To improve the efficiency, the GA code is paralleled using MPI. The program is allocated, say, N+1 cores on a HPC cluster. Then it will use the N cores to call the standalone evaluation package in parallel, wait for them to finish, assess the results, then generate a new set of parameters to start a new round of evaluations. This process is repeated until a certain criteria is met.
Here's the problem:
The program sometimes over-commits the allocated resources.
The following is an example output of the top command sent to me by the administrator of the cluster. Note that the CPU usage is 200%.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7498 bgeng 20 0 148512 17912 5300 R 199.3 0.0 534:31.08 GATOOL
The admin tells me that my code is doing threading somewhere. Since hyper-threading is disabled on the cluster (i.e. 1 core 1 process), it is running two processes on 1 core and slows down the entire computing node.
I don't think the threading is happening in the standalone package, because the over-committing does not happen when I run it alone with 1 core. Besides, the over-committing seems to occur in only one process among all the MPI processes allocated.
What could be the places or conditions where the over-committing can happen? I know little about MPI and I'm more of a user than a developer of the program. But I can try to test some clues. My apology if the question is too abstract ...
The admin asked me to add
export OMP_NUM_THREADS=1
in the job submission script, which I tried and not helping.
This is pseudo code showing the structure of the program
main()
{
MPI_INIT()
while(criteria not met){
if(myid>0) execute_command_line('./ccx',wait=.ture.)
if(myid==0) assess results
}
MPI_Finalize()
}

Related

how are the multiprocessing and threading and thread pooling working

https://code.tutsplus.com/articles/introduction-to-parallel-and-concurrent-programming-in-python--cms-28612
From this link I have studied, I have few questions
Q1 : How thread pool (Concurrent) and threading are different here? why do we see the performance improvement. Threading with Que is having 4 threads and each runs cooperatively during the idle time and picks the item from the Que once they get website response. As i see, the thread pool is also in a way doing the same. completing its work and waiting for the manager to assign a task; which is very similar to picking a new item from the Que. I'm not sure how this is different and why i see the perfroamcne improvment. Seems i'm wrong in interpreting the poling here. Could you expalin
Q2 : Question 2 : using multiprocessing the time taken is more. If I have multiprocessor which can handle multiple processes at a time, then all my 4 processes should be handled by it at a time. That is the real parallelization is happening. Also, I have a question here - in such case since 4 processes are running same function doesn't GIL try to stop them executing the same piece of code. Lets suppose all of them share a common variable that gets updated - like number of websites checked. So how does GIL work in these cases of multiprocessing?
Also, here are the same processes used again and again or they get killed and created every time after their job - I think same processes are used. Also, I think that the performance problem is because of the process creation compared to light weight threads at the concurrent threading phase - which is costly. So could you explain more in detail how the GIL is working here and process are running, are they running cooperatively (like each process wait for its turn - like threads in a process do). Or are these processes using the multiprocessors to run really parallel. Also, my other question is If I have a 8 core machine, I think I can run 8 threads of a same process simultaneously or parallel. if I have the 8 core machine can I run 2 processes with 4 threads each? can I run 8 processes on 8 cores? I think cores are only for threads of a process, which means I cant run the 8 process on 8 cores but I can run as many number of processes as many CPU's or multiprocessor system is mine, am i right? So can I run 2 processes with 4 threads each? on my 8 core machine with 2 multiprocessors and each processor having 4 cores each?

Python has a rich set of libraries for multitasking with Processes and Threads. However, there is overlap between the libraries, the choice depends on how abstractly you view the computational tasks. For example, the concurrent.futures library views threads as asynchronous tasks, while the Threading library deals with them as high-level threads. Further, the _thread implements a low-level interface for threading exposing all the synchronization mechanisms.
The GIL(Global Interpreter Lock) is just a synchronization primitive, specifically a mutex which prevents multiple threads of the same process from executing Python bytecode fragments(for certain objects which need to remain consistent with concurrent operations). This is exactly why Python threads excel with I/O operations in terms of speed when compared to compute intensive tasks.(owing to the fact that the GIL is released in case of certain blocking calls/computationally intensive libraries such as numpy). Note that only CPython and Pypy versions of Python are constrained by the GIL mechanism.
Now, let's see those questions...
How thread pool (Concurrent) and threading are different here? Why do we see the performance improvement?
Coming to the comparison between Threading and concurrent.futures.ThreadPoolExecutor (aka threading_squirrel vs future_squirrel), I've executed both programs with the same test case. There are two factors that contribute to this "performance improvement":
Network HEAD requests: Remember that network operations need not complete in the same time period every time you execute them... due to the very nature of packet transfer delays...
Order of thread execution: In the website you've linked, the author creates all threads initially, sets up the queue full of website links and then starts all of them in a list comprehension loop. In ThreadPoolExecutor of concurrent.futures, each time a task is submitted, a thread is assigned to it if the predefined maximum number of threads/workers have not been reached. I've changed the code to mirror this technique. It seems to give a speedup as the first thread begins work early on and doesn't need to wait for the queue to be filled up...
How does GIL work in these cases of multiprocessing?
Remember that the GIL comes into effect for threads of a process only, not among processes. GIL locks up the whole interpreter bytecode during a thread of execution, so the other threads have to wait for their turn. This is the reason multiprocessing used processes instead of threads, as each process has it's own interpreter and consequently, it's own GIL.
Are the same processes used again and again or they get killed and created every time after their job?
The concept of pooling is to reduce the overhead of creating and destroying workers(be it threads or processes) during computation. However, the processes are kind of "brand new" in the sense that the library effectively asks the OS to perform a fork in an UNIX based OS or spawn in an NT based OS...
Also, are the processes running co-operatively?
Maybe. They have to run in co-operation if they use shared memory...(need not be running together). There is definitely going to be a context switch if there are more processes than the OS can allocate to its processors' cores. They can run in parallel if there's no shared memory updates to make.
If I have the 8 core machine can I run 2 processes with 4 threads each? Can I run 8 processes on 8 cores?
Sure (subject to the GIL, in Python). Each process can be allocated to each processing unit for execution. A processing unit can be a physical or a virtual core of a CPU. As long as the OS scheduler supports it, it's possible. Any reasonable split up of processes and threads are possible. If all are allocatable, that's the best situation, else you will encounter context switches...(which are more expensive when it comes to processes)
Hope I've answered all those questions!
Here are a few resources:
MultiCore CPUs, Multithreading and context switching?
Why does multiprocessing use only a single core after I import numpy?
Bonus celery-squirrel resource

Changing processes priority in linux

I have a C code, that generates two processes, but I want to change their priority in kernel, so I wrote a syscall, in which, increasing and decreasing the priority of two processes (I tried this with all types of priorities, static_prio, normal_prio, prio, rt_priority) but again the processes ran simultaneously!
I can't use other syscalls in my syscall code and the operating system is ubuntu 16.04, kernel 4.4.

First, about concurrency: With sufficient resources, a single processor can be "working on" more than one program at the same time. The processor can not be executing instructions from several programs simultaneously (it can only execute one instruction at any time), but it may be executing instructions from one program after it has started another program and before it has finished that other program. These programs are said to be executing "concurrently". Typically, the processor will be executing instructions for one program while another (or several other) program(s) are waiting for I/O from external devices or from an end-user.
Second, Efficiency: scheduler should keep the system (or in particular CPU) busy 100% percent of the time when possible. If the CPU and all the Input/Output devices can be kept running all the time, more work gets done per second than if some components are idle. If we assume that you have a Dual-core processor, Then, the scheduler will assign your two processes each to one processor of the dual processors (as a minimum you need more processes than number of processors).
Third, you must put a huge load on your system to start seeing results of your code as priority scheduling used to schedule processes when there is limited resources on the machine to prevent deadlocks from happen. So, if you have Dual-core processor you need about three or more processes to start seeing results and so on (as we mentioned above about efficiency).

Matlab: -maxNumCompThreads, hyperthreading, and parpool

I'm running Matlab R2014a on a node in a Linux cluster that has 20 cores and hyperthreading enabled. I know this has been discussed before, but I'm looking for some clarification. Here's what my understanding is of the threads vs. cores issue in Matlab:
Matlab has inherent multithreading capabilities, and will utilize extra cores on a multicore machine.
Matlab runs its threads in such a way that putting multiple Matlab threads on the same core (i.e. hyperthreading) isn't useful. So by default, the maximum number of threads that Matlab will create is the number of cores on your system.
When using parpool(), regardless of the number of workers you create, each worker will use only one physical core, as mentioned in this thread.
However, I've also read that using the (deprecated) function maxNumCompThreads(), you can either decrease or increase the number of threads that Matlab or one of the workers will generate. This can be useful in several scenarios:
You want to utilize Matlab's implicit multithreading capabilities to run some code on a cluster node without allocating the entire node. It would be nice if there was some other way to do this if maxNumCompThreads ever gets removed.
You want to do a parameter sweep but have less parameters than the number of cores on your machine. In this case you might want to increase the number of threads per worker so that all of your cores are utilized. This was suggested recently in this thread. However, in my experience, while the individual workers seem quite happy to use maxNumCompThreads() to increase their thread count, inspecting the actual CPU usage using the "top" command suggests that it doesn't have any effect, i.e. each worker still only gets to use one core. It's possible that what is happening is that the individual Matlab processes spawned by the parpool are run with the argument -singleCompThread. I've confirmed that if the parent Matlab process is run with -singleCompThread, the command maxNumCompThreads(n), where n > 1 throws an error due to the fact that Matlab is running in single threaded mode. So the result seems to be that (at least in 2014a), you can't increase the number of computational threads on the parallel pool workers. Related to this is that I can't seem to get the Parent matlab process to to start more threads than there are cores, even though the computer itself has hyperthreading enabled. Again, it will happily run maxNumCompThreads(n), where n > # physical cores, but the fact that top is showing CPU utilization to be 50% suggests otherwise. So what is happening, or what am I misunderstanding?
Edit: to lay out my questions more explicitly:
Within a parfor loop, why doesn't setting maxNumCompThreads(n), when n > 1 seem to work? If it's because the worker process is started with -singleCompThread, why doesn't maxNumCompThreads() return an error like it does in the parent process started with -singleCompThread?
In the parent process, why doesn't using maxNumCompThreads(n), where n > # physical cores, do anything?
Note: I posted this previously on Matlab answers, but haven't received any feedback.
Edit2: It looks like the problem in (1) was an issue with the test code I was using.

That's quite a long question, but I think the straightforward answer is that yes, as I understand it, MATLAB workers are started with -singleCompThread.

First, a few quick tests to confirm our understanding:
> matlab.exe -singleCompThread
>> warning('off', 'MATLAB:maxNumCompThreads:Deprecated')
>> maxNumCompThreads
ans =
1
>> maxNumCompThreads(2)
Error using feature
MATLAB has computational multithreading disabled.
To enable multithreading please restart MATLAB without singleCompThread option.
Error in maxNumCompThreadsHelper (line 37)
Error in maxNumCompThreads (line 27)
lastn = maxNumCompThreadsHelper(varargin{:});
As indicated, when MATLAB is started with the -singleCompThread option, we cannot override it using maxNumCompThreads.
> matlab.exe
>> parpool(2); % local pool
>> spmd, n = maxNumCompThreads, end
Lab 1:
n =
1
Lab 2:
n =
1
We can see that each worker is by default limited to a single computation thread. This is a good thing because we want to avoid over-subscription and unnecessary context switches, which occurs when the number of threads trying to run exceeds the number of available physical/logical cores. So in theory, the best way to maximize CPU utilization is to start as many single-threaded workers as we have cores.
No by looking at the local worker processes running in background, we see that each is launched as:
matlab.exe -dmlworker -noFigureWindows [...]
I believe the undocumented -dmlworker option does something similar to -singleCompThread, but probably a bit different. For one, I was able to override it using maxNumCompThreads(2) without it throwing an error like before..
Remember that even if a MATLAB session is running in single-threaded computation mode, it doesn't mean the computational thread is exclusively restricted to one CPU core only (the thread could jump around between cores assigned by the OS scheduler). You'll have to set the affinity of the worker processes if you want to control that..
So I did some profiling using Intel VTune Amplifier. Basically I ran some linear algebra code, and performed hotspots analysis by attaching to the MATLAB process and filtering on the mkl.dll module (this is the Intel MKL library that MATLAB uses as an optimized BLAS/LAPACK implementation). Here are my results:
- Serial mode
I used the following code: eig(rand(500));
Starting MATLAB normally, computation spawns 4 threads (that's the default automatic value chosen seeing that I have a quad-core i7 Intel CPU).
starting MATLAB normally, but calling maxNumCompThreads(1) before the computation. As expected, only 1 thread is used by the computation.
starting MATLAB with -singleCompThread option, again only 1 thread is used.
- Parallel mode (parpool)
I used the following code: parpool(2); spmd, eig(rand(500)); end. In both cases below, MATLAB is started normally
when running code on the workers with the defaults settings, each worker is limited to one computation thread
when I override the settings on the workers using maxNumCompThreads(2), then each worker will use 2 threads
Here is a screenshot of what VTune reports:
Hope that answers your questions :)

I was wrong about maxNumCompThreads not working on parpool workers. It looks like the problem was that the code I was using:
parfor j = 1:2
tic
maxNumCompThreads(2);
workersCompThreads(j) = maxNumCompThreads;
i = 1;
while toc < 200
a = randn(10^i)*randn(10^i);
i = i + 1;
end
end
used so much memory by the time I checked CPU utilization that the bottleneck was I/O and the extra threads were already shut down. When I did the following:
parfor j = 1:2
tic
maxNumCompThreads(2);
workersCompThreads(j) = maxNumCompThreads;
i = 4;
while toc < 200
a = randn(10^i)*randn(10^i);
end
end
The extra threads started and stayed running.
As for the second issue, I got a confirmation from the Mathworks that the parent Matlab process won't start more threads than the number of physical cores, even if you explicitly raise the limit beyond that. So in the documentation, the sentence:
"Currently, the maximum number of computational threads is equal to the number of computational cores on your machine."
should say:
"Currently, the maximum number of computational threads is equal to the number of physical cores on your machine."

Will forking more workers allow me to balance CPU-heavy work?

I love node.js' evented model, but it only takes you so far - when you have a function (say, a request handler for HTTP connections) that does a lot of heavy work on the CPU, it's still "blocking" until its function returns. That's to be expected. But what if I want to balance this out a bit, so that a given requests takes longer to process but the overall response time is shorter, using the operarting system's ability to schedule the processes?
My production code uses node's wonderfully simple Cluster module to fork a number of workers equal to the number of cores the system's CPU has. Would it be bad to fork more than this - perhaps two or three workers per core? I know there'll be a memory overhead here, but memory is not my limitation. What reading I did mentioned that you want to avoid "oversubscribing", but surely on a modern system you're not going crazy by having two or three processes vying for time on the processor.

I think your idea sounds like a good one; especially because many processors support hyperthreading. Hyperthreading is not magical and won't suddenly double your application's speed or throughput but it can make sense to have another thread ready to execute in a core when the first thread needs to wait for a memory request to be filled.
Be careful when you start multiple workers: the Linux kernel really prefers to keep processes executing on the same processor for their entire lifetime to provide for strong cache affinity. This makes enough sense. But I've seen several CPU-hungry processes vying for a single core or worse a single hyperthread instance rather than the system re-balancing the processes across all cores or all siblings. Check your processor affinities by running ps -eo pid,psr,comm (or whatever your favorite ps(1) command is; add the psr column).
To combat this you might want to start your workers with an explicitly limited CPU affinity:
taskset -c 0,1 node worker 1
taskset -c 2,3 node worker 2
taskset -c 4,5 node worker 3
taskset -c 6,7 node worker 4
Or perhaps start eight, one per HT sibling, or eight and confine each one to their own set of CPUs, or perhaps sixteen, confine four per core or two per sibling, etc. (You can go nuts trying to micromanage. I suggest keeping it simple if you can.) See the taskset(1) manpage for details.

Linux 2.6.31 Scheduler and Multithreaded Jobs

I run massively parallel scientific computing jobs on a shared Linux computer with 24 cores. Most of the time my jobs are capable of scaling to 24 cores when nothing else is running on this computer. However, it seems like when even one single-threaded job that isn't mine is running, my 24-thread jobs (which I set for high nice values) only manage to get ~1800% CPU (using Linux notation). Meanwhile, about 500% of the CPU cycles (again, using Linux notation) are idle. Can anyone explain this behavior and what I can do about it to get all of the 23 cores that aren't being used by someone else?
Notes:
In case it's relevant, I have observed this on slightly different kernel versions, though I can't remember which off the top of my head.
The CPU architecture is x64. Is it at all possible that the fact that my 24-core jobs are 32-bit and the other jobs I'm competing w/ are 64-bit is relevant?
Edit: One thing I just noticed is that going up to 30 threads seems to alleviate the problem to some degree. It gets me up to ~2100% CPU.

It is possible that this is caused by the scheduler trying to keep each of your tasks running on the same CPU that it was previously running on (it does this because the task has likely brought its working set into that CPU's cache - it's "cache hot").
Here's a few ideas you can try:
Run twice as many threads as you have cores;
Run one or two less threads than you have cores;
Reduce the value of /proc/sys/kernel/sched_migration_cost (perhaps down to zero);
Reduce the value of /proc/sys/kernel/sched_domain/.../imbalance_pct down closer to 100.

Do your threads have to synchronize? If so, you might have the following problem:
Assume you have a 4-cpu system, and a 4-thread job. When run alone, threads fan out to use all 4 cores and total usage is near perfect (We'll call this 400%).
If you add one single-threaded interfering job, the scheduler might place 2 of your threads on the same cpu. This means that 2 of your threads are now running at effectively half their normal pace (dramatic simplification), and if your threads need to synchronize periodically, the progress of your job can be limited by the slowest thread, which in this case is running at half normal speed. You would see utilization of only 200% (from your job running 4x 50%) plus 100% (the interfering job) = 300%.
Similarly, if you assume that the interfering job only uses 25% of one processor's time, you might see one of your threads and the interferer on the same CPU. In that case the slowest thread is running at 3/4 normal speed, causing the total utilization to be 300% (4x 75%) + 25% = 325%. Play with these numbers and it's not hard to come up with something similar to what you're seeing.
If that's the problem, you can certainly play with priorities to give unwelcome tasks only tiny fractions of available CPU (I'm assuming I/O delays aren't a factor). Or, as you've found, try to increase threads so that each CPU has, say, 2 threads, minus a few to allow for system tasks. In this way a 24 core system might run best with, say, 46 threads (which always leaves half of 2 cores' time available for system tasks).

Do your threads communicates with each other?
Try to manually bind every thread to cpu, with sched_setaffinity or pthread_setaffinity_np. Scheduler can be rather dumb when working with a lot of relating threads.

It might be worthwhile to use mpstat (part of the sysstat package) to figure out if you have entire CPUs sitting idle while others are fully utilized. It should give you a more detailed view of the utilization than top or vmstat: run mpstat -P ALL to see 1 line per CPU.
As an experiment, you might try setting the CPU affinity on each thread such that each is bound to an individual CPU; this would let you see what performance is like if you don't let the kernel scheduler decide which CPU a task is scheduled on. It's not a good permanent solution, but if it helps a lot it gives you an idea of where the scheduler is falling short.

Do you think the bottleneck is in your application or the kernel's scheduling algorithm? Before you start tweaking scheduling parameters, I suggest you try running a simple multi-threaded application to see if it exhibits the same behavior as your application.
// COMPILE WITH: gcc threads.c -lpthread -o thread
#include <pthread.h>
#define NUM_CORES 24
void* loop_forever(void* argument) {
int a;
while(1) a++;
}
void main() {
int i;
pthread_t threads[NUM_CORES];
for (i = 0; i < NUM_CORES; i++)
pthread_create(&threads[i], 0, loop_forever, 0);
for (i = 0; i < NUM_CORES; i++)
pthread_join(threads[i], 0);
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string