How do I force MPI to not run on all cores if I have more threads than cores? - multithreading

Context: I'm debugging a simulation code that requires that the number of MPI threads does not change when continuing the simulation from a restart file. This code was running on a large cluster, but I'm debugging it on a smaller local machine so that I don't have to wait to submit the job to a queue. The code requires 72 threads, which is more than the number of cores on the local machine. This is not a problem in itself - I can run with more threads than cores, and just take the performance hit, which is not a major issue when debugging.
The Problem: I want to leave some cores free for other tasks and other users. For instance, if my small local computer has 48 cores, I want to run my 72 threads on, say, 36 cores, and leave 12 cores free. I want to debug my large code locally without completely taking over the machine.
Assuming I'm willing to deal with the memory and performance issues of running on more threads than cores, how do I actually do this? Do I have to get into the back-end of the scheduler somehow? Does it depend on whether I'm using MPICH or Open-MPI etc?
I'm essentially looking for something like mpirun -np 72 --cpus-per-proc 0.5, if that were possible.

taskset -c 0-35 mpiexec -np 72 ./a.out should do the trick if the process are to be launched all on the same host and should work with basically all MPI distributions (Open MPI, MPICH, Intel MPI, etc.). Also, make sure to disable any process binding by the MPI library, i.e. pass --bind-to none for Open MPI 1.8+, -bind-to none for MPICH with Hydra or -genv I_MPI_PIN=0 for Intel MPI.

Related

openMp and the number of cores vs cpus

I'm wondering about how openmp figures out how many threads it can run via the omp_get_max_threads() library call. I'm running on a centOS linux machine using gcc -fopenmp. My machine has 16 AMD Opteron(tm) Processor 6136 CPUs, each with 8 cores, all according to /proc/cpuinfo. If I run omp_get_num_procs() it returns 16. But omp_get_max_threads() also returns 16. Why isn't the max threads number 16*8?
When I run a program that uses 16 threads I see the program in top running at ~1600% of CPU and if I toggle 'Last used cpu (SMP)' that number moves around a bit. So the 1600% makes sense but is there any way to know which cores of which CPUs the threads are running on?
I'm pretty new to openmp so sorry if these questions seem naive.
You can use the hwloc tool set to know the binding of the threads of any application to the hardware threads/cores. You need only the name or the PID of the target running process. Here is an example:
$ hwloc-ps --pid 2038168 --threads --get-last-cpu-location
2038168 Machine:0 ./a.out
2038168 Core:5 a.out
2038169 Core:3 a.out
2038170 Core:1 a.out
2038171 Core:4 a.out
2038172 Core:0 a.out
2038173 Core:2 a.out
Here we can see that the process a.out (with the PID 2038168) uses 6 threads each map on different cores.
However, the mapping of threads on cores over time can change if you do not configure OpenMP properly (a starting point is to set the environment variables OMP_PROC_BIND and OMP_PLACES).
Additionally, you can use  hwloc-ps to understand the topology of your machine (how many cores there are, how many threads, how they are connected, etc.).
I am very surprise you can have 16 "AMD Opteron(tm) Processor 6136 CPUs". Indeed, this processor use the G34 socket which is available in up to 4-socket arrangements (and 8 dies). So, please check this with hwloc-ps!
An alternative way is to use a profiling tool (such as Intel VTune).

available threads in Knights Landing

I am programming on a Knights Landing node which has 68 cores and 4 hyperthreads/core. I am working on a hybrid MPI/OpenMP application.
My question is if the 4 hyperthreads are meant to be used as OpenMP
threads or how could I use them? When I run my program using the
following scheme:
export OMP_NUM_THREADS=1
mpirun -np 68 ./app
it runs much more faster than when I use the scheme:
export OMP_NUM_THREADS=4
mpirun -np 68 ./app
Maybe the problem is that the threads for a certain MPI are not close to
each other. However, I don't know how to do it.
In summary, can I use the 4 hyperthreads/core as OpenMP threads?
Thanks.
As you're probably using Intel MPI and OpenMP runtimes, allow me to forward you some links with valuable information for pinning MPI and OpenMP threads into processor cores/threads. Process/thread binding is a must nowadays to achieve high performance. Even though the OS tries to do its best, moving one process/thread from one core/thread to another location implies that the data needs to be transferred as well. For that matter, take a look at Running an MPI/OpenMP Program and Environment Variables for Process Pinning. For instance, if you run with 68 MPI ranks, then you probably start placing each MPI rank into a different core. You can double check if mpirun is honoring your requests by setting I_MPI_DEBUG environment variable (as described here).

Performance Problems with Node.js (Mac OSX) - Processes

i hope to find some little help here.
we are using node, mongodb, supertest, mocha and spawn in our test env.
we've tried to improve our mocha test env to run tests in parallel, because our test cases run now for almost 5minutes! (600 cases)
we are spawning for example 4 processes and run tests in parallel. it is very successful, but just on linux.
on my mac the tests still run very slow. it seems like the different processes are not really running in parallel.
test times:
macosx:
- running 9 tests in parallel: 37s
- running 9 tests not in parallel: 41s
linux:
- running 9 tests in parallel: 16s
- running 9 tests not in parallel: 25s
maxosx early 2011:
10.9.2
16gb ram
core i7 2,2ghz
physical processors: 1
cores: 4
threads: 8
linux dell:
ubuntu
8gb ram
core i5-2520M 2.5ghz
physical processors: 1
cores: 2
threads: 4
my questions are:
are there any tips improving the process performance on macosx?
(except of ulimit, launchctl (maxfiles)?)
why do the tests running much faster on linux?
thanks,
kate
I copied my comment here, as it describes most of the answer.
Given the specs of your machines, I doubt the extra 8 gigs of ram, and processing power effect things much, especially given nodes single process model, and that you're only launching 4 processes. I doubht for the linux machine that 8 gigs, 2.5 ghz, and 4 threads is a bottleneck at all. As such, I would actually expect the time the processor spends running your tests to be roughly equivalent for both machines. I'd be more interested in your Disk I/O performance, given you're running mongo. Your Disk I/O has the most potential to slow stuff down. What are the specs there?
Your specs:
macosx: Toshiba 5400RPM 8MB
linux: Seagate 7200 rpm 16mb
Your Linux drive is significantly, 1.33X faster, than your mac drive, as well as having a significantly larger cache. For database based applications hard drive performance is crucial. Most of the time spent in your application will be waiting for I/O, particularly in Nodes single process method of doing work. I would suggest this as the culprit for 90% of the performance difference, and chalk the rest up to the fact that linux probably has less crap going on in the background, further exacerbating your Mac Disk Drive's performance issues.
Furthermore, launching multiple node processes isn't likely to help this. Since processor time isn't your bottleneck, launching too many processes will just slow your disk down. Another proof that this is the problem is that the performance of multiple processes on linux is proportionally better than the performance of multiple processes on mac. 1 process is nearly maxing out the performance of your 5400 drive, and so you don't see significant performance increase from running multiple processes. Whereas the multiple linux Node processes use the disk to it's full potential. You would likely see diminishing returns on the linux OS if you were to launch many more processes, unless of course you were to upgrade to a SSD.

Matlabpool number of threads vs core

I have a laptop running Ubuntu on Intel(R) Core(TM) i5-2410M CPU # 2.30GHz. According to Intel website for the above processor (located here), this processor has two cores and can run 4 threads at a time in parallel (because although it has 2 physical cores it has 4 logical cores).
When I start matlabpool it starts with local configuration and says it has connected to 2 labs. I suppose this means that it can run 2 threads in parallel. Does it not know that the CPU can actually run 4 threads in parallel?
In my experience, the local configuration of matlabpool uses, by default, the number of physical cores a machine possesses, rather than the number of logical cores. Hence on your machine, matlabpool only connects to two labs.
However, this is just a setting and can be overwritten with the following command:
matlabpool poolsize n
where n is an integer between 1 and 12 denoting the number of labs you want Matlab to use.
Now we get to the interesting bit that I'm a bit better equipped to answer thanks to a quick lesson from #RodyOldenhuis in the comments.
Hyper-threading implies a given physical core can have two threads run through it at the same time. Of course, they can't literally be processed simultaneously. The idea goes more like this: If one of the threads is inefficient in allocating tasks to the core, then the core may exhibit some "down-time". A second thread can take advantage of this "down-time" to get some work done.
In my experience, Matlab is often efficient in its allocation of threads to cores, therefore with one Matlab thread (ie one lab) running through it, a core may have very little "down-time" and hence there will be very little advantage to hyper-threading. My desktop is a core-i7 with 4 physical cores but 8 logical cores. However, I notice very little difference between running a parfor loop with 4 labs versus 8 labs. In fact, 8 labs is often slower due to the start-up costs associated with initializing the extra labs.
Of course, this is probably all complicated by other external factors such as what other programs you might be running simultaneously to Matlab too.
In summary, my suspicion is that even though you could force Matlab to initialize 4 labs (or even 12 labs), you won't see much of a speed-up over 2 labs, since Matlab is generally fairly efficient at allocating tasks to the processor.

Linux per-process resource limits - a deep Red Hat Mystery

I have my own multithreaded C program which scales in speed smoothly with the number of CPU cores.. I can run it with 1, 2, 3, etc threads and get linear speedup.. up to about 5.5x speed on a 6-core CPU on a Ubuntu Linux box.
I had an opportunity to run the program on a very high end Sunfire x4450 with 4 quad-core Xeon processors, running Red Hat Enterprise Linux. I was eagerly anticipating seeing how fast the 16 cores could run my program with 16 threads..
But it runs at the same speed as just TWO threads!
Much hair-pulling and debugging later, I see that my program really is creating all the threads, they really are running simultaneously, but the threads themselves are slower than they should be. 2 threads runs about 1.7x faster than 1, but 3, 4, 8, 10, 16 threads all run at just net 1.9x! I can see all the threads are running (not stalled or sleeping), they're just slow.
To check that the HARDWARE wasn't at fault, I ran SIXTEEN copies of my program independently, simultaneously. They all ran at full speed. There really are 16 cores and they really do run at full speed and there really is enough RAM (in fact this machine has 64GB, and I only use 1GB per process).
So, my question is if there's some OPERATING SYSTEM explanation, perhaps some per-process resource limit which automatically scales back thread scheduling to keep one process from hogging the machine.
Clues are:
My program does not access the disk or network. It's CPU limited. Its speed scales linearly on a
single CPU box in Ubuntu Linux with
a hexacore i7 for 1-6 threads. 6
threads is effectively 6x speedup.
My program never runs faster than
2x speedup on this 16 core Sunfire
Xeon box, for any number of threads
from 2-16.
Running 16 copies of
my program single threaded runs
perfectly, all 16 running at once at
full speed.
top shows 1600% of
CPUs allocated. /proc/cpuinfo shows
all 16 cores running at full 2.9GHz
speed (not low frequency idle speed
of 1.6GHz)
There's 48GB of RAM free, it is not swapping.
What's happening? Is there some process CPU limit policy? How could I measure it if so?
What else could explain this behavior?
Thanks for your ideas to solve this, the Great Xeon Slowdown Mystery of 2010!
My initial guess would be shared memory bottlenecks. From what you say, your performance pretty much flatlines after 2 CPUs. You initially blame Redhat, but I'd be curious to see what happens if you install Ubuntu on the same hardware. I assume, of course, that you're running 64 bit SMP kernels across both tests.
It's probably not possible that the motherboard would peak at utilizing 2 CPUs. You have another machine with multiple cores that has provided better performance. Do you have hyperthreading turned on with the new machine? (and how does that answer compare to the old machine?). You're not, by chance, running in a virtualized environment?
Overall, your evidence is pointing to a ludicrously slow bottleneck somewhere. As you said, you're not I/O bound, so that leaves the CPU and memory. Either something is wrong with the hardware, or something is wrong with the hardware. Test one by changing the other, and you'll narrow down your possibilities quickly.
Do some research on rlimit - it's quite possible the shell/user acct you're running in has some RH-default or admin-set resource limits in place.
When you see this kind of odd scaling behaviour, especially if problems are seen with multiple threads, but not multiple processes, one thing to start looking at is the impacts of lock contention and other synchronisation primitives, which can cause threads running on different processors to have to wait for each other, potentially forcing multiple cores to flush their cache to main memory.
This means memory architecture starts to come into play, and that's going to be substantially faster when you have 6 cores on a single piece of silicon than when you're coordinating across 4 separate processors. Specifically, the single CPU case likely isn't needing to hit main memory for locking operations at all - everything is likely being handled at the L3 cache level, allowing the CPU to get on with things while data is flushed to main memory in the background.
While I expect the OP has lost interest in the question after all this time (or may not even have access to the hardware any more), one way to check this would be to see if the scaling up to 4 threads improves if the process affinity is set to lock it to a single physical CPU. Even better though would be to profile the application itself to see where it is spending it's time.As you change architectures and increase the number of cores, it gets harder and harder to guess where the bottlenecks are, so you really need to start measuring things directly, as in this example: http://postgresql.1045698.n5.nabble.com/Sun-Donated-a-Sun-Fire-T2000-to-the-PostgreSQL-community-td2057445.html

Resources