I want to run a number of benchmarks on a multi-core system running Linux. I want to reserve one of the cores for my benchmarks. I know that I can use sched_setaffinity to limit my benchmarks to that core. How can I keep all other processes off my core? In other words, how can I set the default affinity of all processes to not include my core?
Even if you keep all the other processes off your "reserved for benchmarking" core, bear in mind that you can't stop them from consuming a variable and unpredictable proportion of the limited memory bandwidth to a multi-core chip, and that you can't stop them making variable demands on the shared L2 and L3 caches.
IMHO reproducible, scientific benchmarking needs a machine all to itself.
Related
In a 32 core system, a process(A) consumes 4 core fully (400% cpu usage in top). Rest of the cores are avialble. Does it impact the performance of another process(B)? Will process(B) run better if process(A) is not running , then why ?
Process(B) is using boost and multiple threds ( say 24).
I was expecting performance of Process-B is not impacted by Process-A as there are 32 cores.
In general, yes, running a process can slow down others even though not all cores are active. In practice, the impact is strongly dependent of the code being executed.
This can happen because some hardware resources are shared. The most common ones are storage devices, the network, the RAM, the LLC cache (typically a L3). For example, few cores are generally enough to saturate the RAM bandwidth so using more than 8 cores is generally not significantly faster if the two processes are memory bound. HDD storage devices tends to not be faster in parallel so when 2 processes try to massively use it at the same time they are often significantly slower. In practice, they can be more than 2 times slower because HDD have a high fetch time and a process doing many random accesses can drastically slow down a process reading/writing large contiguous files.
On NUMA systems, things can be a bit complex since 2 processes operating on the same NUMA node can be slower than 2 processes running on different NUMA node due to a saturation of the RAM of the target node and the NUMA allocation policy. In some rare case, 2 processes running on different NUMA nodes can be slower than running on the same NUMA node. This is true if the processes communicate each other (due to the higher latency between core belonging to different NUMA nodes) or if the processes communicate with hardware resources bound to specific NUMA nodes that is not the ones where the processes are running (eg. a GPU with a high-performance interconnect, a high-performance Infiniband device, etc.)
Note that some software resources can also be shared. The operating system can lock them so to ease the maintenance of some parts of its code or just because the resource cannot fundamentally be used in parallel in a way that can scale. Historically, some OS used a giant lock preventing nearly all system call to scale. Such lock has been progressively replaced with finer-grained locks or no lock at all (eg. atomics) due to the democratisation of the multi-core processors. Note that even atomic data structures do not scale very well on most processors so system calls operating on the same data structure tends to impact other running processes on many-core systems. Still, the biggest issue is generally the saturation of shared hardware resources.
I'm writing an application that needs to be executed on a specific core of a processor.
For Example:
If we have 4 cores and i want to execute code on 2nd core only. I need help how to do this.
I'm writing an application that needs to be executed on a specific core of a processor.
This is extremely improbable on most platforms (since most multi-core processors are homogeneous). You really need to explain, motivate and justify such an usual requirement.
You can't do that in general. And if you could do that, how exactly you should proceed is operating system specific, and platform specific. Most multi-core processors are homogeneous (all the cores are the same), some are not.
On Linux/x86-64, the kernel scheduler sees all cores as the same, and will move a task (e.g. a thread of a multi-threaded process) from one core to another at arbitrary moments. Since scheduling is preemptive.
On some processors, moving periodically (e.g dozen of times per second) a task from one core to another is actually recommended (and done automagically by the kernel, or the firmware - e.g. SMM) to avoid overheating of that core. Read about dark silicon.
Some unusual hardware (e.g. ARM big.LITTLE) have two sets of different cores (e.g. 2 high-end ARM cores with 2 low-end ones, all sharing the same memory). If your platform is such, please state that in your question, and ask how to achieve processor affinity on your specific platform. Very likely your OS has appropriate system calls for that purpose.
Some high-end motherboards are multi-sockets. In such case, a RAM module is closer to one socket (in timing) than to another. You then care about non-uniform memory access.
So read more about processor affinity and non-uniform memory access. Most OSes have some support for both. On Linux, see pthread_setaffinity_np(3), sched_setaffinity(2), numa(7) etc...
To learn more about OSes, read Operating Systems: Three Easy pieces.
Notice that by pinning some thread to a some fixed core, you might lower the performance of your program. Since processor affinity is rarely useful.
The programmer can prescribe his/her own affinities (hard affinities) but
Rule of thumb: use the default scheduler unless a good reason not to.
here is a C/C++ function to assign a thread to a certain core
Kernel scheduler API
#include <sched.h>
int sched_setaffinity(pid_t pid, unsigned int len, unsigned long * mask);
sets the current affinity mask of process 'pid' to *mask
'len' is the system word size: sizeof(unsigned int long)
To query affinity of a running process:
[~]$ taskset -p 3935
pid 3945's current affinity mask: f
I have a dual core board running Linux in which I installed PJSIP (VoIP software). I want to add an echo/noise canceler algorithm but I don't want it to work on the same core as PJSIP.
How can I split the use of the cores between the two applications?
It is called CPU affinity. You can set it from the command line using taskset(1) or from your application using sched_setaffinity(2) sched_getaffinity(2).
The term you are looking for is affinity. http://man7.org/linux/man-pages/man2/sched_setaffinity.2.html or http://www.glennklockwood.com/hpc-howtos/process-affinity.html. That being said, if you are using a "slow" cpu, you are probably going to look at a real time scheduler (SCHED_FIFO, SCHED_RR or SCHED_DEADLINE) and if you are using a "fast" cpu you probably don't need to worry about affinity. The probability of you being in the "middle" where affinity will matter but scheduler won't is pretty low.
In Linux (our system is CentOS5), is it possible to allocation CPU resources to processes? For example, I have one tomcat application, I want all the processes and threads invoked by tomcat has p% of total CPU cycles no matter how many other applications are running. And I want to tune the p% dynamically (e.g., at this time slot, tomcat has 40% cpu cycles, and at the next time slot, it has 70% cpu cycles).
If the above is possible, is it possible to do it conservatively? I mean, even though tomcat has 40% cpu cycles, but if it's current workload only consumes 10%, other applications can use the remaining 30% CPU cycles.
Thank.
If you can use RHEL6/CentOS6 (or upgrade kernel), you can use cgroup to do what you want to do.
http://www.kernel.org/doc/Documentation/cgroups/cgroups.txt
https://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch01.html
Are you familiar with the tool nice and niceness levels?
Rather than trying to dictate exact percentages, you might want to check into niceness levels and how to set them in CentOS. Your applications will run as expected with the higher priority processes being able to claim more resources while the lower priority processes will not suffer from lack of resources even when a higher priority process is idle.
If you really really wanted to do this (and esnyder's suggestion of prioritising using nice levels is almost certainly a better solution for whatever you're really trying to achieve) AND you're happy to do it at the granularity of 1/number-of-CPUs (e.g on an 8 core system, specify utilisation as a multiple of 12.5% of total CPU resource) then you could use sched_setaffinity to set the CPU affinity mask for the process you want to control (you can do this from another process). (Actually, I think you'd need to identify all that process' threads and invoke it on each of them).
Alternatively, cpusets might be of interest but I've no idea what it takes to enable them or how dynamic they can be.
I have my own multithreaded C program which scales in speed smoothly with the number of CPU cores.. I can run it with 1, 2, 3, etc threads and get linear speedup.. up to about 5.5x speed on a 6-core CPU on a Ubuntu Linux box.
I had an opportunity to run the program on a very high end Sunfire x4450 with 4 quad-core Xeon processors, running Red Hat Enterprise Linux. I was eagerly anticipating seeing how fast the 16 cores could run my program with 16 threads..
But it runs at the same speed as just TWO threads!
Much hair-pulling and debugging later, I see that my program really is creating all the threads, they really are running simultaneously, but the threads themselves are slower than they should be. 2 threads runs about 1.7x faster than 1, but 3, 4, 8, 10, 16 threads all run at just net 1.9x! I can see all the threads are running (not stalled or sleeping), they're just slow.
To check that the HARDWARE wasn't at fault, I ran SIXTEEN copies of my program independently, simultaneously. They all ran at full speed. There really are 16 cores and they really do run at full speed and there really is enough RAM (in fact this machine has 64GB, and I only use 1GB per process).
So, my question is if there's some OPERATING SYSTEM explanation, perhaps some per-process resource limit which automatically scales back thread scheduling to keep one process from hogging the machine.
Clues are:
My program does not access the disk or network. It's CPU limited. Its speed scales linearly on a
single CPU box in Ubuntu Linux with
a hexacore i7 for 1-6 threads. 6
threads is effectively 6x speedup.
My program never runs faster than
2x speedup on this 16 core Sunfire
Xeon box, for any number of threads
from 2-16.
Running 16 copies of
my program single threaded runs
perfectly, all 16 running at once at
full speed.
top shows 1600% of
CPUs allocated. /proc/cpuinfo shows
all 16 cores running at full 2.9GHz
speed (not low frequency idle speed
of 1.6GHz)
There's 48GB of RAM free, it is not swapping.
What's happening? Is there some process CPU limit policy? How could I measure it if so?
What else could explain this behavior?
Thanks for your ideas to solve this, the Great Xeon Slowdown Mystery of 2010!
My initial guess would be shared memory bottlenecks. From what you say, your performance pretty much flatlines after 2 CPUs. You initially blame Redhat, but I'd be curious to see what happens if you install Ubuntu on the same hardware. I assume, of course, that you're running 64 bit SMP kernels across both tests.
It's probably not possible that the motherboard would peak at utilizing 2 CPUs. You have another machine with multiple cores that has provided better performance. Do you have hyperthreading turned on with the new machine? (and how does that answer compare to the old machine?). You're not, by chance, running in a virtualized environment?
Overall, your evidence is pointing to a ludicrously slow bottleneck somewhere. As you said, you're not I/O bound, so that leaves the CPU and memory. Either something is wrong with the hardware, or something is wrong with the hardware. Test one by changing the other, and you'll narrow down your possibilities quickly.
Do some research on rlimit - it's quite possible the shell/user acct you're running in has some RH-default or admin-set resource limits in place.
When you see this kind of odd scaling behaviour, especially if problems are seen with multiple threads, but not multiple processes, one thing to start looking at is the impacts of lock contention and other synchronisation primitives, which can cause threads running on different processors to have to wait for each other, potentially forcing multiple cores to flush their cache to main memory.
This means memory architecture starts to come into play, and that's going to be substantially faster when you have 6 cores on a single piece of silicon than when you're coordinating across 4 separate processors. Specifically, the single CPU case likely isn't needing to hit main memory for locking operations at all - everything is likely being handled at the L3 cache level, allowing the CPU to get on with things while data is flushed to main memory in the background.
While I expect the OP has lost interest in the question after all this time (or may not even have access to the hardware any more), one way to check this would be to see if the scaling up to 4 threads improves if the process affinity is set to lock it to a single physical CPU. Even better though would be to profile the application itself to see where it is spending it's time.As you change architectures and increase the number of cores, it gets harder and harder to guess where the bottlenecks are, so you really need to start measuring things directly, as in this example: http://postgresql.1045698.n5.nabble.com/Sun-Donated-a-Sun-Fire-T2000-to-the-PostgreSQL-community-td2057445.html