I have a dual core board running Linux in which I installed PJSIP (VoIP software). I want to add an echo/noise canceler algorithm but I don't want it to work on the same core as PJSIP.
How can I split the use of the cores between the two applications?
It is called CPU affinity. You can set it from the command line using taskset(1) or from your application using sched_setaffinity(2) sched_getaffinity(2).
The term you are looking for is affinity. http://man7.org/linux/man-pages/man2/sched_setaffinity.2.html or http://www.glennklockwood.com/hpc-howtos/process-affinity.html. That being said, if you are using a "slow" cpu, you are probably going to look at a real time scheduler (SCHED_FIFO, SCHED_RR or SCHED_DEADLINE) and if you are using a "fast" cpu you probably don't need to worry about affinity. The probability of you being in the "middle" where affinity will matter but scheduler won't is pretty low.
Related
I have the latest coffeelake machine which is primarily used as a storage server. The average workload on each core (4 cores) is around 5-10% when running a storage server alone.
I want to run vtune measurements of a workload on this machine using Intel Sampling drivers. However, I'm doubtful whether or not the measurements will be accurate given the storage server application is concurrently running.
But as the intel's documents suggest, the sampling drivers get installed on the Linux kernel, so is it really the case that the measurements will be inaccurate if run concurrently with other applications? In other words, how exactly do the intel sampling drivers work? Are they able to distinguish between the workload process and other processes running on the system?
If VTune is like the Linux PAPI subsystem that perf uses, it basically saves/restores HW event counter registers on context switch, along with the regular register state. So events like instructions and uops_retired should be unaffected. And effects on other events will be due to actual impacts, like extra cache misses.
(The basic mechanism for HW performance events are that each logical core has its own programmable perf counters that increment every time some microarchitectural event happens. If one overflows, it raises an interrupt for the driver to collect the count. Or for perf record type of functionality, perf or VTune would program them to count down so trigger an interrupt regularly, and sample the saved user-space RIP at that point. This produces some funky effects on a superscalar out-of-order CPU, like "blaming" the instruction waiting for data, not the cache miss load itself, for example. But the key point is that the inside-the-core events are totally per-core. The uncore / L3 cache events count stuff about shared resources like L3 cache, so are more easily disturbed by system load.)
Another point is that if you are running something on a CPU core, Linux isn't going to want to schedule other tasks there. So your background load will tend to avoid whichever core your test is running on, leaving it able to use 100% of a single core without a lot of context switches. (Although network / disk interrupts might still be handled on that core.)
So yes, you should be able to fairly accurately measure what's actually happening in your process while it runs on a system that's not totally idle. That might be a bit different from what would happen if it were run on a fully idle system, but probably not much different. Especially if it's single-threaded, or you can limit it to fewer than all of your cores, so there's at least one left for the OS to schedule other tasks onto.
I'm writing an application that needs to be executed on a specific core of a processor.
For Example:
If we have 4 cores and i want to execute code on 2nd core only. I need help how to do this.
I'm writing an application that needs to be executed on a specific core of a processor.
This is extremely improbable on most platforms (since most multi-core processors are homogeneous). You really need to explain, motivate and justify such an usual requirement.
You can't do that in general. And if you could do that, how exactly you should proceed is operating system specific, and platform specific. Most multi-core processors are homogeneous (all the cores are the same), some are not.
On Linux/x86-64, the kernel scheduler sees all cores as the same, and will move a task (e.g. a thread of a multi-threaded process) from one core to another at arbitrary moments. Since scheduling is preemptive.
On some processors, moving periodically (e.g dozen of times per second) a task from one core to another is actually recommended (and done automagically by the kernel, or the firmware - e.g. SMM) to avoid overheating of that core. Read about dark silicon.
Some unusual hardware (e.g. ARM big.LITTLE) have two sets of different cores (e.g. 2 high-end ARM cores with 2 low-end ones, all sharing the same memory). If your platform is such, please state that in your question, and ask how to achieve processor affinity on your specific platform. Very likely your OS has appropriate system calls for that purpose.
Some high-end motherboards are multi-sockets. In such case, a RAM module is closer to one socket (in timing) than to another. You then care about non-uniform memory access.
So read more about processor affinity and non-uniform memory access. Most OSes have some support for both. On Linux, see pthread_setaffinity_np(3), sched_setaffinity(2), numa(7) etc...
To learn more about OSes, read Operating Systems: Three Easy pieces.
Notice that by pinning some thread to a some fixed core, you might lower the performance of your program. Since processor affinity is rarely useful.
The programmer can prescribe his/her own affinities (hard affinities) but
Rule of thumb: use the default scheduler unless a good reason not to.
here is a C/C++ function to assign a thread to a certain core
Kernel scheduler API
#include <sched.h>
int sched_setaffinity(pid_t pid, unsigned int len, unsigned long * mask);
sets the current affinity mask of process 'pid' to *mask
'len' is the system word size: sizeof(unsigned int long)
To query affinity of a running process:
[~]$ taskset -p 3935
pid 3945's current affinity mask: f
With a 1270v3 and a single thread app I'm at the end of performance but when I watch monitoring tools like atop I don't understand how this whole stuff works. I tried to find a nice article about this sort of topic but they either have been explained in a language I don't understand or are not about the stuff I would like to know. I hope it is alright to ask this kind of stuff here.
From my understanding a single-thread app does only use one thread for all/most of the work. So the performance is defined by the single-thread power of the CPU.
A moment before I wrote this question I played around with CPU-frequency and noticed that although there are only two instances of the app running the usage is shared across all cores.
So I assume that the thread jumps around between these cores.
So I set the CPU scaling to performance with cpufreq-set -g performance. The result was that all CPU cores/threads stayed at about 2GHz like it was before besides one that is permanently on 3.5GHz (100%). As I only changed the scaling for one core, why is the usage still shared across all cores? I mean the app is running at about 300%, why doesn't it stick to the CPU core with the 100%?
Furthermore as I noticed that only one of the CPU's got scaled up I looked into the help page and found -r which should scale all cores with the performance settings. Unfortunately nothing does change. (Is this a bug in Ubuntu 1404?) So I used -c with the number 8 (8 threads) -> didn't work. 4 -> works but only scales 2 cores out of 8. 7 -> scaled 4 cores. So I'm wondering, does this not support hyper-threading or is the whole program that buggy?
However as I understand it, the CPU's with the max frequency together with the thread jump around in the monitoring tools as they display the average the usage, which than looks like shared. Did I figure this right?
Would forcing one cpu to 3.5GHz and forcing the app to this core improve performance or is all the stuff I'm wondering about only about avg calculation between the data they show each second.
If so am I right that I should run best with cpufreq-set -c 7 -g performance if power consumption doesn't matter?
Thanks for reading so far, I hope you have a moment to help me understand the whole thing.
Atop example screenshots:
http://i.imgur.com/VFEBvLx.png
http://i.imgur.com/cBKOnJM.png
http://i.imgur.com/bgQfwfU.png
I believe a lot of your confusion has to do with the fuzzy mapping of the capabilities of cpufreq to the actual capabilities of the hardware.
Here’s a description of what is taking place on the HW and in the OS.
A processor is a collection of cores on the same silicon substrate. The cores are what we used to call CPUs with some enhancements. CPUs now have the capability of running multiple HW threads (hyperthreading), each hardware thread being equivalent to one of the old type CPUs. Putting this all together, the 1270v3 is a quad core (if I recall correctly), meaning it has 4 cores on the same silicon substrate. Each core can support two HW threads, each HW thread being equivalent to what the OS calls a CPU (and I’ll call a virtual CPU). So from the OS perspective, the 1270v3 has 8 (virtual) CPUs.
The OS doesn’t see packages, cores or HW threads. It sees CPUs, and to it there appear to be 8 of them.
To further complicate the issue, modern processors have various HW supporting power saving states called P-states, C-states and package C-states. Why do I mention these? It’s because they are intimately associated with the frequency of the processor. And cpufreq professes to provide the user with control over the processor’s frequency.
Now, I’m not familiar with cpufreq outside of reading the manpage and other material on the web. From my reading, it has a lot of idiosyncrasies, so I’ll talk about what it is doing from a broad perspective.
In a general sense, cpufreq has a lot of generic capability that may or may not be supported by the HW or the kernel. This is confusing because it looks like the functionality is there but then things don’t happen as you would expect. For example, cpufreq gives the impression that you can set each CPU’s frequency independently. In reality, on a hyperthreaded system, two “CPUs” are associated with each core and must have the same frequency.
A lot of the functionality that cpufreq is supposed to control is associated with the power efficiency characteristics of the processor, but again, its mapping to the processor’s actual hardware capabilities is incomplete and misleading. Though cpufreq seems to allow you to set max and min frequencies, the processor hardware doesn’t work this way. In modern Intel processors, such as the 1270v3, there are something called P-states. These P-states are frequency-voltage pairs that slow down a processor’s frequency (and drop its voltage) to reduce the processor’s power consumption at the cost of the application taking longer to run. These frequency-voltage pairings aren’t arbitrary though cpufreq gives the impression that they are.
What does this all mean? In addition to the thread migration issues that the commenter mentioned, cpufreq isn’t going to behave the way you expect because it appears to have capability that it actually doesn’t, and the capability that it does actually have maps only roughly to the actual capabilities of the HW and OS.
I embedded some further comments in your text.
With a 1270v3 and a single thread app I'm at the end of performance but when I watch monitoring tools like atop I don't understand how this whole stuff works. I tried to find a nice article about this sort of topic but they either have been explained in a language I don't understand or are not about the stuff I would like to know. I hope it is alright to ask this kind of stuff here.
From my understanding a single-thread app does only use one thread for all/most of the work. [Yes, but this doesn’t mean that the thread is locked to a specific virtual CPU or core.] So the performance is defined by the single-thread power of the CPU. [It’s not that simple. The OS migrates threads around, it has its own maintenance processes, etc] A moment before I wrote this question I played around with CPU-frequency and noticed that although there are only two instances of the app running the usage is shared across all cores. So I assume that the thread jumps around between these cores. So I set the CPU scaling to performance with cpufreq-set -g performance. The result was that all CPU cores/threads stayed at about 2GHz like it was before besides one that is permanently on 3.5GHz (100%). As I only changed the scaling for one core, why is the usage still shared across all cores? I mean the app is running at about 300%, why doesn't it stick to the CPU core with the 100%? [Since I can’t see what you are observing, I don’t really understand what you are asking.]
Furthermore as I noticed that only one of the CPU's got scaled up I looked into the help page and found -r which should scale all cores with the performance settings. Unfortunately nothing does change. (Is this a bug in Ubuntu 1404?) So I used -c with the number 8 (8 threads) -> didn't work. 4 -> works but only scales 2 cores out of 8. 7 -> scaled 4 cores. [I haven’t used cpufreq so can’t directly speak to its behavior, but the manpage implies that “-c ” refers to a specific virtual CPU and not the number of virtual CPUs.] So I'm wondering, does this not support hyper-threading or is the whole program that buggy? [Again, I’m not sure from your explanation what you are doing, but the n->n/2 behavior makes sense to me. You can change the frequency of a core but since each core has two hyperthreads/virtual CPUs, two of those virtual CPUs must scale together.]
However as I understand it, the CPU's with the max frequency together with the thread jump around in the monitoring tools as they display the average the usage, which than looks like shared. Did I figure this right? [Again, I’m not sure what you are observing. Both physically and in atop, the CPU designation should not change, meaning CPU001 will always refer to the same virtual CPU. The core with the max frequency shouldn’t physically jump around, though the user thread may. Something to note is that monitoring tools can be pretty heavy users of the CPU. This heavy usage can make figuring out your processor usage difficult if it causes threads to jump around to different virtual CPUs.]
Would forcing one cpu to 3.5GHz and forcing the app to this core improve performance or is all the stuff I'm wondering about only about avg calculation between the data they show each second. [I found a pretty good explanation of atop with a lot of helpful screen shots: http://www.unixmen.com/linux-basics-monitor-system-resources-processes-using-atop/] If so am I right that I should run best with cpufreq-set -c 7 -g performance if power consumption doesn't matter? [It all depends upon what other processes are running on your system. If your system is mostly idle except for your processes, then forcing a core to a certain frequency won’t make a difference. [I’m suspicious of what a “governor” does. The language appears to refer to power-efficiency/performance (“balanced”, “powersave”, “performance”, etc) but the details don’t match the capability of today’s hardware.]
Thanks for reading so far, I hope you have a moment to help me
I want to run a number of benchmarks on a multi-core system running Linux. I want to reserve one of the cores for my benchmarks. I know that I can use sched_setaffinity to limit my benchmarks to that core. How can I keep all other processes off my core? In other words, how can I set the default affinity of all processes to not include my core?
Even if you keep all the other processes off your "reserved for benchmarking" core, bear in mind that you can't stop them from consuming a variable and unpredictable proportion of the limited memory bandwidth to a multi-core chip, and that you can't stop them making variable demands on the shared L2 and L3 caches.
IMHO reproducible, scientific benchmarking needs a machine all to itself.
In Linux (our system is CentOS5), is it possible to allocation CPU resources to processes? For example, I have one tomcat application, I want all the processes and threads invoked by tomcat has p% of total CPU cycles no matter how many other applications are running. And I want to tune the p% dynamically (e.g., at this time slot, tomcat has 40% cpu cycles, and at the next time slot, it has 70% cpu cycles).
If the above is possible, is it possible to do it conservatively? I mean, even though tomcat has 40% cpu cycles, but if it's current workload only consumes 10%, other applications can use the remaining 30% CPU cycles.
Thank.
If you can use RHEL6/CentOS6 (or upgrade kernel), you can use cgroup to do what you want to do.
http://www.kernel.org/doc/Documentation/cgroups/cgroups.txt
https://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch01.html
Are you familiar with the tool nice and niceness levels?
Rather than trying to dictate exact percentages, you might want to check into niceness levels and how to set them in CentOS. Your applications will run as expected with the higher priority processes being able to claim more resources while the lower priority processes will not suffer from lack of resources even when a higher priority process is idle.
If you really really wanted to do this (and esnyder's suggestion of prioritising using nice levels is almost certainly a better solution for whatever you're really trying to achieve) AND you're happy to do it at the granularity of 1/number-of-CPUs (e.g on an 8 core system, specify utilisation as a multiple of 12.5% of total CPU resource) then you could use sched_setaffinity to set the CPU affinity mask for the process you want to control (you can do this from another process). (Actually, I think you'd need to identify all that process' threads and invoke it on each of them).
Alternatively, cpusets might be of interest but I've no idea what it takes to enable them or how dynamic they can be.