Reserve a processor for only one process (with already the max priority)

Reserve a processor for only one process (with already the max priority) - linux

I have used this piece of code for trying to set the -same- high priority while executing a program :
CPU_SET(CPU_NUM, &cmask);
if (pthread_setaffinity_np(pid, sizeof(cmask), &cmask) < 0) {
LOG_ERROR("Could not set cpu affinity to core %d", CPU_NUM); goto exit_err;
}
errno = 0;
setpriority(PRIO_PROCESS, 0, -19);
The purpose of the program is to perform a computation for a constant bunch (every 80 bytes) of input.
But when executing the program, the time elapsed for this computation varies from 30% to 150%.
When plotting the computation time values, I was waiting for a -quite- smooth graph were the deviation would be something like 10%-15%, but instead there is more than 40% !!!
So I would like to ask, if the CPU is interfering the execution of the program with an other, and if so could I force the CPU to run ONLY a specific program?
Thanks in advance !
P.S. I haven't found a thread that could answer to my question yet...
The most relevant is :) :
Linux reserve a processor for a group of processes (dynamically)

To try and reduce jitter some of the things you can do are:
Ensure sure you've turned off CPU scaling.
Set scheduling policy to SCHED_FIFO for that program.
Try and pin your process to a single processor if you have more than one.
Try and run as few other processes at the same time while you're measuring your program.
Don't trigger sources of time related non-determinism (e.g. disk I/O).
It is probably useful to skim through How to build a Linux RT application because accurate measurement is the same domain - it's possible to be more extreme though:
Ensure your program doesn't use dynamic memory allocations.
Use a realtime Linux kernel.
Prevent Linux from scheduling non-specific userspace programs on a given CPU.
Even disable timer ticks on a given CPU (CONFIG_TASK_ISOLATION).
Modern desktop/server processors are so complicated that trying to precisely measure a single program's execution time with low variance is extremely hard. Things like the various caches and pipeline starting states can perturb execution times in any number of ways so there are always going to be limits.

Related

CPU percentage and heavy multi-threading

I am observing strange effects with the CPU percentage as shown in e.g. top or htop on Linux (Ubuntu 16.04) for one special application. The application uses many threads (around 1000). Each thread has one computational task. About half of these tasks need to be computed once per "trigger" - the trigger is an external event received exactly every 100ms. The other threads are mostly sleeping (waiting for user interaction) and hence do not play a big role here. So to summarise: many threads are waking up basically simultaneously within a short period of time, doing there (relatively short) computation and going back to sleep again.
Since the machine running this application has 8 virtual CPUs (4 cores each 2 threads, it's an i7-3612QE), only 8 threads can really wake up at a time, so many threads will have to wait. Also some of these tasks have interdependencies, so they anyway have to wait, but I think as an approximation one can think of this application as a bunch of threads going to the runnable state at the same time every 100ms and each doing only a short computation (way below 1ms of CPU time each).
Now coming to the strange effect: If I look at the CPU percentage in "top", it shows something like 250%. As far as I know, top looks on the CPU time (user + system) the kernel accounts for this process, so 250% would mean the process uses 3 virtual CPUs on average. So far so good. Now, if I use taskset to force the entire process to use only a single virtual CPU, the CPU percentage drops to 80%. The application has internal accounting which tells me that still all data is being processed. So the application is doing the same amount of work, but it seemingly uses less CPU resources. How can that be? Can I really trust the kernel CPU time accounting, or is this an artefact of the measurement?
The CPU percentage also goes down, if I start other processes which take a lot of CPU, even if the do nothing ("while(true);") and are running at low priority (nice). If I launch 8 of these CPU-eating processes, the application reaches again 80%. With fewer CPU-eaters, I get gradually higher CPU%.
Not sure if this plays a role: I have used the profiler vtune, which tells me my application is actually quite inefficient (only about 1 IPC), mostly because it's memory bound. This does not change if I restrict the process to a single virtual CPU, so I assume the effect is not caused by a huge increase in efficiency when running everything on the same core (which would be strange anyway).

My question was essentially already answered by myself in the last paragraph: The process is memory bound. Hence not the CPU is the limited resource but the memory bandwidth. Allowing such process to run on multiple CPU cores in parallel will mainly have the effect that more CPU cores are waiting for data to arrive from RAM. This is counted as CPU load, since the CPU is executing the thread, but just quite slowly. All my other observations go along with this.

How to start two CPU cores to run instructions at the same time?

For example, in X86, 2 CPU cores are running different software threads.
At a moment, these 2 threads need to run on their CPU cores at the same time.
Is there a way to sync-up these 2 CPU cores/threads, or something like this to make them start to run at (almost) the same time (at instruction level)?

Use a shared variable to communicate a rdtsc based deadline between the two threads. E.g., set a deadline of say the current rdtsc value plus 10,000.
Then have both threads spin on rdtsc waiting until the gap between the current rdtsc value and the threshold is less than a threshold value T (T = 100 should be fine). Finally, use the final gap value (that is, the deadline rdtsc value minus last read rdtsc value) to jump into a sequence of dependent add instructions such that the number of add instructions is equal to the gap.
This final step compensates for the fact that each chip will generally not be "in phase" with respect to their rdtsc spin loop. E.g., assuming a 30-cycle back-to-back throughput for rdtsc readings, one chip may get readings of 890, 920, 950 etc, while the other may read 880, 910, 940 so there will be a 10 or 20 cycle error if rdtsc alone is used. Using the add slide compensation, if the deadline was 1,000, and with a threshold of 100, the first thread would trigger at rdtsc == 920 and execute 80 additions, while the second would trigger at rdtsc == 910 and execute 90 additions. In principle both cores are then approximately synced up.
Some notes:
The above assumes CPU frequency equal to the nominal rdtsc frequency - if that's not the case you'll have to apply a compensation factor based on the nominal to true frequency ration when calculating where to jump into the add slide.
Don't expect your CPUs to say synced for long: anything like an interrupt, a variable latency operation like a cache miss, or a lot of other things can make them get out of sync.
You want all your payload code, and the addition slide to be hot in the icache of each core, or else they are very likely to get out of sync immediately. You can warm up the icache by doing one or more dummy runs through this code prior to the sync.
You want T to be large enough that the gap is always positive, so somewhat larger than the back-to-back rdtsc latency, but no so large as to increase the chance of events like interrupts during the add slide.
You can check the effectiveness of the "sync" by issuing a rdtsc or rdtscp at various points in the "payload" code following the sync up and seeing how close the recorded values are across threads.
A totally different option would be to use Intel TSX: transactional extensions. Organize for the two threads that want to coordinate to both read a shared line inside a transactional region and then spin, and have a third thread to write to the shared line. This will cause an abort on both of the waiting threads. Depending on the inter-core topology, the two waiting threads may receive the invalidation and hence the subsequent TSX abort at nearly the same time. Call the code you want to run "in sync" from the abort handler.

Depending on your definition of "(almost) the same time", this is a very hard problem microarchitecturally.
Even the definition of "Run" isn't specific enough if you care about timing down to the cycle. Do you mean issue from the front-end into the out-of-order back-end? Execute? (dispatch to an execution unit? or complete execution successfully without needing a replay?) Or retire?
I'd tend to go with Execute1 because that's when an instruction like rdtsc samples the timestamp counter. This it's the one you can actually record the timing of and then compare later.
footnote 1: on the correct path, not in the shadow of a mis-speculation, unless you're also ok with executions that don't reach retirement.
But if the two cores have different ROB / RS states when the instruction you care about executes, they won't continue in lock-step. (There are extremely few in-order x86-64 CPUs, like some pre-Silvermont Atoms, and early Xeon Phi: Knight's Corner. The x86-64 CPUs of today are all out-of-order, and outside of low-power Silvermont-family are aggressively so with large ROB + scheduler.)
x86 asm tricks:
I haven't used it, but x86 asm monitor / mwait to have both CPUs monitor and wait for a write to a given memory location could work. I don't know how synchronized the wakeup is. I'd guess that the less deep the sleep, the less variable the latency.
Early wake-up from an interrupt coming before a write is always possible. Unless you disable interrupts, you aren't going to be able to make this happen 100% of the time; hopefully you just need to make it happen with some reasonable chance of success, and be able to tell after the fact whether you achieved it.
(On very recent low-power Intel CPUs (Tremont), a user-space-usable version of these are available: umonitor / umwait. But in kernel you can probably just use monitor/mwait)
If umonitor/umwait are available, that means you have the WAITPKG CPU feature which also includes tpause: like pause but wait until a given TSC timestamp.
On modern x86 CPUs, the TSC is synchronized between all cores by hardware, so using the same wake-up time for multiple cores makes this trivial.
Otherwise you could spin-wait on a rdtsc deadline and probably get within ~25 cycles at worst on Skylake.
rdtsc has one per 25 cycle throughput on Skylake (https://agner.org/optimize/) so you expect each thread to be on average 12.5 cycles late leaving the spin-wait loop, +-12.5. I'm assuming the branch-mispredict cost for both threads is the same. These are core clock cycles, not the reference cycles that rdtsc counts. RDTSC typically ticks close to the max non-turbo clock. See How to get the CPU cycle count in x86_64 from C++? for more about RDTSC from C.
See How much delay is generated by this assembly code in linux for an asm function that spins on rdtsc waiting for a deadline. You could write this in C easily enough.
Staying in sync after initial start:
On a many-core Xeon where each core can change frequency independently, you'll need to fix the CPU frequency to something, probably max non-turbo would be a good choice. Otherwise with cores at different clock speeds, they'll obviously de-sync right away.
On a desktop you might want to do this anyway, in case pausing the clock to change CPU frequency throws things off.
Any difference in branch mispredicts, cache misses, or even different initial states of ROB/RS could lead to major desync.
More importantly, interrupts are huge and take a very long time compared to running 1 more instruction in an already-running task. And it can even lead to the scheduler doing a context switch to another thread. Or a CPU migration for the task, obviously costing a lot of cycles.

Thread Quantum: How to compute it

I have been reading a few posts and articles regarding thread quanta (here, here and here). Apparently Windows allocate a fix number of CPU ticks for a thread quantum depending on the windows "mode" (server, or something else). However from the last link we can read:
(A thread quantum) between 10-200 clock ticks (i.e. 10-200 ms) under Linux, though some
granularity is introduced in the calculation
Is there any way to compute the quantum length on Linux?
Does that make any sense to compute it anyway? (since from my understanding threads can still be pre-empted, nothing forces a thread to run during the full duration of the quantum)
From a developer's perspective, I could see the interest in writing a program that could predict the running time of a program given its number of threads, and "what they do" (possibly removing all the testing to find the optimal number of threads would be kind of neat, although I am not sure it is the right approach)

On Linux, the default realtime quantum length constant is declared as RR_TIMESLICE, at least in 4.x kernels; HZ must be defined while configuring the kernel.
The interval between pausing the thread whose quantum has expired and resuming it may depend on a lot of things like, say, load average.
To be able to predict the running time at least with some degree of accuracy, give the target process realtime priority; realtime processes are scheduled following a round-robin algorithm, which is generally simpler and more predictable than the common Linux scheduling algo.
To get the realtime quantum length, call sched_rr_get_interval().

How can I utilize multithread CPU most in Matlab?

I just bought the Matlab Parallel Computing toolbox.
The command matlabpool open opens parallel workers with the number of the cores in my CPU.
But each of my CPU core has two threads. According to Windows Task Manager, each worker can only use half performance of one CPU core, which seems could be interpreted as one worker = one thread = "half core".
Therefore, after all workers opened, still half of the total power of CPU could be utilized.
Is there any other command could help with that?

By default, the local cluster type for matlabpool considers only "real" cores when choosing the default number of workers to launch. This is because for MATLAB workloads, hyperthreading often does not provide much benefit. However, this value is only a default - you can edit the cluster type and run anything up to 12 local workers.

You need to understand HyperThreading to answer this question.
Matlab launches a worker thread for every CPU. Suppose you now use a directive like parfor to distribute computation over multiple threads. Every thread will now be crunching numbers happily.
Suppose you are doing a sum of a large vector of numbers. What actually happens is the following:
sum = sum + a[0]
array a is not in my CPU cache yet
I will fetch a small part of a from main memory and put it in the CPU cache
sum = sum + a[1]
sum = sum + a[2]
...
During the fetch of a, the CPU stalls, waiting for the system memory. This is called a pipeline bubble, and it is not good for performance. Sometimes, a part of the array a was swapped out to the hard drive. The operating system will need to access the drive to put that part into main memory, after which it will be transferred to the CPU cache. When this happens, your operating system will not let the CPU wait for +200 ms. It will use that time to execute another task instead (like the backup running on your system, or refreshing your screen, or ...).
Switching tasks on a CPU results in a performance penalty. To switch to a different task, the operating system must save the CPU registers in main memory, and load the CPU registers of the other task back into the CPU first. This takes time.
With HyperThreading, the number of registers per CPU is doubled. This means that two processes can 'occupy' the CPU. Only one can be executed, but during a stall, the operating system can switch to the second process without any performance penalty.
Forget how Microsoft Windows reports CPU usage. It's wrong. CPU usage is a lot more complicated than only a simple 47%. The real question is rather: should matlab register two threads per core, or only one?
Arguments pro:
During a stall, the CPU can quickly switch to the other thread and continue executing.
Arguments contra:
There are more threads, and the problem is divided in smaller pieces. This may actually reduce performance, as you need to put more pieces together to get the final result.
A context switch will still 'poison' the L1 and L2 cache, loading in pieces of memory that are of no use to the other thread on the CPU.
If there are no stalls, you have more overhead.
On a desktop, the operating system will also want to run: redrawing the screen, moving your mouse, etc. When all logical cpu's are in use, the operating system is required to do an actual (costly) context switch.
Your problem will only be complete if all pieces of the problem have been calculated. Using all the cores / threads increases the risk of one thread taking more time.
My guess is that the Matlab developers considered the arguments contra to be more important than the arguments pro. My own performance tests certainly suggest that there is little performance gain from HyperThreading for cpu-intensive calculations.

Linux 2.6.31 Scheduler and Multithreaded Jobs

I run massively parallel scientific computing jobs on a shared Linux computer with 24 cores. Most of the time my jobs are capable of scaling to 24 cores when nothing else is running on this computer. However, it seems like when even one single-threaded job that isn't mine is running, my 24-thread jobs (which I set for high nice values) only manage to get ~1800% CPU (using Linux notation). Meanwhile, about 500% of the CPU cycles (again, using Linux notation) are idle. Can anyone explain this behavior and what I can do about it to get all of the 23 cores that aren't being used by someone else?
Notes:
In case it's relevant, I have observed this on slightly different kernel versions, though I can't remember which off the top of my head.
The CPU architecture is x64. Is it at all possible that the fact that my 24-core jobs are 32-bit and the other jobs I'm competing w/ are 64-bit is relevant?
Edit: One thing I just noticed is that going up to 30 threads seems to alleviate the problem to some degree. It gets me up to ~2100% CPU.

It is possible that this is caused by the scheduler trying to keep each of your tasks running on the same CPU that it was previously running on (it does this because the task has likely brought its working set into that CPU's cache - it's "cache hot").
Here's a few ideas you can try:
Run twice as many threads as you have cores;
Run one or two less threads than you have cores;
Reduce the value of /proc/sys/kernel/sched_migration_cost (perhaps down to zero);
Reduce the value of /proc/sys/kernel/sched_domain/.../imbalance_pct down closer to 100.

Do your threads have to synchronize? If so, you might have the following problem:
Assume you have a 4-cpu system, and a 4-thread job. When run alone, threads fan out to use all 4 cores and total usage is near perfect (We'll call this 400%).
If you add one single-threaded interfering job, the scheduler might place 2 of your threads on the same cpu. This means that 2 of your threads are now running at effectively half their normal pace (dramatic simplification), and if your threads need to synchronize periodically, the progress of your job can be limited by the slowest thread, which in this case is running at half normal speed. You would see utilization of only 200% (from your job running 4x 50%) plus 100% (the interfering job) = 300%.
Similarly, if you assume that the interfering job only uses 25% of one processor's time, you might see one of your threads and the interferer on the same CPU. In that case the slowest thread is running at 3/4 normal speed, causing the total utilization to be 300% (4x 75%) + 25% = 325%. Play with these numbers and it's not hard to come up with something similar to what you're seeing.
If that's the problem, you can certainly play with priorities to give unwelcome tasks only tiny fractions of available CPU (I'm assuming I/O delays aren't a factor). Or, as you've found, try to increase threads so that each CPU has, say, 2 threads, minus a few to allow for system tasks. In this way a 24 core system might run best with, say, 46 threads (which always leaves half of 2 cores' time available for system tasks).

Do your threads communicates with each other?
Try to manually bind every thread to cpu, with sched_setaffinity or pthread_setaffinity_np. Scheduler can be rather dumb when working with a lot of relating threads.

It might be worthwhile to use mpstat (part of the sysstat package) to figure out if you have entire CPUs sitting idle while others are fully utilized. It should give you a more detailed view of the utilization than top or vmstat: run mpstat -P ALL to see 1 line per CPU.
As an experiment, you might try setting the CPU affinity on each thread such that each is bound to an individual CPU; this would let you see what performance is like if you don't let the kernel scheduler decide which CPU a task is scheduled on. It's not a good permanent solution, but if it helps a lot it gives you an idea of where the scheduler is falling short.

Do you think the bottleneck is in your application or the kernel's scheduling algorithm? Before you start tweaking scheduling parameters, I suggest you try running a simple multi-threaded application to see if it exhibits the same behavior as your application.
// COMPILE WITH: gcc threads.c -lpthread -o thread
#include <pthread.h>
#define NUM_CORES 24
void* loop_forever(void* argument) {
int a;
while(1) a++;
}
void main() {
int i;
pthread_t threads[NUM_CORES];
for (i = 0; i < NUM_CORES; i++)
pthread_create(&threads[i], 0, loop_forever, 0);
for (i = 0; i < NUM_CORES; i++)
pthread_join(threads[i], 0);
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string