How can OS actually measure the CPU power? - linux

Currently I think that processor only has two states: run and not run. If it's running, it will use its full power to process a task. If there are multiple processes, processes will be shared by a portion of CPU.
How can the computing power can be divided into "portions"? So, suppose a CPU has 1 million transistors, only half of the transistors are used if the CPU is only at 50%?
Or is this related to allocated processing time for each process? i.e. assume "100%" means a process seizes a CPU for 200 milliseconds, if a process with a default nice value (priority value) 0, which means the process will receive 50% computing power or, in other word, 100 milliseconds. What is the correct idea?

Let me explain this on the example of Intel x86 CPUs and Windows NT (and its derivatives). One of the built-in system processes on these OSes is the System Idle Process. This process represents how much CPU time is utilized by the operating system's "idle loop". That idle loop does nothing else but executes the HLT instruction of the CPU. That instruction, in turn, commands the CPU to do nothing until the next interrupt arrives.
Therefore, if the scheduler decides that there are no processes that require CPU time at the given moment, it is given to the System Idle Process. If, say, 99% of the time in the last n seconds was spent by "executing" that process, it means that the CPU was really utilized only in 1% in these n seconds.
I believe it is totally analogous with Linux, only that it doesn't have a separate process to model the "idleness" of the CPU.
On a side note : it is, of course, possible, to have a OS that doesn't execute the HLT instruction at all. That was the case with Windows 98 and earlier (including, obviously, MS-DOS), whose idle loop simply consisted of a jmp $. That caused the CPU to use much more power.

Related

Purpose of multiprocessors and multi-core processor

I do want to clarify things in my head and model concrete knowledge. dual-core with one processor system, only two threads within the one process can be executed concurrently by each core. Uni-core with two processor system, two different process can be executed by each CPU.
So can we say, each processor can execute processes concurrently. While multi-core processor execute threads within the process concurrently?
I think you have a fundamental misunderstanding of what a process and thread are and how they relate to the hardware itself.
A CPU core can only execute 1 machine level instruction per clock cycle (so essentially, just 1 assembly instruction). CPU's are typically measured by the number of clock cycles they go through in a second. So a 2.5 GHz core can execute 2.5 billion instructions per second.
The OS (the operating system, like Windows, Linux, macOS, Android, iOS, etc.) is responsible for launching programs and giving them access to the hardware resources. Each program can be considered a "process".
Each process can launch multiple threads.
To ensure that multiple processes can share the same hardware resources, the idea of pre-emptive computing came about over 40 years ago.
In a nut-shell, pre-emptive computing, or time-slicing, is a function of the OS. It basically gives a few milliseconds to each thread that is running, regardless of which process that thread is a part of, and keeps the "context" of each thread so that the state of each thread can be handled appropriately when it's time for that thread to run; that's also known as a context switch.
A dual, quad, or even 128 core CPU does not change that, nor will the amount of CPU's in the system (e.g. 4 CPU's each with 128 cores). Each core can only execute 1 instruction per clock cycle.
What changes is how many instructions can be run in true parallel. If my CPU has 16 cores, then that means it can execute 16 instructions per clock cycle, and thus run 16 separate threads of execution without any context switching being necessary (though it does still happen, but that's a different issue).
This doesn't cover hyper-threading, in which 1 core can execute 2 instructions per cycle, essentially doubling your CPU count, and doesn't cover the idea of cache-misses or other low-level ideas in which extra cycles could be spent on a thread, but it covers the general idea of CPU scheduling.

CPU percentage and heavy multi-threading

I am observing strange effects with the CPU percentage as shown in e.g. top or htop on Linux (Ubuntu 16.04) for one special application. The application uses many threads (around 1000). Each thread has one computational task. About half of these tasks need to be computed once per "trigger" - the trigger is an external event received exactly every 100ms. The other threads are mostly sleeping (waiting for user interaction) and hence do not play a big role here. So to summarise: many threads are waking up basically simultaneously within a short period of time, doing there (relatively short) computation and going back to sleep again.
Since the machine running this application has 8 virtual CPUs (4 cores each 2 threads, it's an i7-3612QE), only 8 threads can really wake up at a time, so many threads will have to wait. Also some of these tasks have interdependencies, so they anyway have to wait, but I think as an approximation one can think of this application as a bunch of threads going to the runnable state at the same time every 100ms and each doing only a short computation (way below 1ms of CPU time each).
Now coming to the strange effect: If I look at the CPU percentage in "top", it shows something like 250%. As far as I know, top looks on the CPU time (user + system) the kernel accounts for this process, so 250% would mean the process uses 3 virtual CPUs on average. So far so good. Now, if I use taskset to force the entire process to use only a single virtual CPU, the CPU percentage drops to 80%. The application has internal accounting which tells me that still all data is being processed. So the application is doing the same amount of work, but it seemingly uses less CPU resources. How can that be? Can I really trust the kernel CPU time accounting, or is this an artefact of the measurement?
The CPU percentage also goes down, if I start other processes which take a lot of CPU, even if the do nothing ("while(true);") and are running at low priority (nice). If I launch 8 of these CPU-eating processes, the application reaches again 80%. With fewer CPU-eaters, I get gradually higher CPU%.
Not sure if this plays a role: I have used the profiler vtune, which tells me my application is actually quite inefficient (only about 1 IPC), mostly because it's memory bound. This does not change if I restrict the process to a single virtual CPU, so I assume the effect is not caused by a huge increase in efficiency when running everything on the same core (which would be strange anyway).
My question was essentially already answered by myself in the last paragraph: The process is memory bound. Hence not the CPU is the limited resource but the memory bandwidth. Allowing such process to run on multiple CPU cores in parallel will mainly have the effect that more CPU cores are waiting for data to arrive from RAM. This is counted as CPU load, since the CPU is executing the thread, but just quite slowly. All my other observations go along with this.

How to start two CPU cores to run instructions at the same time?

For example, in X86, 2 CPU cores are running different software threads.
At a moment, these 2 threads need to run on their CPU cores at the same time.
Is there a way to sync-up these 2 CPU cores/threads, or something like this to make them start to run at (almost) the same time (at instruction level)?
Use a shared variable to communicate a rdtsc based deadline between the two threads. E.g., set a deadline of say the current rdtsc value plus 10,000.
Then have both threads spin on rdtsc waiting until the gap between the current rdtsc value and the threshold is less than a threshold value T (T = 100 should be fine). Finally, use the final gap value (that is, the deadline rdtsc value minus last read rdtsc value) to jump into a sequence of dependent add instructions such that the number of add instructions is equal to the gap.
This final step compensates for the fact that each chip will generally not be "in phase" with respect to their rdtsc spin loop. E.g., assuming a 30-cycle back-to-back throughput for rdtsc readings, one chip may get readings of 890, 920, 950 etc, while the other may read 880, 910, 940 so there will be a 10 or 20 cycle error if rdtsc alone is used. Using the add slide compensation, if the deadline was 1,000, and with a threshold of 100, the first thread would trigger at rdtsc == 920 and execute 80 additions, while the second would trigger at rdtsc == 910 and execute 90 additions. In principle both cores are then approximately synced up.
Some notes:
The above assumes CPU frequency equal to the nominal rdtsc frequency - if that's not the case you'll have to apply a compensation factor based on the nominal to true frequency ration when calculating where to jump into the add slide.
Don't expect your CPUs to say synced for long: anything like an interrupt, a variable latency operation like a cache miss, or a lot of other things can make them get out of sync.
You want all your payload code, and the addition slide to be hot in the icache of each core, or else they are very likely to get out of sync immediately. You can warm up the icache by doing one or more dummy runs through this code prior to the sync.
You want T to be large enough that the gap is always positive, so somewhat larger than the back-to-back rdtsc latency, but no so large as to increase the chance of events like interrupts during the add slide.
You can check the effectiveness of the "sync" by issuing a rdtsc or rdtscp at various points in the "payload" code following the sync up and seeing how close the recorded values are across threads.
A totally different option would be to use Intel TSX: transactional extensions. Organize for the two threads that want to coordinate to both read a shared line inside a transactional region and then spin, and have a third thread to write to the shared line. This will cause an abort on both of the waiting threads. Depending on the inter-core topology, the two waiting threads may receive the invalidation and hence the subsequent TSX abort at nearly the same time. Call the code you want to run "in sync" from the abort handler.
Depending on your definition of "(almost) the same time", this is a very hard problem microarchitecturally.
Even the definition of "Run" isn't specific enough if you care about timing down to the cycle. Do you mean issue from the front-end into the out-of-order back-end? Execute? (dispatch to an execution unit? or complete execution successfully without needing a replay?) Or retire?
I'd tend to go with Execute1 because that's when an instruction like rdtsc samples the timestamp counter. This it's the one you can actually record the timing of and then compare later.
footnote 1: on the correct path, not in the shadow of a mis-speculation, unless you're also ok with executions that don't reach retirement.
But if the two cores have different ROB / RS states when the instruction you care about executes, they won't continue in lock-step. (There are extremely few in-order x86-64 CPUs, like some pre-Silvermont Atoms, and early Xeon Phi: Knight's Corner. The x86-64 CPUs of today are all out-of-order, and outside of low-power Silvermont-family are aggressively so with large ROB + scheduler.)
x86 asm tricks:
I haven't used it, but x86 asm monitor / mwait to have both CPUs monitor and wait for a write to a given memory location could work. I don't know how synchronized the wakeup is. I'd guess that the less deep the sleep, the less variable the latency.
Early wake-up from an interrupt coming before a write is always possible. Unless you disable interrupts, you aren't going to be able to make this happen 100% of the time; hopefully you just need to make it happen with some reasonable chance of success, and be able to tell after the fact whether you achieved it.
(On very recent low-power Intel CPUs (Tremont), a user-space-usable version of these are available: umonitor / umwait. But in kernel you can probably just use monitor/mwait)
If umonitor/umwait are available, that means you have the WAITPKG CPU feature which also includes tpause: like pause but wait until a given TSC timestamp.
On modern x86 CPUs, the TSC is synchronized between all cores by hardware, so using the same wake-up time for multiple cores makes this trivial.
Otherwise you could spin-wait on a rdtsc deadline and probably get within ~25 cycles at worst on Skylake.
rdtsc has one per 25 cycle throughput on Skylake (https://agner.org/optimize/) so you expect each thread to be on average 12.5 cycles late leaving the spin-wait loop, +-12.5. I'm assuming the branch-mispredict cost for both threads is the same. These are core clock cycles, not the reference cycles that rdtsc counts. RDTSC typically ticks close to the max non-turbo clock. See How to get the CPU cycle count in x86_64 from C++? for more about RDTSC from C.
See How much delay is generated by this assembly code in linux for an asm function that spins on rdtsc waiting for a deadline. You could write this in C easily enough.
Staying in sync after initial start:
On a many-core Xeon where each core can change frequency independently, you'll need to fix the CPU frequency to something, probably max non-turbo would be a good choice. Otherwise with cores at different clock speeds, they'll obviously de-sync right away.
On a desktop you might want to do this anyway, in case pausing the clock to change CPU frequency throws things off.
Any difference in branch mispredicts, cache misses, or even different initial states of ROB/RS could lead to major desync.
More importantly, interrupts are huge and take a very long time compared to running 1 more instruction in an already-running task. And it can even lead to the scheduler doing a context switch to another thread. Or a CPU migration for the task, obviously costing a lot of cycles.

How can I utilize multithread CPU most in Matlab?

I just bought the Matlab Parallel Computing toolbox.
The command matlabpool open opens parallel workers with the number of the cores in my CPU.
But each of my CPU core has two threads. According to Windows Task Manager, each worker can only use half performance of one CPU core, which seems could be interpreted as one worker = one thread = "half core".
Therefore, after all workers opened, still half of the total power of CPU could be utilized.
Is there any other command could help with that?
By default, the local cluster type for matlabpool considers only "real" cores when choosing the default number of workers to launch. This is because for MATLAB workloads, hyperthreading often does not provide much benefit. However, this value is only a default - you can edit the cluster type and run anything up to 12 local workers.
You need to understand HyperThreading to answer this question.
Matlab launches a worker thread for every CPU. Suppose you now use a directive like parfor to distribute computation over multiple threads. Every thread will now be crunching numbers happily.
Suppose you are doing a sum of a large vector of numbers. What actually happens is the following:
sum = sum + a[0]
array a is not in my CPU cache yet
I will fetch a small part of a from main memory and put it in the CPU cache
sum = sum + a[1]
sum = sum + a[2]
...
During the fetch of a, the CPU stalls, waiting for the system memory. This is called a pipeline bubble, and it is not good for performance. Sometimes, a part of the array a was swapped out to the hard drive. The operating system will need to access the drive to put that part into main memory, after which it will be transferred to the CPU cache. When this happens, your operating system will not let the CPU wait for +200 ms. It will use that time to execute another task instead (like the backup running on your system, or refreshing your screen, or ...).
Switching tasks on a CPU results in a performance penalty. To switch to a different task, the operating system must save the CPU registers in main memory, and load the CPU registers of the other task back into the CPU first. This takes time.
With HyperThreading, the number of registers per CPU is doubled. This means that two processes can 'occupy' the CPU. Only one can be executed, but during a stall, the operating system can switch to the second process without any performance penalty.
Forget how Microsoft Windows reports CPU usage. It's wrong. CPU usage is a lot more complicated than only a simple 47%. The real question is rather: should matlab register two threads per core, or only one?
Arguments pro:
During a stall, the CPU can quickly switch to the other thread and continue executing.
Arguments contra:
There are more threads, and the problem is divided in smaller pieces. This may actually reduce performance, as you need to put more pieces together to get the final result.
A context switch will still 'poison' the L1 and L2 cache, loading in pieces of memory that are of no use to the other thread on the CPU.
If there are no stalls, you have more overhead.
On a desktop, the operating system will also want to run: redrawing the screen, moving your mouse, etc. When all logical cpu's are in use, the operating system is required to do an actual (costly) context switch.
Your problem will only be complete if all pieces of the problem have been calculated. Using all the cores / threads increases the risk of one thread taking more time.
My guess is that the Matlab developers considered the arguments contra to be more important than the arguments pro. My own performance tests certainly suggest that there is little performance gain from HyperThreading for cpu-intensive calculations.

Reserve a processor for only one process (with already the max priority)

I have used this piece of code for trying to set the -same- high priority while executing a program :
CPU_SET(CPU_NUM, &cmask);
if (pthread_setaffinity_np(pid, sizeof(cmask), &cmask) < 0) {
LOG_ERROR("Could not set cpu affinity to core %d", CPU_NUM); goto exit_err;
}
errno = 0;
setpriority(PRIO_PROCESS, 0, -19);
The purpose of the program is to perform a computation for a constant bunch (every 80 bytes) of input.
But when executing the program, the time elapsed for this computation varies from 30% to 150%.
When plotting the computation time values, I was waiting for a -quite- smooth graph were the deviation would be something like 10%-15%, but instead there is more than 40% !!!
So I would like to ask, if the CPU is interfering the execution of the program with an other, and if so could I force the CPU to run ONLY a specific program?
Thanks in advance !
P.S. I haven't found a thread that could answer to my question yet...
The most relevant is :) :
Linux reserve a processor for a group of processes (dynamically)
To try and reduce jitter some of the things you can do are:
Ensure sure you've turned off CPU scaling.
Set scheduling policy to SCHED_FIFO for that program.
Try and pin your process to a single processor if you have more than one.
Try and run as few other processes at the same time while you're measuring your program.
Don't trigger sources of time related non-determinism (e.g. disk I/O).
It is probably useful to skim through How to build a Linux RT application because accurate measurement is the same domain - it's possible to be more extreme though:
Ensure your program doesn't use dynamic memory allocations.
Use a realtime Linux kernel.
Prevent Linux from scheduling non-specific userspace programs on a given CPU.
Even disable timer ticks on a given CPU (CONFIG_TASK_ISOLATION).
Modern desktop/server processors are so complicated that trying to precisely measure a single program's execution time with low variance is extremely hard. Things like the various caches and pipeline starting states can perturb execution times in any number of ways so there are always going to be limits.

Resources