How can I measure the queuing time of a process (CPU intensive) before it gets executed? - linux

Actually I am trying to run some experiments where i need to run benchmarks under heavy load. Starting from CPU load, I schedule a sysbench daemon that generates 1000 primes. I set its priority to low so that it only runs once the cpu is not busy with other tasks so as to reduce its impact on the regular workload. Since the priority of the process is set to Low, the process keeps waiting in the queue until it finds a free cpu core to run on. The problem is that its result shows the execution time including the wait period (in the queue) which renders the result invalid.
Is there some way that I could actually calculate the wait period and subtract it from the result to get a valid result?

Related

Multi threading with Millicores in Kubernetes

I am confused of the concept of millicores in Kubernetes . As per my programming knowledge , only 1 thread can run per core so why would give a limit in millicores ?
For example if i give a cpu limit of 600m to a container , can i use 400m for another pod or container , is it possible ?
I have tried installing minikube and ran on it .
Will both containers or pods run different threads ? Please if anyone can explain.
It's best to see millicores as a way to express fractions, x millicores correspond to the fraction x/1000 (e.g. 250millicores = 250/1000 = 1/4).
The value 1 represent the complete usage of 1 core (or hardware thread if hyperthreading or any other SMT is enabled).
So 100mcpu means the process is using 1/10th of a single CPU time. This means that it is using 1 second out of 10, or 100ms out of a second or 10us out of 100.
Just take any unit of time, divide it into ten parts, the process is running only for one of them.
Of course, if you take a too short interval (say, 1us), the overhead of the scheduler becomes non-negligeable but that's not important.
If the value is above 1, then the process is using more than one CPU. A value of 2300mcpu means that out of, say, 10 seconds, the process is running for... 23!
This is used to mean that the process is using 2 whole CPUs and a 3/10 of a third one.
This may sound weird but it's no different to saying: "I work out 3.5 times a week" to mean that "I work out 7 days every 2 weeks".
Remember: millicores represent a fraction of CPU time not of CPU number. So 2300mcpu is 230% the time of a single CPU.
What I hate about technologies like Kubernetes and Docker is that they hide too much, confusing seasoned programmers.
The millicores unit arises, at its base, from the way the Linux scheduler works. It doesn't divide the time into quanta and assigns each thread the CPU for a quantum, instead, it runs a thread until it's unfair to keep it running. So a thread can run for a variable time.
The current Linux scheduler, named CFS, works with the concept of waiting time.
Each thread has a waiting time, a counter that is incremented each nanosecond (but any sufficiently fine unit of time will do) that the thread is waiting to execute and that is decremented each nanosecond the thread is executing.
The threads are then ordered by their wait time divided the total number of threads, the thread with the greatest wait time is picked up and run until its wait time (that now is decreasing) falls below the wait time of another thread (which will be then scheduled).
So if we have one core (without HyperThreading or any other SMT) and four threads, after, say, a second, the scheduler will have allocated 1/4 of that second (250ms) to each thread.
You can say that each thread used 250millicores. This means it uses 250/1000 = 1/4 of the core time on average. The "core time" can be any amount of time, granted it is far greater than the scheduler wallclock. So 250millicores means 1 minute of time every 4, or 2 days every 8.
When a system has multiple CPUs/cores, the waiting time is scaled to account for that.
Now if a thread is scheduled, over the course of 1 second, to two CPUs for the whole second, we have an usage of 1/1 for the first CPU and 1/1 for the second one. A total of 1/1 + 1/1 = 2 or 2000mcpu.
This way of counting CPU times, albeit weird at first, at the advantage that it is absolute. 100mcpu means 1/10 of a CPU, no matter how many CPUs there are, this is by design.
If we counted time in a relative matter (i.e. where the value 1 means all the CPUs) then a value like 0.5 would mean 24 CPUs in a 48 CPUs system and 4 in an 8 CPUs system.
It would be hard to compare timings.
The Linux scheduler doesn't actually know about millicores, as we have seen it uses the waiting time and doesn't need any other measurement unit.
That millicores unit is just a unit we make up, so far, for our convenience.
However, it will turn out this unit will arise naturally due to how containers are constrained.
As implied by its name, the Linux scheduler is fair: all threads are equals. But you don't always want that, a process in a container should not hog all the cores on a machine.
This is where cgroups comes into play. It is a kernel feature that is used, along with namespace and union fs, to implement containers.
Its main goal is to restrict processes, including their CPU bandwidth.
This is done with two parameters, a period and a quota.
The restricted thread is allowed, by the scheduler, to run for quota microseconds (us) every period us.
Here, again, a quota greater than the period means using more than one CPU. Quoting the kernel documentation:
Limit a group to 1 CPU worth of runtime.
If period is 250ms and quota is also 250ms, the group will get
1 CPU worth of runtime every 250ms.
Limit a group to 2 CPUs worth of runtime on a multi-CPU machine.
With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
runtime every 500ms.
We see how, given x millicores, we can compute the quota and the period.
We can fix the period to 100ms and the quota to (100 * x) / 1000.
This is how Docker does it.
Of course, we have an infinite choice of pairs, we set the period to 100ms but indeed we can use any value (actually, there aren't infinite value but still).
Larger values of the period mean the thread can run for a longer time but will also pause for a longer time.
Here is where Docker is hiding things from the programmer, using an arbitrary value for the period in order to compute the quota (given the millicores, which the authors dub as more "user-friendly").
Kubernetes is designed around Docker (yes, it can use other container managers but they must expose an interface similar to the Docker's one), and the Kubernetes millicores unit match the unit used by Docker in its --cpus parameter.
So, long story short, millicores are the fractions of time of a single CPU (not the fraction of number of CPUs).
Cgroups, and hence Docker, and hence Kubernetes, doesn't restrict CPU usage by assigning cores to processes (like VMs do), instead it restricts CPU usage by restricting the amount of time (quota over period) the process can run on each CPU (with each CPU taking up to 1000mcpus worth of allowed time).
The scheduler of the kernel running the containers (f.e. linux) has means reserve time slices for an process to run concurrently with other processes on the same cpu.
You can throttle a process - giving it less time slices - if it uses too much cpu. This happens then a (hard) limit is hit. You can schedule a pod to a different node, if the cpu requests exceed the available cpu resources on a node.
So the requests is a hint for the kubernetes scheduler how to optimally place pods across nodes and the limit is to ensure by the kernel scheduler that no more resources will actually be used.
Actually if you just configure requests and no limits, all pods will be scheduled by the kernel scheduler policy, which is trying to be fair and balance the resources across all processes to maximize the usage while not starving any single process.

CPU percentage and heavy multi-threading

I am observing strange effects with the CPU percentage as shown in e.g. top or htop on Linux (Ubuntu 16.04) for one special application. The application uses many threads (around 1000). Each thread has one computational task. About half of these tasks need to be computed once per "trigger" - the trigger is an external event received exactly every 100ms. The other threads are mostly sleeping (waiting for user interaction) and hence do not play a big role here. So to summarise: many threads are waking up basically simultaneously within a short period of time, doing there (relatively short) computation and going back to sleep again.
Since the machine running this application has 8 virtual CPUs (4 cores each 2 threads, it's an i7-3612QE), only 8 threads can really wake up at a time, so many threads will have to wait. Also some of these tasks have interdependencies, so they anyway have to wait, but I think as an approximation one can think of this application as a bunch of threads going to the runnable state at the same time every 100ms and each doing only a short computation (way below 1ms of CPU time each).
Now coming to the strange effect: If I look at the CPU percentage in "top", it shows something like 250%. As far as I know, top looks on the CPU time (user + system) the kernel accounts for this process, so 250% would mean the process uses 3 virtual CPUs on average. So far so good. Now, if I use taskset to force the entire process to use only a single virtual CPU, the CPU percentage drops to 80%. The application has internal accounting which tells me that still all data is being processed. So the application is doing the same amount of work, but it seemingly uses less CPU resources. How can that be? Can I really trust the kernel CPU time accounting, or is this an artefact of the measurement?
The CPU percentage also goes down, if I start other processes which take a lot of CPU, even if the do nothing ("while(true);") and are running at low priority (nice). If I launch 8 of these CPU-eating processes, the application reaches again 80%. With fewer CPU-eaters, I get gradually higher CPU%.
Not sure if this plays a role: I have used the profiler vtune, which tells me my application is actually quite inefficient (only about 1 IPC), mostly because it's memory bound. This does not change if I restrict the process to a single virtual CPU, so I assume the effect is not caused by a huge increase in efficiency when running everything on the same core (which would be strange anyway).
My question was essentially already answered by myself in the last paragraph: The process is memory bound. Hence not the CPU is the limited resource but the memory bandwidth. Allowing such process to run on multiple CPU cores in parallel will mainly have the effect that more CPU cores are waiting for data to arrive from RAM. This is counted as CPU load, since the CPU is executing the thread, but just quite slowly. All my other observations go along with this.

Treadpool: Simple example for the wait time and execution time to determine the size of the pool

I am trying to find simple examples for what are exactly the wait time and execution time in determining the size of the thread pool. According to brian Goetz:
For tasks that may wait for I/O to complete -- for example, a task
that reads an HTTP request from a socket -- you will want to increase
the pool size beyond the number of available processors, because not
all threads will be working at all times. Using profiling, you can
estimate the ratio of waiting time (WT) to service time (ST) for a
typical request. If we call this ratio WT/ST, for an N-processor
system, you'll want to have approximately N*(1+WT/ST) threads to keep
the processors fully utilized.
I really didn't understand what he meant the Input/output. Who's doing the I/O tasks.
Imagine a task that reads some data from disk. What actually happens:
Open file.
Wait for (the spinning) disk to awake from sleep, to position the head at the right spot and for the desired blocks to appear underneath the head until all bytes arrive in a buffer.
Read from the buffer.
The whole task takes 0.1s to complete. Of this 0.1s 10 percent are spent on step 1 and 3 and the remaining 90 percent on step 2. So 0.01s are "working time" and 0.09s "wait time" that is spent waiting for the disk.

Execute process for n cpu cycles

How to execute a process for n cpu cycles on Linux? I have a batch processing system on a multi-core server and would like to ensure that each task gets exactle the same amount of cpu time. Once the cpu amount is consumed I would like to stop the process. So far I tried to do some thing with /proc/pid/stats utime and stime, but I did not succeed.
I believe it is impossible (to give the exact same number of cycles to several processes; a CPU cycle is often less than a nanosecond). You could execute a process for x CPU seconds. For that use setrlimit(2) with RLIMIT_CPU
Your batch processor could also manage time itself, see time(7). You could use timers (see timer_create(2) & timerfd_create(2)), have an event loop around poll(2), measure time with clock_getttime(2)
I'm not sure it is useful to write your own batch processing system. You could use the existing batch or slurm or gnqs (see also commercial products like pbsworks, lsf ..)

Measuring thread idle time with TBB

I have a threaded application with timing blocks around all the sections of code I believe are performance intensive. These sections are long enough that the timing code can remain active even in final production runs. The application has a bunch of worker threads pulling work from a task pool.
In my current custom pool implementation, I compute two extra times for each worker thread: (1) the amount of idle time spent waiting on an empty pool for a job to arrive and (2) the amount of "missing" time not accounted for by either idling or a timing block (indicating a performance intensive region that I've missed).
I'm now considering switching my code over to Intel Thread Building Blocks. Is it possible to measure idle time when using TBB tasks? The sum of idle time and missing time can easily be computed by subtracting each timed section from the total wall clock interval, but getting at each separately requires special support (or perhaps a special idle task with low priority that spins trying to preempt itself?).
Note: I also asked this question at http://software.intel.com/en-us/forums/showthread.php?t=107203, and will synchronize any answers.

Resources