Multi threading with Millicores in Kubernetes - multithreading

I am confused of the concept of millicores in Kubernetes . As per my programming knowledge , only 1 thread can run per core so why would give a limit in millicores ?
For example if i give a cpu limit of 600m to a container , can i use 400m for another pod or container , is it possible ?
I have tried installing minikube and ran on it .
Will both containers or pods run different threads ? Please if anyone can explain.

It's best to see millicores as a way to express fractions, x millicores correspond to the fraction x/1000 (e.g. 250millicores = 250/1000 = 1/4).
The value 1 represent the complete usage of 1 core (or hardware thread if hyperthreading or any other SMT is enabled).
So 100mcpu means the process is using 1/10th of a single CPU time. This means that it is using 1 second out of 10, or 100ms out of a second or 10us out of 100.
Just take any unit of time, divide it into ten parts, the process is running only for one of them.
Of course, if you take a too short interval (say, 1us), the overhead of the scheduler becomes non-negligeable but that's not important.
If the value is above 1, then the process is using more than one CPU. A value of 2300mcpu means that out of, say, 10 seconds, the process is running for... 23!
This is used to mean that the process is using 2 whole CPUs and a 3/10 of a third one.
This may sound weird but it's no different to saying: "I work out 3.5 times a week" to mean that "I work out 7 days every 2 weeks".
Remember: millicores represent a fraction of CPU time not of CPU number. So 2300mcpu is 230% the time of a single CPU.
What I hate about technologies like Kubernetes and Docker is that they hide too much, confusing seasoned programmers.
The millicores unit arises, at its base, from the way the Linux scheduler works. It doesn't divide the time into quanta and assigns each thread the CPU for a quantum, instead, it runs a thread until it's unfair to keep it running. So a thread can run for a variable time.
The current Linux scheduler, named CFS, works with the concept of waiting time.
Each thread has a waiting time, a counter that is incremented each nanosecond (but any sufficiently fine unit of time will do) that the thread is waiting to execute and that is decremented each nanosecond the thread is executing.
The threads are then ordered by their wait time divided the total number of threads, the thread with the greatest wait time is picked up and run until its wait time (that now is decreasing) falls below the wait time of another thread (which will be then scheduled).
So if we have one core (without HyperThreading or any other SMT) and four threads, after, say, a second, the scheduler will have allocated 1/4 of that second (250ms) to each thread.
You can say that each thread used 250millicores. This means it uses 250/1000 = 1/4 of the core time on average. The "core time" can be any amount of time, granted it is far greater than the scheduler wallclock. So 250millicores means 1 minute of time every 4, or 2 days every 8.
When a system has multiple CPUs/cores, the waiting time is scaled to account for that.
Now if a thread is scheduled, over the course of 1 second, to two CPUs for the whole second, we have an usage of 1/1 for the first CPU and 1/1 for the second one. A total of 1/1 + 1/1 = 2 or 2000mcpu.
This way of counting CPU times, albeit weird at first, at the advantage that it is absolute. 100mcpu means 1/10 of a CPU, no matter how many CPUs there are, this is by design.
If we counted time in a relative matter (i.e. where the value 1 means all the CPUs) then a value like 0.5 would mean 24 CPUs in a 48 CPUs system and 4 in an 8 CPUs system.
It would be hard to compare timings.
The Linux scheduler doesn't actually know about millicores, as we have seen it uses the waiting time and doesn't need any other measurement unit.
That millicores unit is just a unit we make up, so far, for our convenience.
However, it will turn out this unit will arise naturally due to how containers are constrained.
As implied by its name, the Linux scheduler is fair: all threads are equals. But you don't always want that, a process in a container should not hog all the cores on a machine.
This is where cgroups comes into play. It is a kernel feature that is used, along with namespace and union fs, to implement containers.
Its main goal is to restrict processes, including their CPU bandwidth.
This is done with two parameters, a period and a quota.
The restricted thread is allowed, by the scheduler, to run for quota microseconds (us) every period us.
Here, again, a quota greater than the period means using more than one CPU. Quoting the kernel documentation:
Limit a group to 1 CPU worth of runtime.
If period is 250ms and quota is also 250ms, the group will get
1 CPU worth of runtime every 250ms.
Limit a group to 2 CPUs worth of runtime on a multi-CPU machine.
With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
runtime every 500ms.
We see how, given x millicores, we can compute the quota and the period.
We can fix the period to 100ms and the quota to (100 * x) / 1000.
This is how Docker does it.
Of course, we have an infinite choice of pairs, we set the period to 100ms but indeed we can use any value (actually, there aren't infinite value but still).
Larger values of the period mean the thread can run for a longer time but will also pause for a longer time.
Here is where Docker is hiding things from the programmer, using an arbitrary value for the period in order to compute the quota (given the millicores, which the authors dub as more "user-friendly").
Kubernetes is designed around Docker (yes, it can use other container managers but they must expose an interface similar to the Docker's one), and the Kubernetes millicores unit match the unit used by Docker in its --cpus parameter.
So, long story short, millicores are the fractions of time of a single CPU (not the fraction of number of CPUs).
Cgroups, and hence Docker, and hence Kubernetes, doesn't restrict CPU usage by assigning cores to processes (like VMs do), instead it restricts CPU usage by restricting the amount of time (quota over period) the process can run on each CPU (with each CPU taking up to 1000mcpus worth of allowed time).

The scheduler of the kernel running the containers (f.e. linux) has means reserve time slices for an process to run concurrently with other processes on the same cpu.
You can throttle a process - giving it less time slices - if it uses too much cpu. This happens then a (hard) limit is hit. You can schedule a pod to a different node, if the cpu requests exceed the available cpu resources on a node.
So the requests is a hint for the kubernetes scheduler how to optimally place pods across nodes and the limit is to ensure by the kernel scheduler that no more resources will actually be used.
Actually if you just configure requests and no limits, all pods will be scheduled by the kernel scheduler policy, which is trying to be fair and balance the resources across all processes to maximize the usage while not starving any single process.

Related

Concurrent threads, processes and multiple cores

I'm trying to understand the usage of CPU cores with regard to concurrent threads and processes. Please see the below questions:
Assume I have 2 CPU cores. When there are 2 processes running, each process has only 1 thread. Are the two processes using the 2 cores?
Assume I have 2 CPU cores. When there is 1 process running, which has 2 threads. Are the two threads using the 2 cores?
Assume I have 2 CPU cores. When there are 2 processes running, each process has 2 threads. How are the two cores used by those processes and cores?
How to calculate the maximum real concurrent execution given CPU cores? What are other factor I should take into account?
1,2: Quite likely but not definitely. A portion of the system software determines what runs where. It would be unlikely to choose to keep a process or thread waiting for cpu attention when there is one that is otherwise idle, it isn't absolute.
Most processing involves some sort of transfer to and from a device, network, etc.. Typically this necessitates a period of inactivity waiting for the transfer to complete. During this inactivity, another process / thread can run on that cpu. So, if a given process is 30% cpu time and 70% I/O time, then I can run about 3 of them concurrently on a single cpu without degrading performance.
3,4: Like the paragraph above implies, depending upon the workload, their could be any distribution of the threads among the cpus. If the threads were all compute bound (100% cpu), most operating systems switch between them at a small enough granularity that all remain lively, and large enough that the switching has a minimal impact on them.
This scheduling may take other notions into consideration, such as data affinity. Recently touched bits of data are likely to remain in the cpu cache when a thread has relinquished it. The next time the thread is to be scheduled, it would be best to put it onto that cpu, to retain the effort required to warm the cache for it. It might also think that two threads of one process (address space) are more likely to share data, so should prefer the same cpu.
4: depending upon your system, there are likely to be many performance analysis tools available. Top, on UNIX-inspired systems is a simple tool which gives system wide utilization information, and the simple tool time will show how much time a process spent on a cpu vs real-world time. If you run each of your tasks sequentially, noting the cpu-time that they take, then time them running concurrently, the ratio between these cpu-times indicates the scaling factor of your concurrent app. Note that real-world time can be misleading because of io-overlap.

Weird CPU usage: 100% utilization, but temperature abnormally low

I have encountered a weird behavior with my algorithm/cpu, I was wondering what could be causing this.
CPU that I am using: AMD 2990WX 32c/64t, OS: Ubuntu 18.04LTS with 4.15.0-64-generic kernel.
The algorithm (Julia 1.0.3):
#sync #distributed for var in range(0.1,step=0.1,stop=10.0)
res=do_heavy_stuff(var) #solves differential equation,
#basically, multiplying 200x200 matrices many times
save(filename,"RES",res)
end
Function do_heavy_stuff(var) takes ~3 hours to solve on a single CPU core.
When I launch it in parallel with 10 processes (julia -p 10 my_code.jl)it takes ~4 hours for each parallel loop, meaning every 4 hours I get 10 files saved. The slowdown is expected, as cpu frequency goes down from 4.1Ghz to 3.4Ghz.
If I launch 3 separate instances with 10 processes each, so a total cpu utilization is 30 cores, it still takes ~4 hours for one loop cycle, meaning I get 30 runs completed and saved every 4 hours.
However, if I run 2 instances (one has nice value of 0, another nice value of +10) with 30 processes each at once julia -p 30 my_code.jl, I see (using htop) that CPU utilization is 60(+) threads, but the algorithm becomes extremely slow (after 20 hours still zero files saved). Furthermore, I see that CPU temperature is abnormally low (~45C instead of expected 65C).
From this information I can guess, that using (almost) all threads of my cpu makes it do something useless that is eating up CPU cycles, but no floating point operations are being done. I see no I/O to SSD, I utilize only half of RAM.
I launched mpstat mpstat -A: https://pastebin.com/c19nycsT and I can see that all of my cores are just chilling in idle state, that explains low temperature, however, I still don`t understand what exactly is the bottleneck? How do I troubleshoot from here? Is there any way too see (without touching hardware) whether the problem is RAM bandwidth or something else?
EDIT: It came to my attention, that I was using mpstat wrong. Apparently mpstat -A gives cpu stats since launch of the computer, while what I was needed was short time integrated results that can be obtained with mpstat -P ALL 2. Unfortunately, I only learned this after I killed my code in question, so no real data from mpstat. However, I am still interested, how would one troubleshoot such situation, where cores seems to be doing something, but result is not showing? How do I find the bottleneck?
Since you are using multiprocessing there are 2 most likely reasons for the observer behavior:
long delays on I/O. When you are processing lots of disk data or reading data from the network your processes are naturally staled. In this case CPU utilization can be low combined with long execution times.
high variance of execution time for do_heavy_stuff. This variance could arise from unstable I/O or different model parameters resulting in different execution times. Why it is a problem requires understanding how #distributed is sharing the workload among worker processes. Namely, each worker gets an equal of the for loop. For an example if you have 4 workers the first one gets var in range 0.1:0.1:2.5 the second one 2.6:0.1:5.0 and so on. Now if some of the var values result in heavy tasks the first worker might get 5h of work and other workers 1h of work. This means that #sync completes after 5 hours with only one CPU actually working all time.
Looking at your post I would strongly bet on the second reason.

CPU percentage and heavy multi-threading

I am observing strange effects with the CPU percentage as shown in e.g. top or htop on Linux (Ubuntu 16.04) for one special application. The application uses many threads (around 1000). Each thread has one computational task. About half of these tasks need to be computed once per "trigger" - the trigger is an external event received exactly every 100ms. The other threads are mostly sleeping (waiting for user interaction) and hence do not play a big role here. So to summarise: many threads are waking up basically simultaneously within a short period of time, doing there (relatively short) computation and going back to sleep again.
Since the machine running this application has 8 virtual CPUs (4 cores each 2 threads, it's an i7-3612QE), only 8 threads can really wake up at a time, so many threads will have to wait. Also some of these tasks have interdependencies, so they anyway have to wait, but I think as an approximation one can think of this application as a bunch of threads going to the runnable state at the same time every 100ms and each doing only a short computation (way below 1ms of CPU time each).
Now coming to the strange effect: If I look at the CPU percentage in "top", it shows something like 250%. As far as I know, top looks on the CPU time (user + system) the kernel accounts for this process, so 250% would mean the process uses 3 virtual CPUs on average. So far so good. Now, if I use taskset to force the entire process to use only a single virtual CPU, the CPU percentage drops to 80%. The application has internal accounting which tells me that still all data is being processed. So the application is doing the same amount of work, but it seemingly uses less CPU resources. How can that be? Can I really trust the kernel CPU time accounting, or is this an artefact of the measurement?
The CPU percentage also goes down, if I start other processes which take a lot of CPU, even if the do nothing ("while(true);") and are running at low priority (nice). If I launch 8 of these CPU-eating processes, the application reaches again 80%. With fewer CPU-eaters, I get gradually higher CPU%.
Not sure if this plays a role: I have used the profiler vtune, which tells me my application is actually quite inefficient (only about 1 IPC), mostly because it's memory bound. This does not change if I restrict the process to a single virtual CPU, so I assume the effect is not caused by a huge increase in efficiency when running everything on the same core (which would be strange anyway).
My question was essentially already answered by myself in the last paragraph: The process is memory bound. Hence not the CPU is the limited resource but the memory bandwidth. Allowing such process to run on multiple CPU cores in parallel will mainly have the effect that more CPU cores are waiting for data to arrive from RAM. This is counted as CPU load, since the CPU is executing the thread, but just quite slowly. All my other observations go along with this.

Context switch: what happens in a worst case scenario?

I want to understand how a certain worst case scenario of context switch happens. Say I have 10 CPU cores running a single process. Everything is CPU intensive, no thread is sleeping (waiting for I/O).
(I am mainly concerned with mainstream modern personal computer architectures and systems, typically x64 with Windows, Linux...)
Correct me if I'm wrong: running 10 CPU/RAM intensive independent threads is most often a near optimal situation. The amount of time spent in context switch is rather negligible. While the system may sometimes decide to re-attribute threads to different cores in a round-robin fashion causing a reset of RAM caches, it has a minor effect and works almost as if each thread was running on a single fixed core.
Only the main RAM bus may be a limitation since all threads share it, but it's not the point I'm interested in here. Reducing the number of threads will not increase the throughput anyway.
Now assume you still have 10 cores but run 1000 threads. The scheduler could theoretically decide to switch rarely (say every second) running 10 threads for a second, then 10 others... and the whole thing would still be close to optimal performance (throughput).
But it does not seem to be the case and it looks like threads are switched intensively causing a strongly suboptimal performance (throughput). Am I right about it? What is the main cause for this suboptimal performance? A few numbers would be nice if you have any idea of orders of magnitude of (for example): switches per second, performance loss caused by switching...
I'm going to answer my own question (after some search).
On windows, the number of context switches can be measured with performance counters: https://technet.microsoft.com/en-us/library/cc938606.aspx
I measured it on my machine (core i7/Windows 10) and the order of magnitude is around 1000/s by core when the number of running threads is more than the number of cores (and these threads are full CPU).
The time needed for a context switch varies quite a bit depending on:
what registers need to be saved
if FPU registers need to be saved
the processor model (of course)
You can read: https://www.quora.com/How-long-does-a-context-switch-take or http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
A slightly pessimistic avg. order of magnitude seems to be 1000 ns. Thus the total time for all context switches on each core is 1ms per second, that is 0.1%.
This does not depend on the number of threads: if you run 100 or 1000 threads, the number of switches does not change. As a conclusion the time spent in context switching is somehow negligible.
This reasoning is correct as long as the threads are pure CPU with only small memory read/write like a few local variables. I ran a test with full CPU threads and the difference between a few and 1000 threads is not noticeable.
But the situation changes when RAM is involved and switches makes CPU (memory) cache less efficient. A worse case is when:
computation can be split into 1000 independent "data" parts
each part of the data fits just into the memory cache (say L1 or L2) of a core
each part needs to be read many times
In this situation, running 10 threads to completion, then ten others... would take full advantage of the cache, while running 1000 threads at a time would causes the cache to be useful only during 1ms.
But if the data of several threads could fit into the cache, or if the threads read common data to some degree, or if each thread reads the data just once, then it is possible that running 1000 threads vs. running 10 threads a hundred times will have similar throughput.
It is more a matter a adapting parallelism to memory access. And it depends very much on the way memory needs to be accessed.
The time spend in context switching is negligible, the time lost because of wrong usage of caches may sometimes be problem, sometimes not, depending on how the memory is accessed and shared.

Question about app with multiple threads in a few CPU-machine

Given a machine with 1 CPU and a lot of RAM. Besides other kinds of applications (web server etc.), there are 2 other server applications running on that machine doing the exact same kind of processing although one uses 10 threads and the other users 1 thread. Assume the processing logic for each request is 100% CPU-bound and typically takes no longer than 2 seconds to finish. The question is whose throughput, in terms of transactions processed per minute, might be better? Why?
Note that the above is not a real environment, I just make up the data to make the question clear. My current thinking is that there should be no difference because the apps are 100% CPU-bound and therefore if the machine can handle 30 requests per minute for the 2nd app, it will also be able to handle 3 requests per minute for each of the 10 threads of the 1st app. But I'm glad to be proven wrong, given the fact that there are other applications running in the machine and one application might not be always given 100% CPU time.
There's always some overhead involved in task switching, so if the threads aren't blocking on anything, fewer threads is generally better. Also, if the threads aren't executing the same part of code, you'll get some cache flushing each time you swtich.
On the other hand, the difference might not be measurable.
Interesting question.
I wrote a sample program that does just this. It has a class that will go do some processor intensive work, then return. I specify the total number of threads I want to run, and the total number of times I want the work to run. The program will then equally divide the work between all the threads (if there's only one thread, it just gets it all) and start them all up.
I ran this on a single proc VM since I could find a real computer with only 1 processor in it anymore.
Run independently:
1 Thread 5000 Work Units - 50.4365sec
10 Threads 5000 Work Units - 49.7762sec
This seems to show that on a one proc PC, with lots of threads that are doing processor intensive work, windows is smart enough not to rapidly switch them back and fourth, and they take about the same amount of time.
Run together (or as close as I could get to pushing enter at the same time):
1 Thread 5000 Work Units - 99.5112sec
10 Threads 5000 Work Units - 56.8777sec
This is the meat of the question. When you run 10 threads + 1 thread, they all seem to be scheduled equally. The 10 threads each took 1/10th longer (because there was an 11th thread running) while the other thread took almost twice its time (really, it got 1/10th of its work done in the first 56sec, then did the other 9/10ths in the next 43sec...which is about right).
The result: Window's scheduler is fair on a thread level, but not on a process level. If you make a lot of threads, it you can leave the other processes that weren't smart enought to make lots of threads high and dry. Or just do it right and us a thread pool :-)
If you're interested in trying it for yourself, you can find my code:
http://teeks99.com/ThreadWorkTest.zip
The scheduling overhead could make the app with 10 threads slower than the one with 1 thread. You won't know for sure unless you create a test.
For some background on multithreading see http://en.wikipedia.org/wiki/Thread_(computer_science)
This might very well depend on the operating system scheduler. For example, back in single-thread days the scheduler knew only about processes, and had measures like "niceness" to figure out how much to allocate.
In multithreaded code, there is probably a way in which one process that has 100 threads doesn't get 99% of the CPU time if there's another process that has a single thread. On the other hand, if you have only two processes and one of them is multithreaded I would suspect that the OS may give it more overall time. However, AFAIK nothing is really guaranteed.
Switching costs between threads in the same process may be cheaper than switching between processes (e.g., due to cache behavior).
One thing you must consider is wait time on the other end of the transaction. Having multiple threads will allow you to be waiting for a response on one while preparing the next transaction on the next. At least that's how I understand it. So I think a few threads will turn out better than one.
On the other hand you must consider the overhead involved with dealing on multiple threads. The details of the application are important part of the consideration here.

Resources