What is the concept of vruntime in CFS - linux

I have been reading about Linux Kernel and CFS scheduler in the kernel. I came across vruntime (virtual runtime) that is the core concept behind CFS scheduler. I read from “Linux Kernel Development” and also from other blogs on internet but could not understand the basic calculations behind the vruntime. Does vruntime belong to a particular process or does it belong to a group of process with same nice values. What is the weighting factor and how is it calculated? I went through all these concepts but could not understand. Also what is the difference between vruntime and *min_vruntime*?

vruntime is per-thread; it is a member nested within the task_struct.
Essentially, vruntime is a measure of the "runtime" of the thread - the amount of time it has spent on the processor. The whole point of CFS is to be fair to all; hence, the algo kind of boils down to a simple thing: (among the tasks on a given runqueue) the task with the lowest vruntime is the task that most deserves to run, hence select it as 'next'. (The actual implementation is done using an rbtree for efficiency).
Taking into account various factors - like priority, nice value, cgroups, etc - the calculation of vruntime is not as straight-forward as a simple increment. I'd suggest reading the relevant section in "Professional Linux Kernel Architecture", Mauerer, Wrox Press - it's explained in great detail.
Pl see below a quick attempt at summarizing some of this.
Other resource:
Documentation/scheduler/sched-design-CFS.txt
Quick summary - vruntime calculation:
(based on the book)
Most of the work is done in kernel/sched_fair.c:__update_curr()
Called on timer tick
Updates the physical and virtual time 'current' has just spent on the processor
For tasks that run at default priority, i.e., nice value 0, the physical and virtual time spent is identical
Not so for tasks at other priority (nice) levels; thus the calculation of vruntime is affected by the priority of current using a load weight factor
delta_exec = (unsigned long)(now – curr->exec_start);
// ...
delta_exec_weighted = calc_delta_fair(delta_exec, curr);
curr->vruntime += delta_exec_weighted;
Neglecting some rounding and overflow checking, what calc_delta_fair does is to
compute the value given by the following formula:
delta_exec_weighed = delta_exec * (NICE_0_LOAD / curr->load.weight)
The thing is, more important tasks (those with a lower nice value) will have larger
weights; thus, by the above equations, the vruntime accounted to them will be smaller
(thus having them enqueued more to the left on the rbtree!).

The vruntime is the virtual runtime of a process which helps in tracking for how much time a process has run. The vruntime is a member of the sched_entity structure defined in include/linux/sched.h
The min_vruntime represents the minimum vruntime of a cfs runqueue. It represents the minimum of all the vruntime of the processes that is scheduled on that cfs runqueue. The min_vruntime is a member of cfs_rq structure defined in include/linux/sched.h
The purpose of min_vruntime is to select the next process in the cfs runqueue to run. In order to be fair to all the processes, the CFS scheduler selects the process with the minimum vruntime to execute first.
The link to include/linux/sched.h is: https://elixir.bootlin.com/linux/latest/source/include/linux/sched.h

Related

Thread Quantum: How to compute it

I have been reading a few posts and articles regarding thread quanta (here, here and here). Apparently Windows allocate a fix number of CPU ticks for a thread quantum depending on the windows "mode" (server, or something else). However from the last link we can read:
(A thread quantum) between 10-200 clock ticks (i.e. 10-200 ms) under Linux, though some
granularity is introduced in the calculation
Is there any way to compute the quantum length on Linux?
Does that make any sense to compute it anyway? (since from my understanding threads can still be pre-empted, nothing forces a thread to run during the full duration of the quantum)
From a developer's perspective, I could see the interest in writing a program that could predict the running time of a program given its number of threads, and "what they do" (possibly removing all the testing to find the optimal number of threads would be kind of neat, although I am not sure it is the right approach)
On Linux, the default realtime quantum length constant is declared as RR_TIMESLICE, at least in 4.x kernels; HZ must be defined while configuring the kernel.
The interval between pausing the thread whose quantum has expired and resuming it may depend on a lot of things like, say, load average.
To be able to predict the running time at least with some degree of accuracy, give the target process realtime priority; realtime processes are scheduled following a round-robin algorithm, which is generally simpler and more predictable than the common Linux scheduling algo.
To get the realtime quantum length, call sched_rr_get_interval().

How linux process scheduler prevents starvation of a process

I have read that linux kernel contains many schedule classes each having it's own priority. To select a new process to run, the process scheduler iterates from the highest priority class to lowest priority class. If a runnable process is found in a class, the highest priority process is selected to run from that class.
Extract from Linux kernel development by Robert Love:
The main entry point into the process schedule is the function
schedule() , defined in kernel/sched.c .This is the function that the
rest of the kernel uses to invoke the process scheduler, deciding
which process to run and then running it. schedule() is generic with
respect to scheduler classes.That is, it finds the highest priority
scheduler class with a runnable process and asks it what to run next.
Given that, it should be no surprise that schedule() is simple.The
only important part of the function—which is otherwise too
uninteresting to reproduce here—is its invocation of pick_next_task()
, also defined in kernel/sched.c .The pick_next_task() function goes
through each scheduler class, starting with the highest priority, and
selects the highest priority process in the highest priority class.
Let's imagine the following scenario. There are some processes waiting in lower priority classes and processes are being added to higher priority classes continuously. Won't the processes in lower priority classes starve?
Linux kernel implements Completely Fair Scheduling algorithm which is based on virtual clock.
Each scheduling entity has a sched_entity structure associated with it whose snapshot looks like
struct sched_entity {
...
u64 exec_start;
u64 sum_exec_runtime;
u64 vruntime;
u64 prev_sum_exec_runtime;
...
}
The above four attributes are used to track the runtime of a process and using these attributes along with some other methods(update_curr() where these are updated), the virtual clock is implemented.
When a process is assigned to CPU, exec_start is updated to current time and the consumed CPU time is recorded in sum_exec_runtime. When process is taken off from CPU sum_exec_runtime value is preserved in prev_sum_exec_runtime. sum_exec_runtime is calculated cumulatively. (Meaning it grows monotonically).
vruntime stores the amount of time that has elapsed on virtual clock during process execution.
How vruntime is calculated?
Ignoring all the complex calculations, the core concept of how it is calculated is :-
vruntime += delta_exec_weighted;
delta_exec_weighted = delta_exec * (NICE_0_LOAD/load.weight);
Here delta_exec is the time difference between process assigned to CPU and taken off from CPU whereas load.weight is the weight of the process which depends on priority (Nice Value). Usually an increase in nice value of 1 for a process means it gets 10 percent less CPU time resulting in less weight.
Process with NICE value 0, weight = 1024
Process re-Niced with value 1, weight = 1024/1.25 = 820(approx)
Points drawn from above
So vruntime increases when a process gets CPU
And vruntimeincreases slowly for higher priority processes compared with lower priority processes.
The runqueue is maintained in red-black tree and each runqueue has a min_vruntime variable associated with it that holds the smallest vruntime among all the process in the run-queue. (min_vruntime can only increase, not decrease as processes will be scheduled).
The key for the node in red black tree is process->vruntime - min_vruntime
When scheduler is invoked, the kernel basically picks up the task which has the smallest key (the leftmost node) and assigns it the CPU.
Elements with smaller key will be placed more to the left, and thus be scheduled more quickly.
When a process is running, its vruntime will steadily increase, so it will finally move rightwards in the red-black tree.
Because vruntime is increase more slowly for more important processes, they will also move rightwards more slowly, so their chance to be scheduled is bigger for a less important process - just as required.
If a process sleeps, its vruntime will remain unchanged. Because the per-queue min_vruntime will increase in the meantime, the sleeper process will be placed more to the left after waking up because the key(mentioned above) got smaller.
Therefore there are no chances of starvation as a lower priority process if deprived of CPU, will have its vruntime smallest and hence key will be smallest so it moves to the left of the tree quickly and therefore scheduled.
It would indeed starve.
There are many ways of dealing with such scenario.
Aging, the longer the process is in the system, increase its priority.
Scheduling algorithms giving every process a time-quantum to use the CPU. Time-Quantum varies, usually, interactive processes are given lower time-quantum as they spend more time doing I/O while time consuming/computational processes are given bigger time quantum.
After a process runs its time quantum, it is put in an expired queue until there are no active processes in the system.
Then, the expired queue becomes the active queue and vice versa.
These are 2 ways in preventing starvation.

Why we define Scheduler timeslice in CFS also?

To be specific, I am talking about Linux kernel Scheduling system after CFS patch merged.
Everywhere it is mentioned that in the CFS (completely fair scheduler) there is no fixed timeslice for the process and timeslice is calculated based on the equal division of the number of processes running in the system as they were executing in parallel in hardware. Figure explains it more ..
Still why we define the scheduler timeslice in the kernel?
http://lxr.free-electrons.com/source/include/linux/sched/rt.h#L62
Like the comment in the link says, that is the default time slice. For each scheduler implemented, the value of the time slices may change, if it makes sense.
For example, in the real time scheduler with the SCHED_RR policy, you can see a default time slice is used, whereas for the SCHED_FIFO policy the time slice is 0 because tasks with the SCHED_FIFO policy must preempt every other task.
In the case of Completely Fair Scheduling, the time slice is computed in get_rr_interval_fair by calling sched_slice. It computes the slice based on the number of running tasks and its weight (which in turn is determined by the process' nice level).

Linux HZ and fair schedule timeslice

In sched_fair.c it has:
unsigned int sysctl_sched_latency = 5000000ULL //5m
unsigned int sysctl_sched_min_granularity = 1000000ULL //1ms
I understand that Linux fair timeslice varies depending on the nr_running and the relative weight of this fair task, but through code studying, I figured out the main idea is to keep the timeslice 1 to 5 ms. Please correct me if I understand it wrong. I must be wrong here but I just cannot figure out how!
Also knowing that HZ, or the number of system ticks per s, or the number of timer interrupts every second, is normally 200 or 100 for arm machine (and most non-desktop machines too), which gives us a 5 to 10 ms tick rate.
Timeslice is put in action by starting rq->hrtick_timer in set_next_entity(), every time a fair task is scheduled to run, and invoking resched_task() in timeout callback function hrtick(). This timer is simply one of the queued timers that are processed by timer irq handler on every tick, timer_tick()...run_local_timer(). There seems no other hidden secret.
Then how we can get a timeslice shorter than 5 ms? Please help me understand this. Thank you very much!
As stated in Robert Love's Linux Kernel Development, the only ways to get a timeslice shorter is to increase number of running processes (or ones with less priority than others).
Increasing number on running process creates a need for shorten timeslice to guarantee appropriate target latency (but timeslice is lower bounded with minimum granularity). But there is no guarantee that process will be preempted in given timeslice. That's because time accounting is driven by timer interrupts.
Increasing value of HZ makes timer interrupts happen more frequently that makes time accounting more precious, so rescheduling may occur more frequently.
The vruntime variable stores the virtual runtime of a process, which is the actual runtime normalized by the number of runnable processes. On the ideal multitasking system vruntime of all the process would be identical—all tasks would have received an equal, fair share of the processor.
Typically timeslice is target latency divided by number of running processes. But when number of running processes approaches infinity, timeslice approaches 0. As this will eventually result in unacceptable switching costs, CFS imposes a floor on the timeslice assigned to each process.This floor is called the minimum granularity. So timeslice is value between sysctl_sched_latency and sysctl_sched_granularity. (See sched_timeslice())
vruntime variable is managed by update_curr(). update_curr() is invoked periodically by the system timer and also whenever a process becomes runnable or blocks, becoming unrunnable.
To drive preemption between tasks, hrtick() calls task_tick_fair() on each timer interrupt, which, in turn, calls entity_tick(). entity_tick() calls update_curr() to update process vruntime and then calls check_preempt_tick(). check_preempt_tick() checks whether current runtime is greater than ideal runtime (timeslice), if so, calls resched_task(), which sets TIF_NEED_RESCHED flag.
When TIF_NEED_RESCHED is set, schedule() gets called on the nearest possible occasion.
So, with increasing value of HZ, timer interrupts happens more frequently causing more precious time accounting and allowing scheduler to reschedule tasks more frequently.

Linux CFS (Completely Fair Scheduler) latency

I am a beginner to the Linux Kernel and I am trying to learn how Linux schedules processes.
I have read some books on the Linux Kernel and gone through the links from IBM http://www.ibm.com/developerworks/linux/library/l-cfs/ and all, but I am still left with some doubts.
How does the scheduler schedule all of the tasks within the sysctl_sched_latency time?
When a process wakes up what actually is done in the place_entity function?
When a process wakes up why is the vruntime adjusted by subtracting from sched_latency? Can't that lead to processes in the run queue with large differences in the vruntime value?
Firstly the virtual runtime of a task
in theory is the when the task would start its next time slice of
execution on a theoretically perfect multiple threaded CPU.
in practice is its actual runtime normalized to the total number of running tasks
1. How does the scheduler schedule all of the tasks within the
sysctl_sched_latency time?
It maintains a time ordered red and black tree, where all the runnable tasks are
sorted by their virtual runtime. Nodes on the left have run for the shortest amount of time.
CFS picks the left most task and runs it, until the task schedules or the scheduler ticks
then the CPU time it spent running is added to its virtual runtime.
When it is no longer the left most node, then new task with the shortest virtual is run and
the old task prempted.
2. When a process wakes up what actually is done in the place_entity function?
Short version:
When a process wakes up the place_entity function either leaves the
task's virtual runtime as it was or increases it.
Long version:
When a process wakes up the place_entity function does the following things
Initialise the temporary virtual runtime to the CFS run queue's virtual runtime of the smallest task.
As sleeps less than a single latency don't count,
initializses a threshold variable to sysctl_sched_latency.
If the GENTLE_FAIR_SLEEPERS feature is enabled,
then half the value of the this variable.
Decrement the previously initialised temporary virtual runtime by this threshold value.
Ensure that the temporary virtual runtime is at least equal to the task's virtual runtime, by setting the calculated virtual runtime to the maximum of itself and the task's virtual runtime.
Set the task's virtual runtime to the temporary runtime.
3. When a process wakes up why is the vruntime adjusted by subtracting from sched_latency?
The virtual runtime is decremented because sleeps less than a single latency don't count.
E.g the task shouldn't have its position changed in the red black tree changed if it has
only slept for a single scheduler latency.
4. Can't that lead to processes in the run queue with large differences in the vruntime value?
I believe that the logic described in Step 3 for Question 2, prevents or at least minimises that.
References
sched Linux Kernel Source
sched_fair.c Linux Kernel Source
Notes on the CFS Scheduler Design

Resources