How linux process scheduler prevents starvation of a process

How linux process scheduler prevents starvation of a process - linux

I have read that linux kernel contains many schedule classes each having it's own priority. To select a new process to run, the process scheduler iterates from the highest priority class to lowest priority class. If a runnable process is found in a class, the highest priority process is selected to run from that class.
Extract from Linux kernel development by Robert Love:
The main entry point into the process schedule is the function
schedule() , defined in kernel/sched.c .This is the function that the
rest of the kernel uses to invoke the process scheduler, deciding
which process to run and then running it. schedule() is generic with
respect to scheduler classes.That is, it finds the highest priority
scheduler class with a runnable process and asks it what to run next.
Given that, it should be no surprise that schedule() is simple.The
only important part of the function—which is otherwise too
uninteresting to reproduce here—is its invocation of pick_next_task()
, also defined in kernel/sched.c .The pick_next_task() function goes
through each scheduler class, starting with the highest priority, and
selects the highest priority process in the highest priority class.
Let's imagine the following scenario. There are some processes waiting in lower priority classes and processes are being added to higher priority classes continuously. Won't the processes in lower priority classes starve?

Linux kernel implements Completely Fair Scheduling algorithm which is based on virtual clock.
Each scheduling entity has a sched_entity structure associated with it whose snapshot looks like
struct sched_entity {
...
u64 exec_start;
u64 sum_exec_runtime;
u64 vruntime;
u64 prev_sum_exec_runtime;
...
}
The above four attributes are used to track the runtime of a process and using these attributes along with some other methods(update_curr() where these are updated), the virtual clock is implemented.
When a process is assigned to CPU, exec_start is updated to current time and the consumed CPU time is recorded in sum_exec_runtime. When process is taken off from CPU sum_exec_runtime value is preserved in prev_sum_exec_runtime. sum_exec_runtime is calculated cumulatively. (Meaning it grows monotonically).
vruntime stores the amount of time that has elapsed on virtual clock during process execution.
How vruntime is calculated?
Ignoring all the complex calculations, the core concept of how it is calculated is :-
vruntime += delta_exec_weighted;
delta_exec_weighted = delta_exec * (NICE_0_LOAD/load.weight);
Here delta_exec is the time difference between process assigned to CPU and taken off from CPU whereas load.weight is the weight of the process which depends on priority (Nice Value). Usually an increase in nice value of 1 for a process means it gets 10 percent less CPU time resulting in less weight.
Process with NICE value 0, weight = 1024
Process re-Niced with value 1, weight = 1024/1.25 = 820(approx)
Points drawn from above
So vruntime increases when a process gets CPU
And vruntimeincreases slowly for higher priority processes compared with lower priority processes.
The runqueue is maintained in red-black tree and each runqueue has a min_vruntime variable associated with it that holds the smallest vruntime among all the process in the run-queue. (min_vruntime can only increase, not decrease as processes will be scheduled).
The key for the node in red black tree is process->vruntime - min_vruntime
When scheduler is invoked, the kernel basically picks up the task which has the smallest key (the leftmost node) and assigns it the CPU.
Elements with smaller key will be placed more to the left, and thus be scheduled more quickly.
When a process is running, its vruntime will steadily increase, so it will finally move rightwards in the red-black tree.
Because vruntime is increase more slowly for more important processes, they will also move rightwards more slowly, so their chance to be scheduled is bigger for a less important process - just as required.
If a process sleeps, its vruntime will remain unchanged. Because the per-queue min_vruntime will increase in the meantime, the sleeper process will be placed more to the left after waking up because the key(mentioned above) got smaller.
Therefore there are no chances of starvation as a lower priority process if deprived of CPU, will have its vruntime smallest and hence key will be smallest so it moves to the left of the tree quickly and therefore scheduled.

It would indeed starve.
There are many ways of dealing with such scenario.
Aging, the longer the process is in the system, increase its priority.
Scheduling algorithms giving every process a time-quantum to use the CPU. Time-Quantum varies, usually, interactive processes are given lower time-quantum as they spend more time doing I/O while time consuming/computational processes are given bigger time quantum.
After a process runs its time quantum, it is put in an expired queue until there are no active processes in the system.
Then, the expired queue becomes the active queue and vice versa.
These are 2 ways in preventing starvation.

Related

How to find out whether task is I/O bound in linux kernel?

I am trying to implement scheduler for linux kernel (version 2.4.27) and I am trying to find out whether a task is CPU bound or I/O bound. Are there any variables/function which I can use to get this information?

If talking about O(1)
Scheduler:
A process can be determined as CPU bound or I/O bound based on the timeslice it runs on the CPU.
Every process will have its default timeslice(100ms) set before allowing it to be scheduled on a processor.
A process is called a CPU bound process, if the very process consumes full time slice(runs for the entire time slice on the processor).
Similarly, any process that doesn't consume its entire timeslice, but would call sched_yield even before its timeslice has run-out, or, if the process waits/sleeps for any event to occur, then the scheduler will be invoked to push it to sleeping queue, which means it is waiting for some I/O to happen, is an I/O bound process.
Every such CPU bound process will be penalized with priority, keeping the time slice same, and every such I/O bound process will be appreciated with a bonus of priority, keeping the time slice same.
So, on an GPOS(General Purpose Operating System), it is the effective_priority or dynamic_priority, that will tell you if the process is well-behaved(I/O bound) or ill-behaved(CPU bound process), as the default priority will be 20 for a new process, unless it is altered otherwise.
There are some parameters based on which, you can determine the same.
effective_prio: Returns the effective priority of a task (based on the static priority, but includes any rewards or penalties).
recalc_task_prio: Determines a task's bonus or penalty based on its idle time.
Ref: https://www.cs.columbia.edu/~smb/classes/s06-4118/l13.pdf

What is the concept of vruntime in CFS

I have been reading about Linux Kernel and CFS scheduler in the kernel. I came across vruntime (virtual runtime) that is the core concept behind CFS scheduler. I read from “Linux Kernel Development” and also from other blogs on internet but could not understand the basic calculations behind the vruntime. Does vruntime belong to a particular process or does it belong to a group of process with same nice values. What is the weighting factor and how is it calculated? I went through all these concepts but could not understand. Also what is the difference between vruntime and *min_vruntime*?

vruntime is per-thread; it is a member nested within the task_struct.
Essentially, vruntime is a measure of the "runtime" of the thread - the amount of time it has spent on the processor. The whole point of CFS is to be fair to all; hence, the algo kind of boils down to a simple thing: (among the tasks on a given runqueue) the task with the lowest vruntime is the task that most deserves to run, hence select it as 'next'. (The actual implementation is done using an rbtree for efficiency).
Taking into account various factors - like priority, nice value, cgroups, etc - the calculation of vruntime is not as straight-forward as a simple increment. I'd suggest reading the relevant section in "Professional Linux Kernel Architecture", Mauerer, Wrox Press - it's explained in great detail.
Pl see below a quick attempt at summarizing some of this.
Other resource:
Documentation/scheduler/sched-design-CFS.txt
Quick summary - vruntime calculation:
(based on the book)
Most of the work is done in kernel/sched_fair.c:__update_curr()
Called on timer tick
Updates the physical and virtual time 'current' has just spent on the processor
For tasks that run at default priority, i.e., nice value 0, the physical and virtual time spent is identical
Not so for tasks at other priority (nice) levels; thus the calculation of vruntime is affected by the priority of current using a load weight factor
delta_exec = (unsigned long)(now – curr->exec_start);
// ...
delta_exec_weighted = calc_delta_fair(delta_exec, curr);
curr->vruntime += delta_exec_weighted;
Neglecting some rounding and overflow checking, what calc_delta_fair does is to
compute the value given by the following formula:
delta_exec_weighed = delta_exec * (NICE_0_LOAD / curr->load.weight)
The thing is, more important tasks (those with a lower nice value) will have larger
weights; thus, by the above equations, the vruntime accounted to them will be smaller
(thus having them enqueued more to the left on the rbtree!).

The vruntime is the virtual runtime of a process which helps in tracking for how much time a process has run. The vruntime is a member of the sched_entity structure defined in include/linux/sched.h
The min_vruntime represents the minimum vruntime of a cfs runqueue. It represents the minimum of all the vruntime of the processes that is scheduled on that cfs runqueue. The min_vruntime is a member of cfs_rq structure defined in include/linux/sched.h
The purpose of min_vruntime is to select the next process in the cfs runqueue to run. In order to be fair to all the processes, the CFS scheduler selects the process with the minimum vruntime to execute first.
The link to include/linux/sched.h is: https://elixir.bootlin.com/linux/latest/source/include/linux/sched.h

How scheduler gets called when a high priority task comes

I have read here about situations where a scheduler is called. But what happens when a high priority task comes?

High priority tasks are scheduled more often than low priority tasks but when a high priority task comes it still has to wait until the quantum of the running task is over.

Priority changes and is adjusted based on past CPU usage.
The longer version
In Linux, process priority is dynamic. The scheduler keeps track of what processes are doing and adjusts their priorities periodically; in this way, processes that have been denied the use of the CPU for a long time interval are boosted by dynamically increasing their priority. Correspondingly, processes running for a long time are penalized by decreasing their priority.

Scheduler maintains a set of all tasks that are ready to run in the system. In a multi-priority system, the task set usually supports the notion of priority. When a high priority task arrives in the system, it is put into the set of tasks sorted by priority.
There are certain points in the kernel where we check if a better process is available to run, compared to the currently running process. This can happen when the time slice expires OR when the ISR is done OR when a lock is unlocked, etc. Look for calls to switch() OR _switch() or something similar...this is the routine that checks the set of tasks and determines if the current task is the highest prio.
If the current task is not the highest prio task, then the current task is switched out and the highest prio task is obtained from the task set and scheduled to run.

Linux HZ and fair schedule timeslice

In sched_fair.c it has:
unsigned int sysctl_sched_latency = 5000000ULL //5m
unsigned int sysctl_sched_min_granularity = 1000000ULL //1ms
I understand that Linux fair timeslice varies depending on the nr_running and the relative weight of this fair task, but through code studying, I figured out the main idea is to keep the timeslice 1 to 5 ms. Please correct me if I understand it wrong. I must be wrong here but I just cannot figure out how!
Also knowing that HZ, or the number of system ticks per s, or the number of timer interrupts every second, is normally 200 or 100 for arm machine (and most non-desktop machines too), which gives us a 5 to 10 ms tick rate.
Timeslice is put in action by starting rq->hrtick_timer in set_next_entity(), every time a fair task is scheduled to run, and invoking resched_task() in timeout callback function hrtick(). This timer is simply one of the queued timers that are processed by timer irq handler on every tick, timer_tick()...run_local_timer(). There seems no other hidden secret.
Then how we can get a timeslice shorter than 5 ms? Please help me understand this. Thank you very much!

As stated in Robert Love's Linux Kernel Development, the only ways to get a timeslice shorter is to increase number of running processes (or ones with less priority than others).
Increasing number on running process creates a need for shorten timeslice to guarantee appropriate target latency (but timeslice is lower bounded with minimum granularity). But there is no guarantee that process will be preempted in given timeslice. That's because time accounting is driven by timer interrupts.
Increasing value of HZ makes timer interrupts happen more frequently that makes time accounting more precious, so rescheduling may occur more frequently.
The vruntime variable stores the virtual runtime of a process, which is the actual runtime normalized by the number of runnable processes. On the ideal multitasking system vruntime of all the process would be identical—all tasks would have received an equal, fair share of the processor.
Typically timeslice is target latency divided by number of running processes. But when number of running processes approaches infinity, timeslice approaches 0. As this will eventually result in unacceptable switching costs, CFS imposes a floor on the timeslice assigned to each process.This floor is called the minimum granularity. So timeslice is value between sysctl_sched_latency and sysctl_sched_granularity. (See sched_timeslice())
vruntime variable is managed by update_curr(). update_curr() is invoked periodically by the system timer and also whenever a process becomes runnable or blocks, becoming unrunnable.
To drive preemption between tasks, hrtick() calls task_tick_fair() on each timer interrupt, which, in turn, calls entity_tick(). entity_tick() calls update_curr() to update process vruntime and then calls check_preempt_tick(). check_preempt_tick() checks whether current runtime is greater than ideal runtime (timeslice), if so, calls resched_task(), which sets TIF_NEED_RESCHED flag.
When TIF_NEED_RESCHED is set, schedule() gets called on the nearest possible occasion.
So, with increasing value of HZ, timer interrupts happens more frequently causing more precious time accounting and allowing scheduler to reschedule tasks more frequently.

Linux CFS (Completely Fair Scheduler) latency

I am a beginner to the Linux Kernel and I am trying to learn how Linux schedules processes.
I have read some books on the Linux Kernel and gone through the links from IBM http://www.ibm.com/developerworks/linux/library/l-cfs/ and all, but I am still left with some doubts.
How does the scheduler schedule all of the tasks within the sysctl_sched_latency time?
When a process wakes up what actually is done in the place_entity function?
When a process wakes up why is the vruntime adjusted by subtracting from sched_latency? Can't that lead to processes in the run queue with large differences in the vruntime value?

Firstly the virtual runtime of a task
in theory is the when the task would start its next time slice of
execution on a theoretically perfect multiple threaded CPU.
in practice is its actual runtime normalized to the total number of running tasks
1. How does the scheduler schedule all of the tasks within the
sysctl_sched_latency time?
It maintains a time ordered red and black tree, where all the runnable tasks are
sorted by their virtual runtime. Nodes on the left have run for the shortest amount of time.
CFS picks the left most task and runs it, until the task schedules or the scheduler ticks
then the CPU time it spent running is added to its virtual runtime.
When it is no longer the left most node, then new task with the shortest virtual is run and
the old task prempted.
2. When a process wakes up what actually is done in the place_entity function?
Short version:
When a process wakes up the place_entity function either leaves the
task's virtual runtime as it was or increases it.
Long version:
When a process wakes up the place_entity function does the following things
Initialise the temporary virtual runtime to the CFS run queue's virtual runtime of the smallest task.
As sleeps less than a single latency don't count,
initializses a threshold variable to sysctl_sched_latency.
If the GENTLE_FAIR_SLEEPERS feature is enabled,
then half the value of the this variable.
Decrement the previously initialised temporary virtual runtime by this threshold value.
Ensure that the temporary virtual runtime is at least equal to the task's virtual runtime, by setting the calculated virtual runtime to the maximum of itself and the task's virtual runtime.
Set the task's virtual runtime to the temporary runtime.
3. When a process wakes up why is the vruntime adjusted by subtracting from sched_latency?
The virtual runtime is decremented because sleeps less than a single latency don't count.
E.g the task shouldn't have its position changed in the red black tree changed if it has
only slept for a single scheduler latency.
4. Can't that lead to processes in the run queue with large differences in the vruntime value?
I believe that the logic described in Step 3 for Question 2, prevents or at least minimises that.
References
sched Linux Kernel Source
sched_fair.c Linux Kernel Source
Notes on the CFS Scheduler Design

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string