To be specific, I am talking about Linux kernel Scheduling system after CFS patch merged.
Everywhere it is mentioned that in the CFS (completely fair scheduler) there is no fixed timeslice for the process and timeslice is calculated based on the equal division of the number of processes running in the system as they were executing in parallel in hardware. Figure explains it more ..
Still why we define the scheduler timeslice in the kernel?
http://lxr.free-electrons.com/source/include/linux/sched/rt.h#L62
Like the comment in the link says, that is the default time slice. For each scheduler implemented, the value of the time slices may change, if it makes sense.
For example, in the real time scheduler with the SCHED_RR policy, you can see a default time slice is used, whereas for the SCHED_FIFO policy the time slice is 0 because tasks with the SCHED_FIFO policy must preempt every other task.
In the case of Completely Fair Scheduling, the time slice is computed in get_rr_interval_fair by calling sched_slice. It computes the slice based on the number of running tasks and its weight (which in turn is determined by the process' nice level).
Related
I have read that linux kernel contains many schedule classes each having it's own priority. To select a new process to run, the process scheduler iterates from the highest priority class to lowest priority class. If a runnable process is found in a class, the highest priority process is selected to run from that class.
Extract from Linux kernel development by Robert Love:
The main entry point into the process schedule is the function
schedule() , defined in kernel/sched.c .This is the function that the
rest of the kernel uses to invoke the process scheduler, deciding
which process to run and then running it. schedule() is generic with
respect to scheduler classes.That is, it finds the highest priority
scheduler class with a runnable process and asks it what to run next.
Given that, it should be no surprise that schedule() is simple.The
only important part of the function—which is otherwise too
uninteresting to reproduce here—is its invocation of pick_next_task()
, also defined in kernel/sched.c .The pick_next_task() function goes
through each scheduler class, starting with the highest priority, and
selects the highest priority process in the highest priority class.
Let's imagine the following scenario. There are some processes waiting in lower priority classes and processes are being added to higher priority classes continuously. Won't the processes in lower priority classes starve?
Linux kernel implements Completely Fair Scheduling algorithm which is based on virtual clock.
Each scheduling entity has a sched_entity structure associated with it whose snapshot looks like
struct sched_entity {
...
u64 exec_start;
u64 sum_exec_runtime;
u64 vruntime;
u64 prev_sum_exec_runtime;
...
}
The above four attributes are used to track the runtime of a process and using these attributes along with some other methods(update_curr() where these are updated), the virtual clock is implemented.
When a process is assigned to CPU, exec_start is updated to current time and the consumed CPU time is recorded in sum_exec_runtime. When process is taken off from CPU sum_exec_runtime value is preserved in prev_sum_exec_runtime. sum_exec_runtime is calculated cumulatively. (Meaning it grows monotonically).
vruntime stores the amount of time that has elapsed on virtual clock during process execution.
How vruntime is calculated?
Ignoring all the complex calculations, the core concept of how it is calculated is :-
vruntime += delta_exec_weighted;
delta_exec_weighted = delta_exec * (NICE_0_LOAD/load.weight);
Here delta_exec is the time difference between process assigned to CPU and taken off from CPU whereas load.weight is the weight of the process which depends on priority (Nice Value). Usually an increase in nice value of 1 for a process means it gets 10 percent less CPU time resulting in less weight.
Process with NICE value 0, weight = 1024
Process re-Niced with value 1, weight = 1024/1.25 = 820(approx)
Points drawn from above
So vruntime increases when a process gets CPU
And vruntimeincreases slowly for higher priority processes compared with lower priority processes.
The runqueue is maintained in red-black tree and each runqueue has a min_vruntime variable associated with it that holds the smallest vruntime among all the process in the run-queue. (min_vruntime can only increase, not decrease as processes will be scheduled).
The key for the node in red black tree is process->vruntime - min_vruntime
When scheduler is invoked, the kernel basically picks up the task which has the smallest key (the leftmost node) and assigns it the CPU.
Elements with smaller key will be placed more to the left, and thus be scheduled more quickly.
When a process is running, its vruntime will steadily increase, so it will finally move rightwards in the red-black tree.
Because vruntime is increase more slowly for more important processes, they will also move rightwards more slowly, so their chance to be scheduled is bigger for a less important process - just as required.
If a process sleeps, its vruntime will remain unchanged. Because the per-queue min_vruntime will increase in the meantime, the sleeper process will be placed more to the left after waking up because the key(mentioned above) got smaller.
Therefore there are no chances of starvation as a lower priority process if deprived of CPU, will have its vruntime smallest and hence key will be smallest so it moves to the left of the tree quickly and therefore scheduled.
It would indeed starve.
There are many ways of dealing with such scenario.
Aging, the longer the process is in the system, increase its priority.
Scheduling algorithms giving every process a time-quantum to use the CPU. Time-Quantum varies, usually, interactive processes are given lower time-quantum as they spend more time doing I/O while time consuming/computational processes are given bigger time quantum.
After a process runs its time quantum, it is put in an expired queue until there are no active processes in the system.
Then, the expired queue becomes the active queue and vice versa.
These are 2 ways in preventing starvation.
I am trying to implement scheduler for linux kernel (version 2.4.27) and I am trying to find out whether a task is CPU bound or I/O bound. Are there any variables/function which I can use to get this information?
If talking about O(1)
Scheduler:
A process can be determined as CPU bound or I/O bound based on the timeslice it runs on the CPU.
Every process will have its default timeslice(100ms) set before allowing it to be scheduled on a processor.
A process is called a CPU bound process, if the very process consumes full time slice(runs for the entire time slice on the processor).
Similarly, any process that doesn't consume its entire timeslice, but would call sched_yield even before its timeslice has run-out, or, if the process waits/sleeps for any event to occur, then the scheduler will be invoked to push it to sleeping queue, which means it is waiting for some I/O to happen, is an I/O bound process.
Every such CPU bound process will be penalized with priority, keeping the time slice same, and every such I/O bound process will be appreciated with a bonus of priority, keeping the time slice same.
So, on an GPOS(General Purpose Operating System), it is the effective_priority or dynamic_priority, that will tell you if the process is well-behaved(I/O bound) or ill-behaved(CPU bound process), as the default priority will be 20 for a new process, unless it is altered otherwise.
There are some parameters based on which, you can determine the same.
effective_prio: Returns the effective priority of a task (based on the static priority, but includes any rewards or penalties).
recalc_task_prio: Determines a task's bonus or penalty based on its idle time.
Ref: https://www.cs.columbia.edu/~smb/classes/s06-4118/l13.pdf
I have been reading about Linux Kernel and CFS scheduler in the kernel. I came across vruntime (virtual runtime) that is the core concept behind CFS scheduler. I read from “Linux Kernel Development” and also from other blogs on internet but could not understand the basic calculations behind the vruntime. Does vruntime belong to a particular process or does it belong to a group of process with same nice values. What is the weighting factor and how is it calculated? I went through all these concepts but could not understand. Also what is the difference between vruntime and *min_vruntime*?
vruntime is per-thread; it is a member nested within the task_struct.
Essentially, vruntime is a measure of the "runtime" of the thread - the amount of time it has spent on the processor. The whole point of CFS is to be fair to all; hence, the algo kind of boils down to a simple thing: (among the tasks on a given runqueue) the task with the lowest vruntime is the task that most deserves to run, hence select it as 'next'. (The actual implementation is done using an rbtree for efficiency).
Taking into account various factors - like priority, nice value, cgroups, etc - the calculation of vruntime is not as straight-forward as a simple increment. I'd suggest reading the relevant section in "Professional Linux Kernel Architecture", Mauerer, Wrox Press - it's explained in great detail.
Pl see below a quick attempt at summarizing some of this.
Other resource:
Documentation/scheduler/sched-design-CFS.txt
Quick summary - vruntime calculation:
(based on the book)
Most of the work is done in kernel/sched_fair.c:__update_curr()
Called on timer tick
Updates the physical and virtual time 'current' has just spent on the processor
For tasks that run at default priority, i.e., nice value 0, the physical and virtual time spent is identical
Not so for tasks at other priority (nice) levels; thus the calculation of vruntime is affected by the priority of current using a load weight factor
delta_exec = (unsigned long)(now – curr->exec_start);
// ...
delta_exec_weighted = calc_delta_fair(delta_exec, curr);
curr->vruntime += delta_exec_weighted;
Neglecting some rounding and overflow checking, what calc_delta_fair does is to
compute the value given by the following formula:
delta_exec_weighed = delta_exec * (NICE_0_LOAD / curr->load.weight)
The thing is, more important tasks (those with a lower nice value) will have larger
weights; thus, by the above equations, the vruntime accounted to them will be smaller
(thus having them enqueued more to the left on the rbtree!).
The vruntime is the virtual runtime of a process which helps in tracking for how much time a process has run. The vruntime is a member of the sched_entity structure defined in include/linux/sched.h
The min_vruntime represents the minimum vruntime of a cfs runqueue. It represents the minimum of all the vruntime of the processes that is scheduled on that cfs runqueue. The min_vruntime is a member of cfs_rq structure defined in include/linux/sched.h
The purpose of min_vruntime is to select the next process in the cfs runqueue to run. In order to be fair to all the processes, the CFS scheduler selects the process with the minimum vruntime to execute first.
The link to include/linux/sched.h is: https://elixir.bootlin.com/linux/latest/source/include/linux/sched.h
When we are writing a program we are not specifying the nature of the process like whether it is realtime, interactive... I read that Linux kernel does scheduling based on the nature of the process. Couldn't find an article explaining how Linux decides on that. Would be nice is someone could give some info that. The question is of academic interest only.
I have read that I can use the system call sched_setscheduler to set the scheduler. But what happens when the call is not made?
Also how scheduler decides a process as interactive/batch?
When sched_setscheduler is not called, then the default scheduling policy is used, which is SCHED_OTHER. That means that the scheduler is round robin/time sharing, in other words: Threads are run in a round robin fashion, and the time sharing part means that sometimes tasks will get swapped out (preempted) if they do not give up the cpu voluntarily in order to allow other threads to have execution time. Additionally there is no notion of process/thread priority with this scheduling policy.
http://linux.die.net/man/2/sched_setscheduler
Ok. Found the answer from this link.
Dynamic priority bonuses and penalties are based on interactivity
heuristics. This heuristic is implemented by keeping track of how much
time tasks spend sleeping (presumably blocked on I/O) as opposed to
running. Tasks that are I/O-bound tend to sleep quite a bit as they
block on I/O, whereas CPU-bound task rarely sleep as they rarely block
on I/O. Quite often, tasks are in the middle, and are not entirely
CPU-bound or I/O-bound so the heuristic produces some sort of scale
instead of a simple binary label (I/O-bound or CPU-bound).
In sched_fair.c it has:
unsigned int sysctl_sched_latency = 5000000ULL //5m
unsigned int sysctl_sched_min_granularity = 1000000ULL //1ms
I understand that Linux fair timeslice varies depending on the nr_running and the relative weight of this fair task, but through code studying, I figured out the main idea is to keep the timeslice 1 to 5 ms. Please correct me if I understand it wrong. I must be wrong here but I just cannot figure out how!
Also knowing that HZ, or the number of system ticks per s, or the number of timer interrupts every second, is normally 200 or 100 for arm machine (and most non-desktop machines too), which gives us a 5 to 10 ms tick rate.
Timeslice is put in action by starting rq->hrtick_timer in set_next_entity(), every time a fair task is scheduled to run, and invoking resched_task() in timeout callback function hrtick(). This timer is simply one of the queued timers that are processed by timer irq handler on every tick, timer_tick()...run_local_timer(). There seems no other hidden secret.
Then how we can get a timeslice shorter than 5 ms? Please help me understand this. Thank you very much!
As stated in Robert Love's Linux Kernel Development, the only ways to get a timeslice shorter is to increase number of running processes (or ones with less priority than others).
Increasing number on running process creates a need for shorten timeslice to guarantee appropriate target latency (but timeslice is lower bounded with minimum granularity). But there is no guarantee that process will be preempted in given timeslice. That's because time accounting is driven by timer interrupts.
Increasing value of HZ makes timer interrupts happen more frequently that makes time accounting more precious, so rescheduling may occur more frequently.
The vruntime variable stores the virtual runtime of a process, which is the actual runtime normalized by the number of runnable processes. On the ideal multitasking system vruntime of all the process would be identical—all tasks would have received an equal, fair share of the processor.
Typically timeslice is target latency divided by number of running processes. But when number of running processes approaches infinity, timeslice approaches 0. As this will eventually result in unacceptable switching costs, CFS imposes a floor on the timeslice assigned to each process.This floor is called the minimum granularity. So timeslice is value between sysctl_sched_latency and sysctl_sched_granularity. (See sched_timeslice())
vruntime variable is managed by update_curr(). update_curr() is invoked periodically by the system timer and also whenever a process becomes runnable or blocks, becoming unrunnable.
To drive preemption between tasks, hrtick() calls task_tick_fair() on each timer interrupt, which, in turn, calls entity_tick(). entity_tick() calls update_curr() to update process vruntime and then calls check_preempt_tick(). check_preempt_tick() checks whether current runtime is greater than ideal runtime (timeslice), if so, calls resched_task(), which sets TIF_NEED_RESCHED flag.
When TIF_NEED_RESCHED is set, schedule() gets called on the nearest possible occasion.
So, with increasing value of HZ, timer interrupts happens more frequently causing more precious time accounting and allowing scheduler to reschedule tasks more frequently.