Linux CFS (Completely Fair Scheduler) latency

Linux CFS (Completely Fair Scheduler) latency - linux

I am a beginner to the Linux Kernel and I am trying to learn how Linux schedules processes.
I have read some books on the Linux Kernel and gone through the links from IBM http://www.ibm.com/developerworks/linux/library/l-cfs/ and all, but I am still left with some doubts.
How does the scheduler schedule all of the tasks within the sysctl_sched_latency time?
When a process wakes up what actually is done in the place_entity function?
When a process wakes up why is the vruntime adjusted by subtracting from sched_latency? Can't that lead to processes in the run queue with large differences in the vruntime value?

Firstly the virtual runtime of a task
in theory is the when the task would start its next time slice of
execution on a theoretically perfect multiple threaded CPU.
in practice is its actual runtime normalized to the total number of running tasks
1. How does the scheduler schedule all of the tasks within the
sysctl_sched_latency time?
It maintains a time ordered red and black tree, where all the runnable tasks are
sorted by their virtual runtime. Nodes on the left have run for the shortest amount of time.
CFS picks the left most task and runs it, until the task schedules or the scheduler ticks
then the CPU time it spent running is added to its virtual runtime.
When it is no longer the left most node, then new task with the shortest virtual is run and
the old task prempted.
2. When a process wakes up what actually is done in the place_entity function?
Short version:
When a process wakes up the place_entity function either leaves the
task's virtual runtime as it was or increases it.
Long version:
When a process wakes up the place_entity function does the following things
Initialise the temporary virtual runtime to the CFS run queue's virtual runtime of the smallest task.
As sleeps less than a single latency don't count,
initializses a threshold variable to sysctl_sched_latency.
If the GENTLE_FAIR_SLEEPERS feature is enabled,
then half the value of the this variable.
Decrement the previously initialised temporary virtual runtime by this threshold value.
Ensure that the temporary virtual runtime is at least equal to the task's virtual runtime, by setting the calculated virtual runtime to the maximum of itself and the task's virtual runtime.
Set the task's virtual runtime to the temporary runtime.
3. When a process wakes up why is the vruntime adjusted by subtracting from sched_latency?
The virtual runtime is decremented because sleeps less than a single latency don't count.
E.g the task shouldn't have its position changed in the red black tree changed if it has
only slept for a single scheduler latency.
4. Can't that lead to processes in the run queue with large differences in the vruntime value?
I believe that the logic described in Step 3 for Question 2, prevents or at least minimises that.
References
sched Linux Kernel Source
sched_fair.c Linux Kernel Source
Notes on the CFS Scheduler Design

Related

Why “ps aux” in Linux does not show the process whose pid=0? [duplicate]

The idle task (a.k.a. swapper task) is chosen to run when no more runnable tasks in the run queue at the point of task scheduling. But what is the usage for this so special task? Another question is why i can't find this thread/process in the "ps aux" output (PID=0) from the userland?

The reason is historical and programatic. The idle task is the task running, if no other task is runnable, like you said it. It has the lowest possible priority, so that's why it's running of no other task is runnable.
Programatic reason: This simplifies process scheduling a lot, because you don't have to care about the special case: "What happens if no task is runnable?", because there always is at least one task runnable, the idle task. Also you can count the amount of cpu time used per task. Without the idle task, which task gets the cpu-time accounted no one needs?
Historical reason: Before we had cpus which are able to step-down or go into power saving modes, it HAD to run on full speed at any time. It ran a series of NOP-instructions, if no tasks were runnable. Today the scheduling of the idle task usually steps down the cpu by using HLT-instructions (halt), so power is saved. So there is a functionality somehow in the idle task in our days.
In Windows you can see the idle task in the process list, it's the idle process.

The linux kernel maintains a waitlist of processes which are "blocked" on IO/mutexes etc. If there is no runnable process, the idle process is placed onto the run queue until it is preempted by a task coming out of the wait queue.
The reason it has a task is so that you can measure (approximately) how much time the kernel is wasting due to blocks on IO / locks etc. Additionally it makes the code that much easier for the kernel as the idle task is the same as every task it needs to context switch, instead of a "special case" idle task which could make changing kernel behaviour more difficult.

There is actually one idle task per cpu, but it's not held in the main task list, instead it's in the cpu's "struct rq" runqueue struct, as a struct task_struct * .
This gets activated by the scheduler whenever there is nothing better to do (on that CPU) and executes some architecture-specific code to idle the cpu in a low power state.

You can use ps -ef and it will list the no of process which are running. Then in the first link, it will list the first pid - 0 which is the swapper task.

How linux process scheduler prevents starvation of a process

I have read that linux kernel contains many schedule classes each having it's own priority. To select a new process to run, the process scheduler iterates from the highest priority class to lowest priority class. If a runnable process is found in a class, the highest priority process is selected to run from that class.
Extract from Linux kernel development by Robert Love:
The main entry point into the process schedule is the function
schedule() , defined in kernel/sched.c .This is the function that the
rest of the kernel uses to invoke the process scheduler, deciding
which process to run and then running it. schedule() is generic with
respect to scheduler classes.That is, it finds the highest priority
scheduler class with a runnable process and asks it what to run next.
Given that, it should be no surprise that schedule() is simple.The
only important part of the function—which is otherwise too
uninteresting to reproduce here—is its invocation of pick_next_task()
, also defined in kernel/sched.c .The pick_next_task() function goes
through each scheduler class, starting with the highest priority, and
selects the highest priority process in the highest priority class.
Let's imagine the following scenario. There are some processes waiting in lower priority classes and processes are being added to higher priority classes continuously. Won't the processes in lower priority classes starve?

Linux kernel implements Completely Fair Scheduling algorithm which is based on virtual clock.
Each scheduling entity has a sched_entity structure associated with it whose snapshot looks like
struct sched_entity {
...
u64 exec_start;
u64 sum_exec_runtime;
u64 vruntime;
u64 prev_sum_exec_runtime;
...
}
The above four attributes are used to track the runtime of a process and using these attributes along with some other methods(update_curr() where these are updated), the virtual clock is implemented.
When a process is assigned to CPU, exec_start is updated to current time and the consumed CPU time is recorded in sum_exec_runtime. When process is taken off from CPU sum_exec_runtime value is preserved in prev_sum_exec_runtime. sum_exec_runtime is calculated cumulatively. (Meaning it grows monotonically).
vruntime stores the amount of time that has elapsed on virtual clock during process execution.
How vruntime is calculated?
Ignoring all the complex calculations, the core concept of how it is calculated is :-
vruntime += delta_exec_weighted;
delta_exec_weighted = delta_exec * (NICE_0_LOAD/load.weight);
Here delta_exec is the time difference between process assigned to CPU and taken off from CPU whereas load.weight is the weight of the process which depends on priority (Nice Value). Usually an increase in nice value of 1 for a process means it gets 10 percent less CPU time resulting in less weight.
Process with NICE value 0, weight = 1024
Process re-Niced with value 1, weight = 1024/1.25 = 820(approx)
Points drawn from above
So vruntime increases when a process gets CPU
And vruntimeincreases slowly for higher priority processes compared with lower priority processes.
The runqueue is maintained in red-black tree and each runqueue has a min_vruntime variable associated with it that holds the smallest vruntime among all the process in the run-queue. (min_vruntime can only increase, not decrease as processes will be scheduled).
The key for the node in red black tree is process->vruntime - min_vruntime
When scheduler is invoked, the kernel basically picks up the task which has the smallest key (the leftmost node) and assigns it the CPU.
Elements with smaller key will be placed more to the left, and thus be scheduled more quickly.
When a process is running, its vruntime will steadily increase, so it will finally move rightwards in the red-black tree.
Because vruntime is increase more slowly for more important processes, they will also move rightwards more slowly, so their chance to be scheduled is bigger for a less important process - just as required.
If a process sleeps, its vruntime will remain unchanged. Because the per-queue min_vruntime will increase in the meantime, the sleeper process will be placed more to the left after waking up because the key(mentioned above) got smaller.
Therefore there are no chances of starvation as a lower priority process if deprived of CPU, will have its vruntime smallest and hence key will be smallest so it moves to the left of the tree quickly and therefore scheduled.

It would indeed starve.
There are many ways of dealing with such scenario.
Aging, the longer the process is in the system, increase its priority.
Scheduling algorithms giving every process a time-quantum to use the CPU. Time-Quantum varies, usually, interactive processes are given lower time-quantum as they spend more time doing I/O while time consuming/computational processes are given bigger time quantum.
After a process runs its time quantum, it is put in an expired queue until there are no active processes in the system.
Then, the expired queue becomes the active queue and vice versa.
These are 2 ways in preventing starvation.

Linux CFS how to select the next process

I read this book
http://www.amazon.com/Professional-Kernel-Architecture-Wolfgang-Mauerer/dp/0470343435
Now i study scheduler，linux kernel now using CFS for normal process
But this book sometime say ,scheduler will choose the wait time longest
process in runqueue to run and sometime say scheduler will choose the vruntime
smallest in runqueue
Which is correct??

Both are correct - both say the same thing in different words.
To pick up the next process, the scheduler selects the task that has the minimal vruntime. Namely, the process that run the least.
A process accumulates vruntime only while it is running. So by picking the task that has the minimal vruntime we pick the task that run the least.
On the other hand, a process that waits a lot - does not accumulate vruntime. Therefore its vruntime is low. And if its wait time is the longest- its vruntime will be the lowest - and it will be picked to run next.
Just different ways to say the same thing.

How scheduler gets called when a high priority task comes

I have read here about situations where a scheduler is called. But what happens when a high priority task comes?

High priority tasks are scheduled more often than low priority tasks but when a high priority task comes it still has to wait until the quantum of the running task is over.

Priority changes and is adjusted based on past CPU usage.
The longer version
In Linux, process priority is dynamic. The scheduler keeps track of what processes are doing and adjusts their priorities periodically; in this way, processes that have been denied the use of the CPU for a long time interval are boosted by dynamically increasing their priority. Correspondingly, processes running for a long time are penalized by decreasing their priority.

Scheduler maintains a set of all tasks that are ready to run in the system. In a multi-priority system, the task set usually supports the notion of priority. When a high priority task arrives in the system, it is put into the set of tasks sorted by priority.
There are certain points in the kernel where we check if a better process is available to run, compared to the currently running process. This can happen when the time slice expires OR when the ISR is done OR when a lock is unlocked, etc. Look for calls to switch() OR _switch() or something similar...this is the routine that checks the set of tasks and determines if the current task is the highest prio.
If the current task is not the highest prio task, then the current task is switched out and the highest prio task is obtained from the task set and scheduled to run.

Task in vxworks

When we doing taskSpawn a task is creating in vxworks. What is actually a task. Is there any relation with thread.
In my understanding vxworks is thread based Operating system.
Can some one please help me the real difference between task/thread/process in real scenario.
Somewhere I saw task is the execution of set of instruction. If it is like that then thread also have some set of instruction so can we call thread as task.
Please help

Thread is a concept typically used with an OS supporting process models (Unix/Linux/Windows) where you run a process.
This process could have a single thread of execution (like a simple C program). Or you could create multiple threads to perform certain operations in parallel within the current process memory space.
With older vxWorks, there was no process model. Everything would run in the same memory space. A vxWorks tasks provides the context where the system code would execute. All code (with the exception of interrupt handlers) will execute in the context of a Task.
Tasks are independent execution units. They can share resources, have common memory, etc... but the scheduler executes the tasks based on very specific criteria. Typically, the highest priority task in the system is the task that will be executing at any given time.
Once a task is done/sleeps/blocked waiting for resources, then the next highest priority task in the system will run.
For your purpose, you can probably think of the task as a thread.

A task is abstract concept in OS design. A task is a single context of execution. A task has a memory space it operates in where its data and code is stored. This memory space may or may not be shared with other tasks. A task has a state (e.g. running, stopped, killed...), it (usually) has a stack. A task has a priority over other tasks.
On example of such a task, is a VxWorks task. Another is a Linux thread.
In Linux (and I believe also in latest version of VxWorks btw), there exists a concept of a related group of tasks. Tasks belonging to the same group share memory space and several other resources (e.g. file handlers). A Linux process is such a group of tasks.
By an large, the OS scheduler schedules tasks and not processes. The process is a convective abstraction for the programmer to think about group of related threads together.
I hope that helped.

In vxWorks, tasks is a runnable unit.
Task have TCB (Task Control Block) with a unique task space and specific priority (as you defined in the taskSpawn function).
The vxWorks scheduler can run only task, this is the minimum runnable unit (the scheduler can run the kernel itself and interrupt can run in the system).
The decision which task to run will base on the task state (must be in READY), and the task priority (in vxWorks, the highest priority is the lower number).
Note that several tasks might be in the same priority and then the kernel will run different tasks according the scheme you configured (FIFO or round robin).
In vxWorks, all tasks have the same memory space (including Kernel memory space). This is the reason that WindRiver added the "Process like" mechanism from vxWorks 6.x. Process have its own "virtual memory space" that protected by MMU.
Just to summery it for you:
Tasks have the same memory space over the system.
Threads have the same memory space within their process.
Process memory space protected by MMU.

task and threads are similar to process. but the difference is threads dont have seperate memory space for their own they run under the pcb(stack) of the process itself.but whereas,task has its own stack area and is a light weighted process i.e.,tcb is much smaller when compared to pcb so context switching or task switching can happen faster .
since vxworks deals with rtos and switching latency should be very less ,it deals with tasks.

In addition to the existing anwers:
If you ever need to create POSIX threads on your VxWorks system (which is possible by including POSIX in the kernel config and calling pthread_create() ), you will notice that those threads will appear as tasks in you task list (type 'i' in C shell).
Hence, tasks and threads are very much alike. VxWorks even wraps POSIX threads as tasks so they can be handled in parallel to existing native tasks.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string