I am new to Linux kernel and right now i am studying about Process Scheduling in Linux kernel . There are three types of priorities in Linux :
Static priority
Dynamic priority
Real time priority
Now what i have understood is that :
Static priority and Dynamic priority are defined only for Conventional processes and they can take value from 100 to 139 only.
Static priority is used to determine base time slice of a process
Dynamic priority is used to select a process to be executed next.
Real time priorities are defined only for Real time processes and it's value can range from 0 to 99
Now my questions are :
Correct me if i am wrong and please also tell me why we are using
three types of priorities in Linux and what are the differences
among these priorities?
Are the processes are differentiated as Real time or Conventional on the basis of priorities that is if priority is between 100 to 139
then processes are Conventional processes otherwise Real time
processes?
How the priorities are changed in Linux , i mean , we know that priority of a process does not remain constant through out the execution ?
Disclaimer: Following is true for scheduling in linux (I am not sure about windows or other OS). Thread and process has been used interchangeably here, althogh there is a difference in between them.
Priorities & differences
1.Static priority: These are the default priorities (value 0 for conventional processes aka non-real time processes ie when real time scheduling is not used) set while creating a new thread. You can change them using:
`pthread_setschedparam(pthread_t thread, int policy, const struct sched_param *param);`
where, sched_param contains the priority:
struct sched_param
{
int sched_priority; /* Scheduling priority */
};
2 Dynamic priority: When threads start to starve because higher priority threads being scheduled all the time, there becomes a need to raise the priority of such a thread using various mechanisms. This raised/lowered (yes, this happens too) priority is known as the dynamic priority because it keeps on changing. In Linux, even the fat kid gets to play.
3.Real time priority: This comes into picture only when threads (processes) are scheduled under one of the real-time policies (SCHED_FIFO, SCHED_RR) and have a sched_priority value in the range 1 (low) to 99 (high). This is the highest in comparison to the static/dynamic priorities of non real time processes.
More information: http://man7.org/linux/man-pages/man3/pthread_getschedparam.3.html
Now, to your questions:
Correct me if i am wrong and please also tell me why we are using three types of priorities in Linux and what are the differences among
these priorities?
So, for non-real time scheduling policies, every process has some static priorities, a higher priority gives the thread a kick-start, and later to avoid any injustice, the priority is boosted/lowered which becomes the dynamic priority.
Are the processes are differentiated as Real time or Conventional on the basis of priorities that is if priority is between 100 to 139
then processes are Conventional processes otherwise Real time
processes?
Not really, it depends upon the scheduling mechanism in place.
How the priorities are changed in Linux , i mean , we know that priority of a process does not remain constant through out the
execution ?
That's when the dynamicness comes into picture. Read about the "nice value" in the given links.
Related
Suppose four user processes are running in my system say P1, P2, P3, P4. Can the user understand which process is of least priority among others? How does the kernel prioritizes the processes? What are the parameters it takes into consideration while determining process priority?
I need this information since I'm trying to suspend one of the process which has least priority compared to others.
Process priority is a not as simple as that and typically unless you do something, all user level process have the same priority to begin with (as they are time shared by scheduler). However, you can instruct Kernel to either prioritize/de-prioritize a process by using a nice value per process.
For more details, take a look at http://man7.org/linux/man-pages/man7/sched.7.html
Why, in Linux, the sched_fifo and the sched_rr are called real-time schedulers, while sched_other is called non real-time scheduler ?
The term "real-time" when it comes to schedulers is used to denote if that scheduler can mathematically guarantee that within certain parameters all the periodic tasks within the system will meet their deadlines. This is important in things like closed loop control.
The problem with the Linux kernel is that even if you are using a real-time scheduler you do not have the guarantee that you will meet deadlines because of things like spin locks in the kernel itself. If you want true hard real-time you have to not only use a real-time scheduler but have to use the real-time patches as of this writing. In the future it is intended by some in the community that those patches be put into the mainline kernel.
The scheduler is the kernel component that decides which runnable thread will be executed by the CPU next. Each thread has an associated scheduling policy and a static scheduling priority, sched_priority. The scheduler makes its decisions based on knowledge of the scheduling policy and static priority of all threads on the system.
For threads scheduled under one of the normal scheduling policies (SCHED_OTHER, SCHED_IDLE, SCHED_BATCH), sched_priority is not used in scheduling decisions (it must be specified as 0).
Processes scheduled under one of the real-time policies (SCHED_FIFO, SCHED_RR) have a sched_priority value in the range 1 (low) to 99(high). (As the numbers imply, real-time threads always have higher priority than normal threads.)
I have read that linux kernel contains many schedule classes each having it's own priority. To select a new process to run, the process scheduler iterates from the highest priority class to lowest priority class. If a runnable process is found in a class, the highest priority process is selected to run from that class.
Extract from Linux kernel development by Robert Love:
The main entry point into the process schedule is the function
schedule() , defined in kernel/sched.c .This is the function that the
rest of the kernel uses to invoke the process scheduler, deciding
which process to run and then running it. schedule() is generic with
respect to scheduler classes.That is, it finds the highest priority
scheduler class with a runnable process and asks it what to run next.
Given that, it should be no surprise that schedule() is simple.The
only important part of the function—which is otherwise too
uninteresting to reproduce here—is its invocation of pick_next_task()
, also defined in kernel/sched.c .The pick_next_task() function goes
through each scheduler class, starting with the highest priority, and
selects the highest priority process in the highest priority class.
Let's imagine the following scenario. There are some processes waiting in lower priority classes and processes are being added to higher priority classes continuously. Won't the processes in lower priority classes starve?
Linux kernel implements Completely Fair Scheduling algorithm which is based on virtual clock.
Each scheduling entity has a sched_entity structure associated with it whose snapshot looks like
struct sched_entity {
...
u64 exec_start;
u64 sum_exec_runtime;
u64 vruntime;
u64 prev_sum_exec_runtime;
...
}
The above four attributes are used to track the runtime of a process and using these attributes along with some other methods(update_curr() where these are updated), the virtual clock is implemented.
When a process is assigned to CPU, exec_start is updated to current time and the consumed CPU time is recorded in sum_exec_runtime. When process is taken off from CPU sum_exec_runtime value is preserved in prev_sum_exec_runtime. sum_exec_runtime is calculated cumulatively. (Meaning it grows monotonically).
vruntime stores the amount of time that has elapsed on virtual clock during process execution.
How vruntime is calculated?
Ignoring all the complex calculations, the core concept of how it is calculated is :-
vruntime += delta_exec_weighted;
delta_exec_weighted = delta_exec * (NICE_0_LOAD/load.weight);
Here delta_exec is the time difference between process assigned to CPU and taken off from CPU whereas load.weight is the weight of the process which depends on priority (Nice Value). Usually an increase in nice value of 1 for a process means it gets 10 percent less CPU time resulting in less weight.
Process with NICE value 0, weight = 1024
Process re-Niced with value 1, weight = 1024/1.25 = 820(approx)
Points drawn from above
So vruntime increases when a process gets CPU
And vruntimeincreases slowly for higher priority processes compared with lower priority processes.
The runqueue is maintained in red-black tree and each runqueue has a min_vruntime variable associated with it that holds the smallest vruntime among all the process in the run-queue. (min_vruntime can only increase, not decrease as processes will be scheduled).
The key for the node in red black tree is process->vruntime - min_vruntime
When scheduler is invoked, the kernel basically picks up the task which has the smallest key (the leftmost node) and assigns it the CPU.
Elements with smaller key will be placed more to the left, and thus be scheduled more quickly.
When a process is running, its vruntime will steadily increase, so it will finally move rightwards in the red-black tree.
Because vruntime is increase more slowly for more important processes, they will also move rightwards more slowly, so their chance to be scheduled is bigger for a less important process - just as required.
If a process sleeps, its vruntime will remain unchanged. Because the per-queue min_vruntime will increase in the meantime, the sleeper process will be placed more to the left after waking up because the key(mentioned above) got smaller.
Therefore there are no chances of starvation as a lower priority process if deprived of CPU, will have its vruntime smallest and hence key will be smallest so it moves to the left of the tree quickly and therefore scheduled.
It would indeed starve.
There are many ways of dealing with such scenario.
Aging, the longer the process is in the system, increase its priority.
Scheduling algorithms giving every process a time-quantum to use the CPU. Time-Quantum varies, usually, interactive processes are given lower time-quantum as they spend more time doing I/O while time consuming/computational processes are given bigger time quantum.
After a process runs its time quantum, it is put in an expired queue until there are no active processes in the system.
Then, the expired queue becomes the active queue and vice versa.
These are 2 ways in preventing starvation.
I am testing my multithreaded application in Linux RT multicore machine.
However during testing, we are observing that scheduling (created with SCHED_FIFO scheduling policy ) in Linux RT is not happening according to the SCHED_FIFO policy.
We could see in multiple places that the higher priority thread execution is getting preempted by a lower priority thread.
Based on some research we did on the internet, we found that the following kernel parameters need to be changed from
/proc/sys/kernel/sched_rt_period_us containing 1000000
/proc/sys/kernel/sched_rt_runtime_us containing 950000
to
/proc/sys/kernel/sched_rt_period_us containing 1000000
/proc/sys/kernel/sched_rt_runtime_us containing 1000000
or
/proc/sys/kernel/sched_rt_period_us containing -1
/proc/sys/kernel/sched_rt_runtime_us containing -1
We tried doing both but still we are facing the problem sometimes. We are facing the issue even when higher priority thread is not suspended by any system call.
It would be great if you could let us know if you are aware of such problems in Linux RT scheduling and/or have any solutions to make the Linux RT scheduling deterministic based on priority.
There are no printfs or any system calls in the higher priority thread but still the higher priority thread is getting preempted by the lower priority thread.
Also I have made sure all the threads in the process are running on a single core using taskset command.
There could be two reasons:
CPU throttling: the scheduler is designed to reserve some CPU time to non-RT tasks; you have already disabled it by acting on the /proc/sys/kernel/ entries
blocking: your high-priority task is blocking either
on some synchronization mechanism (e.g., mutex, semaphore) or
on some blocking call (e.g., malloc, printf, read, write, etc.)
this is from sched_setscheduler(2) - Linux man page:
"Processes scheduled under one of the real-time policies (SCHED_FIFO, SCHED_RR) have a sched_priority value in the range 1 (low) to 99 (high)."
"A SCHED_FIFO process runs until either it is blocked by an I/O request, it is preempted by a higher priority process, or it calls sched_yield(2)."
I have the following code:
struct sched_param sp;
memset( &sp, 0, sizeof(sp) );
sp.sched_priority = 99;
sched_setscheduler( 0, SCHED_FIFO, &sp );
Now the process should be running under the highest possible priority (99)
and should never be preempted.
So, when it starts running the following loop:
while ( 1 ) ;
it should be running forever and no other process should be allowed to run.
In spite of this, when I start such a process, I can use other processes too. Other processes run much slower, but they DO run.
My processor has 2 cores, so I started two copies of the process.
Usage of both cores jumped to 97%-100%. Both processes were running their infinite loop.
I could still type commands in a shell and watch their output. I could use GUI programs as well.
How's that possible since a SCHED_FIFO process with a priority of 99 should never be preempted?
If you haven't changed any other policy settings, then you're likely getting throttled. See this informative article on the real-time limits added to the scheduler a few years back.
The gist of it is: Unprivileged users can use SCHED_FIFO and try to soak the CPU, but the RT limits code will force a little bit of SCHED_OTHER in anyway so you don't wedge the system. From the article:
Kernels shipped since 2.6.25 have set the rt_bandwidth value for the
default group to be 0.95 out of every 1.0 seconds. In other words, the
group scheduler is configured, by default, to reserve 5% of the CPU
for non-SCHED_FIFO tasks.