How "Real-Time" is Linux 2.6? - linux

I am looking at moving my product from an RTOS to embedded Linux. I don't have many real-time requirements, and the few RT requirements I have are on the order of 10s of milliseconds.
Can someone point me to a reference that will tell me how Real-Time the current version of Linux is?
Are there any other gotchas from moving to a commercial RTOS to Linux?

You can get most of your answers from the Real Time Linux wiki and FAQ
What are real-time capabilities of the stock 2.6 linux kernel?
Traditionally, the Linux kernel will only allow one process to preempt another only under certain circumstances:
When the CPU is running user-mode code
When kernel code returns from a system call or an interrupt back to user space
When kernel code code blocks on a mutex, or explicitly yields control to another process
If kernel code is executing when some event takes place that requires a high priority thread to start executing, the high priority thread can not preempt the running kernel code, until the kernel code explicitly yields control. In the worst case, the latency could potentially be hundreds milliseconds or more.
The Linux 2.6 configuration option CONFIG_PREEMPT_VOLUNTARY introduces checks to the most common causes of long latencies, so that the kernel can voluntarily yield control to a higher priority task waiting to execute. This can be helpful, but while it reduces the occurences of long latencies (hundreds of milliseconds to potentially seconds or more), it does not eliminate them. However unlike CONFIG_PREEMPT (discussed below), CONFIG_PREEMPT_VOLUNTARY has a much lower impact on the overall throughput of the system. (As always, there is a classical tradeoff between throughput --- the overall efficiency of the system --- and latency. With the faster CPU's of modern-day systems, it often makes sense to trade off throughput for lower latencies, but server class systems that do not need minimum latency guarantees may very well chose to use either CONFIG_PREEMPT_VOLUNTARY, or to stick with the traditional non-preemptible kernel design.)
The 2.6 Linux kernel has an additional configuration option, CONFIG_PREEMPT, which causes all kernel code outside of spinlock-protected regions and interrupt handlers to be eligible for non-voluntary preemption by higher priority kernel threads. With this option, worst case latency drops to (around) single digit milliseconds, although some device drivers can have interrupt handlers that will introduce latency much worse than that. If a real-time Linux application requires latencies smaller than single-digit milliseconds, use of the CONFIG_PREEMPT_RT patch is highly recommended.
They also have a list of "Gotcha's" as you called them in the FAQ.
What are important things to keep in
mind while writing realtime
applications?
Taking care of the following during
the initial startup phase:
Call mlockall() as soon as possible from main().
Create all threads at startup time of the application, and touch each page of the entire stack of each thread. Never start threads dynamically during RT show time, this will ruin RT behavior.
Never use system calls that are known to generate page faults, such as
fopen(). (Opening of files does the
mmap() system call, which generates a
page-fault).
If you use 'compile time global variables' and/or 'compile time global
arrays', then use mlockall() to
prevent page faults when accessing
them.
more information: HOWTO: Build an
RT-application
They also have a large publications page you might want to checkout.

Have you had a look at Xenomai? It will let you run "hard real time" processes above Linux, while still allowing you to access the regular Linux APIs for all the non-real-time needs.

There are two fundamentally different approaches to achieve real-time capabilities with Linux.
Patch the existing kernel with things like the rt-preempt patches. This will eventually lead to a fully preemptive kernel
Dual kernel approach (like xenomai, RTLinux, RTAI,...)
There are lots of gotchas moving from a RTOS to Linux.
Maybe you don't really need real-time?
I'm talking about real-time Linux in my training sessions:
https://rlbl.me/elisa
https://rlbl.me/elisa-en-pdf
https://rlbl.me/intely
https://rlbl.me/intely-en-pdf
https://rlbl.me/entirety-en-all-pdf

The answer is probably "good enough".
If you're running an embedded system, you probably have control of all or most of the software on the box.
Stock Linux 2.6 has several features suitable for low-latency tasks - chiefly these are:
Scheduling policies
Memory locking
Assuming you're using a single-core machine, if you have just one task which has set its scheduling policy to SCHED_FIFO or SCHED_RR (it doesn't matter which if you have just one task), AND locked all its memory in with mlockall(), then it WILL get scheduled as soon as it is ready to run.
Then the only thing you'd have to worry about was some non-preemptable part of the kernel taking longer than your acceptable latency to complete - which is unlikely to happen in an embedded system unless something bad happens, such as extreme memory pressure, or your drivers are dodgy.
I guess "try it and see" is a good answer, but that's probably rather complicated in your case (and might involve writing device drivers etc).
Look at the doc for sched_setscheduler for some good info.

Related

Would it makes the kernel level thread clearly preferable to user level thread if system calls is as fast as procedure calls?

Some web searching results told me that the only deficiency of kernel-level thread is the slow speed of its management(create, switch, terminate, etc.). It seems that if the operation on the kernel-level thread is all through system calls, the answer to my question will be true. However, I've searched a lot to find whether the management of kernel-level thread is all through system call but find nothing. And I always have an instinct that such management should be done by the OS automatically because only OS knows which thread would be suitable to run at a specific time. So it seems impossible for programmers to write some explicit system calls to manage threads. I'm appreciative of any ideas.
Some web searching results told me that the only deficiency of kernel-level thread is the slow speed of its management(create, switch, terminate, etc.).
It's not that simple. To understand, think about what causes task switches. Here's a (partial) list:
a device told a device driver that an operation completed (some data arrived, etc) causing a thread that was waiting for the operation to unblock and then preempt the currently running thread. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
enough time passed; either causing an "end of time slice" task switch, or causing a sleeping thread to unblock and preempt. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the thread accessed virtual memory that isn't currently accessible, triggering the kernel's page fault handler which finds out that the current task has to wait while the kernel fetches data from from swap space or from a file (if the virtual memory is part of a memory mapped file), or has to wait for kernel to free up RAM by sending other pages to swap space (if virtual memory was involved in some kind of "copy on write"); causing a task switch because the currently running task can't continue. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
a new process is being created, and its initial thread preempts the currently running thread. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the currently running thread asked kernel to do something with a file and kernel got "VFS cache miss" that prevents the request from being performed without any task switches. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the currently running thread releases a mutex or sends some data (e.g. using a pipe or socket); causing a thread that belongs to a different process to unblock and preempt. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the currently running thread releases a mutex or sends some data (e.g. using a pipe or socket); causing a thread that belongs to the same process to unblock and preempt. For this case you're running user-space code when you find out that a task switch is needed, so in theory user-space task switching is faster, but in practice it can just as easily be an indicator of poor design (using too many threads and/or far too much lock contention).
a new thread is being created for the same process; and the new thread preempts the currently running thread. For this case you're running user-space code when you find out that a task switch is needed, so in user-space task switching is faster; but only if kernel isn't informed (e.g. so that utilities like "top" can properly display details for threads) - if kernel is informed anyway then it doesn't make much difference where the task switch happens.
For most software (which doesn't use very many threads); doing task switches in the kernel is faster. Of course it's also (hopefully) fairly irrelevant for performance (because time spent switching tasks should be tiny compared to time spend doing other work).
And I always have an instinct that such management should be done by the OS automatically because only OS knows which thread would be suitable to run at a specific time.
Yes; but possibly not for the reason you think.
Another problem with user-space threading (besides making most task switches slower) is that it can't support global thread priorities without becoming a severe security disaster. Specifically; a process can't know if its own thread is higher or lower priority than a thread belonging to a different process (unless it has information about all threads for the entire OS, which is information that normal processes shouldn't be trusted to have); so user-space threading leads to wasting CPU time doing unimportant work (for one process) when there's important work to do (for a different process).
Another problem with user-space threading is that (for some CPUs - e.g. most 80x86 CPUs) the CPUs are not independent, and there may be power management decisions involved with scheduling. For examples; most 80x86 CPUs have hyper-threading (where a core is shared by 2 logical processors), where a smart scheduler may say "one logical processor in the core is running a high priority/important thread, so the other logical processor in the same core should not run a low priority/unimportant thread because that would make the important work slower"; most 80x86 CPUs have "turbo boost" (with similar "don't let low priority threads ruin the turbo-boost/performance of high priority thread" possibilities); and most CPUs have thermal management (where scheduler might say "Hey, these threads are all low priority, so let's underclock the CPU so that it cools down and can go faster later (has more thermal headroom) when there's high priority/more important work to do!").
Would it makes the kernel level thread clearly preferable to user level thread if system calls is as fast as procedure calls?
If system calls were as fast as normal procedure calls, then the performance differences between user-space threading and kernel threading would disappear (but all the other problems with user-space threading would remain). However, the reason why system calls are slower than normal procedure calls is that they pass through a kind of "isolation barrier" (that isolates kernel's code and data from malicious user-space code); so to make system calls as fast as normal procedure calls you'd have to get rid of the isolation (effectively turning the kernel into a kind of "global shared library" that can be dynamically linked) but without that isolation you'll have an extreme security disaster. In other words; to have any hope of achieving acceptable security, system calls must be slower than normal procedure calls.
Your basic premise is wrong. System calls are much slower than procedure calls in almost every interesting architecture.
The perceived cpu throughput is based on pipelining, speculative execution and fetching. The syscall stops the pipeline, invalidates the speculative execution and halts the speculative fetching, is a store and instruction barrier, and may flush the write fifo.
So, the processor slows down to its ‘spec’ speed around the syscall, accelerating back up until the syscall return, whereupon it does about the exact same thing.
Attempts to optimise this area have given rise to lots of papers named after fictional James Bond organizations, and not conciliatory enough apologies from not embarrassed enough cpu product managers. Google spectre as an example, then follow the associated links.
The other cost of syscall
A bit over 30 years ago, some smart guys wrote a paper about least privilege. Conceptually, it is a stunner. The basic premise is that whatever your program is doing, it should do it with the least privilege possible.
If your program is inverting arrays, according to the notion of least privilege, it should not be able to disable interrupts. Disabling interrupts can cause a very difficult to diagnose system failure. Simple user code should not have this ability.
The notion of user and kernel modes of execution evolved from early computer systems, and (with the possible exception of the iax32 / 80286 ) are increasingly showing their inadequacy in the connected computer environment. At one point in time you could say "this is a single user system"; but the IoT dweebs have made everything multi-user.
Least privilege insists that all code should execute with the minimum privilege required to complete the task at hand. Thus, nothing should be in the kernel that absolutely doesn't need to be. If you think that is a radical thought, in Ken Thompson's 1977(?) paper on the UNIX kernel he states exactly the same thing.
So no, putting your junk in the kernel just means you have increased the attack surface for no valid reason. Try to think in terms of exposing minimum risk, it leads to better software and better sleep.

What do we mean by "Non-preemptive Kernel"? [duplicate]

I read that Linux kernel is preemptive, which is different from most Unix kernels. So, what does it really mean for a kernal to be preemptive?
Some analogies or examples would be better than pure theoretical explanation.
ADD 1 -- 11:00 AM 12/7/2018
Preemptive is just one paradigm of multi-tasking. There are others like Cooperative Multi-tasking. A better understanding can be achieved by comparing them.
Prior to Linux kernel version 2.5.4, Linux Kernel was not preemptive which means a process running in kernel mode cannot be moved out of processor until it itself leaves the processor or it starts waiting for some input output operation to get complete.
Generally a process in user mode can enter into kernel mode using system calls. Previously when the kernel was non-preemptive, a lower priority process could priority invert a higher priority process by denying it access to the processor by repeatedly calling system calls and remaining in the kernel mode. Even if the lower priority process' timeslice expired, it would continue running until it completed its work in the kernel or voluntarily relinquished control. If the higher priority process waiting to run is a text editor in which the user is typing or an MP3 player ready to refill its audio buffer, the result is poor interactive performance. This way non-preemptive kernel was a major drawback at that time.
Imagine the simple view of preemptive multi-tasking. We have two user tasks, both of which are running all the time without using any I/O or performing kernel calls. Those two tasks don't have to do anything special to be able to run on a multi-tasking operating system. The kernel, typically based on a timer interrupt, simply decides that it's time for one task to pause to let another one run. The task in question is completely unaware that anything happened.
However, most tasks make occasional requests of the kernel via syscalls. When this happens, the same user context exists, but the CPU is running kernel code on behalf of that task.
Older Linux kernels would never allow preemption of a task while it was busy running kernel code. (Note that I/O operations always voluntarily re-schedule. I'm talking about a case where the kernel code has some CPU-intensive operation like sorting a list.)
If the system allows that task to be preempted while it is running kernel code, then we have what is called a "preemptive kernel." Such a system is immune to unpredictable delays that can be encountered during syscalls, so it might be better suited for embedded or real-time tasks.
For example, if on a particular CPU there are two tasks available, and one takes a syscall that takes 5ms to complete, and the other is an MP3 player application that needs to feed the audio pipe every 2ms, you might hear stuttering audio.
The argument against preemption is that all kernel code that might be called in task context must be able to survive preemption-- there's a lot of poor device driver code, for example, that might be better off if it's always able to complete an operation before allowing some other task to run on that processor. (With multi-processor systems the rule rather than the exception these days, all kernel code must be re-entrant, so that argument isn't as relevant today.) Additionally, if the same goal could be met by improving the syscalls with bad latency, perhaps preemption is unnecessary.
A compromise is CONFIG_PREEMPT_VOLUNTARY, which allows a task-switch at certain points inside the kernel, but not everywhere. If there are only a small number of places where kernel code might get bogged down, this is a cheap way of reducing latency while keeping the complexity manageable.
Traditional unix kernels had a single lock, which was held by a thread while kernel code was running. Therefore no other kernel code could interrupt that thread.
This made designing the kernel easier, since you knew that while one thread using kernel resources, no other thread was. Therefore the different threads cannot mess up each others work.
In single processor systems this doesn't cause too many problems.
However in multiprocessor systems, you could have a situation where several threads on different processors or cores all wanted to run kernel code at the same time. This means that depending on the type of workload, you could have lots of processors, but all of them spend most of their time waiting for each other.
In Linux 2.6, the kernel resources were divided up into much smaller units, protected by individual locks, and the kernel code was reviewed to make sure that locks were only held while the corresponding resources were in use. So now different processors only have to wait for each other if they want access to the same resource (for example hardware resource).
The preemption allows the kernel to give the IMPRESSION of parallelism: you've got only one processor (let's say a decade ago), but you feel like all your processes are running simulaneously. That's because the kernel preempts (ie, take the execution out of) the execution from one process to give it to the next one (maybe according to their priority).
EDIT Not preemptive kernels wait for processes to give back the hand (ie, during syscalls), so if your process computes a lot of data and doesn't call any kind of yield function, the other processes won't be able to execute to execute their calls. Such systems are said to be cooperative because they ask for the cooperation of the processes to ensure the equity of the execution time
EDIT 2 The main goal of preemption is to improve the reactivity of the system among multiple tasks, so that's good for end-users, whereas on the other-hand, servers want to achieve the highest througput, so they don't need it: (from the Linux kernel configuration)
Preemptible kernel (low-latency desktop)
Voluntary kernel preemption (desktop)
No forced preemption (server)
The linux kernel is monolithic and give a little computing timespan to all the running process sequentially. It means that the processes (eg. the programs) do not run concurrently, but they are given a give timespan regularly to execute their logic. The main problem is that some logic can take longer to terminate and prevent the kernel to allow time for the next process. This results in system "lags".
A preemtive kernel has the ability to switch context. It means that it can stop a "hanging" process even if it is not finished, and give the computing time to the next process as expected. The "hanging" process will continue to execute when its time has come without any problem.
Practically, it means that the kernel has the ability to achieve tasks in realtime, which is particularly interesting for audio recording and editing.
The ubuntu studio districution packages a preemptive kernel as well as a buch of quality free software devoted to audio and video edition.
It means that the operating system scheduler is free to suspend the execution of the running processes to give the CPU to another process whenever it wants; the normal way to do this is to give to each process that is waiting for the CPU a "quantum" of CPU time to run. After it has expired the scheduler takes back the control (and the running process cannot avoid this) to give another quantum to another process.
This method is often compared with the cooperative multitasking, in which processes keep the CPU for all the time they need, without being interrupted, and to let other applications run they have to call explicitly some kind of "yield" function; naturally, to avoid giving the feeling of the system being stuck, well-behaved applications will yield the CPU often. Still,if there's a bug in an application (e.g. an infinite loop without yield calls) the whole system will hang, since the CPU is completely kept by the faulty program.
Almost all recent desktop OSes use preemptive multitasking, that, even if it's more expensive in terms of resources, is in general more stable (it's more difficult for a sigle faulty app to hang the whole system, since the OS is always in control). On the other hand, when the resources are tight and the application are expected to be well-behaved, cooperative multitasking is used. Windows 3 was a cooperative multitasking OS; a more recent example can be RockBox, an opensource PMP firmware replacement.
I think everyone did a good job of explaining this but I'm just gonna add little more info. in context of Linux IRQ, interrupt and kernel scheduler.
Process scheduler is the component of the OS that is responsible for deciding if current running job/process should continue to run and if not which process should run next.
preemptive scheduler is a scheduler which allows to be interrupted and a running process then can change it's state and then let another process to run (since the current one was interrupted).
On the other hand, non-preemptive scheduler can't take away CPU away from a process (aka cooperative)
FYI, the name word "cooperative" can be confusing because the word's meaning does not clearly indicate what scheduler actually does.
For example, Older Windows like 3.1 had cooperative schedulers.
Full credit to wonderful article here
I think it became preemptive from 2.6. preemptive means when a new process is ready to run, the cpu will be allocated to the new process, it doesn't need the running process be co-operative and give up the cpu.
Linux kernel is preemptive means that The kernel supports preemption.
For example, there are two processes P1(higher priority) and P2(lower priority) which are doing read system calls and they are running in kernel mode. Suppose P2 is running and is in the kernel mode and P2 is scheduled to run.
If kernel preemption is available, then preemption can happen at the kernel level i.e P2 can get preempted and but to sleep and the P1 can continue to run.
If kernel preemption is not available, since P2 is in kernel mode, system simply waits till P2 is complete and then

What makes a kernel/OS real-time?

I was reading this article, but my question is on a generic level, I was thinking along the following lines:
Can a kernel be called real time just because it has a real time scheduler? Or in other words, say I have a linux kernel, and if I change the default scheduler from O(1) or CFS to a real time scheduler, will it become an RTOS?
Does it require any support from the hardware? Generally I have seen embedded devices having an RTOS (eg VxWorks, QNX), do these have any special provisions/hw to support them? I know RTOS process's running time is deterministic, but then one can use longjump/setjump to get the output in determined time.
I'd really appreciate some input/insight on it, if I am wrong about something, please correct me.
After doing some research, talking to poeple (Jamie Hanrahan, Juha Aaltonen #linkedIn Group - Device Driver Experts) and ofcourse the input from #Jim Garrison, this what I can conclude:
In Jamie Hanrahan's words-
What makes a kernel real time?
The sine qua non of a real time OS -
The ability to guarantee a maximum latency between an external interrupt and the start of the interrupt handler.
Note that the maximum latency need not be particularly short (e.g. microseconds), you could have a real time OS that guaranteed an absolute maximum latency of 137 milliseconds.
A real time scheduler is one that offers completely predictable (to the developer) behavior of thread scheduling - "which thread runs next".
This is generally separate from the issue of a guaranteed maximum latency to responding to an interrupt (since interrupt handlers are not necessarily scheduled like ordinary threads) but it is often necessary to implement a real-time application. Schedulers in real-time OSs generally implement a large number of priority levels. And they almost always implement priority inheritance, to avoid priority inversion situations.
So, it is good to have a guaranteed latency for an interrupt and predictability of thread scheduling, then why not make every OS real time?
Because an OS suited for general purpose use (servers and/or desktops) needs to have characteristics that are generally at odds with real-time latency guarantees.
For example, a real-time scheduler should have completely predictable behavior. That means, among other things, that whatever priorities have been assigned to the various tasks by the developer should be left alone by the OS. This might mean that some low-priority tasks end up being starved for long periods of time. But the RT OS has to shrug and say "that's what the dev wanted." Note that to get the correct behavior, the RT system developer has to worry a lot about things like task priorities and CPU affinities.
A general-purpose OS is just the opposite. You want to be able to just throw apps and services on it, almost always things written by many different vendors (instead of being one tightly integrated system as in most R-T systems), and get good performance. Perhaps not the absolute best possible performance, but good.
Note that "good performance" is not just measured in interrupt latency. In particular, you want CPU and other resource allocations that are often described as "fair", without the user or admin or even the app developers having to worry much if at all about things like thread priorities and CPU affinities and NUMA nodes. One job might be more important than another, but in a general-purpose OS, that doesn't mean that the second job should get no resources at all.
So the general purpose OS will usually implement time-slicing among threads of equal priority, and it may adjust the priorities of threads according to their past behavior (e.g. a CPU hog might have its priority reduced; an I/O bound thread might have its priority increased, so it can keep the I/O devices working; a CPU-starved thread might have its priority boosted so it can get a little bit of CPU time now and then).
Can a kernel be called real time just because it has a real time scheduler?
No, an RT scheduler is a necessary component of an RT OS, but you also need predictable behavior in other parts of the OS.
Does it require any support from the hardware?
In general, the simpler the hardware the more predictable its behavior is. So PCI-E is less predictable than PCI, and PCI is less predictable than ISA, etc. There are specific I/O buses that were designed for (among other things) easy predictability of e.g. interrupt latency, but a lot of R-T requirements can be met these days with commodity hardware.
The specific description of real-time is that processes have minimum response time guarantees. This is often not sufficient for the application, and even less important than determinism. This is especially hard to achieve with modern feature rich OS's. Consider:
If I want to command some hardware or a machine at precise points in time, I need to be able to generate command signals at those specific moments, often with far sub millisecond accuracy. Generally if you compile let's say a C-code that runs a loop that waits for "half a millisecond" and does something, the wait time is not exactly half a millisecond, it is a little bit more, since the way common OS's handle this, is that they put the process aside at least up until the correct time has passed, after which the scheduler might (at some point) pick it up again.
What is seriously problematic is not that the time t is not exactly half a second but that it cannot be known in advance how much more it is. This inaccuracy is not constant nor deterministic.
This has surprising consequences when doing physical automation. For example it is impossible to command a stepper motor accurately with any typical OS without using dedicated hardware through kernel interfaces and telling them how long time steps you really want. Because of this, a single AVR module can command several motors accurately, but a Raspberry Pi (that absolutely stomps the AVR in terms of clockspeed) cannot manage more than 2 with any typical OS.

What does it mean to say "linux kernel is preemptive"?

I read that Linux kernel is preemptive, which is different from most Unix kernels. So, what does it really mean for a kernal to be preemptive?
Some analogies or examples would be better than pure theoretical explanation.
ADD 1 -- 11:00 AM 12/7/2018
Preemptive is just one paradigm of multi-tasking. There are others like Cooperative Multi-tasking. A better understanding can be achieved by comparing them.
Prior to Linux kernel version 2.5.4, Linux Kernel was not preemptive which means a process running in kernel mode cannot be moved out of processor until it itself leaves the processor or it starts waiting for some input output operation to get complete.
Generally a process in user mode can enter into kernel mode using system calls. Previously when the kernel was non-preemptive, a lower priority process could priority invert a higher priority process by denying it access to the processor by repeatedly calling system calls and remaining in the kernel mode. Even if the lower priority process' timeslice expired, it would continue running until it completed its work in the kernel or voluntarily relinquished control. If the higher priority process waiting to run is a text editor in which the user is typing or an MP3 player ready to refill its audio buffer, the result is poor interactive performance. This way non-preemptive kernel was a major drawback at that time.
Imagine the simple view of preemptive multi-tasking. We have two user tasks, both of which are running all the time without using any I/O or performing kernel calls. Those two tasks don't have to do anything special to be able to run on a multi-tasking operating system. The kernel, typically based on a timer interrupt, simply decides that it's time for one task to pause to let another one run. The task in question is completely unaware that anything happened.
However, most tasks make occasional requests of the kernel via syscalls. When this happens, the same user context exists, but the CPU is running kernel code on behalf of that task.
Older Linux kernels would never allow preemption of a task while it was busy running kernel code. (Note that I/O operations always voluntarily re-schedule. I'm talking about a case where the kernel code has some CPU-intensive operation like sorting a list.)
If the system allows that task to be preempted while it is running kernel code, then we have what is called a "preemptive kernel." Such a system is immune to unpredictable delays that can be encountered during syscalls, so it might be better suited for embedded or real-time tasks.
For example, if on a particular CPU there are two tasks available, and one takes a syscall that takes 5ms to complete, and the other is an MP3 player application that needs to feed the audio pipe every 2ms, you might hear stuttering audio.
The argument against preemption is that all kernel code that might be called in task context must be able to survive preemption-- there's a lot of poor device driver code, for example, that might be better off if it's always able to complete an operation before allowing some other task to run on that processor. (With multi-processor systems the rule rather than the exception these days, all kernel code must be re-entrant, so that argument isn't as relevant today.) Additionally, if the same goal could be met by improving the syscalls with bad latency, perhaps preemption is unnecessary.
A compromise is CONFIG_PREEMPT_VOLUNTARY, which allows a task-switch at certain points inside the kernel, but not everywhere. If there are only a small number of places where kernel code might get bogged down, this is a cheap way of reducing latency while keeping the complexity manageable.
Traditional unix kernels had a single lock, which was held by a thread while kernel code was running. Therefore no other kernel code could interrupt that thread.
This made designing the kernel easier, since you knew that while one thread using kernel resources, no other thread was. Therefore the different threads cannot mess up each others work.
In single processor systems this doesn't cause too many problems.
However in multiprocessor systems, you could have a situation where several threads on different processors or cores all wanted to run kernel code at the same time. This means that depending on the type of workload, you could have lots of processors, but all of them spend most of their time waiting for each other.
In Linux 2.6, the kernel resources were divided up into much smaller units, protected by individual locks, and the kernel code was reviewed to make sure that locks were only held while the corresponding resources were in use. So now different processors only have to wait for each other if they want access to the same resource (for example hardware resource).
The preemption allows the kernel to give the IMPRESSION of parallelism: you've got only one processor (let's say a decade ago), but you feel like all your processes are running simulaneously. That's because the kernel preempts (ie, take the execution out of) the execution from one process to give it to the next one (maybe according to their priority).
EDIT Not preemptive kernels wait for processes to give back the hand (ie, during syscalls), so if your process computes a lot of data and doesn't call any kind of yield function, the other processes won't be able to execute to execute their calls. Such systems are said to be cooperative because they ask for the cooperation of the processes to ensure the equity of the execution time
EDIT 2 The main goal of preemption is to improve the reactivity of the system among multiple tasks, so that's good for end-users, whereas on the other-hand, servers want to achieve the highest througput, so they don't need it: (from the Linux kernel configuration)
Preemptible kernel (low-latency desktop)
Voluntary kernel preemption (desktop)
No forced preemption (server)
The linux kernel is monolithic and give a little computing timespan to all the running process sequentially. It means that the processes (eg. the programs) do not run concurrently, but they are given a give timespan regularly to execute their logic. The main problem is that some logic can take longer to terminate and prevent the kernel to allow time for the next process. This results in system "lags".
A preemtive kernel has the ability to switch context. It means that it can stop a "hanging" process even if it is not finished, and give the computing time to the next process as expected. The "hanging" process will continue to execute when its time has come without any problem.
Practically, it means that the kernel has the ability to achieve tasks in realtime, which is particularly interesting for audio recording and editing.
The ubuntu studio districution packages a preemptive kernel as well as a buch of quality free software devoted to audio and video edition.
It means that the operating system scheduler is free to suspend the execution of the running processes to give the CPU to another process whenever it wants; the normal way to do this is to give to each process that is waiting for the CPU a "quantum" of CPU time to run. After it has expired the scheduler takes back the control (and the running process cannot avoid this) to give another quantum to another process.
This method is often compared with the cooperative multitasking, in which processes keep the CPU for all the time they need, without being interrupted, and to let other applications run they have to call explicitly some kind of "yield" function; naturally, to avoid giving the feeling of the system being stuck, well-behaved applications will yield the CPU often. Still,if there's a bug in an application (e.g. an infinite loop without yield calls) the whole system will hang, since the CPU is completely kept by the faulty program.
Almost all recent desktop OSes use preemptive multitasking, that, even if it's more expensive in terms of resources, is in general more stable (it's more difficult for a sigle faulty app to hang the whole system, since the OS is always in control). On the other hand, when the resources are tight and the application are expected to be well-behaved, cooperative multitasking is used. Windows 3 was a cooperative multitasking OS; a more recent example can be RockBox, an opensource PMP firmware replacement.
I think everyone did a good job of explaining this but I'm just gonna add little more info. in context of Linux IRQ, interrupt and kernel scheduler.
Process scheduler is the component of the OS that is responsible for deciding if current running job/process should continue to run and if not which process should run next.
preemptive scheduler is a scheduler which allows to be interrupted and a running process then can change it's state and then let another process to run (since the current one was interrupted).
On the other hand, non-preemptive scheduler can't take away CPU away from a process (aka cooperative)
FYI, the name word "cooperative" can be confusing because the word's meaning does not clearly indicate what scheduler actually does.
For example, Older Windows like 3.1 had cooperative schedulers.
Full credit to wonderful article here
I think it became preemptive from 2.6. preemptive means when a new process is ready to run, the cpu will be allocated to the new process, it doesn't need the running process be co-operative and give up the cpu.
Linux kernel is preemptive means that The kernel supports preemption.
For example, there are two processes P1(higher priority) and P2(lower priority) which are doing read system calls and they are running in kernel mode. Suppose P2 is running and is in the kernel mode and P2 is scheduled to run.
If kernel preemption is available, then preemption can happen at the kernel level i.e P2 can get preempted and but to sleep and the P1 can continue to run.
If kernel preemption is not available, since P2 is in kernel mode, system simply waits till P2 is complete and then

Developing Kernels to support Multiple CPUs

I am looking to get into operating system kernel development and figured my contribution would be to extend the SANOS operating system in order to support multiple core machines. I have been reading books on operating systems (Tannenbaum) as well as studying how BSD and Linux have tackled this challenge but still am stuck on several concepts.
Does SANOS need to have more sophisticated scheduling algorithms when it runs on multiple CPUs or will what is currently in place work fine?
I know that it is a good idea for threads to have affinity to a core that they were started on, but is this handled via scheduling or by changing the implementation of how threads are created?
What would need to be considered such that SANOS could run on a machine with hundreds of cores? From what I can tell, BSD and Linux at best only support a maximum of a dozen of cores.
Your reading material is good. SO no problems there. Also take a peek at the CS downloadable lectures on operating system design from Stanford.
The scheduling algorithm may need to be more sophisticated. This depends on the types of applications running and how greedy they are. Do they yield themselves or are they forced to. That kind of thing. This is more a question of what your processes want, or expect. A RTOS will have more complex scheduling than a desktop.
Threads should have an affinity to one core, because 2 threads in one process can execute in parallel ... but not at the same real-time on the same core. Putting them on different cores allows them to really-run-in-parallel. Also caching can be optimized for core affinity. This is really a mix of your thread implementation and scheduler. The sched may want to ensure threads are started at the same time on cores, rather than ad-hoc to reduce the amount of time threads wait on eachother and things. If your thread library is user-space, maybe it assigns core, or lets the scheduler decide based on capacity or recent deaths.
Scalability is often a kernel limit (which can be arbitrary). In Linux, if I recall, the limits are due to static sizing of arrays that hold CPU information structs in the scheduler. Hence they are a fixed size. This can be changed by recompiling the kernel. Most good scheduling algorithms will support a very large number of cores. As your core or processor count gets higher, you need to be careful that you don't fragment a processes execution too much. If a program has 2 threads, try and schedule them in close-time-proximity because causation may exist (through shared data) between them.
You also need to decide how your threads are implemented, and how a process is represented (be it heavy or lightweight) in the kernel. Are threads kernel managed? user-space managed? These things all have an impact on scheduler design. Look at how POSIX threads are implemented in various operating systems. There is just so much for you to think about :)
in short there are not really any straight-cut answers to where the logic does, or should reside. It is all down to design, application expectation, time-constraints (on the programs) and so on.
Hope this helps, I am not an expert here however.

Resources