Why kernel preemption is safe only when preempt_count == 0? - multithreading

Linux kernel 2.6 introduced a new per-thread field---preempt_count---which is incremented/decremented whenever a lock is acquired/released. This field is used to allow kernel preemption: "If need_resched is set and preempt_count is zero, then a more important task is runnable and it is safe to preempt."
According to the "Linux Kernel Development" book by Robert Love:
"So when is it safe to reschedule? The kernel is capable of preempting a task running in the kernel so long as it does not hold a lock."
My question is: why isn't it safe to preempt a task running in the kernel while this task holds a lock?
If another task is scheduled and tries to grab the lock, it will block (or spin until its time slice ends), so we wouldn't get the two threads simultaneously inside the same critical section. Can anyone please outline a problematic scenario in case we do preempt a task that holds a lock in kernel-mode?
Thanks!

While this is an old question, the accepted answer isn't correct.
First of all the title is asking:
Why kernel preemption is safe only when preempt_count > 0?
This isn't correct, it's the opposite. Kernel preemption is disabled when preempt_count > 0, and enabled when preempt_count == 0.
Furthermore, the claim:
If another task is scheduled and tries to grab the lock, it will block (or spin until its time slice ends),
Is not always true.
Say you acquire a spin lock. Preemption is enabled. A process switch happens, and in the context of the new process some softirq runs. Preemption is disabled while running softirqs. If one of those softirqs attempts to accquire your lock it will never stop spinning because preemption is disabled. Thus you have a deadlock.
You have no control over whether the process that preempts yours will run softirqs or not. The preempt_count field where you disable softirqs is process-specific. Softirqs have to run with preemption disabled to preserve the per-cpu serialization of softirqs.

With the help of #Tsyvarev, I think I can now answer my own question and depict a problematic scenario in which we do preempt a task that holds a lock in kernel-mode.
Thread #1 holds a spin-lock and gets preempted.
Thread #2 is then scheduled, and spins to grab the spin-lock.
Now, if thread #2 is a conventional process, it will eventually finish its time slice. In that case, thread #1 will be scheduled again, release the lock, and we are all good.
But, if thread #2 is real-time process of a higher priority, thread #1 will never get to run again and we have a deadlock.
This answer is corroborated by another stackoverflow thread which cites the FreeBSD documentation:
While locks can protect most data in the case of a preemption, not all
of the kernel is preemption safe. For example, if a thread holding a
spin mutex preempted and the new thread attempts to grab the same spin
mutex, the new thread may spin forever as the interrupted thread may
never get a chance to execute.
although the above quote doesn't explicitly explain why the "interrupted thread may never get a chance to execute" again.

Related

How does the scheduler know that a thread is blocked waiting for input?

When a thread executing user code is waiting for input, how does the scheduler know to interrupt it or how does the thread know to call the scheduler, seeing as the average programmer of a simple single threaded application is unlikely to insert sched_yield() everywhere. Does the compiler insert sched_yield() on optimisation or does the thread just spin lock until the general timer interrupt set by the scheduler fires, or does the user have to explicitly state wait(), sleep() functions in order for the context to switch?
This question is especially relevant if the scheduler is not preemptive because then it has to call the scheduler when it is waiting for input for throughput to be effective, but I'm not sure how it does this.
Be careful not to confuse preemption with the ability of a process to sleep. Processes can sleep even with a non-preempting scheduler. This is what happens when a process is waiting for I/O. The process makes a system call such as read() and the device determines no data is available. It then internally puts the process to sleep by updating a data structure used by the scheduler. The scheduler then executes other processes until an interrupt or some other event occurs that wakes the original process. The awoken process then becomes eligible again for scheduling.
On the other hand preemption is the ability of an architecture's scheduler to stop execution of a process without its cooperation. The interruption can occur anywhere in the program's instruction stream. Control returns to the scheduler which can then execute other processes and return to the interrupted (preempted) process later. Most schedulers allocate time slices where a process is allowed to run for up to a predetermined amount of time, after which it is preempted if higher-priority processes need time slices.
Unless you're writing drivers or kernel code, you don't need to worry about the underlying mechanisms too much. When writing user-space applications the key concepts are (1) that some system calls may block which means your process is put to sleep until an event occurs, and (2) on preemptible systems (all mainstream modern operating systems) your program may be preempted at any time so that other processes can run.
* Note that in some platforms, such as Linux, a thread is really just another process which shares its virtual address space with another process. Processes and threads are therefore treated exactly the same by the scheduler.
It is not clear to me whether your question is about theory or practice. In practice in every modern operating system, i/o operations are privileged. Meaning that in order for a user process or thread to access files, devices and so on it must issue a system call.
Then the kernel has the opportunity to do whatever it considers appropriate. For example it can check whether the I/o operation will block and, therefore switch the running (i.e. “call” the scheduler) process after issuing the operation.
Note that this mechanism can work even when there is no timer interruption handled by the kernel. Anyway in general it will depend upon your system. For example in an embedded system where no OS exits (or a minimal one) it could be the entire responsibility of the user’s code to invoke the scheduler before issueing a blocking operation.
Kernel can be preemptive, not scheduler.
First sched_yield() and wait() are types of voluntary preemption, when process itself gives out CPU even if kernel is non-preemptive.
If kernel has ability to switch to another process when time quantum has expired or higher priority process become runnable then we are talking about involuntary preemption, i.e preemptive kernel, and it can happen on different places explained below.
Difference is that insched_yield() process stays in runnable TASK_RUNNING state but just goes to the end of the run queue for it's static priority. Process must wait to get the CPU again.
On the other hand, wait() puts process to a sleep TASK_(UN)INTERRUPTABLE state, on a wait queue, calls schedule() and waits for an event to occur. When event occur, process are moved to run queue again. But that doesn't mean that they will get CPU immediately.
Here is explained when schedule() can be called after process is woken up:
Wakeups don't really cause entry into schedule(). They add a
task to the run-queue and that's it.
If the new task added to the run-queue preempts the current
task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
called on the nearest possible occasion:
If the kernel is preemptible (CONFIG_PREEMPT=y):
in syscall or exception context, at the next outmost
preempt_enable(). (this might be as soon as the wake_up()'s
spin_unlock()!)
in IRQ context, return from interrupt-handler to
preemptible context
If the kernel is not preemptible (CONFIG_PREEMPT is not set)
then at the next:
cond_resched() call
explicit schedule() call
return from syscall or exception to user-space
return from interrupt-handler to user-space

What does it mean by code holding semaphore can be preempted

I was going through the Robert Love Book and was bit confused about this line. What does it mean by code holding semaphore can be preempted?
If an interrupt occurs accessing the same variable which the user-space application has while it is executing the code in critical section then the user-space application can be preempted?
If my above understanding is true then there is no other alternative than spin-locks to disable an interrupt whenever an user-space application is in critical section?
So what is the use of semaphore in the context of OS? Interrupts might occur anytime when the user application is in critical section and in-order to avoid interrupt intervention we need to use spin-locks all the time.
What does it mean by code holding semaphore can be preempted?
It means that a process that is currently running in its critical section holding some lock for the purpose of synchronization can be preempted. Ideally interrupts have the highest
priority, so unless you disable the interrupt on that processor core, the running process
can be preempted and that might happen while the process is in its critical section.
While there are multiple spin_lock_XXX apis to disable interrupts, you might want to use
the spin_lock_irqsave as it saves the interrupt flags on that core and restores them while releasing the lock.

How spinlocked threads avoid overhead of context switching?

In wikipedia, the aritcle about spinlocks.
Because they avoid overhead from operating system process rescheduling or context switching, spinlocks are efficient if threads are likely to be blocked for only short periods.
I actually can't grasp this sentence.
I think that even if a thread has a spinlock, it's going to be rescheduled, am I wrong ?
The context switching over-head - which is saving the registers,pc & scheduling queue - is constant for all switches, isn't it?
I actually can't grasp this sentence. I think that even if a thread
has a spinlock, it's going to be rescheduled, am I wrong ?
Eventually it would be... when its timeslice expired.
What a spinlock avoids is the chance of having the thread get context-switched out immediately whenever it tries to acquire and the lock is already locked by another thread.
(In the traditional mutex case, when the mutex is already locked, the thread would immediately be put to sleep aka context-switched out, and it would not be reawoken until after the other thread had unlocked the mutex. In spinlock case, OTOH, the thread would just keep checking the spinlock's state in a tight loop, until the spinlock was no longer locked, and then the thread would lock the spinlock for itself. Note that at no point during that process would the thread ask the kernel to put the thread to sleep, although if it spun for a long time its possible the kernel would do anyway... but a program using spinlocks will hopefully never lock them for a long time anyway, since spinning is really inefficient)
The context switching over-head - which is saving the registers,pc &
scheduling queue - is constant for all switches, isn't it?
Yes, I believe it is.
Generally an OS is only going to use spinlocks in interrupt service routines. These are designed to be of short duration.
I actually can't grasp this sentence. I think that even if a thread has a spinlock, it's going to be rescheduled, am I wrong ?
Not while it is handling an interrupt (simplifying here that only there is only one IPL). That interrupt might be the timer interrupt where the a context switch may take place. However, in that situation, the spinlock wait would be for the resources necessary to process a context switch.

Lock Holder Preemption

Could you have the following scenario in concurrent programs?
suppose a thread acquires a lock to execute a critical section.Then before the critical section is executed the processor preempts the thread. The new thread that comes for execution needs the lock from the old thread (that was preempted). So the current thread can't proceed (hangs until it get preempted). Is there a mechanism in Operating systems to not let threads preempted until the lock is released?
It is possible for a thread holding a mutex to be preempted while executing a critical section. If the thread that the OS switches to tries to acquire that mutex and finds that it is already locked, then that thread should be context switched out immediately. The thread scheduler should be smart enough to not switch back to that thread until it has switched back to the thread holding the mutex and the mutex is released.
If you are writing Kernel code then yes, there are mechanisms for preventing a thread to preempt.
For standard code there is no such thing. Some operations are atomic and are ensured atomic by the compiler and kernel but right after those operations the thread may be preempted and it can remain preempted for an undetermined amount of time (unless the system is a real-time sistem).

What guarantee thread with spin lock on multiprocessor run on a different processor

I know spin lock only works on multiprocessor. But if two threads try to acquire the same resource and one is put on spinlock, what prevents the other one not running on the same processor? If it happens the one with spin lock will prevent the one holding the resources to exceed. In this case it becomes a deadlock. How does OS prevent it happen?
Some background facts first:
spin-locks (and locks generally) are not limited to multiprocessor systems. They work fine on single processor or even single-threaded application can use them without any harm.
spin-locks are not only provided by OS, they have pure user-space implementation as well. For example, tbb provides tbb::spin_mutex.
By default, nothing prevents a thread from running on any available CPU (regardless of the locks they use).
There are reentrant/recursive type of locks. It means that if a thread acquired it once, and tries to acquire it once again without releasing, it will succeed, not deadlock as usual locks. But it does not mean that the same applies to different threads just because they are scheduled to the same CPU. With any type of lock, if one software thread locked a mutex, other threads have to wait.
It is possible for one thread to acquire the lock and be preempted (i.e. interrupted by OS timer) before it releases the lock. Another thread can be scheduled to the same CPU and it might want to acquire the same lock. In case of pure spin-locks, this thread will uselessly spin until it exceeds its time-slice allowed by OS and will be preempted. Finally, the first thread will get a chance to run and release its lock so another thread will be able to acquire it.
As you can see, it is not quite efficient to spent the time on the hopeless waiting. Thus, more sophisticated implementations, after a number of attempts to acquire the spinlock, call OS for help in order to voluntary give away its time-slice to other threads which possibly can unlock the current one.

Resources