preempt_count variable keeps track of per CPU statistics::
static __always_inline int preempt_count(void)
{
return current_thread_info()->preempt_count;
}
Bits 0 - 7 keeps track of how many times kernel preemption is disabled.
Bits 8 - 15 if non-zero, means softirqs are disbaled that number of times.
Bits 16 - 27 specifies how many calls of irq_enter happened. It mean the number of nested interrupt handlers.
I am not able to comprehend why it is sufficient for preempt_count to be
per thread.
When a new process would be scheduled, off course bit 0-7 will be zero, otherwise it means preemption is disabled and switch is not allowed. But what about bits 8 - 27. Will they be 0 too?
Does it mean that whenever there is a process schedule call, at that time preempt_count should be 0 and hence its value does not need to be copied across thread_info of different processes to keep the track of the status softirqs and irqs on a particular CPU?
Linux forbids thread being scheduled when in interrupt,this is a convention and there is no code to implement this restraint.So under this convention the preempt_count of new thread must be zero and there is no need copy preempt_count.If somebody call schedule in interrupt context there may be other problem for example new interrupt can't be processed because the previous interrupt has disabled interrupt.
current_thread_info is architecture dependent. It point to end of kernel stack. Before 2.6 at the end of kernel stack task_struct leave. But after task_struct move to slab and thread_info replaced task_struct. Each process have thread_info that lie in the end of kernel stack for that task. current_thread_info() return arch dependent pointer (calculate based on stack pointer and size). Thread_info have preempt_count it is a counter. If value is zero so kernel can be preempt (preempt_count += SOFTIRQ_OFFSET - disable bottom half). preempt_count incremented is lock is held and decremented if lock is freed. If irq handle finished and instruction pointer should point kernel code. Kernel should check if preempt_count is zero so code preemption is safe, also if need_reshed from the same thread_info is set. Sheduler should be start more important task. If preempt_count non zero - preemption not safe and next execution should be point to the same task that was interrupt. Only when all locks will be free by current task and preempt_count == 0 so code can by preempt safe. And again need_recshed check.
In other hand preempt_count() function can be very help full with preempt_disable(), preempt_enable() - two nested function. Preemption function can protect per cpu variable from concurrent access from more than one task that use the same per_cpu variable. It is important that only one task should work with shared per_cpu variable (in the meedle it can be changed by other task) and preemption should be disable. Preamption function is nested so we can use preempt_count(). in_atomic based on preempt_count() and actually it explain why it is not very good function.
Related
With KLT, each thread gets its own stack, right? And those details are maintained in a distinct PCB for each block and the different Page Tables, right? How would this apply for user level threads. Do all the threads in ULT have different stacks? If so how is it implemented?
Edit: I have since figured out that this exact question has been asked here more than 10 years ago. Unfortunately, it hasn't been answered adequately there either.
On the Linux kernel, you'll see kernel threads around when the bottom half of an interrupt handler wasn't completed and preempted to another thread. For example, an interrupt occurs, the top half of the interrupt handler runs with interrupts disabled and then adds the bottom half to the queue of threads (in reality, it is more complex than that). This creates kernel threads. Kernel threads are given high priority so they run fast because, most likely, a user thread is waiting for their completion.
Kernel threads have their own stack which are created when creating them in the top half of an interrupt handler (when a kernel thread is created, its stack is created). As far as I know, each core has one interrupt stack for servicing interrupts. Kernel threads have their own task_struct but without an address space. Most likely, they are basically a driver's servicing function which is supposed to do some work on behalf of a device that was queried by a user mode thread. For example, let's say thread A makes a syscall to read from disk. The driver used for that disk will write some registers of the hard-disk controller to start a DMA operation from the disk. When the operation is done, the controller triggers an interrupt. During execution of the top half, the interrupt stack is used and further interrupts are disabled. The top half creates a new kernel thread which is added to the queue of ready threads with a high priority. Eventually, that kernel thread runs (with its own task_struct and stack) and finishes. When it finishes, it will place the user mode thread on behalf of which this operation was done back in the ready queue.
With the Linux kernel, user threads all have 2 stacks: one for their user mode operations and one for their kernel mode operations (during a syscall). Each user mode stack is given a fixed size (in virtual memory). Since you seem to have some misconceptions, you can read some of my answers for more details:
Process of State
What is paging exactly? OSDEV
Understanding how operating systems store/retrieve IO device input
If
CPU has mode (privilege level) (Added because not all processors have privilege levels according to here)
CPU is multi-core (CISC / x86-64 instruction set)
scheduling is round robin scheduling
thread is kernel managed thread
OS is windows if necessary
I want to know simplified core execution flow in the point of thread context switch and CPU mode switch per time slice.
My understanding is as follows. Please correct me if I'm wrong.
In case of the thread is kernel managed user mode thread not involving interrupt, or anything that requires kernel mode,
The thread context switch occurs.
The core executing thread switches to kernel mode because context switch can only occur in kernel mode according to here, here and here unless the thread is user managed thread.
The core executing thread switches to user mode.
The core executes sequence of instructions located in user space.
Time slice expires.
Repeat 1.
Closest related diagram I could find is below.
Even a little clue to answer will be sincerely appreciated.
You said it yourself:
context switch can only occur in kernel mode
So the CPU must enter kernel mode before there can be a context switch. That can happen in either one of two ways in most operating systems:
The user-mode code makes a system call, or
An interrupt occurs.
If the thread enters kernel mode by making a system call, then there could be a context switch if the syscall causes the thread to no longer be runnable (e.g., a sleep() call), or there could be a context switch if the syscall causes some higher-priority thread to become runnable. (e.g., the syscall releases a mutex that the higher priority thread was awaiting.)
If the thread enters kernel mode because of an interrupt, then there could be a context switch because the interrupt handler made some higher-priority thread runnable (e.g., if the other thread was awaiting data from the disk), or there could be a context switch because it was a timer interrupt, and the current thread's time slice has expired.
The mechanism of context switching may be different on different hardware platforms. Here's how it could happen on some hypothetical CPU:
The current thread (threadA) enters sheduler code which chooses some other thread (threadB) as the next to run on the current CPU.
It calls some switchContext(threadB) function.
The switchContext function copies values from the stack pointer register, and from other live registers into the current thread (threadA)'s saved context area.*
It then sets the "current thread" pointer to point to threadB's saved context area, and it restores threadB's context by copying all the same things in reverse.**
Finally, the switchContext function returns... IN threadB,... at exactly the place where threadB last called it.
Eventually, threadB returns from the interrupt or system call to application code running in user-mode.
* The author of switchContext may have to be careful, may have to do some tricky things, in order to save the entire context without trashing it. E.g., it had better not use any register that needs saving before it has actually saved it somewhere.
** The trickiest part is when restoring the stack pointer register. As soon as that happens, "the" stack suddenly is threadB's stack instead of threadA's stack.
I was reading through the ARM kernel source code in order to better my understanding and came across something interesting.
Inside arch/arm/kernel/entry-armv.S there is a macro named vector_stub, that generates a small chunk of assembly followed by a jump table for various ARM modes. For instance, there is a call to vector_stub irq, IRQ_MODE, 4 which causes the macro to be expanded to a body with label vector_irq; and the same occurs for vector_dabt, vector_pabt, vector_und, and vector_fiq.
Inside each of these vector_* jump tables, there is exactly 1 DWORD with the address of a label with a _usr suffix.
I'd like to confirm that my understanding is accurate, please see below.
Does this mean that labels with the _usr suffix are executed, only if the interrupt arises when the kernel thread executing on that CPU is in userspace context? For instance, irq_usr is executed if the interrupt occurs when the kernel thread is in userspace context, dabt_usr is executed if the interrupt occurs when the kernel thread is in userspace context, and so on.
If [1] is true, then which kernel threads are responsible for handling, say irqs, with a different suffix such as irq_svc. I am assuming that this is the handler for an interrupt request that happens in SVC mode. If so, which kernel thread handles this? The kernel thread currently in SVC mode, on whichever CPU receives the interrupt?
If [2] is true, then at what point does the kernel thread finish processing the second interrupt, and return to where it had left off(also in SVC mode)? Is it ret_from_intr?
Inside each of these vector_* jump tables, there is exactly 1 DWORD with the address of a label with a _usr suffix.
This is correct. The table in indexed by the current mode. For instance, irq only has three entries; irq_usr, irq_svc, and irq_invalid. Irq's should be disabled during data aborts, FIQ and other modes. Linux will always transfer to svc mode after this brief 'vector stub' code. It is accomplished with,
#
# Prepare for SVC32 mode. IRQs remain disabled.
#
mrs r0, cpsr
eor r0, r0, #(\mode ^ SVC_MODE | PSR_ISETSTATE)
msr spsr_cxsf, r0
### ... other unrelated code
movs pc, lr # branch to handler in SVC mode
This is why irq_invalid is used for all other modes. Exceptions should never happen when this vector stub code is executing.
Does this mean that labels with the _usr suffix are executed, only if the interrupt arises when the kernel thread executing on that CPU is in userspace context? For instance, irq_usr is executed if the interrupt occurs when the kernel thread is in userspace context, dabt_usr is executed if the interrupt occurs when the kernel thread is in userspace context, and so on.
Yes, the spsr is the interrupted mode and the table indexes by these mode bits.
If 1 is true, then which kernel threads are responsible for handling, say irqs, with a different suffix such as irq_svc. I am assuming that this is the handler for an interrupt request that happens in SVC mode. If so, which kernel thread handles this? The kernel thread currently in SVC mode, on whichever CPU receives the interrupt?
I think you have some misunderstanding here. There is a 'kernel thread' for user space processes. The irq_usr is responsible for storing the user mode registers as a reschedule might take place. The context is different for irq_svc as a kernel stack was in use and it is the same one the IRQ code will use. What happens when a user task calls read()? It uses a system call and code executes in a kernel context. Each process has both a user and svc/kernel stack (and thread info). A kernel thread is a process without any user space stack.
If 2 is true, then at what point does the kernel thread finish processing the second interrupt, and return to where it had left off(also in SVC mode)? Is it ret_from_intr?
Generally Linux returns to the kernel thread that was interrupted so it can finish it's work. However, there is a configuration option for pre-empting svc threads/contexts. If the interrupt resulted in a reschedule event, then a process/context switch may result if CONFIG_PREEMPT is active. See svc_preempt for this code.
See also:
Linux kernel arm exception stack init
Arm specific irq initialization
Linux kernel 2.6 introduced a new per-thread field---preempt_count---which is incremented/decremented whenever a lock is acquired/released. This field is used to allow kernel preemption: "If need_resched is set and preempt_count is zero, then a more important task is runnable and it is safe to preempt."
According to the "Linux Kernel Development" book by Robert Love:
"So when is it safe to reschedule? The kernel is capable of preempting a task running in the kernel so long as it does not hold a lock."
My question is: why isn't it safe to preempt a task running in the kernel while this task holds a lock?
If another task is scheduled and tries to grab the lock, it will block (or spin until its time slice ends), so we wouldn't get the two threads simultaneously inside the same critical section. Can anyone please outline a problematic scenario in case we do preempt a task that holds a lock in kernel-mode?
Thanks!
While this is an old question, the accepted answer isn't correct.
First of all the title is asking:
Why kernel preemption is safe only when preempt_count > 0?
This isn't correct, it's the opposite. Kernel preemption is disabled when preempt_count > 0, and enabled when preempt_count == 0.
Furthermore, the claim:
If another task is scheduled and tries to grab the lock, it will block (or spin until its time slice ends),
Is not always true.
Say you acquire a spin lock. Preemption is enabled. A process switch happens, and in the context of the new process some softirq runs. Preemption is disabled while running softirqs. If one of those softirqs attempts to accquire your lock it will never stop spinning because preemption is disabled. Thus you have a deadlock.
You have no control over whether the process that preempts yours will run softirqs or not. The preempt_count field where you disable softirqs is process-specific. Softirqs have to run with preemption disabled to preserve the per-cpu serialization of softirqs.
With the help of #Tsyvarev, I think I can now answer my own question and depict a problematic scenario in which we do preempt a task that holds a lock in kernel-mode.
Thread #1 holds a spin-lock and gets preempted.
Thread #2 is then scheduled, and spins to grab the spin-lock.
Now, if thread #2 is a conventional process, it will eventually finish its time slice. In that case, thread #1 will be scheduled again, release the lock, and we are all good.
But, if thread #2 is real-time process of a higher priority, thread #1 will never get to run again and we have a deadlock.
This answer is corroborated by another stackoverflow thread which cites the FreeBSD documentation:
While locks can protect most data in the case of a preemption, not all
of the kernel is preemption safe. For example, if a thread holding a
spin mutex preempted and the new thread attempts to grab the same spin
mutex, the new thread may spin forever as the interrupted thread may
never get a chance to execute.
although the above quote doesn't explicitly explain why the "interrupted thread may never get a chance to execute" again.
I have read Intel document about memory orderings on x64: http://www.multicoreinfo.com/research/papers/2008/damp08-intel64.pdf .They says that locked instructions cause full barriers which makes processors to see e.g. updates in specified order. But there is nothing about visibility caused by barriers. Does barriers cause that other processors will see updates of variables immediately or maybe updates will propagate to other processors only in specified order but with not specified time?
E.g.
Thread1:
flag = true;
MemoryBarrier();
Thread 2:
MemoryBarrier();
tmp = flag;
Does thread 2 will always flag=true if Thread 1 will execute its code before Thread 2?
The barriers guarantee that other processors will see updates in the specified order, but not when that happens.
Which brings the follow-up question, how do you define "immediately" in a multiprocessor system [1], or how do you ensure that Thread 1 executes before Thread 2? In this case, one answer would be that Thread 1 uses an atomic instruction such as xchg to do the store to the flag variable, and then Thread 2 spins on the flag, and proceeds when it notices that the value changes (due to the way the x86 memory model works, Thread 2 can spin using normal load instructions, it is sufficient that the store is done with an atomic)
[1] One can think of it in terms of relativistic physics, each observer (thread) sees events through its own "light cone". Hence one must abandon concepts such as a single universal time for all observers.