This question already has answers here:
During an x86 software interrupt, when exactly is a context switch made?
(1 answer)
Context switch in Interrupt handlers
(2 answers)
Closed 5 years ago.
A process's virtual address space contains 1 GB of kernel space:
Now I assume that this 1 GB of kernel space points to data and code related to the kernel (including the Interrupt Descriptor Table (IDT)).
Now let's say that some process is being executed by the CPU, and this process made a system call (fired the interrupt 0x80 (int 0x80)). What will happen is that the CPU will go to the IDT and execute the interrupt handler associated with the interrupt number 0x80.
Now will the CPU stays in the current process, and execute the interrupt handler from the kernel space of the current process (so no context switching occurs)?
Related
With KLT, each thread gets its own stack, right? And those details are maintained in a distinct PCB for each block and the different Page Tables, right? How would this apply for user level threads. Do all the threads in ULT have different stacks? If so how is it implemented?
Edit: I have since figured out that this exact question has been asked here more than 10 years ago. Unfortunately, it hasn't been answered adequately there either.
On the Linux kernel, you'll see kernel threads around when the bottom half of an interrupt handler wasn't completed and preempted to another thread. For example, an interrupt occurs, the top half of the interrupt handler runs with interrupts disabled and then adds the bottom half to the queue of threads (in reality, it is more complex than that). This creates kernel threads. Kernel threads are given high priority so they run fast because, most likely, a user thread is waiting for their completion.
Kernel threads have their own stack which are created when creating them in the top half of an interrupt handler (when a kernel thread is created, its stack is created). As far as I know, each core has one interrupt stack for servicing interrupts. Kernel threads have their own task_struct but without an address space. Most likely, they are basically a driver's servicing function which is supposed to do some work on behalf of a device that was queried by a user mode thread. For example, let's say thread A makes a syscall to read from disk. The driver used for that disk will write some registers of the hard-disk controller to start a DMA operation from the disk. When the operation is done, the controller triggers an interrupt. During execution of the top half, the interrupt stack is used and further interrupts are disabled. The top half creates a new kernel thread which is added to the queue of ready threads with a high priority. Eventually, that kernel thread runs (with its own task_struct and stack) and finishes. When it finishes, it will place the user mode thread on behalf of which this operation was done back in the ready queue.
With the Linux kernel, user threads all have 2 stacks: one for their user mode operations and one for their kernel mode operations (during a syscall). Each user mode stack is given a fixed size (in virtual memory). Since you seem to have some misconceptions, you can read some of my answers for more details:
Process of State
What is paging exactly? OSDEV
Understanding how operating systems store/retrieve IO device input
If
CPU has mode (privilege level) (Added because not all processors have privilege levels according to here)
CPU is multi-core (CISC / x86-64 instruction set)
scheduling is round robin scheduling
thread is kernel managed thread
OS is windows if necessary
I want to know simplified core execution flow in the point of thread context switch and CPU mode switch per time slice.
My understanding is as follows. Please correct me if I'm wrong.
In case of the thread is kernel managed user mode thread not involving interrupt, or anything that requires kernel mode,
The thread context switch occurs.
The core executing thread switches to kernel mode because context switch can only occur in kernel mode according to here, here and here unless the thread is user managed thread.
The core executing thread switches to user mode.
The core executes sequence of instructions located in user space.
Time slice expires.
Repeat 1.
Closest related diagram I could find is below.
Even a little clue to answer will be sincerely appreciated.
You said it yourself:
context switch can only occur in kernel mode
So the CPU must enter kernel mode before there can be a context switch. That can happen in either one of two ways in most operating systems:
The user-mode code makes a system call, or
An interrupt occurs.
If the thread enters kernel mode by making a system call, then there could be a context switch if the syscall causes the thread to no longer be runnable (e.g., a sleep() call), or there could be a context switch if the syscall causes some higher-priority thread to become runnable. (e.g., the syscall releases a mutex that the higher priority thread was awaiting.)
If the thread enters kernel mode because of an interrupt, then there could be a context switch because the interrupt handler made some higher-priority thread runnable (e.g., if the other thread was awaiting data from the disk), or there could be a context switch because it was a timer interrupt, and the current thread's time slice has expired.
The mechanism of context switching may be different on different hardware platforms. Here's how it could happen on some hypothetical CPU:
The current thread (threadA) enters sheduler code which chooses some other thread (threadB) as the next to run on the current CPU.
It calls some switchContext(threadB) function.
The switchContext function copies values from the stack pointer register, and from other live registers into the current thread (threadA)'s saved context area.*
It then sets the "current thread" pointer to point to threadB's saved context area, and it restores threadB's context by copying all the same things in reverse.**
Finally, the switchContext function returns... IN threadB,... at exactly the place where threadB last called it.
Eventually, threadB returns from the interrupt or system call to application code running in user-mode.
* The author of switchContext may have to be careful, may have to do some tricky things, in order to save the entire context without trashing it. E.g., it had better not use any register that needs saving before it has actually saved it somewhere.
** The trickiest part is when restoring the stack pointer register. As soon as that happens, "the" stack suddenly is threadB's stack instead of threadA's stack.
I was reading through the ARM kernel source code in order to better my understanding and came across something interesting.
Inside arch/arm/kernel/entry-armv.S there is a macro named vector_stub, that generates a small chunk of assembly followed by a jump table for various ARM modes. For instance, there is a call to vector_stub irq, IRQ_MODE, 4 which causes the macro to be expanded to a body with label vector_irq; and the same occurs for vector_dabt, vector_pabt, vector_und, and vector_fiq.
Inside each of these vector_* jump tables, there is exactly 1 DWORD with the address of a label with a _usr suffix.
I'd like to confirm that my understanding is accurate, please see below.
Does this mean that labels with the _usr suffix are executed, only if the interrupt arises when the kernel thread executing on that CPU is in userspace context? For instance, irq_usr is executed if the interrupt occurs when the kernel thread is in userspace context, dabt_usr is executed if the interrupt occurs when the kernel thread is in userspace context, and so on.
If [1] is true, then which kernel threads are responsible for handling, say irqs, with a different suffix such as irq_svc. I am assuming that this is the handler for an interrupt request that happens in SVC mode. If so, which kernel thread handles this? The kernel thread currently in SVC mode, on whichever CPU receives the interrupt?
If [2] is true, then at what point does the kernel thread finish processing the second interrupt, and return to where it had left off(also in SVC mode)? Is it ret_from_intr?
Inside each of these vector_* jump tables, there is exactly 1 DWORD with the address of a label with a _usr suffix.
This is correct. The table in indexed by the current mode. For instance, irq only has three entries; irq_usr, irq_svc, and irq_invalid. Irq's should be disabled during data aborts, FIQ and other modes. Linux will always transfer to svc mode after this brief 'vector stub' code. It is accomplished with,
#
# Prepare for SVC32 mode. IRQs remain disabled.
#
mrs r0, cpsr
eor r0, r0, #(\mode ^ SVC_MODE | PSR_ISETSTATE)
msr spsr_cxsf, r0
### ... other unrelated code
movs pc, lr # branch to handler in SVC mode
This is why irq_invalid is used for all other modes. Exceptions should never happen when this vector stub code is executing.
Does this mean that labels with the _usr suffix are executed, only if the interrupt arises when the kernel thread executing on that CPU is in userspace context? For instance, irq_usr is executed if the interrupt occurs when the kernel thread is in userspace context, dabt_usr is executed if the interrupt occurs when the kernel thread is in userspace context, and so on.
Yes, the spsr is the interrupted mode and the table indexes by these mode bits.
If 1 is true, then which kernel threads are responsible for handling, say irqs, with a different suffix such as irq_svc. I am assuming that this is the handler for an interrupt request that happens in SVC mode. If so, which kernel thread handles this? The kernel thread currently in SVC mode, on whichever CPU receives the interrupt?
I think you have some misunderstanding here. There is a 'kernel thread' for user space processes. The irq_usr is responsible for storing the user mode registers as a reschedule might take place. The context is different for irq_svc as a kernel stack was in use and it is the same one the IRQ code will use. What happens when a user task calls read()? It uses a system call and code executes in a kernel context. Each process has both a user and svc/kernel stack (and thread info). A kernel thread is a process without any user space stack.
If 2 is true, then at what point does the kernel thread finish processing the second interrupt, and return to where it had left off(also in SVC mode)? Is it ret_from_intr?
Generally Linux returns to the kernel thread that was interrupted so it can finish it's work. However, there is a configuration option for pre-empting svc threads/contexts. If the interrupt resulted in a reschedule event, then a process/context switch may result if CONFIG_PREEMPT is active. See svc_preempt for this code.
See also:
Linux kernel arm exception stack init
Arm specific irq initialization
What does the following phrase mean: "the kernel executes in the process context"?
Does it mean that if CPU is executing some process and then some interrupt occurs (system call, key press, etc.), the CPU will keep the page table for the currently running process loaded and then it will execute the interrupt handler which resides in the process's kernel space?
If this is what it means, then it seems like the interrupt handler is executed in the process context, so what does interrupt context means?
Process context is its current state.
We need to save the context of the current running process so it can be resumed after the interrupt is handled.
Process context is basically its current state (what is in its registers).
esp
ss
eip
cs
and more.
We need to save the instruction pointer (EIP) and the CS (Code Segment) so that after the interrupt is handled we can continue running from where we were stopped.
The interrupt handler code resides in Kernel memory. Once an interrupt occur, we immediately switch from user mode to kernel mode. The state of the current running process is saved, part of it on user-stack and the other part on kernel-stack (depending on architecture). Assuming it's x86 then the interrupt handler is run by loading the appropriate ss, cs, esp and eip from TSS and Interrupt descriptor table.
preempt_count variable keeps track of per CPU statistics::
static __always_inline int preempt_count(void)
{
return current_thread_info()->preempt_count;
}
Bits 0 - 7 keeps track of how many times kernel preemption is disabled.
Bits 8 - 15 if non-zero, means softirqs are disbaled that number of times.
Bits 16 - 27 specifies how many calls of irq_enter happened. It mean the number of nested interrupt handlers.
I am not able to comprehend why it is sufficient for preempt_count to be
per thread.
When a new process would be scheduled, off course bit 0-7 will be zero, otherwise it means preemption is disabled and switch is not allowed. But what about bits 8 - 27. Will they be 0 too?
Does it mean that whenever there is a process schedule call, at that time preempt_count should be 0 and hence its value does not need to be copied across thread_info of different processes to keep the track of the status softirqs and irqs on a particular CPU?
Linux forbids thread being scheduled when in interrupt,this is a convention and there is no code to implement this restraint.So under this convention the preempt_count of new thread must be zero and there is no need copy preempt_count.If somebody call schedule in interrupt context there may be other problem for example new interrupt can't be processed because the previous interrupt has disabled interrupt.
current_thread_info is architecture dependent. It point to end of kernel stack. Before 2.6 at the end of kernel stack task_struct leave. But after task_struct move to slab and thread_info replaced task_struct. Each process have thread_info that lie in the end of kernel stack for that task. current_thread_info() return arch dependent pointer (calculate based on stack pointer and size). Thread_info have preempt_count it is a counter. If value is zero so kernel can be preempt (preempt_count += SOFTIRQ_OFFSET - disable bottom half). preempt_count incremented is lock is held and decremented if lock is freed. If irq handle finished and instruction pointer should point kernel code. Kernel should check if preempt_count is zero so code preemption is safe, also if need_reshed from the same thread_info is set. Sheduler should be start more important task. If preempt_count non zero - preemption not safe and next execution should be point to the same task that was interrupt. Only when all locks will be free by current task and preempt_count == 0 so code can by preempt safe. And again need_recshed check.
In other hand preempt_count() function can be very help full with preempt_disable(), preempt_enable() - two nested function. Preemption function can protect per cpu variable from concurrent access from more than one task that use the same per_cpu variable. It is important that only one task should work with shared per_cpu variable (in the meedle it can be changed by other task) and preemption should be disable. Preamption function is nested so we can use preempt_count(). in_atomic based on preempt_count() and actually it explain why it is not very good function.