There are two cases where the scheduler code schedule() is invoked-
When a process voluntarily calls schedule()
Timer interrupt calls schedule()
In case 2, I think schedule() runs in interrupt context, but what about the first case? Does it run in the context of the process which invoked it?
Also are there any more scenarios which invoke schedule()?
schedule() always runs in process context. In the second case, when it is initiated by a timer interrupt, it is in the return path back from the kernel to the interrupted process where schedule() is called.
__schedule() is the main scheduler function.
The main means of driving the scheduler and thus entering this function are:
Explicit blocking: mutex, semaphore, waitqueue, etc.
TIF_NEED_RESCHED flag is checked on interrupt and userspace return paths. For example, see arch/x86/entry_64.S. To drive preemption between tasks, the scheduler sets the flag in timer interrupt handler scheduler_tick().
Wakeups don't really cause entry into schedule(). They add a task to the run-queue and that's it. Now, if the new task added to the run-queue preempts the current task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets called on the nearest possible occasion:
If the kernel is preemptible (CONFIG_PREEMPT=y):
in syscall or exception context, at the next outmost preempt_enable(). (this might be as soon as the wake_up()'s spin_unlock()!)
in IRQ context, return from interrupt-handler to preemptible context
If the kernel is not preemptible (CONFIG_PREEMPT is not set) then at the next:
cond_resched() call
explicit schedule() call
return from syscall or exception to user-space
return from interrupt-handler to user-space
http://lxr.free-electrons.com/source/kernel/sched/core.c#L2389
When a process calls schedule() it runs in a system call context which is interrupt based. In the 2nd case a hardware interrupt triggers the schedule() call. In both cases it runs as an interrupt. AFAIK those are the only times that schedule() is called because most manipulation of scheduling involves modifying the kernel run queue of things to be scheduled although a process can be interrupted but that is usually done via an interrupt to tell the process to yield or the process yielding itself.
Related
What is interrupt context ? What is process context ?
What are the advantages of interrupt context ?
Why bottom half required ? Why not all the processing in top half ?
Process context is the current state of process, process context can be go into the sleep, preemptable, It perform time consumable task, acquiring and releasing mutex.
Interrupt context is when the interrupt occurs state/priority goes to interrupt handler, and current process stops/saves until we complete interrupt, Interrupt context is not time consumable, non preemptable, It cannot go into the sleep.
Bottom Half mechanism, SoftIRQ, Tasklet works in a interrupt context, workqueue can go into the sleep, so it is not run in interrupt context.
Process Context
One of the most important parts of a process are the executing program code. This code was read in from a executable file and executed within the program's address space. Normal program execution occurs in User-space. When a program executes a system call or triggers an exception, it enters Kernel-space. At this point, the kernel are said to being "executing on behalf of the process" and are in process context. When in process context, the current macro is valid. Upon exiting the kernel, the process resumes execution in User-space, unless a higher-priority process have become runnable In the interim (transition period), in which case the scheduler is invoked to select the higher priority process.
Interrupt Context
When executing a interrupt handler or bottom half, the kernel is in interrupt context.Recall That process context is the mode of operation the kernel are in while it's executing on behalf of a process-- For example, executing a system call or running a kernel thread. In process context, the current macro points to the associated task. Furthermore, because a process is coupled to the kernel in process context (because the process is connected to the kernel in the same way as the process above), process context can SleeP or otherwise invoke the scheduler.
Interrupt context, on the other hand, was not associated with a process. The current macro isn't relevant (although it points to the interrupted process). Without a backing process (because there is no process background), interrupt context cannot sleep-how would it ever reschedule? (or how to reschedule it again?) Therefore, cannot call certain functions from interrupt context. If A function sleeps, you cannot use it from your interrupt handler--this limits the functions so one can call from an Interrupt handler. (This is the limit on what functions can be used in an interrupt handler)
Link for more details.
When a thread executing user code is waiting for input, how does the scheduler know to interrupt it or how does the thread know to call the scheduler, seeing as the average programmer of a simple single threaded application is unlikely to insert sched_yield() everywhere. Does the compiler insert sched_yield() on optimisation or does the thread just spin lock until the general timer interrupt set by the scheduler fires, or does the user have to explicitly state wait(), sleep() functions in order for the context to switch?
This question is especially relevant if the scheduler is not preemptive because then it has to call the scheduler when it is waiting for input for throughput to be effective, but I'm not sure how it does this.
Be careful not to confuse preemption with the ability of a process to sleep. Processes can sleep even with a non-preempting scheduler. This is what happens when a process is waiting for I/O. The process makes a system call such as read() and the device determines no data is available. It then internally puts the process to sleep by updating a data structure used by the scheduler. The scheduler then executes other processes until an interrupt or some other event occurs that wakes the original process. The awoken process then becomes eligible again for scheduling.
On the other hand preemption is the ability of an architecture's scheduler to stop execution of a process without its cooperation. The interruption can occur anywhere in the program's instruction stream. Control returns to the scheduler which can then execute other processes and return to the interrupted (preempted) process later. Most schedulers allocate time slices where a process is allowed to run for up to a predetermined amount of time, after which it is preempted if higher-priority processes need time slices.
Unless you're writing drivers or kernel code, you don't need to worry about the underlying mechanisms too much. When writing user-space applications the key concepts are (1) that some system calls may block which means your process is put to sleep until an event occurs, and (2) on preemptible systems (all mainstream modern operating systems) your program may be preempted at any time so that other processes can run.
* Note that in some platforms, such as Linux, a thread is really just another process which shares its virtual address space with another process. Processes and threads are therefore treated exactly the same by the scheduler.
It is not clear to me whether your question is about theory or practice. In practice in every modern operating system, i/o operations are privileged. Meaning that in order for a user process or thread to access files, devices and so on it must issue a system call.
Then the kernel has the opportunity to do whatever it considers appropriate. For example it can check whether the I/o operation will block and, therefore switch the running (i.e. “call” the scheduler) process after issuing the operation.
Note that this mechanism can work even when there is no timer interruption handled by the kernel. Anyway in general it will depend upon your system. For example in an embedded system where no OS exits (or a minimal one) it could be the entire responsibility of the user’s code to invoke the scheduler before issueing a blocking operation.
Kernel can be preemptive, not scheduler.
First sched_yield() and wait() are types of voluntary preemption, when process itself gives out CPU even if kernel is non-preemptive.
If kernel has ability to switch to another process when time quantum has expired or higher priority process become runnable then we are talking about involuntary preemption, i.e preemptive kernel, and it can happen on different places explained below.
Difference is that insched_yield() process stays in runnable TASK_RUNNING state but just goes to the end of the run queue for it's static priority. Process must wait to get the CPU again.
On the other hand, wait() puts process to a sleep TASK_(UN)INTERRUPTABLE state, on a wait queue, calls schedule() and waits for an event to occur. When event occur, process are moved to run queue again. But that doesn't mean that they will get CPU immediately.
Here is explained when schedule() can be called after process is woken up:
Wakeups don't really cause entry into schedule(). They add a
task to the run-queue and that's it.
If the new task added to the run-queue preempts the current
task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
called on the nearest possible occasion:
If the kernel is preemptible (CONFIG_PREEMPT=y):
in syscall or exception context, at the next outmost
preempt_enable(). (this might be as soon as the wake_up()'s
spin_unlock()!)
in IRQ context, return from interrupt-handler to
preemptible context
If the kernel is not preemptible (CONFIG_PREEMPT is not set)
then at the next:
cond_resched() call
explicit schedule() call
return from syscall or exception to user-space
return from interrupt-handler to user-space
Say we have one CPU with only one core, and there are many threads that are running on this core.
Let's say that Thread A has issued a system call, now the interrupt handler for the system call will be executed.
Now while the the system call is being executed, say that the hardware timer interrupt (the one responsible for the scheduling of threads) fires. What will happen in this case, will the CPU stop running the system call and go to execute the scheduler code, or does the CPU must wait for the system call to be fully executed before switching to another thread?
In Linux the answer is actually dependent on a kernel build time configuration option called CONFIG_PREEMPT. There are actually three options:
If CONFIG_PREEMPT is not set, the interrupt handler will mark a flag indicating that the scheduler needs to run. The flag will be checked on return to user space upon system call termination.
If CONFIG_PREEMPT_VOLUNTARY is set, the same will occur except the flag will be checked and the scheduler run (and task possibly switched if needed) at specific static code points in the system call code
If CONFIG_PREEMPT_FULL is set, the scheduler will run in most cases on the return code path from the interrupt handler to the system call code, unless a preempt critical section is in force.
Unless the system call blocks interrupts, the interrupt handler will get invoked.
I read some related posts:
(1) From Robert Love: http://permalink.gmane.org/gmane.linux.kernel.kernelnewbies/1791
You cannot sleep in an interrupt handler because interrupts do not have a backing
process context, and thus there is nothing to reschedule back into. In other
words, interrupt handlers are not associated with a task, so there is nothing to
"put to sleep" and (more importantly) "nothing to wake up". They must run
atomically.
(2) From Which context are softirq and tasklet in?
If sleep is allowed, then the linux cannot schedule them and finally cause a
kernel panic with a dequeue_task error. The interrupt context does not even
have a data structure describing the register info, so they can never be scheduled
by linux. If it is designed to have that structure and can be scheduled, the
performance for interrupt handling process will be effected.
So in my understanding, interrupt handlers run in interrupt context, and can not sleep, that is to say, can not perform the context switch as normal processes do with backing mechanism.
But a interrupt handler can be interrupted by another interrupt. And when the second interrupt handler finishes its work, control flow would jump back to the first interrupt handler.
How is this "restoring" implemented without normal context switch? Is it like normal function calls with all the registers and other related stuff stored in a certain stack?
The short answer is that an interrupt handler, if it can be interrupted by an interrupt, is interrupted precisely the same way anything else is interrupted by an interrupt.
Say process X is running. If process X is interrupted, then the interrupt handler runs. To the extent there is a context, it's still process X, though it's now running interrupt code in the kernel (think of the state as X->interrupt if you like). If another interrupt occurs, then the interrupt is interrupted, but there is still no special process context. The state is now X->first_interrupt->second_interrupt. When the second interrupt finishes, the first interrupt will resume just as X will resume when the first interrupt finishes. Still, the only process context is process X.
You can describe these as context switches, but they aren't like process context switches. They're more analogous to entering and exiting the kernel -- the process context stays the same but the execution level and unit of code can change.
The interrupt routine will store some CPU state and registers before enter real interrupt handler, and will restore these information before returning to interrupted task. Normally, this kind of storing and restoring is not called context-switch, as the context of interrupted process is not changed.
As of 2020, interrupts (hard IRQ here) in Linux do not nest on a local CPU in general. This is at least mentioned twice by group/maintainer actively contributing to Linux kernel:
From NAPI updates written by Jakub Kicinski in 2020:
…Because normal interrupts don't nest in Linux, the system can't service any new interrupt while it's already processing one.
And from Bootlin in 2022:
…Interrupt handlers are run with all interrupts disabled on the local CPU…
So this question is probably less relevant nowadays, at least for Linux kernel.
In case of blocking IO, say, driver read, we call wait_event_interruptible() with some condition. When the condition is met, read will be done.
I looked into wait_event_interruptible() function, it checks for condition and calls schedule(). schedule() will look for the next runnable process and does context switch and other process will run. Does it mean that, the next instruction to be executed for the current process will be inside schedule() function when this process is woken up again?
If yes, if multiple process voluntarily calls schedule, then all processes will have next instruction to be executed once after it gets woken up will be well inside schedule()?
In case of ret_from_interrupt, schedule() is called. When it will return? as iret is executed after that.
I think the answer to the first question is yes as that's a fairly typical way of implementing context switching. That's how OS161 works, for example.
If the scheduler is called from an ISR, everything should be the same. The scheduler should change the context and return to the ISR and the ISR should then return using IRET. It will return to a different process/thread if the scheduler chooses to switch to a different one and therefore loads its context and saves the old one.
Re point 2: The iret instruction (return from interrupt handler) is executed and that gets you into ret_from_interrupt. Then Linux passes control to the task next to run (schedule()). One of the overriding considerations when writing interrupt handlers is that while they are executing many other activities are inhibited (other, lower priority, interrupts are the prime example), so you want to get out of there as fast as possible. That is why most interrupt handlers just stash away work to be done before returning, and said work is then handled elsewhere (today in some special kernel thread).