when schedule() returns? - linux

In case of blocking IO, say, driver read, we call wait_event_interruptible() with some condition. When the condition is met, read will be done.
I looked into wait_event_interruptible() function, it checks for condition and calls schedule(). schedule() will look for the next runnable process and does context switch and other process will run. Does it mean that, the next instruction to be executed for the current process will be inside schedule() function when this process is woken up again?
If yes, if multiple process voluntarily calls schedule, then all processes will have next instruction to be executed once after it gets woken up will be well inside schedule()?
In case of ret_from_interrupt, schedule() is called. When it will return? as iret is executed after that.

I think the answer to the first question is yes as that's a fairly typical way of implementing context switching. That's how OS161 works, for example.
If the scheduler is called from an ISR, everything should be the same. The scheduler should change the context and return to the ISR and the ISR should then return using IRET. It will return to a different process/thread if the scheduler chooses to switch to a different one and therefore loads its context and saves the old one.

Re point 2: The iret instruction (return from interrupt handler) is executed and that gets you into ret_from_interrupt. Then Linux passes control to the task next to run (schedule()). One of the overriding considerations when writing interrupt handlers is that while they are executing many other activities are inhibited (other, lower priority, interrupts are the prime example), so you want to get out of there as fast as possible. That is why most interrupt handlers just stash away work to be done before returning, and said work is then handled elsewhere (today in some special kernel thread).

Related

How does the scheduler know that a thread is blocked waiting for input?

When a thread executing user code is waiting for input, how does the scheduler know to interrupt it or how does the thread know to call the scheduler, seeing as the average programmer of a simple single threaded application is unlikely to insert sched_yield() everywhere. Does the compiler insert sched_yield() on optimisation or does the thread just spin lock until the general timer interrupt set by the scheduler fires, or does the user have to explicitly state wait(), sleep() functions in order for the context to switch?
This question is especially relevant if the scheduler is not preemptive because then it has to call the scheduler when it is waiting for input for throughput to be effective, but I'm not sure how it does this.
Be careful not to confuse preemption with the ability of a process to sleep. Processes can sleep even with a non-preempting scheduler. This is what happens when a process is waiting for I/O. The process makes a system call such as read() and the device determines no data is available. It then internally puts the process to sleep by updating a data structure used by the scheduler. The scheduler then executes other processes until an interrupt or some other event occurs that wakes the original process. The awoken process then becomes eligible again for scheduling.
On the other hand preemption is the ability of an architecture's scheduler to stop execution of a process without its cooperation. The interruption can occur anywhere in the program's instruction stream. Control returns to the scheduler which can then execute other processes and return to the interrupted (preempted) process later. Most schedulers allocate time slices where a process is allowed to run for up to a predetermined amount of time, after which it is preempted if higher-priority processes need time slices.
Unless you're writing drivers or kernel code, you don't need to worry about the underlying mechanisms too much. When writing user-space applications the key concepts are (1) that some system calls may block which means your process is put to sleep until an event occurs, and (2) on preemptible systems (all mainstream modern operating systems) your program may be preempted at any time so that other processes can run.
* Note that in some platforms, such as Linux, a thread is really just another process which shares its virtual address space with another process. Processes and threads are therefore treated exactly the same by the scheduler.
It is not clear to me whether your question is about theory or practice. In practice in every modern operating system, i/o operations are privileged. Meaning that in order for a user process or thread to access files, devices and so on it must issue a system call.
Then the kernel has the opportunity to do whatever it considers appropriate. For example it can check whether the I/o operation will block and, therefore switch the running (i.e. “call” the scheduler) process after issuing the operation.
Note that this mechanism can work even when there is no timer interruption handled by the kernel. Anyway in general it will depend upon your system. For example in an embedded system where no OS exits (or a minimal one) it could be the entire responsibility of the user’s code to invoke the scheduler before issueing a blocking operation.
Kernel can be preemptive, not scheduler.
First sched_yield() and wait() are types of voluntary preemption, when process itself gives out CPU even if kernel is non-preemptive.
If kernel has ability to switch to another process when time quantum has expired or higher priority process become runnable then we are talking about involuntary preemption, i.e preemptive kernel, and it can happen on different places explained below.
Difference is that insched_yield() process stays in runnable TASK_RUNNING state but just goes to the end of the run queue for it's static priority. Process must wait to get the CPU again.
On the other hand, wait() puts process to a sleep TASK_(UN)INTERRUPTABLE state, on a wait queue, calls schedule() and waits for an event to occur. When event occur, process are moved to run queue again. But that doesn't mean that they will get CPU immediately.
Here is explained when schedule() can be called after process is woken up:
Wakeups don't really cause entry into schedule(). They add a
task to the run-queue and that's it.
If the new task added to the run-queue preempts the current
task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
called on the nearest possible occasion:
If the kernel is preemptible (CONFIG_PREEMPT=y):
in syscall or exception context, at the next outmost
preempt_enable(). (this might be as soon as the wake_up()'s
spin_unlock()!)
in IRQ context, return from interrupt-handler to
preemptible context
If the kernel is not preemptible (CONFIG_PREEMPT is not set)
then at the next:
cond_resched() call
explicit schedule() call
return from syscall or exception to user-space
return from interrupt-handler to user-space

What happens if when a system call is being executed, the hardware timer interrupt fires?

Say we have one CPU with only one core, and there are many threads that are running on this core.
Let's say that Thread A has issued a system call, now the interrupt handler for the system call will be executed.
Now while the the system call is being executed, say that the hardware timer interrupt (the one responsible for the scheduling of threads) fires. What will happen in this case, will the CPU stop running the system call and go to execute the scheduler code, or does the CPU must wait for the system call to be fully executed before switching to another thread?
In Linux the answer is actually dependent on a kernel build time configuration option called CONFIG_PREEMPT. There are actually three options:
If CONFIG_PREEMPT is not set, the interrupt handler will mark a flag indicating that the scheduler needs to run. The flag will be checked on return to user space upon system call termination.
If CONFIG_PREEMPT_VOLUNTARY is set, the same will occur except the flag will be checked and the scheduler run (and task possibly switched if needed) at specific static code points in the system call code
If CONFIG_PREEMPT_FULL is set, the scheduler will run in most cases on the return code path from the interrupt handler to the system call code, unless a preempt critical section is in force.
Unless the system call blocks interrupts, the interrupt handler will get invoked.

How is interrupt context "restored" when a interrupt handler is interrupted by another interrupt?

I read some related posts:
(1) From Robert Love: http://permalink.gmane.org/gmane.linux.kernel.kernelnewbies/1791
You cannot sleep in an interrupt handler because interrupts do not have a backing
process context, and thus there is nothing to reschedule back into. In other
words, interrupt handlers are not associated with a task, so there is nothing to
"put to sleep" and (more importantly) "nothing to wake up". They must run
atomically.
(2) From Which context are softirq and tasklet in?
If sleep is allowed, then the linux cannot schedule them and finally cause a
kernel panic with a dequeue_task error. The interrupt context does not even
have a data structure describing the register info, so they can never be scheduled
by linux. If it is designed to have that structure and can be scheduled, the
performance for interrupt handling process will be effected.
So in my understanding, interrupt handlers run in interrupt context, and can not sleep, that is to say, can not perform the context switch as normal processes do with backing mechanism.
But a interrupt handler can be interrupted by another interrupt. And when the second interrupt handler finishes its work, control flow would jump back to the first interrupt handler.
How is this "restoring" implemented without normal context switch? Is it like normal function calls with all the registers and other related stuff stored in a certain stack?
The short answer is that an interrupt handler, if it can be interrupted by an interrupt, is interrupted precisely the same way anything else is interrupted by an interrupt.
Say process X is running. If process X is interrupted, then the interrupt handler runs. To the extent there is a context, it's still process X, though it's now running interrupt code in the kernel (think of the state as X->interrupt if you like). If another interrupt occurs, then the interrupt is interrupted, but there is still no special process context. The state is now X->first_interrupt->second_interrupt. When the second interrupt finishes, the first interrupt will resume just as X will resume when the first interrupt finishes. Still, the only process context is process X.
You can describe these as context switches, but they aren't like process context switches. They're more analogous to entering and exiting the kernel -- the process context stays the same but the execution level and unit of code can change.
The interrupt routine will store some CPU state and registers before enter real interrupt handler, and will restore these information before returning to interrupted task. Normally, this kind of storing and restoring is not called context-switch, as the context of interrupted process is not changed.
As of 2020, interrupts (hard IRQ here) in Linux do not nest on a local CPU in general. This is at least mentioned twice by group/maintainer actively contributing to Linux kernel:
From NAPI updates written by Jakub Kicinski in 2020:
…Because normal interrupts don't nest in Linux, the system can't service any new interrupt while it's already processing one.
And from Bootlin in 2022:
…Interrupt handlers are run with all interrupts disabled on the local CPU…
So this question is probably less relevant nowadays, at least for Linux kernel.

does kernel's panic() function completely freezes every other process?

I would like to be confirmed that kernel's panic() function and the others like kernel_halt() and machine_halt(), once triggered, guarantee complete freezing of the machine.
So, are all the kernel and user processes frozen? Is panic() interruptible by the scheduler? The interrupt handlers could still be executed?
Use case: in case of serious error, I need to be sure that the hardware watchdog resets the machine. To this end, I need to make sure that no other thread/process is keeping the watchdog alive. I need to trigger a complete halt of the system. Currently, inside my kernel module, I simply call panic() to freeze everything.
Also, the user-space halt command is guaranteed to freeze the system?
Thanks.
edit: According to: http://linux.die.net/man/2/reboot, I think the best way is to use reboot(LINUX_REBOOT_CMD_HALT): "Control is given to the ROM monitor, if there is one"
Thank you for the comments above. After some research, I am ready to give myself a more complete answer, below:
At least for the x86 architecture, the reboot(LINUX_REBOOT_CMD_HALT) is the way to go. This, in turn, calls the syscall reboot() (see: http://lxr.linux.no/linux+v3.6.6/kernel/sys.c#L433). Then, for the LINUX_REBOOT_CMD_HALT flag (see: http://lxr.linux.no/linux+v3.6.6/kernel/sys.c#L480), the syscall calls kernel_halt() (defined here: http://lxr.linux.no/linux+v3.6.6/kernel/sys.c#L394). That function calls syscore_shutdown() to execute all the registered system core shutdown callbacks, displays the "System halted" message, then it dumps the kernel, AND, finally, it calls machine_halt(), that is a wrapper for native_machine_halt() (see: http://lxr.linux.no/linux+v3.6.6/arch/x86/kernel/reboot.c#L680). It is this function that stops the other CPUs (through machine_shutdown()), then calls stop_this_cpu() to disable the last remaining working processor. The first thing that this function does is to disable interrupts on the current processor, that is the scheduler is no more able to take control.
I am not sure why the syscall reboot() still calls do_exit(0), after calling kernel_halt(). I interpret it like that: now, with all processors marked as disabled, the syscall reboot() calls do_exit(0) and ends itself. Even if the scheduler is awoken, there are no more enabled processors on which it could schedule some task, nor interrupt: the system is halted. I am not sure about this explanation, as the stop_this_cpu() seems to not return (it enters an infinite loop). Maybe is just a safeguard, for the case when the stop_this_cpu() fails (and returns): in this case, do_exit() will end cleanly the current task, then the panic() function is called.
As for the panic() code (defined here: http://lxr.linux.no/linux+v3.6.6/kernel/panic.c#L69), the function first disables the local interrupts, then it disables all the other processors, except the current one by calling smp_send_stop(). Finally, as the sole task executing on the current processor (which is the only processor still alive), with all local interrupts disabled (that is, the preemptible scheduler -- a timer interrupt, after all -- has no chance...), then the panic() function loops some time or it calls emergency_restart(), that is supposed to restart the processor.
If you have better insight, please contribute.

What context does the scheduler code run in?

There are two cases where the scheduler code schedule() is invoked-
When a process voluntarily calls schedule()
Timer interrupt calls schedule()
In case 2, I think schedule() runs in interrupt context, but what about the first case? Does it run in the context of the process which invoked it?
Also are there any more scenarios which invoke schedule()?
schedule() always runs in process context. In the second case, when it is initiated by a timer interrupt, it is in the return path back from the kernel to the interrupted process where schedule() is called.
__schedule() is the main scheduler function.
The main means of driving the scheduler and thus entering this function are:
Explicit blocking: mutex, semaphore, waitqueue, etc.
TIF_NEED_RESCHED flag is checked on interrupt and userspace return paths. For example, see arch/x86/entry_64.S. To drive preemption between tasks, the scheduler sets the flag in timer interrupt handler scheduler_tick().
Wakeups don't really cause entry into schedule(). They add a task to the run-queue and that's it. Now, if the new task added to the run-queue preempts the current task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets called on the nearest possible occasion:
If the kernel is preemptible (CONFIG_PREEMPT=y):
in syscall or exception context, at the next outmost preempt_enable(). (this might be as soon as the wake_up()'s spin_unlock()!)
in IRQ context, return from interrupt-handler to preemptible context
If the kernel is not preemptible (CONFIG_PREEMPT is not set) then at the next:
cond_resched() call
explicit schedule() call
return from syscall or exception to user-space
return from interrupt-handler to user-space
http://lxr.free-electrons.com/source/kernel/sched/core.c#L2389
When a process calls schedule() it runs in a system call context which is interrupt based. In the 2nd case a hardware interrupt triggers the schedule() call. In both cases it runs as an interrupt. AFAIK those are the only times that schedule() is called because most manipulation of scheduling involves modifying the kernel run queue of things to be scheduled although a process can be interrupted but that is usually done via an interrupt to tell the process to yield or the process yielding itself.

Resources