Why do we need to disable interrupt before WFI in ARM Linux cpu_idle - linux

The Linux kernel for ARM basically does CPU_idle in a loop:
while (1) {
disalbe_irq
wfi
enable_irq
}
I can understand that this logic works because "wfi" wakes up ARM regardless of IRQ/FIQ status. However, why "wfi" has to be bracketed by disable_irq and eanble_irq in the first place?
The source code /arch/arm/process.c has the following commends:
* We need to disable interrupts here
* to ensure we don't miss a wakeup call.
But I can't make sense of it. Can someone enlighten me in which scenario we would miss a wakeup call ?

The whole 'going to sleep' sequence in the main loop is split into two steps:
Realize that you don't have work to do;
Try to sleep (i.e. WFI)
The WFI instruction will act as a NOP if there are some interrupt flags still set, which allows the main loop to go back running the required tasks. So far so good.
There's a problem though if an interrupt occurs right after step 1 and before step 2. If that happens then the interrupt flags will be cleared upon exiting the ISR, and when control goes back to the main loop it will hit the WFI instruction with all interrupt flags cleared, causing the CPU to go into sleep before the main loop has had a chance to execute whatever tasks were required by the ISR.

<< Cortex-A Series Programmers Guide >>:
ARM recommends the use of a Data Synchronization Barrier (DSB) instruction
before WFI or WFE, to ensure that pending memory transactions complete before changing state.
If interrupt is enabled, we maybe see this:
DSB
interrupt handler
WFI
But we can not assume that we do not need DSB after interrupt handler. So, we need disable interrupt.

Related

kernel entry points on ARM

I was reading through the ARM kernel source code in order to better my understanding and came across something interesting.
Inside arch/arm/kernel/entry-armv.S there is a macro named vector_stub, that generates a small chunk of assembly followed by a jump table for various ARM modes. For instance, there is a call to vector_stub irq, IRQ_MODE, 4 which causes the macro to be expanded to a body with label vector_irq; and the same occurs for vector_dabt, vector_pabt, vector_und, and vector_fiq.
Inside each of these vector_* jump tables, there is exactly 1 DWORD with the address of a label with a _usr suffix.
I'd like to confirm that my understanding is accurate, please see below.
Does this mean that labels with the _usr suffix are executed, only if the interrupt arises when the kernel thread executing on that CPU is in userspace context? For instance, irq_usr is executed if the interrupt occurs when the kernel thread is in userspace context, dabt_usr is executed if the interrupt occurs when the kernel thread is in userspace context, and so on.
If [1] is true, then which kernel threads are responsible for handling, say irqs, with a different suffix such as irq_svc. I am assuming that this is the handler for an interrupt request that happens in SVC mode. If so, which kernel thread handles this? The kernel thread currently in SVC mode, on whichever CPU receives the interrupt?
If [2] is true, then at what point does the kernel thread finish processing the second interrupt, and return to where it had left off(also in SVC mode)? Is it ret_from_intr?
Inside each of these vector_* jump tables, there is exactly 1 DWORD with the address of a label with a _usr suffix.
This is correct. The table in indexed by the current mode. For instance, irq only has three entries; irq_usr, irq_svc, and irq_invalid. Irq's should be disabled during data aborts, FIQ and other modes. Linux will always transfer to svc mode after this brief 'vector stub' code. It is accomplished with,
#
# Prepare for SVC32 mode. IRQs remain disabled.
#
mrs r0, cpsr
eor r0, r0, #(\mode ^ SVC_MODE | PSR_ISETSTATE)
msr spsr_cxsf, r0
### ... other unrelated code
movs pc, lr # branch to handler in SVC mode
This is why irq_invalid is used for all other modes. Exceptions should never happen when this vector stub code is executing.
Does this mean that labels with the _usr suffix are executed, only if the interrupt arises when the kernel thread executing on that CPU is in userspace context? For instance, irq_usr is executed if the interrupt occurs when the kernel thread is in userspace context, dabt_usr is executed if the interrupt occurs when the kernel thread is in userspace context, and so on.
Yes, the spsr is the interrupted mode and the table indexes by these mode bits.
If 1 is true, then which kernel threads are responsible for handling, say irqs, with a different suffix such as irq_svc. I am assuming that this is the handler for an interrupt request that happens in SVC mode. If so, which kernel thread handles this? The kernel thread currently in SVC mode, on whichever CPU receives the interrupt?
I think you have some misunderstanding here. There is a 'kernel thread' for user space processes. The irq_usr is responsible for storing the user mode registers as a reschedule might take place. The context is different for irq_svc as a kernel stack was in use and it is the same one the IRQ code will use. What happens when a user task calls read()? It uses a system call and code executes in a kernel context. Each process has both a user and svc/kernel stack (and thread info). A kernel thread is a process without any user space stack.
If 2 is true, then at what point does the kernel thread finish processing the second interrupt, and return to where it had left off(also in SVC mode)? Is it ret_from_intr?
Generally Linux returns to the kernel thread that was interrupted so it can finish it's work. However, there is a configuration option for pre-empting svc threads/contexts. If the interrupt resulted in a reschedule event, then a process/context switch may result if CONFIG_PREEMPT is active. See svc_preempt for this code.
See also:
Linux kernel arm exception stack init
Arm specific irq initialization

Why is interrupt disabled between spin_lock and spin_unlock in Linux?

I was reading the implementation of Linux semaphores. Due to atomicity, signal and wait (up and down in the source code) use spin locks. Then I saw Linux disabled interrupt in spin_lock_irqsave and reenabled interrupt in spin_unlock. This confused me. In my opinion, there is really no point disabling interrupt within a critical section.
For example, proc A (currently active) acquired the lock, proc B (blocked) is waiting for the lock and proc C is doing some unrelated stuff. It makes perfect sense to context switch to C within the critical section between A and B. Even if C also tries to acquire the lock, since the lock is already locked by A, the result would be C being blocked and A resuming execution.
Therefore, I don't know why Linux decided to disable interrupt within critical sections guarded by spin locks. It probably won't cause any problems but seems like a redundant operation to me.
Allow me to start off with a disclaimer that I am not a Linux expert, so my answer may not be the most accurate. Please point out any flaws and problems that you may find.
Imagine if some shared data is used by various parts of the kernel, including operations such as interrupt handlers that need to be fast and cannot block. Let's say system call foo is currently active and has acquired a lock to use/access shared data bar, and interrupts are not disabled when/before acquiring said lock.
Now a (hardware) interrupt handler, e.g. the keyboard, kicks in and also needs access to bar (hardware interrupts have higher priority than system calls). Since bar is currently being locked by syscall foo, the interrupt handler cannot do anything. Interrupt handlers do need to be fast & not be blocked though, so they just keep spinning while trying to acquire the lock, which would cause a deadlock (i.e. system freeze) since syscall foo never gets a chance to finish and release its lock.
If you disable interrupts before trying to acquire the lock in foo, though, then foo will be able to finish up whatever it's doing and ultimately release the lock (and restore interrupts). Any interrupts trying to come in while foo holds the spinlock will be left on the queue, and will be able to start when the lock is released. This way, you won't run into the problem described above. However, care must also be taken to ensure that the lock for bar is held for as short as possible, so that other higher priority operations can take over whenever required.
The answer is very simple: There is no way for the thread that tries to acquire a lock, to know if the ISR that will interrupt it, will try to acquire the same lock. If that will happen, the ISR will spin forever on that same lock and the system will deadlock.
But what if an interrupt wants to signal a waiting thread ? Or want to test the sempahore value ? The irq disabling is not here to prevent context switch between two process, but to protect from irq. It's all in the comment at the beginning of the file :
/*
* Some notes on the implementation:
*
* The spinlock controls access to the other members of the semaphore.
* down_trylock() and up() can be called from interrupt context, so we
* have to disable interrupts when taking the lock. It turns out various
* parts of the kernel expect to be able to use down() on a semaphore in
* interrupt context when they know it will succeed, so we have to use
* irqsave variants for down(), down_interruptible() and down_killable()
* too.
*
* The ->count variable represents how many more tasks can acquire this
* semaphore. If it's zero, there may be tasks waiting on the wait_list.
*/

request_irq to be handled by a single CPU

I would like to ask if there is a way to register the interrupt handler so that only one cpu will handle this interrupt line.
The problem is that we have a function that can be called in both normal context and interrupt context. In this function we use irqs_disabled() to check the caller context. If the caller context is interrupt, we switch the processing to polling mode (continuously check the interrupt status register). Although the irqs_disabled() tells that the local interrupt of current CPU is disabled, the interrupt handler is still called by other CPUs and hence the interrupt status register are cleared in the interrupt handler. The polling code now checks the wrong value of the interrupt status register and do wrong processing.
You're doing it wrong. Don't limit your interrupt to be handled by a single CPU - instead use a spin_lock_irqsave to protect the code path. This will work both on the same CPU and across CPUs.
See http://www.mjmwired.net/kernel/Documentation/spinlocks.txt for the relevant API and here is a nice article from Linux Journal that explain the usage: http://www.linuxjournal.com/article/5833
I've got no experience with ARM, but on x86 you can arrange for a particular interrupt to be called on only one processor via /proc/irq/<number>/smp_affinity - set from user space - replacing the number with irq you care about - and this looks as if it's essentially generic. Note that the value you set it to is a bit mask, expressed in hex, without a leading 0x. I.e. if you want cpu 0, set it to 1, for cpu 1, set it to 2, etc. Beware of a process called irqbalance, which uses this mechanism, and might well override whatever you have done.
But why are you doing this? If you want to know whether you are called from an interrupt, there's an interface available named something like in_interrupt(). I've used it to avoid trying to call blocking functions from code that might be called from interrupt context.

does kernel's panic() function completely freezes every other process?

I would like to be confirmed that kernel's panic() function and the others like kernel_halt() and machine_halt(), once triggered, guarantee complete freezing of the machine.
So, are all the kernel and user processes frozen? Is panic() interruptible by the scheduler? The interrupt handlers could still be executed?
Use case: in case of serious error, I need to be sure that the hardware watchdog resets the machine. To this end, I need to make sure that no other thread/process is keeping the watchdog alive. I need to trigger a complete halt of the system. Currently, inside my kernel module, I simply call panic() to freeze everything.
Also, the user-space halt command is guaranteed to freeze the system?
Thanks.
edit: According to: http://linux.die.net/man/2/reboot, I think the best way is to use reboot(LINUX_REBOOT_CMD_HALT): "Control is given to the ROM monitor, if there is one"
Thank you for the comments above. After some research, I am ready to give myself a more complete answer, below:
At least for the x86 architecture, the reboot(LINUX_REBOOT_CMD_HALT) is the way to go. This, in turn, calls the syscall reboot() (see: http://lxr.linux.no/linux+v3.6.6/kernel/sys.c#L433). Then, for the LINUX_REBOOT_CMD_HALT flag (see: http://lxr.linux.no/linux+v3.6.6/kernel/sys.c#L480), the syscall calls kernel_halt() (defined here: http://lxr.linux.no/linux+v3.6.6/kernel/sys.c#L394). That function calls syscore_shutdown() to execute all the registered system core shutdown callbacks, displays the "System halted" message, then it dumps the kernel, AND, finally, it calls machine_halt(), that is a wrapper for native_machine_halt() (see: http://lxr.linux.no/linux+v3.6.6/arch/x86/kernel/reboot.c#L680). It is this function that stops the other CPUs (through machine_shutdown()), then calls stop_this_cpu() to disable the last remaining working processor. The first thing that this function does is to disable interrupts on the current processor, that is the scheduler is no more able to take control.
I am not sure why the syscall reboot() still calls do_exit(0), after calling kernel_halt(). I interpret it like that: now, with all processors marked as disabled, the syscall reboot() calls do_exit(0) and ends itself. Even if the scheduler is awoken, there are no more enabled processors on which it could schedule some task, nor interrupt: the system is halted. I am not sure about this explanation, as the stop_this_cpu() seems to not return (it enters an infinite loop). Maybe is just a safeguard, for the case when the stop_this_cpu() fails (and returns): in this case, do_exit() will end cleanly the current task, then the panic() function is called.
As for the panic() code (defined here: http://lxr.linux.no/linux+v3.6.6/kernel/panic.c#L69), the function first disables the local interrupts, then it disables all the other processors, except the current one by calling smp_send_stop(). Finally, as the sole task executing on the current processor (which is the only processor still alive), with all local interrupts disabled (that is, the preemptible scheduler -- a timer interrupt, after all -- has no chance...), then the panic() function loops some time or it calls emergency_restart(), that is supposed to restart the processor.
If you have better insight, please contribute.

Internals of a Linux system call

What happens (in detail) when a thread makes a system call by raising interrupt 80? What work does Linux do to the thread's stack and other state? What changes are done to the processor to put it into kernel mode? After running the interrupt handler, how is control restored back to the calling process?
What if the system call can't be completed quickly: e.g. a read from disk. How does the interrupt handler relinquish control so that the processor can do other stuff while data is being loaded and how does it then obtain control again?
A crash course in kernel mode in one stack overflow answer
Good questions! (Interview questions?)
What happens (in detail) when a
thread makes a system call by raising
interrupt 80?
The int $80 operation is vaguely like a function call. The CPU "takes a trap" and restarts at a known address in kernel mode, typically with a different MMU mode as well. The kernel will save many of the registers, though it doesn't have to save the registers that a program would not expect an ordinary function call to save.
What work does Linux do to the
thread's stack and other state?
Typically an OS will save registers that the ABI promises not to change during procedure calls. The stack will stay the same; the kernel will run on a per-thread kernel stack rather than the per-thread user stack. Naturally some state will change, otherwise there would be no reason to do the system call.
What changes are done to the
processor to put it into kernel mode?
This is usually entirely automatic. The CPU has, generically, a software-interrupt instruction that is a bit like a functional-call operation. It will cause the switch to kernel mode under controlled conditions. Typically, the CPU will change some sort of PSW protection bit, save the old PSW and PC, start at a well-known trap vector address, and may also switch to a different memory management protection and mapping arrangement.
After running the interrupt handler,
how is control restored back to the
calling process?
There will be some sort of "return from interrupt" or "return from trap" instruction, typically, that will act a bit like a complicated function-return instruction. Some RISC processors did very little automatically and required specific code to do the return and some CISC processors like x86 have (never-really-used) instructions that would execute dozens of operations documented in pages of architecture-manual pseudo-code for capability adjustments.
What if the system call can't be
completed quickly: e.g. a read from
disk. How does the interrupt handler
relinquish control so that the
processor can do other stuff while
data is being loaded and how does it
then obtain control again?
The kernel itself is threaded much like a threaded user program is. It just switches stacks (threads) and works on someone else's process for a while.
To answer the last part of the question - what does the kernel do if the system call needs to sleep -
After a system call, the kernel is still logically running in the context of the same task that made the system call - it's just in kernel mode rather than user mode - it is NOT a separate thread and most system calls do not invoke logic from another task/thread. What happens is that the system call calls wait_event, or wait_event_timeout or some other wait function, which adds the task to a list of tasks waiting for something, then puts the task to sleep, which changes its state, and calls schedule() to relinquish the current CPU.
After this the task cannot be run again until it gets woken up, typically by another task (kernel task, etc) or interrupt handler calling a wake* function which will wake up the task(s) sleeping waiting for that particular event, which means the scheduler will soon schedule them again.
It's worth noting that userspace tasks (i.e. threads) are only one type of task and there are a few others internal to the kernel which can do work as well - these are kernel threads and bottom half handlers / tasklets / task queues etc. Work which doesn't belong to any particular userspace process (for example network handling e.g. responding to pings) gets done in these. These tasks are allowed to go to sleep, unlike interrupts (which should not invoke the scheduler)
http://tldp.org/LDP/khg/HyperNews/get/syscall/syscall86.html
This should help people who seek for answers to what happens when the syscall instruction is executed which transfers the control to the kernel (user mode to kernel mode). This is based upon x86_64 architecture.
https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html

Resources