request_irq to be handled by a single CPU - linux

I would like to ask if there is a way to register the interrupt handler so that only one cpu will handle this interrupt line.
The problem is that we have a function that can be called in both normal context and interrupt context. In this function we use irqs_disabled() to check the caller context. If the caller context is interrupt, we switch the processing to polling mode (continuously check the interrupt status register). Although the irqs_disabled() tells that the local interrupt of current CPU is disabled, the interrupt handler is still called by other CPUs and hence the interrupt status register are cleared in the interrupt handler. The polling code now checks the wrong value of the interrupt status register and do wrong processing.

You're doing it wrong. Don't limit your interrupt to be handled by a single CPU - instead use a spin_lock_irqsave to protect the code path. This will work both on the same CPU and across CPUs.
See http://www.mjmwired.net/kernel/Documentation/spinlocks.txt for the relevant API and here is a nice article from Linux Journal that explain the usage: http://www.linuxjournal.com/article/5833

I've got no experience with ARM, but on x86 you can arrange for a particular interrupt to be called on only one processor via /proc/irq/<number>/smp_affinity - set from user space - replacing the number with irq you care about - and this looks as if it's essentially generic. Note that the value you set it to is a bit mask, expressed in hex, without a leading 0x. I.e. if you want cpu 0, set it to 1, for cpu 1, set it to 2, etc. Beware of a process called irqbalance, which uses this mechanism, and might well override whatever you have done.
But why are you doing this? If you want to know whether you are called from an interrupt, there's an interface available named something like in_interrupt(). I've used it to avoid trying to call blocking functions from code that might be called from interrupt context.

Related

Under what circumstances does control pass from userspace to the Linux kernel space?

I'm trying to understand which events can cause a transition from userspace to the linux kernel. If it's relevant, the scope of this question can be limited to the x86/x86_64 architecture.
Here are some sources of transitions that I'm aware of:
System calls (which includes accessing devices) causes a context switch from userspace to kernel space.
Interrupts will cause a context switch. As far as I know, this also includes scheduler preemptions, since a scheduler usually relies on a timer interrupt to do its work.
Signals. It seems like at least some signals are implemented using interrupts but I don't know if some are implemented differently so I'm listing them separately.
I'm asking two things here:
Am I missing any userspace->kernel path?
What are the various code paths that are involved in these context switches?
One you are missing: Exceptions
(which can be further broken down in faults, traps and aborts)
For example a page fault, breakpoint, division by zero or floating-point exception. Technically, one can view exceptions as interrupts but not really the way you have defined an interrupt in your question.
You can find a list of x86 exceptions at this osdev webpage.
With regard to your second question:
What are the various code paths that are involved in these context
switches?
That really depends on the architecture and OS, you will need to be more specific. For x86, when an interrupt occurs you go to the IDT entry and for SYSENTER you get to to address specified in the MSR. What happens after that is completely up to the OS.
No one wrote a complete answer so I will try to incorporate the comments and partial answers into an answer. Feel free to comment or edit the answer to improve it.
For the purposes of this question and answer, userspace to kernel transitions mean a change in processor state that allows access to kernel code and memory. In short I will refer to these transistions as context switches.
When discussing events that can trigger userspace to kernel transitions, it is important to separate the OS constructs that we are used to (signals, system calls, scheduling) that require context switches and the way these constructs are implemented, using context switches.
In x86, there are two central ways for context switches to occur: interrupts and SYSENTER. Interrupts are a processor feature, which causes a context switch when certain events happen:
Hardware devices may request an interrupt, for example, a timer/clock can cause an interrupt when a certain amount of time has elapsed. A keyboard can interrupt when keys are pressed. It's also called a hardware interrupt.
Userspace can initiate an interrupt. For example, the old way to perform a system call in Linux on x86 was to execute INT 0x80 with arguments passed through the registers. Debugging breakpoints are also implemented using interrupts, with the debugger replacing an instruction with INT 0x3. This type of an interrupt is called a software interrupt.
The CPU itself generates interrupts in certain situations, like when memory is accessed without permissions, when a user divides by zero, or when one core must notify another core that it needs to do something. This type of interrupt is called an exception, and you can read more about them in #esm 's answer.
For a broader discussion of interrupts see here: http://wiki.osdev.org/Interrupt
SYSENTER is an instruction that provides the modern path to cause a context switch for the particular case of performing a system call.
The code that handles the context switching due to interrupts or SYSENTER in Linux can be found in arch/x86/kernel/entry_{32|64}.S.
There are many situations in which a higher-level Linux construct might cause a context switch. Here are a few examples:
If a system call got to int 0x80 or sysenter instruction, a context switch occurs. Some system call routines can use userspace information to get the information the system call was meant to get. In this case, no context switch will occur.
Many times scheduling doesn't require an interrupt: a thread will perform a system call, and the return from the syscall is delayed until it is scheduled again. For processses that are in a section where syscalls aren't performed, Linux relies on timer interrupts to gain control.
Virtual memory access to a memory location that was paged out will cause a segmentation fault, and therefore a context switch.
Signals are usually delivered when a process is already "switched out" (see comments by #caf on the question), but sometimes an inter-processor interrupt is used to deliver the signal between two running processes.

Clarification about the behaviour of request_threaded_irq

I have scoured the web, but haven't found a convincing answer to a couple of related questions I have, with regard to the "request_threaded_irq" feature.
Question1:
Firstly, I was reading this article, regarding threaded IRQ's:
http://lwn.net/Articles/302043/
and there is this one line that isn't clear to me:
"Converting an interrupt to threaded makes only sense when the handler
code takes advantage of it by integrating tasklet/softirq
functionality and simplifying the locking."
I understand had we gone ahead with a "traditional", top half/bottom half approach, we would have needed either spin-locks or disable local IRQ to meddle with shared data. But, what I don't understand is, how would threaded interrupts simplify the need for locking by integrating tasklet/softirq functionality.
Question2:
Secondly, what advantage (if any), does a request_threaded_handler approach have over a work_queue based bottom half approach ? In both cases it seems, as though the "work" is deferred to a dedicated thread. So, what is the difference ?
Question3:
Lastly, in the following prototype:
int request_threaded_irq(unsigned int irq, irq_handler_t handler, irq_handler_t thread_fn, unsigned long irqflags, const char *devname, void *dev_id)
Is it possible that the "handler" part of the IRQ is continuously triggered by the relevant IRQ (say a UART receving characters at a high rate), even while the "thread_fn"(writing rx'd bytes to a circular buffer) part of the interrupt handler is busy processing IRQ's from previous wakeups ? So, wouldn't the handler be trying to "wakeup" an already running "thread_fn" ? How would the running irq thread_fn behave in that case ?
I would really appreciate if someone can help me understand this.
Thanks,
vj
For Question 2,
An IRQ thread on creation is setup with a higher priority, unlike workqueues.
In kernel/irq/manage.c, you'll see some code like the following for creation of kernel threads for threaded IRQs:
static const struct sched_param param = {
.sched_priority = MAX_USER_RT_PRIO/2,
};
t = kthread_create(irq_thread, new, "irq/%d-%s", irq,
new->name);
if (IS_ERR(t)) {
ret = PTR_ERR(t);
goto out_mput;
}
sched_setscheduler_nocheck(t, SCHED_FIFO, &param);
Here you can see, the scheduling policy of the kernel thread is set to an RT one (SCHED_FIFO) and the priority of the thread is set to MAX_USER_RT_PRIO/2 which is higher than regular processes.
For Question 3,
The situation you described can also occur with normal interrupts. Typically in the kernel, interrupts are disabled while an ISR executes. During the execution of the ISR, characters can keep filling the device's buffer and the device can and must continue to assert an interrupt even while interrupts are disabled.
It is the job of the device to make sure the IRQ line is kept asserted till all the characters are read and any processing is complete by the ISR. It is also important that the interrupt is level triggered, or depending on the design be latched by the interrupt controller.
Lastly, the device/peripheral should have an adequately sized FIFO so that characters delivered at a high rate are not lost by a slow ISR. The ISR should also be designed to read as many characters as possible when it executes.
Generally speaking what I've seen is, a controller would have a FIFO of a certain size X, and when the FIFO is filled X/2, it would fire an interrupt that would cause the ISR to grab as much data as possible. The ISR reads as much as possible and then clears the interrupt. Meanwhile, if the FIFO is still X/2, the device would keep the interrupt line asserted causing the ISR to execute again.
Previously, the bottom-half was not a task and still could not block. The only difference was that interrupts were disabled. The tasklet or softirq allow different inter-locks between the driver's ISR thread and the user API (ioctl(), read(), and write()).
I think the work queue is near equivalent. However, the tasklet/ksoftirq has a high priority and is used by all ISR based functionality on that processor. This may give better scheduling opportunities. Also, there is less for the driver to manage; everything is already built-in to the kernel's ISR handler code.
You must handle this. Typically ping-pong buffers can be used or a kfifo like you suggest. The handler should be greedy and get all data from the UART before returning IRQ_WAKE_THREAD.
For Question no 3,
when an threadedirq is activated the corresponding interrupt line is masked / disabled. when the threadedirq runs and completes it enables it towards the end of the it. hence there won't be any interrupt firing while the respective threadedirq is running.
The original work of converting "hard"/"soft" handlers to threaded handlers was done by Thomas Gleixner & team when building the PREEMPT_RT Linux (aka Linux-as-an-RTOS) project (it's not part of mainline).
To truly have Linux run as an RTOS, we cannot tolerate a situation where an interrupt handler interrupts the most critical rt (app) thread; but how can we ensure that the app thread even overrides an interrupt?? By making it (the interrupt) threaded, schedulable (SCHED_FIFO) and have a lower priority than the app thread (interrupt threads rtprio defaults to 50). So a "rt" SCHED_FIFO app thread with a rtprio of 60 would be able to "preempt" (closely enough that it works) even an interrupt thread. That should answer your Qs. 2.
Wrt to Qs 3:
As others have said, your code must handle this situation.
Having said that, pl note that a key point to using a threaded handler is so that you can do work that (possibly) blocks (sleeps). If your "bottom half" work is guaranteed to be non-blocking and must be fast, pl use the traditional style 'top-half/bh' handlers.
How can we do that? Simple: don't use request_threaded_irq() just call request_irq() - the comment in the code clearly says (wrt 3rd parameter):
* #thread_fn: Function called from the irq handler thread
* If NULL, no irq thread is created"
Alternatively, you can pass the IRQF_NO_THREAD flag to request_irq.
(BTW, a quick check with cscope on the 3.14.23 kernel source tree shows that request_irq() is called 1502 times [giving us non-threaded interrupt handling], and request_threaded_irq() [threaded interrupts] is explicitly called 204 times).

does kernel's panic() function completely freezes every other process?

I would like to be confirmed that kernel's panic() function and the others like kernel_halt() and machine_halt(), once triggered, guarantee complete freezing of the machine.
So, are all the kernel and user processes frozen? Is panic() interruptible by the scheduler? The interrupt handlers could still be executed?
Use case: in case of serious error, I need to be sure that the hardware watchdog resets the machine. To this end, I need to make sure that no other thread/process is keeping the watchdog alive. I need to trigger a complete halt of the system. Currently, inside my kernel module, I simply call panic() to freeze everything.
Also, the user-space halt command is guaranteed to freeze the system?
Thanks.
edit: According to: http://linux.die.net/man/2/reboot, I think the best way is to use reboot(LINUX_REBOOT_CMD_HALT): "Control is given to the ROM monitor, if there is one"
Thank you for the comments above. After some research, I am ready to give myself a more complete answer, below:
At least for the x86 architecture, the reboot(LINUX_REBOOT_CMD_HALT) is the way to go. This, in turn, calls the syscall reboot() (see: http://lxr.linux.no/linux+v3.6.6/kernel/sys.c#L433). Then, for the LINUX_REBOOT_CMD_HALT flag (see: http://lxr.linux.no/linux+v3.6.6/kernel/sys.c#L480), the syscall calls kernel_halt() (defined here: http://lxr.linux.no/linux+v3.6.6/kernel/sys.c#L394). That function calls syscore_shutdown() to execute all the registered system core shutdown callbacks, displays the "System halted" message, then it dumps the kernel, AND, finally, it calls machine_halt(), that is a wrapper for native_machine_halt() (see: http://lxr.linux.no/linux+v3.6.6/arch/x86/kernel/reboot.c#L680). It is this function that stops the other CPUs (through machine_shutdown()), then calls stop_this_cpu() to disable the last remaining working processor. The first thing that this function does is to disable interrupts on the current processor, that is the scheduler is no more able to take control.
I am not sure why the syscall reboot() still calls do_exit(0), after calling kernel_halt(). I interpret it like that: now, with all processors marked as disabled, the syscall reboot() calls do_exit(0) and ends itself. Even if the scheduler is awoken, there are no more enabled processors on which it could schedule some task, nor interrupt: the system is halted. I am not sure about this explanation, as the stop_this_cpu() seems to not return (it enters an infinite loop). Maybe is just a safeguard, for the case when the stop_this_cpu() fails (and returns): in this case, do_exit() will end cleanly the current task, then the panic() function is called.
As for the panic() code (defined here: http://lxr.linux.no/linux+v3.6.6/kernel/panic.c#L69), the function first disables the local interrupts, then it disables all the other processors, except the current one by calling smp_send_stop(). Finally, as the sole task executing on the current processor (which is the only processor still alive), with all local interrupts disabled (that is, the preemptible scheduler -- a timer interrupt, after all -- has no chance...), then the panic() function loops some time or it calls emergency_restart(), that is supposed to restart the processor.
If you have better insight, please contribute.

How shared IRQ races are avoided in Linux

I am considering an upcoming situation in an embedded Linux project (no hardware yet) where two external chips will need to share a single physical IRQ line. This line is capable in hardware of edge triggering but not level triggered interrupts.
Looking at the shared irq support in Linux, I understand that the way this would work with two separate drivers is that each would have their interrupt handler called, check their hardware and handle if appropriate.
However I imagine the following race condition and would like to know if I'm missing something or what might be done to work around this. Let's say there are two external interrupt sources, devices A and B:
device B interrupt occurs, IRQ goes active
IRQ edge causes Linux core interrupt handler to run
ISR for device A runs, finds no interrupt pending
device A interrupt occurs, IRQ stays active (wire-OR)
ISR for device B runs, finds interrupt pending, handles and clears it
core interrupt handler exits
IRQ stays active, no more edges are generated, IRQ is locked up
It seems that for this to be fixed, the core interrupt handler would have to check the IRQ level after running all handlers, and if still active, run them all again. Will Linux do this? I don't think the interrupt core knows how to check the level of an IRQ line.
Is this race something that can actually happen, and if so how do I deal with this?
Basically, with the hardware you've described, doing a wired-or for the interrupts will NEVER work correctly on it's own.
If you want to do wired-or, you really need to be using level-sensitive IRQ inputs. If that's not feasible, then perhaps you can add in some kind of interrupt controller. That device would take N level-sensitive inputs, and have one output, and some kind of 'clear'. When the interrupt controller gets a clear it would lower it's output, then re-assert the output if any of it's inputs were still asserted.
On the software side, you could look at is running the IRQ line to another processor input. This would allow you to at least check the state, but the Linux core ISR handling isn't going to know anything about this, and so you'll have to patch in something to get it to check it and cycle through the ISRs again. Also, this means that in heavy interrupt loading situations you're NEVER going to get out of this ISR. Given that you're doing a wire-or on the IRQs, I'm kind of assuming these devices won't be interrupting too often.
One other thing is to look really hard at the processor. There may be some kind of trick you can pull with the interrupt setup in order to get it to recognize the interrupt again.
I wouldn't try anything too tricky myself, I'd either separate the sources onto separate IRQ inputs, change to a level-sensitive input, or add an interrupt controller chip.

Internals of a Linux system call

What happens (in detail) when a thread makes a system call by raising interrupt 80? What work does Linux do to the thread's stack and other state? What changes are done to the processor to put it into kernel mode? After running the interrupt handler, how is control restored back to the calling process?
What if the system call can't be completed quickly: e.g. a read from disk. How does the interrupt handler relinquish control so that the processor can do other stuff while data is being loaded and how does it then obtain control again?
A crash course in kernel mode in one stack overflow answer
Good questions! (Interview questions?)
What happens (in detail) when a
thread makes a system call by raising
interrupt 80?
The int $80 operation is vaguely like a function call. The CPU "takes a trap" and restarts at a known address in kernel mode, typically with a different MMU mode as well. The kernel will save many of the registers, though it doesn't have to save the registers that a program would not expect an ordinary function call to save.
What work does Linux do to the
thread's stack and other state?
Typically an OS will save registers that the ABI promises not to change during procedure calls. The stack will stay the same; the kernel will run on a per-thread kernel stack rather than the per-thread user stack. Naturally some state will change, otherwise there would be no reason to do the system call.
What changes are done to the
processor to put it into kernel mode?
This is usually entirely automatic. The CPU has, generically, a software-interrupt instruction that is a bit like a functional-call operation. It will cause the switch to kernel mode under controlled conditions. Typically, the CPU will change some sort of PSW protection bit, save the old PSW and PC, start at a well-known trap vector address, and may also switch to a different memory management protection and mapping arrangement.
After running the interrupt handler,
how is control restored back to the
calling process?
There will be some sort of "return from interrupt" or "return from trap" instruction, typically, that will act a bit like a complicated function-return instruction. Some RISC processors did very little automatically and required specific code to do the return and some CISC processors like x86 have (never-really-used) instructions that would execute dozens of operations documented in pages of architecture-manual pseudo-code for capability adjustments.
What if the system call can't be
completed quickly: e.g. a read from
disk. How does the interrupt handler
relinquish control so that the
processor can do other stuff while
data is being loaded and how does it
then obtain control again?
The kernel itself is threaded much like a threaded user program is. It just switches stacks (threads) and works on someone else's process for a while.
To answer the last part of the question - what does the kernel do if the system call needs to sleep -
After a system call, the kernel is still logically running in the context of the same task that made the system call - it's just in kernel mode rather than user mode - it is NOT a separate thread and most system calls do not invoke logic from another task/thread. What happens is that the system call calls wait_event, or wait_event_timeout or some other wait function, which adds the task to a list of tasks waiting for something, then puts the task to sleep, which changes its state, and calls schedule() to relinquish the current CPU.
After this the task cannot be run again until it gets woken up, typically by another task (kernel task, etc) or interrupt handler calling a wake* function which will wake up the task(s) sleeping waiting for that particular event, which means the scheduler will soon schedule them again.
It's worth noting that userspace tasks (i.e. threads) are only one type of task and there are a few others internal to the kernel which can do work as well - these are kernel threads and bottom half handlers / tasklets / task queues etc. Work which doesn't belong to any particular userspace process (for example network handling e.g. responding to pings) gets done in these. These tasks are allowed to go to sleep, unlike interrupts (which should not invoke the scheduler)
http://tldp.org/LDP/khg/HyperNews/get/syscall/syscall86.html
This should help people who seek for answers to what happens when the syscall instruction is executed which transfers the control to the kernel (user mode to kernel mode). This is based upon x86_64 architecture.
https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html

Resources