Now I encounter a problem about Linux NMI Watchdog.
I want to use Linux NMI watchdog to detect and recovery OS hang. So, I add "nmi_watchdog=1" into grub.cfg. And then check the /proc/interrupt, NMI were triggered per second. But after I load a module with deadlock (double-acquire spinlock), system were hanged totally, and nothing occurs (never panic!). It looks like that NMI watchdog did not work!
Then I read the Documentation/nmi_watchdog.txt, it says:
Be aware that when using local APIC, the frequency of NMI interrupts
it generates, depends on the system load. The local APIC NMI watchdog,
lacking a better source, uses the "cycles unhalted" event.
What's the "cycles unhalted" event?
It added:
but if your system locks up on anything but the "hlt" processor
instruction, the watchdog will trigger very soon as the "cycles
unhalted" event will happen every clock tick...If it locks up on
"hlt", then you are out of luck -- the event will not happen at all
and the watchdog won't trigger.
Seems like that watchdog won't trigger if processor executes "hlt" instruction, then I search "hlt" in "Intel 64 and IA-32 Architectures Software Developer's Manual, Volumn 2A", it describes it as follow:
Stops instruction execution and places the processor in a HALT state.
An enabled interrupt (including NMI and SMI), a debug exception, the
BINIT# signal, the INIT# signal, or the RESET# signal will resume
execution.
Then I am lost...
My question is:
How does Linux NMI watchdog work?
Who trigger the NMI?
My OS is Ubuntu 10.04 LTS, Linux-2.6.32.21, CPU Pentium 4 Dual-core 3.20 GHz.
I didn't read the whole source code about NMI watchdog(no time), if I couldn't understand how NMI watchdog work, I want use performance monitoring counter interrupt and inter-processor interrupt (be provided by APIC) to send NMI instead of NMI watchdog.
The answer depends on your hardware.
Non-maskable interrupts (NMI) can be triggered 2 ways: 1) when the kernel reaches a halting state that can't be interrupted by another method, and 2) by hardware -- using an NMI button.
On the front of some Dell servers, for example, you will see a small circle with a zig-zag line inside it. This is the NMI symbol. Nearby there is a hole. Insert a pin to trigger the interrupt. If your kernel is built to support it, this will dump a kernel panic trace to the console, then reboot the system.
This can happen very fast. So if you don't have a console attached to save the output to a file, it might look like only a reboot.
As I know, nmi_watchdog would only triggered for non-interruptible hangs. I found an code example by google: http://oslearn.blogspot.in/2011/04/use-nmi-watchdog.html
If your deadlock is not non-interruptiable, you can try enable sysRq to trigger some trace (Alt-printscreen-t) or crash (Alt-printscreen-c) to get more information.
Related
I try to stop interrupts from user space for a specific isolated core,
so I set CPU affinity:
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(2, &set);
assert(sched_setaffinity(getpid(),sizeof(set),&set)==0);
and useiopl(3) to execute privileged instruction cli/sti in user space:
iopl(3);
__asm__("cli;");
// busy looping for a while
__asm__("sti;");
and there are two phenomenons I can't explain:
1 cli can't actually stop interrupts (at least not all interrupts), and interrupt, such as LOC (Local Timer Interrupt) comes out every now and then;
I notice lasted kernel patches prevent cli in user space (reference) , but this result can be reproduced in kernel 4.19.0.
2 AFAIK, cli only clear interrupt flag of CPU on which the program is running, but in practice, my whole system is stuck, not responding to my mouse or keyboard.
(2): Many parts of the Linux kernel depend on communicating with other cores, including RCU depending on for each core: run_on(core) and stuff like that. (https://lwn.net/Articles/262464/). Any kernel code doing that will get stuck when this core doesn't respond to the IPI that other cores send to ask the kernel on this core to switch to a certain task, or perhaps to do TLB shootdowns.
I don't know what exact thing would tend to lead to getting stuck, but I don't find it surprising at all that other parts of the kernel are waiting for something that depends on hearing back from the kernel on this core, and that blocks progress of something involved in getting keyboard/mouse events to an X server and to user-space. (Or even to a text console? That might have more hope, fewer layers of software.)
Or it's always possible that some keyboard or mouse interrupts get distributed to this core, and ignored.
As for (1): do you leave the NMI watchdog enabled, or other source of NMIs? That could get the kernel running temporarily in a state where (other?) interrupts are enabled.
I use kernel/nmi_watchdog = 0 in /etc/sysctl.d/99-local.conf to free up an extra perf counter, but the default is enabled.
(cli doesn't stop Non-Maskable Interrupts, as you might guess from the name.)
Other than that guess, I don't know why you'd still be occasional LOCal timer interrupts; maybe someone more familiar with modern x86 interrupts would know.
When NMI watchdog has been "disabled" it is still chatty.
Does anyone know where the docs for these messages live? I'd like to see what is actually happening.
For example, verified that it is disabled:
$ cat /proc/sys/kernel/nmi_watchdog
0
YET, we still see messages like the following on shutdown or boot:
$ journalctl -xn 100000 | grep "NMI watchdog"
Oct 23 14:29:31 hostname-us kernel: NMI watchdog: disabled (cpu0): hardware events not enabled
Oct 23 14:29:31 hostname-us kernel: NMI watchdog: Shutting down hard lockup detector on all cpus
Now I know that this isn't a RESET, it's something else and I'd like to have the documented answer, not a best guess.
Tried looking through kernel.org and debian.org, man pages with no luck, only archived bugzilla pages.
We'd like to know what these messages actually mean, not make assumptions. Does anyone know where the decoder ring lives ?
Should have known it wasn't an exact match but managed to finally find it on kernel.org
https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html
From http://slacksite.com/slackware/nmi.html
The NMI watchdog is a kind of timer event handler, it checks Local APIC or IO-APIC interrupt counter register when it is called on every local timer event of each CPU. Generally speaking there could be hundreds of device and timer interrupts are received per second. If there are no interrupts received in a 5 second interval, the NMI watchdog assumes that the system has hung and initiates a kernel panic. This is very helpful when you need some data for investigating the issue , but occasionally it may have such undesirable effects.
This is for NetBSD on MIPS processor, but answer for Linux is also welcome.
I see that an interrupt occurred while receiving a network packet.
This hardware interrupt sees a TLB miss on store operation and kernel crashed.
When I see the core-dump, gdb points to LWP of a process (lets say ProcA).
I am assuming that, this hardware interrupt may have preempted the ProcA and started execution on ProcA's kernel stack.
Though in the stack frame I don't see anything from ProcA, what I don't understand is why gdb still points to ProcA.
I'm using the Luminary LM3S8962 micro-controller and its included Library Guide, but this should be relevant to any ARM Cortex-M3s that have Nested Vector Interrupts.
You can only register one interrupt service routine function with an entire GPIO Port. A GPIO port typically has 8 pins on it, each of which can be configured with an interrupt. For each pin, you can test whether or not an interrupt "happened" on it (is pending), right? and for each pin you can clear a pending interrupt, right?
If a pin on the GPIO port triggers the ISR then the processor is in the ISR. Then what happens if another pin on the same port triggers an interrupt while we're in the ISR? We assume the code detects what pins have pending interrupts.
- Is this ISR interrupted and a new one begins, with the same code, but an updated PinInterruptStatus register ? (I hope not)
- Is this ISR executed until completion, immediately executing the interrupt for the other pin right afterward? (I know ARM Cortex M3 implements tail-chaining of interrupts)
- Or must there be a while loop that loops until all the pins have been cleared, clearing a pin after it has been processed?
maybe this will help:
http://www.ti.com/lit/gpn/lm3s8962
As stated in the comment: generally ISRs should take steps to prevent reentrancy. In something like a PIC, this could be as simple as disabling the interrupt at the "top" of the ISR, and enabling the interrupt at the "bottom". The M3's NVIC is a bit more complicated. This white paper (http://www.arm.com/files/pdf/IntroToCortex-M3.pdf) states the following on p.7:
The NVIC supports nesting (stacking) of interrupts, allowing an
interrupt to be serviced earlier by exerting higher priority. It also
supports dynamic reprioritisation of interrupts. Priority levels can
be changed by software during run time. Interrupts that are being
serviced are blocked from further activation until the interrupt
service routine is completed, so their priority can be changed without
risk of accidental re-entry.
The above discussion directly addresses the possibility of same interrupt reentrancy, and it also introduces the concept of prioritization to handle interrupts of higher priority interrupting your ISR.
This reference is pretty good: http://infocenter.arm.com/help/topic/com.arm.doc.dui0552a/DUI0552A_cortex_m3_dgug.pdf. On p. 4-9, you'll find instructions to enable/disable interrupts. On page 4-6, you'll find a description of the Interrupt Clear-pending Registers. Using these, you can determine what interrupts are pending. If you really want to get fancy with interrupt enable/disable control, check out the BASEPRI and BASEPRO_MAX registers.
Having said that, I'm not sure I agree with your statement that your question is relevant to any Cortex-M3. Keil (my flavor of Cortex-M3) mentions that the EXTI (external interrupt controller) handles GPIO pin interrupts. Interestingly, the ARM documentation briefly discusses "EXTI", but does not refer to it as a "controller" like the Keil STM32 documentation. A quick google on "STM32 EXTI" yeilds lots of hits, a similar search on "Luminary EXTI" does not yield much. Given that, I'm guessing that this particular controller is one of the peripheral devices that ARM leaves to 3rd parties.
This document
bolsters that view: http://www.st.com/internet/com/TECHNICAL_RESOURCES/TECHNICAL_LITERATURE/REFERENCE_MANUAL/CD00171190.pdf. There are several AFIO_EXTI registers mentioned here. These permit the mapping of GPIO lines to interrupts. Unfortunately, I can't find anything similar in the Luminary documentation.
So...what does this mean? It looks like you only have port-level granularity for your interrupt. Thus, your ISR will have to determine which pin transitioned (assuming your are looking for edges). Good luck!
In Cortex-M3, if two interrupts are the same priority (for all GPIO pins), the former will not be interrupted. The interrupt comes later will be in pending state.
When a GPIO interrupt occurs you can check the GPIO Interrupt Status for Rising/Falling IO0IntEnR/IO0IntEnF (depending on ) for the corresponding bit to find the pin that causes the interrupt.
In Linux, what are the options for handling device interrupts in user space code rather than in kernel space?
Experience tells it is possible to write good and stable user-space drivers for almost any PCI adapter. It just requires some sophistication and a small proxying layer in the kernel. UIO is a step in that direction, but If you want to correctly handle interrupts in user-space then UIO might not be enough, for example if the device doesn't support the PCI-spec's interrupt disable bit which UIO relies on.
Notice that process wakeup latencies are a few microsecs so if your implementation requires very low latency then user-space might be a drag on it.
If I were to implement a user-space driver, I would reduce the kernel ISR to just a "disable & ack & wakeup-userpace" operation, handle the interrupt inside the waked-up process, and then re-enable the interrupt (of course, by writing to mapped PCI memory from the userspace process).
There is Userspace I/O system (UIO), but handling should still be done in kernelspace. OTOH, if you just need to notice the interrupt, you don't need the kernel part.
You may like to take a look at CHAPTER 10: Interrupt Handling from Linux Device Drivers, Third Edition book.
Have to trigger userland code indirectly.
Kernel ISR indicates interrupt by writing file / setting register / signalling. User space application polls this and goes on with the appropriate code.
Edge cases: more or less interrupts than expected (time out / too many interrupts per time interval)
Linux file abstraction is used to connect kernel and user space. This is performed by character devices and ioctl() calls. Some may prefer sysfs entries for this purpose.
This can look odd because event triggered device notifications (interrupts) are hooked with 'time triggered' polling, but it is actually asyncronous blocking (read/select). Anyway some questions are arising according to performance.
So interrupts cannot be directly handled outside the kernel.
E.g. shared memory can be in user space and with some I/O permission settings addresses can be mapped, so U-I/O works, but not for direct interrupt handling.
I have found only one 'minority report' in topic vfio (http://lxr.free-electrons.com/source/Documentation/vfio.txt):
https://stackoverflow.com/a/21197797/5349798
Similar questions:
Running user thread in context of an interrupt in linux
Is it possible in linux to register a interrupt handler from any user-space program?
Linux Kernel: invoke call back function in user space from kernel space
Linux Interrupt vs. Polling
Linux user space PCI driver
How do I inform a user space application that the driver has received an interrupt in linux?