Understanding kernel message 'nobody cared (try booting with the "irqpoll" option)' - linux

I'm trying to understand the meaning of the following message:
irq N:nobody cared (try booting with the "irqpoll" option)
Does this mean the IRQ handler not processing the response even it has gotten the interrupt? Or that the scheduler failed to call an irq handler?
In what condition is this happening?

it means that either no handler is registered for that irq
or the one that is returned status indicating that the irq was not for him (from hardware he is supporting) in case of shared interrupts
probably a faulty HW/FW or buggy driver

Ideally, the above message should be followed by a stack trace, which should help you determine which subsystem is causing the issue. This message means the interrupt handler got stuck due to a overhead, and did not return thus causing the system to disable IRQ#X. This is seen in cases of a buggy firmware.
The irqpoll option needs to be added to grub.conf, which means, when an interrupt is not handled, search all known interrupt handlers for the appropriate handlers and also check all handlers on each timer interrupt. This is sometimes useful to get systems with broken firmware running. The kernel command line in grub.conf should look like the following:
kernel /vmlinuz-version ro root=/dev/sda1 quiet irqpoll

Minimal runnable QEMU example
QEMU has an educational device called edu that generates interrupts, and is perfect to explore this.
First, I have created a minimal Linux PCI device driver for it, which handles the interrupt correctly.
Now we can easily generate the error by commenting out request_irq and free_irq from the code.
Then, if we run the userland program that generates IRQs, we get:
irq 11: nobody cared (try booting with the "irqpoll" option)
followed by a stack trace.
So as others mentioned: unhandled IRQs.

In my case after reloading the driver because the network card had billions of errors in a short period of time.
modprobe -r ixgbe && modprobe ixgbe
lspci showed an unknown device where the 'card' used to be
after a reboot the card disappeared never to be seen again.
So the error might also show failing hardware.

see here:
static inline int bad_action_ret(irqreturn_t action_ret)
{
if (likely(action_ret <= (IRQ_HANDLED | IRQ_WAKE_THREAD)))
return 0;
return 1;
}

Related

How to send a simple message from kernel to user space?

I have a very simple (I think) problem.
I have a very simple kernel module, which handling an interrupt coming from my hardware (its all described in my device tree). I get the interrupt in kernel. Now I want to send a message (just 64 Bit, two uint32_t) to a program in user space. It will also be ok if I can "wake" up my program (there are serveral threads in there, so one thread could sleep until it will woke up by kernel module).
My problem is: What is the easiest and clearest solution? I read about netlink, using the proc filesystem, but
either I cannot find some clear examples out there
the messageing is only from user to kernel space
examples are outdated for the kernel I use (4.4).
Does anybody have a very clear example or a how to do such things?
P.S. I don't want to handle all the things following on the interrupt in kernel space. It's ok if some messages getted lost.

IRQ Handling from User Space Linux

I'm writting a driver for a synthesized device in an FPGA. The device has several IRQs and have requested them on my driver:
irq = platform_get_resource(pdev, IORESOURCE_IRQ, 0);
rc = request_irq(irq, &Custom_driver_handler,IRQF_TRIGGER_RISING , DRIVER_NAME, base_addr);
My problem is that i want that the irq_handler calls a function of an user space application. Is there any way to call my user space application from the irq_handler of the driver on kernel space??
I know i could save a flag from the driver and mmap its direction from the user application to polling it, but what i want to know is if there is any faster/more correct way.
Thank you in advance
There are several ways of invoking user-space functions from kernel, usually named upcalls: http://lkml.iu.edu/hypermail/linux/kernel/9809.3/0922.html; check also https://lwn.net/Articles/127698/ "Handling interrupts in user space" and the http://wiki.tldp.org/kernel_user_space_howto overview from 2008, part "Sending Signals from the Kernel to the User Space".
To make writing drivers easier, there is UIO framework in the kernel now: https://unix.stackexchange.com/questions/136274/can-i-achieve-functionality-similar-to-interrupts-in-linux-userspace https://lwn.net/Articles/232575/ https://yurovsky.github.io/2014/10/10/linux-uio-gpio-interrupt/ https://www.osadl.org/fileadmin/dam/rtlws/12/Koch.pdf http://www.hep.by/gnu/kernel/uio-howto/
With UIO you can block or poll special file descriptor to wait for interrupt (block by using read() syscall; poll with poll syscall): https://lwn.net/Articles/232575/
On the user space side, the first UIO-handled device will show up as /dev/uio0 (assuming a normal udev setup). The user-space driver will open the device. Reading the device returns an int value which is the event count (number of interrupts) seen by the device; if no interrupts have come in since the last read, the operation will block until an interrupt happens (though non-blocking operation is supported in the usual way as well). The file descriptor can be passed to poll().
include/linux/uio_driver.h is available in linux kernel for many years, it is here for 3. and 4. versions of kernel.

i2c accessing at user space issue, How to solve this ..?

WARNING: at kernel/irq/manage.c:274 0xa01aa01b()
Unbalanced enable for IRQ 10
Modules linked in:
Backtrace: no frame pointer
---[ end trace 5cce32c8b5df3d34 ]---
When I run my application program its giving this error what does this mean and how to solve..?please guide me in detail.
Checking the kernel source(kernel/irq/manage.c:274), we can see that this warning print is triggered in enable_irq(). This happens when trying to enable an already enabled IRQ i.e. without it being disabled first.
If you are getting this warning as a result of running some user-space program, then you need to check the logic of the driver that this user-space program interacts with and fix the unnecessary enabling of IRQ 10 in that driver.
Apart from polluting the kernel logs, this warning is pretty much safe to ignore as it does not affect the immediate functionality. However it does indicate a deeper problem in your program's (or the underlying driver's) state machine logic.

request_irq succeeds but interrupt is never detected

I am running embedded linux 3.2.6 on an ARM processor. I am using a modified version of atmel's serial driver to control the 4 USART ports on my device. When I use the driver compiled with the kernel, all works fine. But I want to run the driver as a kernel module instead. I make all of the necessary changes and disable the internal driver and everything seems fine. The 4 tty devices are registered successfully and I can see that the all of my probe and initialization functions work correctly.
So here's the problem:
When I try to write to any of the devices, my "start transmit" function gets called but then waits for an interrupt from the usart which never occurs. So the write just hangs, and using a logic analyzer I can see that RTS gets asserted but no bytes show up on the tx line. I know that my call to request_irq succeeds and yet i never see any of the irq entries in /proc/interrupts. In the driver, I have also tried using request_irq to register a separate interrupt handler for a gpio line, and this works fine.
I know that this is a problem that is probably hard to diagnose, but I am looking for ANY possible suggestions that could lead me in the right direction to finding a solution. Let me know if you need any clarifications. Thank you
The symptoms reads like a peripheral clock that has not been enabled (or turned off): the device can be initialized w/o errors and an I/O operation can be setup, but the device doesn't do anything; it plays dead. Since no I/O ever starts, you're never going to get an interrupt indicating completion!
The other thing to check are the conditional compilation directives for HW configuration structures in your arch/arm/mach-xxx/zzz_devices.c file.
Make sure that the serial port structures have something like:
#if defined(CONFIG_SERIAL_ATMEL) || defined(CONFIG_SERIAL_ATMEL_MODULE)
and not just
#if defined(CONFIG_SERIAL_ATMEL)
Addendum
I could be wrong but the clock shouldn't have any effect on the CTS pin causing an interrupt, right?
Not right.
These digital circuits are synchronous state machines: without a clock, a change-of-state by an input cannot be processed.
Also, SoCs and modern uControllers use the peripheral clocks as on/off switches for those integrated peripherals. There is often way more functionality, i.e. peripherals, on the silicon chip than can actually be used, mostly due to insufficient quantity of pins to the board. So disabling the clocks to unused devices is employed to reduce power consumption.
You are far too focused on interrupts.
You do not have a solvable interrupt problem; those are secondary failures.
The lack of output when attempting to transmit is far more significant and revealing.
The root cause is probably a flawed configuration of the USART devices, since transmitting bits is an automatic operation for a configured & operational USART.
If the difference between not-working versus working is loadable module versus static linking, then the root cause is going to be something fundamental (and trivial) like my two suggestions.
Also your lack of acknowledgement regarding the #if defined(), e.g. you didn't respond with "Oh yeah, we already knew that", raises a gigantic red flag that says "Fix me first!"
Addendum 2
I'm tempted to delete this answer after discovering that the Atmel serial driver cannot be configured/built as a loadable module using make menuconfig (which is the premise for half of the answer). (Of course the Kconfig file could be hacked to make the config variable tristate instead of boolean to overcome the module restriction.) I've left a comment for the OP. But I also wanted to preserve the comment to Mr. Stratton pointing out how symbols in the .config file are (not) used.
So I did finally fix my problem. Thank you for the responses, none of them directly solved my problem but they did prompt further examination of my code. After some trial and error I finally got it working. I had originally moved the platform_device structures for each usart from /mach-at91/xxx_devices.c to my loadable module. Well for some reason the structures weren't getting the correct data to map to the hardware, I suppose because it wasn't correctly linking the symbols from the kernel (never got an error message though) and so some of the registration functions weren't even getting called. I ended up moving the structures and platform_device_register calls back into the devices file. I also decided to keep the driver for the console built-in using the original atmel_serial.c driver. I had to change the platform_device name for the console in both the devices file and in the built-in atmel_serial.c file in order for it to not conflict with my usart ports driver. I found that changing the platform_device and platform_driver name for the usarts from anything but "atmel_usart" resulted in usart transmission failing. I really don't understand why, but i'm just leaving it as atmel_usart so it works.
Thanks again to everybody who responded to my problem.

How to record the system information before system hang? [duplicate]

I have an embedded board with a kernel module of thousands of lines which freeze on random and complexe use case with random time. What are the solution for me to try to debug it ?
I have already try magic System Request but it does not work. I guess that the explanation is that I am in a loop or a deadlock in a code where hardware interrupt is disable ?
Thanks,
Eva.
Typically, embedded boards have a watch dog. You should enable this timer and use the watchdog user process to kick the watch dog hard ware. Use nice on the watchdog process so that higher priority tasks must relinquish the CPU. This gives clues as to the issue. If the device does not reset with a watch dog active, then it maybe that only the network or serial port has stopped communicating. Ie, the kernel has not locked up. The issue is that there is no user visible activity. The watch dog is also useful if/when this type of issue occurs in the field.
For a kernel lockup case, the lockup watchdogs kernel features maybe useful. This will work if you have an infinite loop/deadlock as speculated. However, if this is custom hardware, it is also possible that SDRAM or a peripheral device latches up and causes abnormal bus activity. This will stop the CPU from fetching proper code; obviously, it is tough for Linux to recover from this.
You can combine the watchdog with some fallow memory that is used as a trace buffer. memmap= and mem= can limit the memory used by the kernel. A driver/device using this memory can be written that saves trace points that survive a reboot. The fallow memory's ring buffer is dumped when a watchdog reset is detected on kernel boot.
It is also useful to register thread notifiers that can do a printk on context switches, if the issue is repeatable or to discover how to make the event repeatable. Once you determine a sequence of events that leads to the lockup, you can use the scope or logic analyzer to do some final diagnosis. Or, it maybe evident which peripheral is the issue at this point.
You may also set panic=-1 and reboot=... on the kernel command line. The kdump facilities are useful, if you only have a code problem.
Related: kernel trap (at web archive). This link may no longer be available, but aren't important to this answer.

Resources