I'm trying to read a physical address using mmap in a application. Due to some reason, that physical address has some hardware fault and the ack on the bus will never come back when trying to read it.
When read this address, we found that the application hangs immediately without any message output, but the application can be cancelled or suspended, which means the OS is still alive without being impacted any.
1).I'm just curious what the application is doing and how the hang could happen?
my understand is that the CPU should have timeout detection when the ack not coming back at the specified time slot, the application should not stop at the read instruction and there should be some exception being triggered to inform the kernel.
2).We are doing a lot of hardware testing and so we want the application or the kernel output something when the hang happens. Is there a way of adding something to do this?
thanks a lot in advance!
Related
I've just found that there is NAPI in linux kernel which may be used to poll events from NIC instead of getting interruptions with context switches and other expensive things. It may be useful for highload systems.
I found an article about it which says that I can enable polling mode for NIC. But the problem is that I can't for some reason. I'm trying to figure out the relations between NAPI, poll/epoll/etc functions to understand is there any way to enable that polling from user-mode level instead of setting kernel variables.
I know that there are some functions like poll/epoll/etc which may be used to read from resources (network sockets) when they are ready.
I have several questions and I'll be glad to get answers from people with knowledge.
Is there any way to enable polling from NIC instead of interruptions from user-mode code? Or the only way to do it is to set a kernel variable?
I found a question here on SO says that user-mode functions (poll/epoll/etc) under the hood call a driver's function named poll. So, when I'll use poll/epoll in user-mode looks like I'll work with driver's function. But I'm not sure about interruptions. Will they gone? In my head it's like they are here to give information to a kernel about new packets to read. How the kernel will know about new packets if there are no interruptions?
Am I correct that setting kernel variables to poll NIC instead of waiting for (and handling) interruptions is not the same that using poll/epoll functions in user-mode? In my head there is the next high-level algorigth of work:
with interruptions:
Packet went to a NIC.
NIC calls a hardware interruption to handle that packet.
Blocked .Read(fd) calls in user-mode is unblocked and data may be read.
with kernel polling:
Packet went to a NIC.
Another one packet went.
And one more.
CPU on some interval polls information from a NIC.
Blocked .Read(fd) calls in user-mode is unblocked and data may be read.
I have some specific hardware which run on FreeBSD and Linux.
I have to do an user space application which will work with the driver using shared memory between kernel/user space application.
My application does busy poll on the shared mem from user space.
Is there any idea how I can use a mechanism such select to sleep and to get notify on shared memory change (buy the driver) ?
I dont want to implement some communication like netlink, because the idea with select is to sleep, to wake up if something happens, and to keep awake and keep processing data without handing more IPC with the kernel.
And then, when it will be done, the application can call again select and wait again.
Thank you.
You are looking for the kqueue(2) interface on FreeBSD.
On Linux there is inotify/epoll.
I'm working on an embedded linux platform.
When I do "echo "mem" > /sys/power/state", system will suspend.
I know that kernel and driver can know that suspend operation's coming. But would it be possible that a user space process or application can get the notification that the system will suspend? How?
For example, I have an application who writes 'A' continually into a buffer whose start address is given by a device driver. Would it be possible that this application be notified that the system will suspend so that it could replace all this buffer with 'B' so that when driver is resumed, all what driver sees are 'B'?
Thanks a lot.
Been searching for the same thing. But unfortunately, I didn't find any user space notification during suspend/resume. Applications are just refrigerated/frozen and they will never know they are suspended.
However, one possible approach would be to add a generic netlink message sending or uevent from any driver's suspend/resume function that you can modify. Still the application may never get enough time to process it before it is frozen and might lead to race conditions. Say it received the suspend message and got frozen before it could process it. And once resumed, it will be processing the suspend message.
IMO, it is better to handle the scenario in the driver. Leave the user space alone.
I'm not sure whether it's useful for you in particular, given the mention of "embedded", however systemd can notify you over DBus: https://www.freedesktop.org/wiki/Software/systemd/inhibit/
My inquiry is about block drivers. Suppose our device
encounter problems and all error handling failed. The only remaining
option I see for the driver to do is to offline it. The intention is
for the system to be notified to avoid sending any future request to
the device and be inaccessible so any wicked behaviour, like system
crash, are avoided. This should work whether the device is used
standalone or with RAID.
The question is how to do it cleanly. I've searched existing drivers
in the kernel source (like sd, skd, etc), and all they do (as I see
it) is to complete succeeding requests with -EIO
(__blk_end_request_all(rq, -EIO)). So far this works with FIO,
because once that happens and all fio requests are completed with
-EIO, fio stops sending further requests. But if used with RAID and
unrecoverable fault is triggered, just completing requests with error
seems not to be working. RAID detects errors, but can't remove the
erring device from its system. Sometimes kernel is still sending
requests to the RAID'ed faulty device (which should stop). RAID
becomes degraded (in a not-so-good way) and eventually crashes the
system.
So how does a block driver properly informs the upper layers that
its device is faulty and must not be used again for that session?
While shutting down QNX neutrino using phshutdown(either reboot or shutdown),system hangs while killing message queues(mqueue).the message displayed on screen is
Shutting down service providers(mqueue)
What could be the reason for this ?
This happens from time to time when you issue shutdown from the command line as well.
Some of the reasons I've seen on the web are:
Hardware issue
Driver issue
Kernel told to shut down when it didn't want to
From what I've cobbled together (and this is by no means definitive, but seems to be plausible), basically, any program that is waiting for the hardware or OS to reply has a chance of hanging the shutdown if the thing it is waiting on gets killed before it does.
A possible mitigation is to slay all your apps/servers (especially those touching hardware devices or shared memory queues) prior to issuing a shutdown, wait for a second or two, then go ahead with your shutdown.