I'm working on a linux device driver (kernel version 2.6.32-37). I debug my code mostly by printing to the kernel logs (using printk). everything goes well, until my computer suddenly stop responding. I've checked it over and over again and my code seems to be correct.
my question is:
is it possible that too many printings to the kernel log might cause the computer to stop responding?
Thanks a lot!
Omer
I doubt the problem is caused by printk, certainly using printk in itself slows down the whole code but it won't crash your system.
Here's a quote from Ubuntu Kernel Debugging Tricks:
The internal kernel console message buffer can sometimes be too small to capture all of the printk messages, especially when debug code generates a lot of printk messages. If the buffer fills up, it wraps around and one can lose valueable debug messages.
As you can read, when printing too much data, you'll simply start writing over some old data that you wanted to see in the log-file; that's a problem because some debugging messages will disappear but not troubling enough to crash the whole thing.
I suggest you double check your code again, try to trace when/where it's crashing and if you can't fix the problem post a question here or in some kernel hacking mailing list.
P.S Also gotta mention, you need to be careful where you're putting your printk statements as some code areas might not tolerate the delay caused by it and THIS might cause further problems leading to a freeze/crash.
Related
For user mode application, incorrect page access doesn't create a lot of trouble other than the application crash and the application crash can be done gracefully by exception handling. Why can't we do the same for kernel crash. So when a kernel module tries to access some invalid address, there is a page fault and the kernel crash. Why it can't be handled gracefully like unloading the faulty module.
More specifically I'm interested to know if it is completely impossible or possible. I am not inclined to know the difficulties it might pose in using the system. I understand a driver crash will result unusable device and I'm okay with that. The only thing is whether it is possible to gracefully unload a faulty driver.
As other answer explains very well why it's not feasible to recover from the kernel crashes, I'll try to tell something else.
There is a lot of research in this area, most notably from prof. Andy Tanenbaum with his MINIX. While the kernel crash is still fatal for the MINIX, MINIX kernel is very simple (micro-kernel) narrowing the space for bugs and inside it most other stuff (including drivers) is running as a user-mode process. So, in case of network driver failure, as they are running in the separate address space, all kernel needs to do is to attempt to restart the driver.
Of course, there are areas where you can't recover (or still can't recover), like in case of the file system crash (see the recent discussion here).
There are several good papers on this topic such as http://pages.cs.wisc.edu/~swami/papers/thesis.pdf and I would highly recommend watching Tanenbaum's videos such this one (title is "MINIX 3: A Reliable and Secure Operating System" in case it ever goes offline).
I think this addresses your comment:
We should be able to unload the faulty module. Why can't we? That is my question. Is it a design choice for security or its not possible at all. If it is a design choice, what factors forced us to make that choice
You could live without screen if graphics driver module crashes. However, we can't unload the faulty module and continue because if it crashed and it runs in the same address space as kernel, you don't know if it poisoned the kernel memory - security is the prime factor here.
That's kind of like saying "if you wrap all your Java code in a try/catch block, you've eliminated all bugs!"
There are a number of "errors" that are caught, e.g. kalloc returns NULL if it's out of memory, USB code returns errors if there's no USB, etc. But there's no try/catch for the entire operating system, because not all bugs can be repaired.
As pointed out, what happens if your filesystem module crashes? Keep running without files? How about your ethernet driver? Now your box is cut off from the internet and you can't even ssh into it anymore, but doesn't even have the decency to reboot.
So even though it may be possible for the kernel to not "crash" when a module crashes, the state of the kernel could be arbitrarily broken. The kernel could stay alive without a screen, filesystem or internet connection, but what kind of existence is that?
The kernel modules and the kernel itself share the same address space. There is simply no protection if a modules starts to misbehave and overwrite memory from another subsystem.
So, when a driver crashes, it may or may not stay local to that driver. If you are lucky, you still have a somewhat functional kernel and can continue to work.
That doesn't happen with userspace because the address space for each process is separate and so it is possible to catch erroneous memory access and stop the process (this is a SEGFAULT).
I am a newer of driver development. I have configured my linux kernel according to the Linux Device Driver chaper 4, enabled a lot of debug configuration. When I try to test a driver written by me, the kernel issues an oops. This oops, however, immediately flushed by chunks of other debug information. So, where could I find out the oops info which occurred in a flash.
By the way, can anyone explain the meaning of debug information below?
[ 1698.129712] evbug: Event. Dev: input0, Type: 0, Code: 0, Value: 0
This type of message flushed by screen and I even can't stop them.
To avoid a lot of useless information (in your case) you have to enable only and only what you really need to debug your module. I highly recommend to disable everything you enabled back. Then case-by-case you may enable debug features.
Next, there is a nice framework in kernel called Dynamic Debug. It allows at runtime enable or disable certain debug messages (be sure you have CONFIG_DYNAMIC_DEBUG=y in the Linux kernel configuration). More detailed description is available in Documentation/dynamic-debug-howto.txt.
evbug is a module to monitor input events in the kernel. There is one of the message it can issue. It's very simple one you may check at drivers/input/evbug.c. Unfortunately, it uses printk() calls directly and you can't manipulate its output through dynamic debug.
At the end the answer to your topic question is check output of dmesg command. But be aware that the kernel buffer for output is small enough and if you have a lot of logs you may miss some of them.
I've got a program that is running on a headless/embedded Linux box, and under certain circumstances that program seems to be using up quite a bit more memory (as reported by top, etc) than I would expect it to use.
Since the fault condition is difficult to reproduce outside of the actual working environment, and since the embedded box doesn't have niceties like valgrind or gdb installed, what I'd like to do is simply write out the process's heap-memory to a file, which I could then transfer to my development machine and look through at my leisure, to see if I can tell from the contents of the file what kind of data it is that is taking up the bulk of the heap. If I'm lucky there might be a smoking gun like a repeating string or magic-number that comes up a lot, that points me to the place in my code that is either leaking or perhaps just growing a data structure without bounds.
Is there a good way to do this? The only way I can think of would be to force the process to crash and then collect a core dump, but since the fault condition is rare it would be preferable if I could collect the information without crashing the process as a side effect.
You can read the entire memory space of the process via /proc/pid/mem; You can read /proc/pid/maps to see what is where in the memory space (so you can find the bounds of the heap and read just that). You can attempt to read the data while the process is running (in which case it might be changing while you are reading it), or you can stop the process with a SIGSTOP signal and later resume it with a SIGCONT.
I'm chasing a Heisenbug in a linux x64 process. (Attaching to a the process with a debugger or strace makes the problem never occur.) I've been able to put in an infinite loop when the code detects the fault and attach with gdb that way, but it just shows me that a file descriptor (fd) that should be working is no longer valid. I really want to get a history of the fd, hence trying strace, but of course that won't let the problem repo.
Other factors indicate that the problem with gdb/strace is timing. I've tried running strace with -etrace=desc or even -eraw=open and outputting to a ramdisk to see if that would reduce the strace overhead in the right way to trigger the problem, but no success. I tried running strace+, but it is a good order of magnitude slower than strace.
The process I'm attaching to is partly a commercial binary that I don't have source access to, and partly code I preload into the process space, so printf-everywhere isn't 100% possible.
Do you have suggestions for how to trace the fd history?
Update: added note about strace+
I solved the tracing problem by:
Preloading wrapper stub functions around the relevant system calls, open(), close() and poll()
Logging the relevant information in a filename created on a ramdisk.
(The actual issue was a race, with the kernel's poll() tring to access pollfd memory and returning EFAULT.)
I've just spent my 2 extra hours trying to find bug in my modification of the kernel of the linux, every time when I was connecting module to the kernel it was good but when I unconnected it my mouse stopped to work, so using printk I found infinite loop, my question is does somebody know nice techniques to detect such bugs, sometimes it is difficult to find such loops, and linux becomes unpredictable, so how can I avoid infinite loops in kernel thanks in advance
There is some infrastructure in the kernel that allows you to detect some lockup conditions :
CONFIG_DETECT_SOFTLOCKUP
CONFIG_DETECT_HUNG_TASK
And the various lock checking function you can find in the "Kernel Hacking" section of the kernel config
I've always found printk useful for that, as you did.
Other options would be running your kernel in Bochs in debugging mode. And as I recall, there's a way of running the kernel in gdb. Google can help with those options.
Oh, you said "avoid" not "debug"... hmm, the best way to avoid is do not hack the kernel :^)
Seriously, when doing kernel-level programming you have to be extra careful. Add a main() to the code that stress-tests your routines in usermode before adding to the running kernel. And read over your code, especially after you've isolated the bug to a particular section. I once found an infinite loop in LynxOS's terminal driver when some ANSI art hung the operating system. Some junior programmer, apparently, had written that part, parsing the escape sequence options as text rather than numbers. The code was so bad, I got disgusted trying to locate the exact error that forced the loop, and just rewrote most of the driver. And tested it in usermode before adding to the kernel.
You could try to enable the NMI watchdog.