Linux device driver unsafe FXSAVE/FXRSTOR bug -- any precedents? - linux

There's a nasty problem that has temporarily stumped a number of engineers at my company trying to debug it.
The C++ program is normally run on a cluster of multicore computers with MPI.
It will run for a very long time -- perhaps days -- and then suddenly fail.
Most of engineers working on it have eliminated any reasonable possibility of a bug in the program itself, so they're starting to assign blame to a possible hardware problem, but I suspect there must be a software problem in either a Linux kernel module or device driver.
What is suspect is that a kernel module or device driver, in order to do some floating-point calculations, is doing FXSAVE/FXRSTOR in a manner that is unsafe on SMP systems. It could be something as simple as doing the FXSAVE to a static buffer in a kernel routine that needed to be reentrant. That would create a race condition bug that would very rarely corrupt the floating-point context of a thread.
At the application level, what appears to be happening is that one or more bits of the MXCSR -- which is part of the FXSAVE/FXRSTOR context -- is suddenly changed, but there is no application code to change it.
I encountered something similar many years ago on Windows, which ultimately turned out to be a bug in a video driver, such that when the application code was preempted by the operating system, some MXCSR bits in that thread's context were corrupted.
I'm not an expert at Linux Kernel hacking or device driver development, but I'm reading that the reentrancy rules have been changing a lot; between non-SMP and SMP (multi-core) systems; between kernel versions; etc. So the possibility of a race-condition bug seems reasonable.
So my question is: Are there any known Linux driver(or kernel) bugs that fit that description?
Any precedents that I could cite would be helpful, if they had similar symptoms. At this point, a lot of the people involved are (IMHO) wasting time thinking "well, there's no bug in my code, so it must be bad hardware." and I'd like to get them beyond that and looking for something more likely to be the true cause.

The source for your kernel is available, usually as a src.rpm. You can extract this (and the .tgz inside) and then grep everything for fxsave asm instructions and the like. I'd be very surprised if you find something, but who knows? If you are running any binary video drivers then see if the problem persists without them loaded.
download kernel-2-whatever.src.rpm
mkdir temp; cd temp
rpm2cpio ../kernel*rpm | cpio -id
tar xvf linux-*.tgz
grep -ri fxsave *

Related

When the linux kernel reports time as spent servicing "soft interrupts", precisely what does it mean?

I have top showing high activity in the %si column for several cpus. Sar reports something similar.
Q1) What are the possibilities for what might be executing in the kernel? It appears that "softirqs" themselves are more or less outdated, and generally function as a mechanism to implement other interfaces, including tasklets, rculists, and I'm not sure what else. I'd like to get a comprehensive list.
Q2) How can I get more precise information on what's actually running as "soft interrupts" on my test system?
As it happens, I have strong suspicions that a particular device driver is involved, since the hard interrupt % is also high, and only on whatever cpu that's presently handling this particular device's interrupt :-) But I haven't so far found anything in the driver source code that looks to me as if it could result in softirq activity. I'm probably missing the obvious, so I'm asking for help ;-)
My kernel is rather outdated - 2.6.32 based (RHEL 6.1, I believe) but I doubt that matters too much for questions this general.

programming my own kernel

I need some directions to start learning about programming my own operating system kernel.
Just for educational purpouses.
How can I write my own Kernel?
I would first ask: why did you pick "writing a kernel?" Any answer other than "the idea of implementing my own task structures in memory to be swapped by a scheduler that I write and using memory that is managed by code that I wrote and is protected by abstractions of machine-level atomic instructions and is given I/O access through abstractions that sit atop actual hardware interfaces appeals to me" is probably a bad answer that indicates you haven't done any research whatsoever and are wasting your time.
If you answered similarly to the above, then you have a good starting point and you know what you need to research (that is, you are able to pinpoint to some degree what information you do not know but need to find out).
Either way, I don't think this question is worth asking. In one case, you have done no research of your own to discover if you can actually do this, and in the other case you asked an overly-broad question.
It isn't that hard, but you need to learn about proper resource management and low-level device I/O. If you're targeting a commodity x86 box, then you'll need to learn about how the BIOS works and how the disk is structured. For example, the BIOS will read the first block of the disk into memory at some fixed address and then jump to that address. Since there probably won't be enough space in one block to store your kernel, you'll need to write a boot loader to read your kernel off the disk and load it.
Writing a minimal kernel that does some simple multitasking and performs I/O using just the BIOS isn't too difficult, just don't expect to be throwing up any windows and mousing around any time soon. You'll be busy trying to implement a simple file system and getting read() and write() to work.
Maybe you can start by looking into OS/161, which is a Harvard's simplified operating system for educational purposes. The OS runs on a simulator, so you don't need a new machine to run it. I used it for my operating system course, and it really did help a lot.
Also I think you may really want to consider taking an operating system course if you haven't done so.

How to test the kernel for kernel panics?

I am testing the Linux Kernel on an embedded device and would like to find situations / scenarios in which Linux Kernel would issue panics.
Can you suggest some test steps (manual or code automated) to create Kernel panics?
There's a variety of tools that you can use to try to crash your machine:
crashme tries to execute random code; this is good for testing process lifecycle code.
fsx is a tool to try to exercise the filesystem code extensively; it's good for testing drivers, block io and filesystem code.
The Linux Test Project aims to create a large repository of kernel test cases; it might not be designed with crashing systems in particular, but it may go a long way towards helping you and your team keep everything working as planned. (Note that the LTP isn't proscriptive -- the kernel community doesn't treat their tests as anything important -- but the LTP team tries very hard to be descriptive about what the kernel does and doesn't do.)
If your device is network-connected, you can run nmap against it, using a variety of scanning options: -sV --version-all will try to find versions of all services running (this can be stressful), -O --osscan-guess will try to determine the operating system by throwing strange network packets at the machine and guessing by responses what the output is.
The nessus scanning tool also does version identification of running services; it may or may not offer any improvements over nmap, though.
You can also hand your device to users; they figure out the craziest things to do with software, they'll spot bugs you'd never even think to look for. :)
You can try following key combination
SysRq + c
or
echo c >/proc/sysrq-trigger
Crashme has been known to find unknown kernel panic situations, but it must be run in a potent way that creates a variety of signal exceptions handled within the process and a variety of process exit conditions.
The main purpose of the messages generated by Crashme is to determine if sufficiently interesting things are happening to indicate possible potency. For example, if the mprotect call is needed to allow memory allocated with malloc to be executed as instructions, and if you don't have the mprotect enabled in the source code crashme.c for your platform, then Crashme is impotent.
It seems that operating systems on x64 architectures tend to have execution turned off for data segments. Recently I have updated the crashme.c on http://crashme.codeplex.com/ to use mprotect in case of __APPLE__ and tested it on a MacBook Pro running MAC OS X Lion. This is the first serious update to Crashme since 1994. Expect to see updated Centos and Freebsd support soon.

How to test the kernel to identify problems in the kernel that could cause kernel panic

I have an embedded linux device. I am trying to come up with some test cases that would exercise various subsystems, code paths, system calls in the kernel to identify problems/loose ends in the kernel that lead to kernel panics. Can someone suggest some test ideas for this kind of testing ?
Otherwise also, can someone suggest some ideas for testing the kernel so that it could be made more stable, robust, efficient, fast etc ? Can we write unit tests for linux kernel ?
The Linux Test Project looks like it might have some of what you want. There seem to be some tools for fuzz testing parts of the kernel, but those mostly related to filesystems and network protocols.
For system calls tests you could use the POSIX Test Suite.
The test suite divides tests into several categories:
Conformance, Functional, Stress, Performance, and Speculative.
The last three are probably of most importance to you.
You could also take a peek at Stress-testing the Linux kernel article at IBM regarding the Linux Test Project.
It depends what kind of device it is. If it is a block device, you should try running lots of different filesystems on it (ideally all of them), and hammer them hard with loads of processes.
If the device supports only a single open, then that file descriptor can still be shared between processes (which can all hammer the device)

How to "hibernate" a process in Linux by storing its memory to disk and restoring it later?

Is it possible to 'hibernate' a process in linux?
Just like 'hibernate' in laptop, I would to write all the memory used by a process to disk, free up the RAM. And then later on, I can 'resume the process', i.e, reading all the data from memory and put it back to RAM and I can continue with my process?
I used to maintain CryoPID, which is a program that does exactly what you are talking about. It writes the contents of a program's address space, VDSO, file descriptor references and states to a file that can later be reconstructed. CryoPID started when there were no usable hooks in Linux itself and worked entirely from userspace (actually, it still does work, depending on your distro / kernel / security settings).
Problems were (indeed) sockets, pending RT signals, numerous X11 issues, the glibc caching getpid() implementation amongst many others. Randomization (especially VDSO) turned out to be insurmountable for the few of us working on it after Bernard walked away from it. However, it was fun and became the topic of several masters thesis.
If you are just contemplating a program that can save its running state and re-start directly into that state, its far .. far .. easier to just save that information from within the program itself, perhaps when servicing a signal.
I'd like to put a status update here, as of 2014.
The accepted answer suggests CryoPID as a tool to perform Checkpoint/Restore, but I found the project to be unmantained and impossible to compile with recent kernels.
Now, I found two actively mantained projects providing the application checkpointing feature.
The first, the one I suggest 'cause I have better luck running it, is CRIU
that performs checkpoint/restore mainly in userspace, and requires the kernel option CONFIG_CHECKPOINT_RESTORE enabled to work.
Checkpoint/Restore In Userspace, or CRIU (pronounced kree-oo, IPA: /krɪʊ/, Russian: криу), is a software tool for Linux operating system. Using this tool, you can freeze a running application (or part of it) and checkpoint it to a hard drive as a collection of files. You can then use the files to restore and run the application from the point it was frozen at. The distinctive feature of the CRIU project is that it is mainly implemented in user space.
The latter is DMTCP; quoting from their main page:
DMTCP (Distributed MultiThreaded Checkpointing) is a tool to transparently checkpoint the state of multiple simultaneous applications, including multi-threaded and distributed applications. It operates directly on the user binary executable, without any Linux kernel modules or other kernel modifications.
There is also a nice Wikipedia page on the argument: Application_checkpointing
The answers mentioning ctrl-z are really talking about stopping the process with a signal, in this case SIGTSTP. You can issue a stop signal with kill:
kill -STOP <pid>
That will suspend execution of the process. It won't immediately free the memory used by it, but as memory is required for other processes the memory used by the stopped process will be gradually swapped out.
When you want to wake it up again, use
kill -CONT <pid>
The more complicated solutions, like CryoPID, are really only needed if you want the stopped process to be able to survive a system shutdown/restart - it doesn't sound like you need that.
Linux Kernel has now partially implemented the checkpoint/restart futures:https://ckpt.wiki.kernel.org/, the status is here.
Some useful information are in the lwn(linux weekly net):
http://lwn.net/Articles/375855/ http://lwn.net/Articles/412749/ ......
So the answer is "YES"
The issue is restoring the streams - files and sockets - that the program has open.
When your whole OS hibernates, the local files and such can obviously be restored. Network connections don't, but then the code that accesses the internet is typically more error checking and such and survives the error conditions (or ought to).
If you did per-program hibernation (without application support), how would you handle open files? What if another process accesses those files in the interim? etc?
Maintaining state when the program is not loaded is going to be difficult.
Simply suspending the threads and letting it get swapped to disk would have much the same effect?
Or run the program in a virtual machine and let the VM handle suspension.
Short answer is "yes, but not always reliably". Check out CryoPID:
http://cryopid.berlios.de/
Open files will indeed be the most common problem. CryoPID states explicitly:
Open files and offsets are restored.
Temporary files that have been
unlinked and are not accessible on the
filesystem are always saved in the
image. Other files that do not exist
on resume are not yet restored.
Support for saving file contents for
such situations is planned.
The same issues will also affect TCP connections, though CryoPID supports tcpcp for connection resuming.
I extended Cryopid producing a package called Cryopid2 available from SourceForge. This can
migrate a process as well as hibernating it (along with any open files and sockets - data
in sockets/pipes is sucked into the process on hibernation and spat back into these when
process is restarted).
The reason I have not been active with this project is I am not a kernel developer - both
this (and/or the original cryopid) need to get someone on board who can get them running
with the lastest kernels (e.g. Linux 3.x).
The Cryopid method does work - and is probably the best solution to general purpose process
hibernation/migration in Linux I have come across.
The short answer is "yes." You might start by looking at this for some ideas: ELF executable reconstruction from a core image (http://vx.netlux.org/lib/vsc03.html)
As others have noted, it's difficult for the OS to provide this functionality, because the application needs to have some error checking builtin to handle broken streams.
However, on a side note, some programming languages and tools that use virtual machines explicitly support this functionality, such as the Self programming language.
This is sort of the ultimate goal of clustered operating system. Mathew Dillon puts a lot of effort to implement something like this in his Dragonfly BSD project.
adding another workaround: you can use virtualbox. run your applications in a regular virtual machine and simply "save the machine state" whenever you want.
I know this is not an answer, but I thought it could be useful when there are no real options.
if for any reason you don't like virtualbox, vmware and Qemu are as good.
Ctrl-Z increases the chances the process's pages will be swapped, but it doesn't free the process's resources completely. The problem with freeing a process's resources completely is that things like file handles, sockets are kernel resources the process gets to use, but doesn't know how to persist on its own. So Ctrl-Z is as good as it gets.
There was some research on checkpoint/restore for Linux back in 2.2 and 2.4 days, but it never made it past prototype. It is possible (with the caveats described in the other answers) for certain values of possible - I you can write a kernel module to do it, it is possible. But for the common value of possible (can I do it from the shell on a commercial Linux distribution), it is not yet possible.
There's ctrl+z in linux, but i'm not sure it offers the features you specified. I suspect you asked this question since it doesn't

Resources