My system is a CentOS 6.3 (running Kernel version 2.6.32-279.el6.x86_64).
I have a loadable kernel module which is a driver that manages a PCIe card.
If I manually insert the driver using insmod while the OS is up and running, the driver loads successfully and is operational.
However, if I try to install the driver using rpm and then reboot the system, during startup the OS gets stuck spitting out the following "soft lockup" message for ALL the CPU cores, except for one core that is in "soft lockup" in one of the threads created by my driver.
BUG: soft lockup - CPU#X stuck for 67s! [migration/8:36]
.......(same above message for all cores except one)
BUG: soft lockup - CPU#10 stuck for 67s! [mydriver_thread/8:36]
(one core is locked up in one of the threads in my driver).
I searched the net quite a bit for info on this kernel msg / bug, and there are quite a bit of posts about it, none on what causes it or how to debug. Any help with the following questions would really be appreciated:
I am not able to log into the system, I think it's because all the cores are in a "soft lockup" state, and hence cannot trigger a kernel dump from shell prompt. I enabled SysRq, and tried to trigger a kernel dump with SysRq key combo, but no luck. It seems the system is not responding to keyboard (not even responding to CapsLock button). Any suggestions on how I can trigger a kernel dump in this circumstance?
I can imagine the possibly of my driver thread causing "soft lockup". But how can the "migration" thread (a kernel thread) be in a "soft lockup" just because of my driver?
From browsing the net, the "migration" thread is used to move tasks from one cpu to another. Can someone please help me understand what this thread exact does? And how it can be affected by other threads, if at all.

I had a very similar problem on my desktop. It would soft lockup very frequently - about once a day or so.
It turns out it was because I was running on an Intel Haswell. It seems that the Haswell/Broadwell series of Intel processors have a bug which can cause system instability. This bug was fixed in a microcode update.
Check if CentOS offers an intel-microcode package, and install it. Make sure you configure grub to load it as the initial ramdisk before it loads initramfs.
Personally, I upgraded my microcode by booting into Windows and running a BIOS Update. You can check if the micrcode was actually updated by comparing the output of grep 'microcode' /proc/cpuinfo before and after the update.


CUDA performance penalty when running in Windows

I've noticed a big performance hit when I run my CUDA application in Windows 7 (versus Linux). I think I may know where the slowdown occurs: For whatever reason, the Windows Nvidia driver (version 331.65) does not immediately dispatch a CUDA kernel when invoked via the runtime API.
To illustrate the problem I profiled the mergeSort application (from the examples that ship with CUDA 5.5).
Consider first the kernel launch time when running in Linux:
Next, consider the launch time when running in Windows:
This post suggests the problem might have something to do with the windows driver batching the kernel launches. Is there anyway I can disable this batching?
I am running with a GTX 690 GPU, Windows 7, and version 331.65 of the Nvidia driver.
There is a fair amount of overhead in sending GPU hardware commands through the WDDM stack.
As you've discovered, this means that under WDDM (only) GPU commands can get "batched" to amortize this overhead. The batching process may (probably will) introduce some latency, which can be variable, depending on what else is going on.
The best solution under windows is to switch the operating mode of the GPU from WDDM to TCC, which can be done via the nvidia-smi command, but it is only supported on Tesla GPUs and certain members of the Quadro family of GPUs -- i.e. not GeForce. (It also has the side effect of preventing the device from being used as a windows accelerated display adapter, which might be relevant for a Quadro device or a few specific older Fermi Tesla GPUs.)
AFAIK there is no officially documented method to circumvent or affect the WDDM batching process in the driver, but unofficially I've heard , according to Greg#NV in this link the command to issue after the cuda kernel call is cudaEventQuery(0); which may/should cause the WDDM batch queue to "flush" to the GPU.
As Greg points out, extensive use of this mechanism will wipe out the amortization benefit, and may do more harm than good.
EDIT: moving forward to 2016, a newer recommendation for a "low-impact" flush of the WDDM command queue would be cudaStreamQuery(stream);
EDIT2: Using recent drivers on windows, you should be able to place Titan family GPUs in TCC mode, assuming you have some other GPU set up for primary display. The nvidia-smi tool will allow you to switch modes (using nvidia-smi --help for more info).
Additional info about the TCC driver model can be found in the windows install guide, including that it may reduce the latency of kernel launches.
The statement about TCC support is a general one. Not all Quadro GPUs are supported. The final determinant of support for TCC (or not) on a particular GPU is the nvidia-smi tool. Nothing here should be construed as a guarantee of support for TCC on your particular GPU.
Even it's been almost 3 years since the issue has been active, I still consider it necesssary to provide my findings.
I've been in the same situation: the same cuda programme elapsed for 5ms in Ubuntu cuda 8.0 while over 30ms in Windows 10 cuda 10.1. Both with GTX 1080Ti.
However, in Windows when I changed the compiler from VS Studio to cmd's nvcc compiler suddenly the programme was boosted to the same speed as the Linux one.
This suggests that maybe the problem comes from Visual Studio.

How to simulate an interrupt storm or a live lock on Linux?

I am developing a tool which boots up a custom build of Linux and boots into QT based desktop for x86 based machine. My custom Linux runs from USB and when the it boots on a machine with certain brand of sound cards connected, then my tool runs to a live lock situation with a lot of interrupts. I doubt its some problem with APIC driver but the system is renderd useless and I have to poweroff the system.
My Question:
I would like to simulate the same situation by using a kernel driver or module. I am not sure if I can cause an interrupt to fire from a module. I have a experience with I2C or SPI which causes interrupts on ARM based Linux boards. But i dont know how to do it from a module
Could anybody please suggest me how to cause an interrupt from a driver?
Just create a module with an interrupt forkbomb in it. Google it. It'll only take a second for your vm to halt.

Should a bad USB device be able to crash a bug free Linux kernel?

My question is rather broad, I know, but I have been wondering about this for a long time.
A little background. I work in a Physics lab where all the lab computers are running Debian (mix of old version and Lenny) or more recently Ubuntu 10.4 LTS. We have written a lot of custom software to interface with experiment hardware and other computers.
We have a lot of FPGA boards that are controlling various parts of the experiment, these are connected via USB to different computers. After upgrading a computer controlling an experiment we started seeing crashes/lockups of the computer running all the lasers. This used to be completely stable.
My question is this: If the entire computer locks up because of an issue with
a) Python/GTK software gui
b) USB device driver
c) The actual device
can this be blamed on the Linux kernel (or other levels of the OS)?
Is it unfair to ask of the linux kernel not to panic even if I make mistakes in my implementation of software/hardware.
My own guess: Any user level applications should never be able to crash the entire system since they should only have access to their own stuff.
Any device driver becomes a part of the kernel itself and will therefore be able to crash it. Is my reasoning sound?
Bonus question: IS there a way to insulate device and kernel somehow such that Linux will keep running happily no matter what stupid mistakes are made with the hardware. That would be very useful for two reasons:
1) debugging is easier with a running system,
2) For the purposes of the experiment we really need long uptimes and having only a part of the system crash is infinitely better than crashes in one part of the system propagating to the rest.
Any links and reading material on this subject would be appreciated. Thank you.
You are correct that unprivileged code should not be able to bring down the system, unless there's a kernel bug. The line between unprivileged and privileged isn't exactly the same as user-space vs kernel, however. A user-mode program can open /dev/kmem and trash the OS's internal data structures, if the user account has superuser privileges.
To insulate the main kernel from device driver problems, run the device driver inside a virtual machine.
Several popular VM systems, including VMWare Workstation, support forwarding an arbitrary USB device from the host to the guest without a device-specific driver on the host.

Can I use JTAG to debug my program on top of embedded Linux?

I am using an at91sam9260 for my developments. There is a Linux kernel running in it and I start my own software on top of it.
I was wondering if I could use a JTAG debugger to debug the software I am working on without seeing to much of what is going on the Linux kernel ?
I am asking that because I think that I might become very complex to debug my software while seeing the full Linux execution.
In other words I would like to know if there could be some abstraction layer when debugging with JTAG probe?
Probably not -- as far as I know, most JTAG debuggers assume the ability of setting breakpoints in the processor. Under a multitasking OS, that stops the OS kernel too.
Embedded OS's like QNX have debuggers that operate on top of the OS kernel and which communicate over Ethernet.
Generally yes you can jtag as a debugger has absolutely nothing to do with what software you happen to be running on that processor. Where you can get into trouble is the cache, for example if you stop the processor want to change some instructions in ram, and restart, the changing of instructions in ram is a data access, which does not go through the instruction cache but the data cache, if you have a separate instruction and data cache, they are enabled and some of the instructions you have modified are at address that are in the instruction cache, you can get messed up pretty fast with new and stale instructions being fed to the processor. Linux likes to use the caches if there.
Second is the mmu, the processor/jtag is likely operating on the virtual addresses on the processor side of the mmu not the physical addresses, so depending on how the hardware works, if for example you set a breakpoint by address in a debug unit in the processor and the operating system task switches to another program/thread at that same address space, you will breakpoint on the wrong program at the right address. If the debugger/processor sets breakpoints by modifying an instruction in ram then you run into the cache problem above, IF not cached then you will break on the right instruction in the right thread, but then you have that cache problem.
Bottom line, absolutely, if the processor supports jtag based debugging that doesnt change based on whatever software you choose to run on that processor.
It depends on JTAG device and it's driver. Personally, I know only one device that capable of doing that: XDS560 + Code composer studio (CCS). But, there can be others.
I suggest to consult with manufacturer of your device.
For ARM, the Asset Arium family is claimed to be able to debug application code. I haven't tried it, though.

GDB shows the wrong thread in postmortem analysis

I am experiencing a strange behavior of GDB. When running a post-mortem analysis of a core, dumped from a heavily multithreaded application in c++, the debugger commands
thread info
never tell me the thread which the program actually crashed. It keeps showing me the thread number 1. As I am used to see this working from other Systems, I am curious if is is a Bug in GDB or if they changed the behavior somehow. Can anyone point me to a solution of this, it is PITA to search through 75 Threads, just to find out something the Debugger already knows.
By the way, I am on Debian Squeeze (6.0.1), the version of GDB is 7.0.1-debian, the System is x86 and completely 32-Bit. On my older Debian (5.x) installation, debugging a core, dumped by the exact same source, delivers me a backtrace of the correct thread, as does GDB on a Ubuntu 10.04 installation.
GDB does not know which thread caused the crash, and simply shows the first thread that it sees in the core.
The Linux kernel usually dumps the faulting thread first, and that is why on most systems you end up in exactly the correct thread once you load core into GDB.
I've never seen a kernel where this was broken, but I've never used Debian 6 either.
My guess would be that this was broken, and then got fixed, and Debian 6 shipped with a broken kernel.
You could try upgrading the kernel on your Debian 6 machine to match e.g. your Ubuntu 10.04, and see if the problem disappears.
Alternatively, Google user-space coredumper does it correctly. You can link it in, and call it from SIGSEGV handler.
