CUDA device seems to be blocked - linux

I'm running a small CUDA application: the QuickSort benchmark algorithm (see here). I have a dual system with a NVIDIA 660GTX (device 0) and 8600GTS (device 1).
Under Windows 8 and Visual Studio, the application compiles and runs flawlessly on device 0. Under Linux (Ubuntu 12.04 LTS), the app compiles with nvcc and gcc but suddenly stops in its tracks, returning a (unspecified launch failure).
I have two issues:
After this error, my GPU cannot perform some other operations, e.g., running the SDK example bandwidhtTest blocks when it performs the first data transfer, but running deviceQuery continues to perform well. How can I reset my GPU? I've already tried the cudaDeviceReset() method but it doesn't help
How can I find what is going wrong under linux? Has someone a clue or seen this before?
Thanks in advance for your help!

Using the nvidia-smi utility you can reset the GPU if it is compatible
To my knowledge and experience, (unspecified launch failure) usually referees to segmentation fault. Have you specified the right GPU to use? Try to use cuda-memcheck to see if there is any memory out of bound scenario.

From my experience XID 31 was always caused by accessing bad pointer (aka Memory access violation).
I'd first pursue this trail. Run your application with cuda memcheck. Like that cuda-memcheck you_app args to your app and see if it finds any wrong memory accesses.
Also try stepping though the code with cuda-gdb or Nsight Eclipse Edition.

I've found that using
cuda-memcheck -b ...
prevents the device from locking up.

Related

Code working on windows but launch failures on Linux

First and foremost: I am completely unable to create a MCVE, as I can only reproduce this when running a full code, any attempt to measure or replicate the error in a simpler environment makes it disappear. TDLR I suspect its not a code problem, but a configuration problem.
I have a piece of code for some mathematics on kernels in CUDA. I have a windows machine Win10 x64, GTX 1050, CUDA 9.2 and a Ubuntu 17.04, 2xGTX 1080 Ti, CUDA 9.1.
My code runs good on the windows machine. It is long (~700ms per kernel call for big samples) so I needed to increase the TDR value in windows. The code also (for now) forces it to run in 1 GPU, the first one that is selected with cudaSetDevice(0).
When I copy the same input data and code to the linux machine (I am using git, it is the same code), I get either
an illegal memory access was encountered
or
unspecified launch failure
in my error checking after the GPU call.
If I change the kernel to instead do the math, to just write a number in the output, the kernel executes properly. Other CUDA code (different functions that I have) works fine too. All this leads me to think that there is a problem outside the code, not with the code itself, nor with the general configuration of the drivers/environment variables.
I read that the xorg.conf can have an effect on the timeout of the kernels. I generated a xorg.conf (I had none) and remove the devices from there, as suggested here. I am connecting to the server remotely, and have no monitor plugged in. This changes nothing in the behavior, my kernels still error.
My question is: what else should I look? What linux specific configuration should I have a look at to pinpoint the cause of the kernel halts?
The error ended up being indeed illegal memory access.
These were caused by the fact that sizeof(unsigned long) is machine specific, and my linux machine returns 8 while my windows machine returns 4. As this code is called from MATLAB, and MATLAB (like some other high level languages such as python) defines the sizes of variables in bits (such as uint32(1)) there was a mismatch in the linux machine when doing memcpys. Turns out that this happened in a variable that is a index, so the kernels were reading garbage (due to the bad memcpy), but then triying to access another array at that location, creating an illegal memory error.
Too specific? yeah.

Running perf record with Intel-PT event on compiled binaries from SPECCpu2006 crashes the server machine

I am having a recurring problem when using perf with Intel-PT event. I am currently performing profiling on a Intel(R) Xeon(R) CPU E5-2620 v4 # 2.10GHz machine, with x86_64 architecture and 32 hardware threads with virtualization enabled. I specifically use programs/source codes from SpecCPU2006 for profiling.
I am specifically observing that the first time I perform profiling on one of the compiled binaries from SpecCPU2006, everything works fine and the perf.data file gets generated, which is as expected with Intel-PT. As SpecCPU2006 programs are computationally-intensive(use 100% of CPU at any time), clearly perf.data files would be large for most of the programs. I obtain roughly 7-10 GB perf.data files for most of the profiled programs.
However, when I try to perform profiling the second time on the same compiled binary, after the first one is successfully done -- my server machine freezes up. Sometimes, this happens when I try profiling the third time/the fourth time (after the second or third profiling completed successfully). This behavior is highly unpredictable. Now I cannot profile any more binaries unless I have restarted the machine again.
I have also posted the server error logs which I get once I see that the computer has stopped responding.
Server error logs
Clearly there is an error message saying Fixing recursive fault but reboot is needed!.
This happens for particularly large enough SpecCPU2006 binaries which take more than 1 minute to run without perf.
Is there any particular reason why this might happen ? This should not occur due to high CPU usage, as running the programs without perf or with perf but any other hardware event(that can be seen by perf list) completed successfully. This only seems to happen with Intel-PT.
Please guide me in using the steps to solve this problem. Thanks.
Seems I resolved this issue now. So will post an answer.
The server crashed because of a null pointer dereference/access happening with a specific member of the structure perf_event. Basically the member perf_event->handle was the culprit. This information, as suggested by #osgx, was obtained from var/log/syslog file. A portion of the error message was :-
Apr 19 04:49:15 ###### kernel: [582411.404677] BUG: unable to handle kernel NULL pointer dereference at 00000000000000ea
Apr 19 04:49:15 ###### kernel: [582411.404747] IP: [] perf_event_aux_event+0x2e/0xf0
One possible scenario where this structure member turns out to be NULL is if I start capturing packets even before an earlier run of perf record finished releasing all of its resources. This has been properly handled in kernel version 4.10. I was using kernel version 4.4.
I upgraded my kernel to the newer version and it works fine now!

big double vectors in C++

I have vector a(8000000000);
So 8000000000*8/1024/1024/1024=7.45Gb , so I need 7.45Gb RAM in order my program with this huge size to work, and I have that RAM in my computer but it doesn't work, WHY?
It compiles, but when I run, it gives error
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)
Thanks in advance
2 things MUST be true for your program to work correctly:
The OS has to be 64-bit (which is what you are presumably running on, otherwise you'd be limited to 3GB RAM)
Your program must be built as a 64-bit application so it can actually address that much memory
For Windows MSVC solution/project make sure to follow the steps in this https://msdn.microsoft.com/en-us/library/h2k70f3s.aspx article to set up for 64-bits.
For SunOS follow this article: http://www.well.com/~jax/rcfb/solaris_tips/build_gcc_3.0_64bit.html
And also for any other platform/compiler you should be able to pull the documentation via Google, of course

Intel CPU OpenCL in Mono killed by SIGXCPU (Ubuntu)

Some time ago I wrote simple boids simulation using OpenCL (was school assignment), using C#, Cloo for OpenCL and OpenTK for OpenGL output. I tested it on Windows7 with AMD CPU implementation of OpenCL and on friend's NVidia.
Now I tried it on Linux (Ubuntu 12.04). I installed amd app sdk and intel sdk. It compiled ok, reference CPU implementation is working fine with graphic output. But when I try to run OpenCL version, it runs for about 1 second (showing what seems like valid output in OpenGL) and then gets killed by SIGXCPU. Tried to google some known issue, but found nothing.
So I tried to catch and ignore that signal, but everytime I try, program hangs. When I set mono to catch some different signal (e.g. SIGPIPE), it runs ok (minus that kill when opencl).
In Mono, i Tried Mono.UnixSignal as stated in FAQ
Tried
Mono.Unix.Native.Stdlib.SetSignalAction ( Mono.Unix.Native.Signum.SIGXCPU, Mono.Unix.Native.SignalAction.Ignore);
then something which doesn't hang, but doesn't help either:
Mono.Unix.Native.Stdlib.SetSignalAction ( Mono.Unix.Native.Signum.SIGXCPU, Mono.Unix.Native.SignalAction.Error);
and even
ignore (0);
Mono.Unix.Native.Stdlib.signal(Mono.Unix.Native.Signum.SIGXCPU, new Mono.Unix.Native.SignalHandler (ignore));
with
static void ignore(int signal) {
}
even when I remove everything else from main, it still hangs sometime after "touching" that signal.
One more weird thing:
Mono.Unix.Native.Stdlib.SetSignalAction ( Mono.Unix.Native.Signum.SIGXCPU, Mono.Unix.Native.SignalAction.Default);
Kills application with SIGXCPU somewhere after Application.EnableVisualStyles(); when I set it right before, not even touching OpenCL this time.
Is there something I missed in Mono? Is it using this signal somewhere internally that it gets in the way of OpenCl?
Mono uses SIGXCPU internally for its own purposes, so if you ignore it (or it's raised for some other reason), things will break.
SIGXCPU means that you've exceeded a per-process time limit. Once you've exceeded it, you've exceeded it. The process is stuck. You need to use ulimit, or get help from an admin, to set a higher limit.

How to test the kernel for kernel panics?

I am testing the Linux Kernel on an embedded device and would like to find situations / scenarios in which Linux Kernel would issue panics.
Can you suggest some test steps (manual or code automated) to create Kernel panics?
There's a variety of tools that you can use to try to crash your machine:
crashme tries to execute random code; this is good for testing process lifecycle code.
fsx is a tool to try to exercise the filesystem code extensively; it's good for testing drivers, block io and filesystem code.
The Linux Test Project aims to create a large repository of kernel test cases; it might not be designed with crashing systems in particular, but it may go a long way towards helping you and your team keep everything working as planned. (Note that the LTP isn't proscriptive -- the kernel community doesn't treat their tests as anything important -- but the LTP team tries very hard to be descriptive about what the kernel does and doesn't do.)
If your device is network-connected, you can run nmap against it, using a variety of scanning options: -sV --version-all will try to find versions of all services running (this can be stressful), -O --osscan-guess will try to determine the operating system by throwing strange network packets at the machine and guessing by responses what the output is.
The nessus scanning tool also does version identification of running services; it may or may not offer any improvements over nmap, though.
You can also hand your device to users; they figure out the craziest things to do with software, they'll spot bugs you'd never even think to look for. :)
You can try following key combination
SysRq + c
or
echo c >/proc/sysrq-trigger
Crashme has been known to find unknown kernel panic situations, but it must be run in a potent way that creates a variety of signal exceptions handled within the process and a variety of process exit conditions.
The main purpose of the messages generated by Crashme is to determine if sufficiently interesting things are happening to indicate possible potency. For example, if the mprotect call is needed to allow memory allocated with malloc to be executed as instructions, and if you don't have the mprotect enabled in the source code crashme.c for your platform, then Crashme is impotent.
It seems that operating systems on x64 architectures tend to have execution turned off for data segments. Recently I have updated the crashme.c on http://crashme.codeplex.com/ to use mprotect in case of __APPLE__ and tested it on a MacBook Pro running MAC OS X Lion. This is the first serious update to Crashme since 1994. Expect to see updated Centos and Freebsd support soon.

Resources