nvidia-smi process hangs and can't be killed with SIGKILL either - linux

I'm on Ubuntu 14.04, CUDA toolkit 8, driver version 367.48.
When I give nvidia-smi command, it just hangs indefinitely.
When I login again and try to kill that nvidia-smi process, with kill -9 <PID> for example, it just isn't killed.
If I give another nvidia-smi command, I find both the processes running - of course when logging from another shell, because that gets stuck as before.
Can it be an issue related to the driver?
It's not the latest, but still quite new..

I solved this problem by doing at every boot
sudo nvidia-smi -pm 1
The above command enables persistence mode. This issue has been affecting nvidia drivers for over two years but they don't seem interested in fixing it. It seems to be related with a power management issue, after a bit of booting into the OS, if the nvidia-persistenced service has the no-persistence-mode option enabled, the GPU will save power, and the nvidia-smi command will hang waiting for something giving it control again on the device

Given your peculiar situation, I would try to reinstall it, as bio proposed.
Have you tried doing sudo kill -9 <PID>? You probably have but still putting it out there. Or, perhaps doing sudo kill -15 <PID> to terminate it. This seems as if your driver is stuck in a signal 1 hangup given what you told us.
It seems odd that nvidia-smi would hang spontaneously when run, but the issue may underlie in not being installed correctly or not getting run with superuser access.
Have you tried to use:
service nvidia-smi status
pgrep nvidia-smi
ps -aux | grep nvidia-smi
to get its current state?
Anyway, hope this helps. I would try to uninstall and reinstall or use sudo apt --fix-broken to try and fix broken packages/drivers.
Cheers!

Related

How to solve nvidia-smi command stuck and not showing anything?

My server do not response to nvidia-smi after I use ctrl+c kill the process running my GPU-training code.
Before today, when I tap ctrl+c, the process first shows keyboard interrupt and then killed by linux.
But today, when I use ctrl+c, it response with keyboard interrupt but is not killed.
After I kill this process withkill -9 <pid>, I can not use nvidia-smi anymore and I cant train my code.
How to solve this, please.

LeakSanitizer not working under gdb in Ubuntu 18.04?

I've upgraded my Linux development VM from Ubuntu 16.04 to 18.04 recently, and noticed one thing that has changed. This is on x86-64. With 16.04, I've always had this workflow where I'd build the project I'm working on with gcc (5.4, the stock version in 16.04) and -fsanitize=address and -O0 -g, and then run the executable through gdb (7.11.1, also the version that came with Ubuntu). This worked fine, and at the end, LeakSanitizer would produce a leak report if it detected memory leaks.
In 18.04, this doesn't seem to work anymore; LeakSanitizer complains about running under ptrace:
==5820==LeakSanitizer has encountered a fatal error.
==5820==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1
==5820==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc)
Then the program crashes:
Thread 1 "spyglass" received signal SIGABRT, Aborted.
__GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
I'm not sure what is causing the new behavior. On 18.04 I'm building with the default gcc shipped (7.3.0), using -fsanitize=address -O0 -g and debugging with the default gdb (8.1.0). Can the old behavior be somehow re-enabled? Or do I need to change my workflow and detach from the program before killing it to get a leak report?
LeakSanitizer internally uses ptrace, probably to suspend all threads such that it can scan for leaks without false positives (see issue 9). Only one application can use ptrace, so if you run your application under gdb or strace, then LeakSanitizer won't be able to attach via ptrace.
If you are not interested in leak debugging, disable it:
export ASAN_OPTIONS=detect_leaks=0
If you do want to enable leak debugging, you must detach the debugger before LeakSanitizer starts scanning. To be able to attach a debugger shortly afterwards, sleep a bit (for example, 10 seconds):
export ASAN_OPTIONS=sleep_before_dying=10
./program
Then in another shell, attach to the application again:
gdb -q -p $(pidof program)
For more a description of the above (and other) options, see https://github.com/google/sanitizers/wiki/AddressSanitizerFlags.

ctrl+c not killing a process

I have a process that responds perfectly well to CTRL+C on my local machine. And it appears to also be working.
But on an EC2 instance it freezes and becomes a defunct or zombie process.
kill -9 <PID> doesn't remove it and I have to reboot the EC2 instance to clean it up properly.
When it runs it also loads an in house developed shared library that I have no influence over and have no access to any source code in it to see what it's doing. This library also uses CUDA and appears to start multiple threads.
I tried installing a signal handler on the main thread and it does get installed but calling _exit doesn't shut the whole process down, it seems to still be waiting.
Why might be happening here that is preventing CTRL+C from exiting the process cleanly? Can I override or examine what the other threads could be doing?
Ah, I found the problem. I'll leave the question as it stands in case it helps someone else.
It turns out that on my PC, I have a GTX 680 and the drivers get installed when installing CUDA. On EC2 the card is a GRID K520, and the driver installed by CUDA doesn't work. I downloaded and installed the latest stable card specific driver and it then worked.
The discovery was made after running nvidia-smi and it wouldn't print any details about the card but rather would just show Killed. Run nvidia-smi again and it would lock up the console.
Unfortunately, I hadn't tested that CUDA app's were working but relied on the driver appearing to print a message in the log saying it was loaded and assumed it was working.
Updating the driver consisted of downloading the latest driver from nvidia (use the .run version). Then:
sudo modprobe -r nvidia_uvm
sudo modprobe -r nvidia
Finally install it with a command like:
sudo ./NVIDIA-Linux-x86_64-3xx.xx.xx.run
I then rebooted the instance and verified it with nvidia-smi
This link was insightful - CUDA 7.5 unstable on EC2

Can't get started with Android Studio for Ubuntu ADB error

I keep getting this error when I start Android Studio (AS) (I am not running Eclipse). I am running Ubuntu; I did a fresh install of Ubuntu and AS and this happened upon start up:
ADB not responding. If you'd like to retry, then please manually kill
"adb" and click 'Restart'
I have tried this solution: ADB Not Responding - Wait More or Kill adb or Restart (Ubuntu 13) 64-bit
and this: Adb not responding with android studio on Ubuntu as well as the duplicate link that follows.
I tried making an AVD and it doesn't want to run on there. I double checked that ADB is added to my PATH.
Is there more information I can provide? Any response with information or questions is helpful.
In a terminal, type ps -u (your username), find the pid, and type kill -9 (pid). If that doesn't work, use a higher number, such as kill -15 (pid).
When this happens to me I usually do ps -ef to find out adb pid, and then kill -s 15 <adb_pid>. After that everything works fine.

Is CUDA in installed correctly on my Ubuntu 10.04? Some samples don't run.

I am trying to install CUDA on a server running Ubuntu 10.04.
I followed the NVDIA instructions and installed the "CUDA toolkit for Ubuntu Linux 10.04", "GPU Conputing SDK code samples",and "Developer Drivers for Linux (260.19.26) (64 bit)", my system is 64 bit. This installation seems successful. everything downloaded from http://developer.nvidia.com/object/cuda_3_2_downloads.html#Linux
According to the messages of the installation packages, I added /usr/local/cuda/bin to PATH, /usr/local/cuda/lib64:/usr/local/cuda/lib to LD_LIBRARY_PATH
Then, I tried to run the sample programs. The strange things is, some of them can be run, and some of them don't even through they can be made with no problem.
For example,
- convolutionSeparable will just stop there without any message, I can kill it by ctrl + c.
matrixMul outputs a line
Device 0: "Quadro 5000" with Compute 2.0 capability
and stop there, again can be killed by Ctrl+C
clock works, outputs
PASSED
time = 12574
Press ENTER to exit...
simpleMultiCopy outputs PASSED
MonteCarlo outputs PASSED
simpleZeroCopy outputs PASSED
bandwidthTest stops there with blinking cursor for ever.
What is wrong with this?! How can I check if my CUDA installation is successful ? What is wrong with those programs don't run? They don't even have a error message.
I would start by upgrading the driver to 260.19.36, which can be found here. Then I would suggest running nvidia-smi -a to see if the driver is happy. Then I second the suggestion to run deviceQuery to see if the CUDA Toolkit 3.2 is working.
If deviceQuery output appears nominal, then I would start adding printf's to see where things go awry in matrixMul.
What does deviceQuery say? Also check the output of dmesg right after you run that program to see if you can figure out whats up.
Another tip, if you still are having issues, is try running:
strace ./deviceQuery 2> out.txt
Then check out.txt to see if you can find any clues why this error is occuring.
I have similar problem but solved by updating kernel and drivers.
install newer kernel on 10.04
linux-image-generic-pae-lts-backport-natty
linux-headers-generic-pae-lts-backport-natty
download the latest nvidia driver
from http://www.nvidia.com/Download/index.aspx?lang=en-us
install the latest CUDA (at moment 4.0) from
http://developer.nvidia.com/cuda-toolkit-40
CUDA Toolkit for Ubuntu Linux 10.10 32-bit
CUDA Tools SDK 32-bit
GPU Computing SDK code samples
then I passed all SDK example tests.
ThinkPad w520 Quadro 1000 on Ubuntu 10.04

Resources