How to solve nvidia-smi command stuck and not showing anything? - linux

My server do not response to nvidia-smi after I use ctrl+c kill the process running my GPU-training code.
Before today, when I tap ctrl+c, the process first shows keyboard interrupt and then killed by linux.
But today, when I use ctrl+c, it response with keyboard interrupt but is not killed.
After I kill this process withkill -9 <pid>, I can not use nvidia-smi anymore and I cant train my code.
How to solve this, please.

Related

LeakSanitizer not working under gdb in Ubuntu 18.04?

I've upgraded my Linux development VM from Ubuntu 16.04 to 18.04 recently, and noticed one thing that has changed. This is on x86-64. With 16.04, I've always had this workflow where I'd build the project I'm working on with gcc (5.4, the stock version in 16.04) and -fsanitize=address and -O0 -g, and then run the executable through gdb (7.11.1, also the version that came with Ubuntu). This worked fine, and at the end, LeakSanitizer would produce a leak report if it detected memory leaks.
In 18.04, this doesn't seem to work anymore; LeakSanitizer complains about running under ptrace:
==5820==LeakSanitizer has encountered a fatal error.
==5820==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1
==5820==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc)
Then the program crashes:
Thread 1 "spyglass" received signal SIGABRT, Aborted.
__GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
I'm not sure what is causing the new behavior. On 18.04 I'm building with the default gcc shipped (7.3.0), using -fsanitize=address -O0 -g and debugging with the default gdb (8.1.0). Can the old behavior be somehow re-enabled? Or do I need to change my workflow and detach from the program before killing it to get a leak report?
LeakSanitizer internally uses ptrace, probably to suspend all threads such that it can scan for leaks without false positives (see issue 9). Only one application can use ptrace, so if you run your application under gdb or strace, then LeakSanitizer won't be able to attach via ptrace.
If you are not interested in leak debugging, disable it:
export ASAN_OPTIONS=detect_leaks=0
If you do want to enable leak debugging, you must detach the debugger before LeakSanitizer starts scanning. To be able to attach a debugger shortly afterwards, sleep a bit (for example, 10 seconds):
export ASAN_OPTIONS=sleep_before_dying=10
./program
Then in another shell, attach to the application again:
gdb -q -p $(pidof program)
For more a description of the above (and other) options, see https://github.com/google/sanitizers/wiki/AddressSanitizerFlags.

nvidia-smi process hangs and can't be killed with SIGKILL either

I'm on Ubuntu 14.04, CUDA toolkit 8, driver version 367.48.
When I give nvidia-smi command, it just hangs indefinitely.
When I login again and try to kill that nvidia-smi process, with kill -9 <PID> for example, it just isn't killed.
If I give another nvidia-smi command, I find both the processes running - of course when logging from another shell, because that gets stuck as before.
Can it be an issue related to the driver?
It's not the latest, but still quite new..
I solved this problem by doing at every boot
sudo nvidia-smi -pm 1
The above command enables persistence mode. This issue has been affecting nvidia drivers for over two years but they don't seem interested in fixing it. It seems to be related with a power management issue, after a bit of booting into the OS, if the nvidia-persistenced service has the no-persistence-mode option enabled, the GPU will save power, and the nvidia-smi command will hang waiting for something giving it control again on the device
Given your peculiar situation, I would try to reinstall it, as bio proposed.
Have you tried doing sudo kill -9 <PID>? You probably have but still putting it out there. Or, perhaps doing sudo kill -15 <PID> to terminate it. This seems as if your driver is stuck in a signal 1 hangup given what you told us.
It seems odd that nvidia-smi would hang spontaneously when run, but the issue may underlie in not being installed correctly or not getting run with superuser access.
Have you tried to use:
service nvidia-smi status
pgrep nvidia-smi
ps -aux | grep nvidia-smi
to get its current state?
Anyway, hope this helps. I would try to uninstall and reinstall or use sudo apt --fix-broken to try and fix broken packages/drivers.
Cheers!

ctrl+c not killing a process

I have a process that responds perfectly well to CTRL+C on my local machine. And it appears to also be working.
But on an EC2 instance it freezes and becomes a defunct or zombie process.
kill -9 <PID> doesn't remove it and I have to reboot the EC2 instance to clean it up properly.
When it runs it also loads an in house developed shared library that I have no influence over and have no access to any source code in it to see what it's doing. This library also uses CUDA and appears to start multiple threads.
I tried installing a signal handler on the main thread and it does get installed but calling _exit doesn't shut the whole process down, it seems to still be waiting.
Why might be happening here that is preventing CTRL+C from exiting the process cleanly? Can I override or examine what the other threads could be doing?
Ah, I found the problem. I'll leave the question as it stands in case it helps someone else.
It turns out that on my PC, I have a GTX 680 and the drivers get installed when installing CUDA. On EC2 the card is a GRID K520, and the driver installed by CUDA doesn't work. I downloaded and installed the latest stable card specific driver and it then worked.
The discovery was made after running nvidia-smi and it wouldn't print any details about the card but rather would just show Killed. Run nvidia-smi again and it would lock up the console.
Unfortunately, I hadn't tested that CUDA app's were working but relied on the driver appearing to print a message in the log saying it was loaded and assumed it was working.
Updating the driver consisted of downloading the latest driver from nvidia (use the .run version). Then:
sudo modprobe -r nvidia_uvm
sudo modprobe -r nvidia
Finally install it with a command like:
sudo ./NVIDIA-Linux-x86_64-3xx.xx.xx.run
I then rebooted the instance and verified it with nvidia-smi
This link was insightful - CUDA 7.5 unstable on EC2

Screen tool on Galileo

I have compiled and installed screen tool on Galileo running on Yocto.
http://www.gnu.org/software/screen/
When I run the tool everything is OK, I can create many sessions. However when I close the terminal all my sessions are closed (when I do "screen -ls" from other terminal there are no sockets). This is not happening in any other Linux distribution.
Regards,
Yevgeniy
Are you running screen from inside a ssh connection? There was a bug in earlier releases of the devkit where on disconnect systemd killed all processes started by the daemon, which isn't what you want. This has been fixed so upgrading your image should be sufficient.
If you can't get the upgrade, the fix is to add "KillMode=process" to the end of /lib/systemd/system/ssh#.service.

Can't get started with Android Studio for Ubuntu ADB error

I keep getting this error when I start Android Studio (AS) (I am not running Eclipse). I am running Ubuntu; I did a fresh install of Ubuntu and AS and this happened upon start up:
ADB not responding. If you'd like to retry, then please manually kill
"adb" and click 'Restart'
I have tried this solution: ADB Not Responding - Wait More or Kill adb or Restart (Ubuntu 13) 64-bit
and this: Adb not responding with android studio on Ubuntu as well as the duplicate link that follows.
I tried making an AVD and it doesn't want to run on there. I double checked that ADB is added to my PATH.
Is there more information I can provide? Any response with information or questions is helpful.
In a terminal, type ps -u (your username), find the pid, and type kill -9 (pid). If that doesn't work, use a higher number, such as kill -15 (pid).
When this happens to me I usually do ps -ef to find out adb pid, and then kill -s 15 <adb_pid>. After that everything works fine.

Resources