I have a process that responds perfectly well to CTRL+C on my local machine. And it appears to also be working.
But on an EC2 instance it freezes and becomes a defunct or zombie process.
kill -9 <PID> doesn't remove it and I have to reboot the EC2 instance to clean it up properly.
When it runs it also loads an in house developed shared library that I have no influence over and have no access to any source code in it to see what it's doing. This library also uses CUDA and appears to start multiple threads.
I tried installing a signal handler on the main thread and it does get installed but calling _exit doesn't shut the whole process down, it seems to still be waiting.
Why might be happening here that is preventing CTRL+C from exiting the process cleanly? Can I override or examine what the other threads could be doing?
Ah, I found the problem. I'll leave the question as it stands in case it helps someone else.
It turns out that on my PC, I have a GTX 680 and the drivers get installed when installing CUDA. On EC2 the card is a GRID K520, and the driver installed by CUDA doesn't work. I downloaded and installed the latest stable card specific driver and it then worked.
The discovery was made after running nvidia-smi and it wouldn't print any details about the card but rather would just show Killed. Run nvidia-smi again and it would lock up the console.
Unfortunately, I hadn't tested that CUDA app's were working but relied on the driver appearing to print a message in the log saying it was loaded and assumed it was working.
Updating the driver consisted of downloading the latest driver from nvidia (use the .run version). Then:
sudo modprobe -r nvidia_uvm
sudo modprobe -r nvidia
Finally install it with a command like:
sudo ./NVIDIA-Linux-x86_64-3xx.xx.xx.run
I then rebooted the instance and verified it with nvidia-smi
This link was insightful - CUDA 7.5 unstable on EC2
Related
Whenever I see the update manager glowing that I have an update I get annoyed and click it, so I'm almost always updating something and usually this has gone fine without any problems...
Recently it told me there was a new kernel update, so I clicked install like I usually do but it just got stuck, for hours. When I examined the terminal output it was hanging on a DKMS installation step, so I grabbed all the active DKMS processes and found that the specific thing it was hanging on was installing something called EVDI (which is related to the DisplayLink Ubuntu driver, I think). After letting it sit there doing nothing for more than a day I killed it and had to Timeshift back to before I had done the installation as it corrupted my kernel.
I examined the log file in /var/lib/dkms/evdi/5.2.14/build/make.log and found that it has many errors reported, and the one that starts the chain is:
make -f ./scripts/Makefile.build obj=scripts
make[1]: *** [arch/x86/Makefile:211: archscripts] Error 2
I can provide the full log file if you want, it's just long.
I've tried to google around this and haven't been able to find anyone with this specific issue, so any help is much appreciated! I have also tried installing the DisplayLink driver from source (since it includes an install of EVDI) but it also hangs in the same place (for hours) -- it gets stuck at [[ Installing EVDI DKMS module ]].
I've thought about straight up removing all references to EVDI and hoping that it would then rebuild it, but I am not sure if this would cause further problems. In a different answer I saw that I could remove all DKMS instances of a package from all kernels by doing something like sudo dkms remove package --all but this is entirely new territory for me and I have decided I should wait for someone smarter than me to tell me whether that's a good idea or not before I end up irreparably breaking my installation.
I'm running Linux Mint 20.1 Cinnamon (Cinnamon v 4.8.6), Linux kernel 5.8.0-44-generic, on a Dell XPS 13 with an i7-1065G7 CPU (no GPU). Everything does work fine right now, I just would like to not be stuck on this version of the Linux kernel forever! Any help is very much appreciated :)
Ultimately fixed by booting into an old 5.4 kernel, purging DKMS + all of the 5.8 kernels and a troublesome 5.4 kernel (had to do some things by hand as apt would not remove some directories), then reinstalling everything and updating grub from the 5.4 kernel. Just tested an update via the update manager (now running on the latest 5.8 kernel) and it worked fine! Unclear what exactly was causing the problem but glad it's fixed and hope this helps others if they stumble into something like this.
I'm on Ubuntu 14.04, CUDA toolkit 8, driver version 367.48.
When I give nvidia-smi command, it just hangs indefinitely.
When I login again and try to kill that nvidia-smi process, with kill -9 <PID> for example, it just isn't killed.
If I give another nvidia-smi command, I find both the processes running - of course when logging from another shell, because that gets stuck as before.
Can it be an issue related to the driver?
It's not the latest, but still quite new..
I solved this problem by doing at every boot
sudo nvidia-smi -pm 1
The above command enables persistence mode. This issue has been affecting nvidia drivers for over two years but they don't seem interested in fixing it. It seems to be related with a power management issue, after a bit of booting into the OS, if the nvidia-persistenced service has the no-persistence-mode option enabled, the GPU will save power, and the nvidia-smi command will hang waiting for something giving it control again on the device
Given your peculiar situation, I would try to reinstall it, as bio proposed.
Have you tried doing sudo kill -9 <PID>? You probably have but still putting it out there. Or, perhaps doing sudo kill -15 <PID> to terminate it. This seems as if your driver is stuck in a signal 1 hangup given what you told us.
It seems odd that nvidia-smi would hang spontaneously when run, but the issue may underlie in not being installed correctly or not getting run with superuser access.
Have you tried to use:
service nvidia-smi status
pgrep nvidia-smi
ps -aux | grep nvidia-smi
to get its current state?
Anyway, hope this helps. I would try to uninstall and reinstall or use sudo apt --fix-broken to try and fix broken packages/drivers.
Cheers!
I have compiled and installed screen tool on Galileo running on Yocto.
http://www.gnu.org/software/screen/
When I run the tool everything is OK, I can create many sessions. However when I close the terminal all my sessions are closed (when I do "screen -ls" from other terminal there are no sockets). This is not happening in any other Linux distribution.
Regards,
Yevgeniy
Are you running screen from inside a ssh connection? There was a bug in earlier releases of the devkit where on disconnect systemd killed all processes started by the daemon, which isn't what you want. This has been fixed so upgrading your image should be sufficient.
If you can't get the upgrade, the fix is to add "KillMode=process" to the end of /lib/systemd/system/ssh#.service.
I am trying to follow this tutorial for building and running an MPI application on an ARM based Ubuntu 11.10 system.
When installing open-mpi environment on my PC machine, the sample program runs well. However, trying the same on the ARM machine, the terminal hangs up and I need to kill the MPI process from a second terminal in order to release it.
The MPI packages I installed using apt-get, on both machines, were mpi-default-dev and mpi-default-bin, so I assume that the packages are as updated as they can be.
The first sample program in the tutorial makes every process prints a "hello" message with some info. On the PC I get messages from all 8 processes (although running on a single core) and then the program ends. On the ARM, I get no output at all. The program is just stuck immediately after launch.
Any idea on what's wrong? I am not sure even where to start to debug this?
Update: I tried removing the OpenMPI package and install the alternative MPICH2 package - but the result is just the same.
Ubuntu 11.10 did not ship with a functional Open MPI implementation for ARM (although it may have shipped with a nonfunctional one). Ubuntu 12.04 did.
I would recommend building your own Open MPI from source - available at http://www.open-mpi.org/software/ompi/v1.6/, unless you can update to a more recent version of Ubuntu.
Alternatively, you could rebuild the 11.10 package using the fixes pointed out in https://bugs.launchpad.net/ubuntu/+source/openmpi/+bug/949044.
I am trying to install CUDA on a server running Ubuntu 10.04.
I followed the NVDIA instructions and installed the "CUDA toolkit for Ubuntu Linux 10.04", "GPU Conputing SDK code samples",and "Developer Drivers for Linux (260.19.26) (64 bit)", my system is 64 bit. This installation seems successful. everything downloaded from http://developer.nvidia.com/object/cuda_3_2_downloads.html#Linux
According to the messages of the installation packages, I added /usr/local/cuda/bin to PATH, /usr/local/cuda/lib64:/usr/local/cuda/lib to LD_LIBRARY_PATH
Then, I tried to run the sample programs. The strange things is, some of them can be run, and some of them don't even through they can be made with no problem.
For example,
- convolutionSeparable will just stop there without any message, I can kill it by ctrl + c.
matrixMul outputs a line
Device 0: "Quadro 5000" with Compute 2.0 capability
and stop there, again can be killed by Ctrl+C
clock works, outputs
PASSED
time = 12574
Press ENTER to exit...
simpleMultiCopy outputs PASSED
MonteCarlo outputs PASSED
simpleZeroCopy outputs PASSED
bandwidthTest stops there with blinking cursor for ever.
What is wrong with this?! How can I check if my CUDA installation is successful ? What is wrong with those programs don't run? They don't even have a error message.
I would start by upgrading the driver to 260.19.36, which can be found here. Then I would suggest running nvidia-smi -a to see if the driver is happy. Then I second the suggestion to run deviceQuery to see if the CUDA Toolkit 3.2 is working.
If deviceQuery output appears nominal, then I would start adding printf's to see where things go awry in matrixMul.
What does deviceQuery say? Also check the output of dmesg right after you run that program to see if you can figure out whats up.
Another tip, if you still are having issues, is try running:
strace ./deviceQuery 2> out.txt
Then check out.txt to see if you can find any clues why this error is occuring.
I have similar problem but solved by updating kernel and drivers.
install newer kernel on 10.04
linux-image-generic-pae-lts-backport-natty
linux-headers-generic-pae-lts-backport-natty
download the latest nvidia driver
from http://www.nvidia.com/Download/index.aspx?lang=en-us
install the latest CUDA (at moment 4.0) from
http://developer.nvidia.com/cuda-toolkit-40
CUDA Toolkit for Ubuntu Linux 10.10 32-bit
CUDA Tools SDK 32-bit
GPU Computing SDK code samples
then I passed all SDK example tests.
ThinkPad w520 Quadro 1000 on Ubuntu 10.04