I am trying to run caffe on Linux Ubuntu.
After installation, I run caffe in gpu and the error is
I0910 13:28:13.606891 10629 caffe.cpp:296] Use GPU with device ID 0
modprobe: ERROR: could not insert 'nvidia_352': No such device
F0910 13:28:13.728612 10629 common.cpp:142] Check failed: error == cudaSuccess (38 vs. 0) no CUDA-capable device is detected
*** Check failure stack trace: ***
# 0x7ffd3b9a7daa (unknown)
# 0x7ffd3b9a7ce4 (unknown)
# 0x7ffd3b9a76e6 (unknown)
# 0x7ffd3b9aa687 (unknown)
# 0x7ffd3bf91cb5 caffe::Caffe::SetDevice()
# 0x40a5a7 time()
# 0x4080f8 main
# 0x7ffd3aeb9ec5 (unknown)
# 0x408618 (unknown)
# (nil) (unknown)
Aborted (core dumped)
My NVIDIA driver is 352.41.
I installed 352 and it is installed latest version.
sudo apt-get install nvidia-352[sudo]
Reading package lists... Done
Building dependency tree
Reading state information... Done
nvidia-352 is already the newest version.
The following packages were automatically installed and are no longer required:
account-plugin-windows-live libupstart1
Use 'apt-get autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 31 not upgraded.
My Ubuntu has NVIDIA driver 352 and why I have error like
I0910 13:28:13.606891 10629 caffe.cpp:296] Use GPU with device ID 0
modprobe: ERROR: could not insert 'nvidia_352': No such device
F0910 13:28:13.728612 10629 common.cpp:142] Check failed: error == cudaSuccess (38 vs. 0) no CUDA-capable device is detected
I checked whether I have CUDA capable device like
lspci | grep -i nvidia
05:00.0 VGA compatible controller: NVIDIA Corporation GK107GL [Quadro K2000] (rev a1)
05:00.1 Audio device: NVIDIA Corporation GK107 HDMI Audio Controller (rev a1)
I have CUDA capable device and why I get the error?
EDIT 1:
Yeah my test with ./deviceQuery failed.
../NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL
I checked in the dev/ folder, I have nvidia0.
crwxrwxrwx 1 root root 195, 0 Sep 10 16:51 nvidia0
crw-rw-rw- 1 root root 195, 255 Sep 10 16:51 nvidiactl
My nvcc -V check gave me
li#li-HP-Z420-Workstation:/dev$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17
Then my version check
li#li-HP-Z420-Workstation:/dev$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.41 Fri Aug 21 23:09:52 PDT 2015
GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04)
What could be wrong?
Now the problem is solved.
I checked sudo dpkg --list | grep nvidia
I found as my kernel has 352.41, but the client has 304.12.
So I did sudo apt-get remove --purge nvidia-*. It removed all packages.
Then, install 352.41 as
$ sudo add-apt-repository ppa:xorg-edgers/ppa -y
$ sudo apt-get update
$ sudo apt-get install nvidia-352
After that
$ sudo dpkg --list | grep nvidia
rc nvidia-304 304.128-0ubuntu0~gpu14.04.2 amd64 NVIDIA legacy binary driver - version 304.128
rc nvidia-304-updates 304.125-0ubuntu0.0.2 amd64 NVIDIA legacy binary driver - version 304.125
ii nvidia-352 352.41-0ubuntu0~gpu14.04.1 amd64 NVIDIA binary driver - version 352.41
rc nvidia-opencl-icd-304 304.128-0ubuntu0~gpu14.04.2 amd64 NVIDIA OpenCL ICD
rc nvidia-opencl-icd-304-updates 304.125-0ubuntu0.0.2 amd64 NVIDIA OpenCL ICD
ii nvidia-opencl-icd-352 352.41-0ubuntu0~gpu14.04.1 amd64 NVIDIA OpenCL ICD
ii nvidia-prime 0.6.2 amd64 Tools to enable NVIDIA's Prime
ii nvidia-settings 355.11-0ubuntu0~gpu14.04.1 amd64 Tool for configuring the NVIDIA graphics driver
Now version matches.
Then ./deviceQuery and all work as expected.
Thanks
I have this problem too. And re-installing the nvidia drivers didn't solve the issue.
Finally, I solved this problem by add two kernel parameters with grub.
add in:
GRUB_CMDLINE_LINUX_DEFAULT
with:
pci=nocrs pci=realloc
I think this is a collision between cuda7.5 and kernel3.19.
Another way I can do is install using .run file.
That needs to kill X server first.
X server is killed as follow.
Make sure you are logged out.
Hit CTRL+ALT+F1 and login using your credentials.
kill your current X server session by typing sudo service lightdm stop or sudo stop lightdm
Enter runlevel 3 (or 5) by typing sudo init 3 (or sudo init 5) and install your .run file.
You might be required to reboot when the installation finishes. If not, run sudo service start lightdm or sudo start lightdm to start your X server again.
Then run .run file as sudo sh xxxxx.run
You may get error as The distribution-provided pre-install script failed! Are you sure you want to continue?. Then abort the installation and
disable the "Nouveau kernel driver" as sudo update-initramfs -u
Then reboot the system and redo stop X server, enter runlevel 3 and do sudo sh xxxx.run again.
This time you can ignore the message and continue for that prescript fail message.
Then you will be able to install Nvidia Driver from .run file.
If you are showing video from non-nvidia device but have driver installed, you have to install it with “--no-opengl-files” flag, for Gnome to work.
I suggest to download a separate driver and install it manually by logging to console:
1. Alt Ctrl F2/f3/f4/f5 to get to console.
2. “init 3” to kill UI
3. relogin if necessary to console
4. wget http://us.download.nvidia.com/tesla/418.67/NVIDIA-Linux-
driver x86_64-418.67.run
5. sh NVIDIA-Linux-x86_64-418.67.run --no-opengl-files
6. After installation - reboot
I also had this problem. The above answers didn't work for me. When I installed latest driver(nvidia-364), it worked. Commands to run:
sudo add-apt-repository ppa:xorg-edgers/ppa
sudo apt-get update
sudo apt-get install nvidia-364
I think the problem occurs when we have different version of gcc used to compile driver modules and the Linux kernel.
Related
I have ubuntu 20.04. I installed nvidia-driver-460 and using it. Recently my system updates. then I got the following.
I tried to update
sudo apt upgrade
I got "0 packages upgraded, 0 newly installed, 0 to remove and 2 not upgraded."
I typed: nvidia-smi, then I got
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running
sudo apt upgrade
"0 packages upgraded, 0 newly installed, 0 to remove and 2 not upgraded."
I could not upgrade 2 files. so, checked which files are they by the following command
sudo apt list --upgradable -a
I got the following nvidi-driver libraries conflict. "linux-modules-nvidia-460-generic-hwe-20.04-edge/focal-updates 5.8.0-49.55~20.04.1+1 amd64 [upgradable from: 5.8.0-48.54~20.04.1]".
I tried to upgrade it
sudo apt upgrade linux-modules-nvidia-460-generic-hwe-20.04-edge
it showed, there are dependencies of libraries, so ti was not successful.
solution
Downgrade the nvidia-driver to nvidia-driver-450 or the lower version you prefer. go to "Software & Updates" and select lower version and press Apply changes like shown below
Reboot your system
upgrade the nvidia-driver back to nvidia-driver-460 like previous
Reboot your system again, it works prefectly
I am trying to follow the NVIDIA Driver Installation Quickstart Guide:
https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html
The first instruction says:
The kernel headers and development packages for the currently running
kernel can be installed with:$ sudo apt-get install linux-headers-$(uname -r)
When I try this I get the error:
Unable to locate package linux-headers-4.9.140-tegra
Couldn't find any package by glob 'linux-headers-4.9.140-tegra'
Couldn't find any package by regex 'linux-headers-4.9.140-tegra'
I'm not sure how to proceed.
Your version of Ubuntu is running a tegra kernel. The headers for this kernel are not in the Ubuntu repositories (or any other repositories you may have enabled). You will probably need to these before proceeding with the driver installation.
However. NVIDIA Tegra is a small SoC (system on chip) processor AFAIK. Like a Jetson Nano or something. The instructions you linked are for NVIDIA Tesla GPUs which are data center GPUs. Again, AFAIK. Check you are following the right instructions. Also, in those instructions, look at: 'Section 1.1 - Pre-installation requirements', and this pre install checklist.
Here is a list of all the different kernel headers in the Ubuntu 20.04 repos (not the same I know). tegra is not there.
Before you can install the appropriate kernel headers, update your packages index. First use the update command.
sudo apt-get update
then run sudo apt-get install linux-headers-$(uname -r) again. If this doesn't work, try out
sudo apt-get install linux-headers-generic
which should install the right version.
The NVIDIA-SMI is throwing this error:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA
driver. Make sure that the latest NVIDIA driver is installed and
running
I purged NVIDIA and installed it again following steps mentioned here.
My device specs are as follows:
Server with a Tesla M40
Running on Ubuntu 16.04
Kernel version Linux 4.4.0-116-generic x86_64
Driver: nvidia-384
Can someone please help in solving the error?
The issue might due to a confirmed "bug" in 4.4.0-116 patch. I ran into the same issue with nvidia-390. If you still want to use a newer version of Nvidia-driver, I followed the instructions here and managed to solve the problem. In general, use the following steps:
If you cannot login to the desktop and fall into to the fail-loop, press ctrl + alt + F1 to login into the command line mode.
Check if the version of gcc is outdated, if so, update it: gcc --version
If the gcc version is 5+, uninstall the nvidia driver first: sudo apt-get remove nvidia-390
Purge the 4.4.0-116 kernel: sudo apt-get purge linux-headers-4.4.0-116 linux-headers-4.4.0-116-generic linux-image-4.4.0-116-generic linux-image-extra-4.4.0-116-generic linux-signed-image-4.4.0-116-generic
Reinstall the kernel: sudo apt-get install linux-generic linux-signed-generic
Reinstall the nvidia-390: sudo apt-get install nvidia-390
Check if the problem is solved by modinfo nvidia-390 -k 4.4.0-116-generic | grep vermagic, make sure retpoline shows up this time
Reboot: sudo reboot
Hope this works for you and other people who run into the same issue. The post in the forum saved my weekend.
Note: this answer is from 2018 and works for Ubuntu 16.04, which is very much out-of-date. Don't try this on recent Ubuntu versions.
Try
Download the driver from here
sudo apt-get purge nvidia* - To remove your current installations
dpkg -i nvidia-diag-driver-local-repo-ubuntu1604_375.66-1_amd64.deb - installing what you downloaded earlier
sudo apt-get update
sudo apt-get install cuda-drivers
After this, go on and reboot your computer.
When it's up again, the nvidia-smi command should run smoothly
to download latest driver as of this answer:
sudo apt install libnvidia-compute-435 libnvidia-compute-435
sudo apt install libnvidia-gl-435 nvidia-dkms-435 nvidia-kernel-source-435
nvidia-utils-435 xserver-xorg-video-nvidia-435 libnvidia-ifr1-435
sudo apt install nvidia-driver-435
sudo reboot
and then:
nvidia-smi
If you're running this on Google Colab, just go to Runtime > Change Runtime Type > select GPU. That worked for me.
I have installed virtual box - VirtualBox-5.1-5.1.22_115126_el6-1.x86_64.rpm in redhat 6.5 but while starting virtual box, I am getting "Segmentation fault (core dumped) error.
"
From /var/log/messages:
May 22 20:12:07 MSS-SWM abrtd: send-mail: error while loading shared
libraries: libmysqlclient.so.16: cannot open shared object file: No
such file or directory May 22 20:12:07 MSS-SWM abrtd: Error running
'/bin/mailx' May 22 20:12:07 MSS-SWM abrtd: 'post-create' on
'/var/spool/abrt/ccpp-2017-05-22-20:11:55-141657' exited with 1 May 22
20:12:07 MSS-SWM abrtd: Deleting problem directory
'/var/spool/abrt/ccpp-2017-05-22-20:11:55-141657' May 22 20:21:24
MSS-SWM kernel: VirtualBox[143266]: segfault at 0 ip 00007fddf7914781
sp 00007fffa28a7b70 error 4 in
libQt5CoreVBox.so.5[7fddf7850000+598000]
could you please help on this.
The following is presented in the link I provided in my comments:
add virtual box repo
# cd /etc/yum.repos.d/
# wget http://download.virtualbox.org/virtualbox/rpm/rhel/virtualbox.repo
install dependencies
# yum update
# yum install binutils qt gcc make patch libgomp glibc-headers glibc-devel kernel-headers kernel-devel dkms
If you don't have access to the internet, but do have the CentOS DVD, try
mounting the DVD to your system (to /media/cdrom/ for example)
running yum to install the rpm packages from the DVD as such:
yum --disablerepo=* --enablerepo=c5-media install binutils qt gcc make patch libgomp glibc-headers glibc-devel kernel-headers kernel-devel dkms
First of all, all of this is done as root. I've been trying to install the CUDA 7.5 drivers on a CentOS 7 SATA DOM. The issue I'm running into is the following:
Installing the NVIDIA display driver...
The driver installation is unable to locate the kernel source. Please make sure that the kernel source packages are installed and set up correctly.
If you know that the kernel source packages are installed and set up correctly, you may pass the location of the kernel source with the '--kernel-source-path' flag.
I have tried to point to the kernel source path (I may be pointing to the wrong path; I'm a new Linux user) with the following command:
$ ./cuda_7.5.18_linux.run --kernel-source-path=/usr/src/kernels/3.10.0-327.18.2.el7.x86_64
Same issue as before. I've read online that other people with this issue is due to kernel version mismatch. That, however is not the case:
$ uname -r
3.10.0-327.18.2.el7.x86_64
$ rpm -q kernel-devel kernel-headers
kernel-devel-3.10.0-327.18.2.el7.x86_64
kernel-headers-3.10.0-327.18.2.el7.x86_64
$ ls /usr/src/kernels
3.10.0-327.18.2.el7.x86_64
$ ls /usr/src/kernels/3.10.0-327.18.2.el7.x86_64/
arch block crypto drivers firmware fs include init ipc Kconfig kernel lib Makefile mm Module.symvers net samples scripts security sound System.map tools usr virt vmlinux.id
I've also tried to install different versions of gcc and still no dice.
Any help would be appreciated.
Thanks.
I figure it out. It turns out I needed to install DKMS from the EPEL repository. Here are the commands I used:
sudo yum install epel-release
yum install --enablerepo=epel dkms