Fail to run tensorflow on GPU - linux

I fail to run the TF-CUDA tutorials_example_trainer as given in the installation guide (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/get_started/os_setup.md#installing-from-sources)
I've had problems with the CUDA libs before, but that was with graphics related demo's.
All details below,
Thank you in advance for the help provided.
Environment info
Operating System: Debian Stretch
Installed version of CUDA and cuDNN:
8.0, 5.0
If installed from source, provide
554ddd9ad2d4abad5a9a31f2d245f0b1012f0d10
Build label: 0.3.0
Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Fri Jun 10 11:38:23 2016 (1465558703)
Steps to reproduce
Build from source with 367.35 driver
Run bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu
Logs or other output that would be helpful
bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
modprobe: ERROR: ../libkmod/libkmod-module.c:832 kmod_module_insert_module() could not find module by name='nvidia_367_uvm'
modprobe: ERROR: could not insert 'nvidia_367_uvm': Unknown symbol in module, or unknown parameter (see dmesg)
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: debian
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: debian
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: 367.35.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:356] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 367.35 Mon Jul 11 23:14:21 PDT 2016
GCC version: gcc version 5.4.0 20160609 (Debian 5.4.0-6)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 367.35.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:293] kernel version seems to match DSO: 367.35.0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.
F tensorflow/cc/tutorials/example_trainer.cc:125] Check failed: ::tensorflow::Status::OK() == (session->Run({{"x", x}}, {"y:0", "y_normalized:0"}, {}, &outputs)) (OK vs. Invalid argument: Cannot assign a device to node 'y': Could not satisfy explicit device specification '/gpu:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
[[Node: y = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/gpu:0"](Const, x)]])

The error message indicates that your GPU driver is not well set. You could try the following command to see if the driver is installed correctly.
$ nvidia-smi
If not please follow the instruction on the CUDA official site and reinstall CUDA. As your OS is not officially supported, you may want to change your OS.

Related

Using CUDA 11.x but getting error: Unknown CUDA arch (8.6) or GPU not supported

I'm setting up a conda environment to use pytorch 1.4.0 (on Ubuntu 20.04.2), but getting the error message:
ValueError: Unknown CUDA arch (8.6) or GPU not supported
I know this has been asked before, but no answer fits my case. This answer suggests that the CUDA version is too old. However, I updated my CUDA version to the most recent, and get the same error message.
nvcc -V says I have CUDA 11 installed, and when I run nvidia-smi I get this info:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.84 Driver Version: 460.84 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
which, according to the NVIDIA docs, should work be compatible:
Another auxilliary question: What does the "8.6" in CUDA arch (8.6) represent?
Specific versions of PyTorch work only with specific versions of CUDA.
If you are using CUDA-11.1, you'll need a fairly recent version of PyTorch. You need to either upgrade your PyTorch, or downgrade your CUDA.
It seems you can grab PyTorch v1.4 for CUDA 10.0 from here:
pip install torch==1.4.0+cu100 -f https://download.pytorch.org/whl/torch_stable.html

ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory

I have installed cuda-8.0 and cudnn5.1 on CentOS. Then, when importing tensorflow (python 3.6), it gives the error as above.
I have already set symbol link as below in /etc/profile. Are there any guys who occurred this kind of problem?
export PATH=/usr/local/cuda-8.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:$LD_LIBRARY_PATH
Also, what makes me confused is that, I run nvcc -V, it shows
Cuda compilation tools, release 8.0, V8.0.61
However, when I run ./deviceQuery in folder /usr/local/cuda-8.0/samples/1_Utilities/deviceQuery, on device 0: "Tesla M40", it shows
CUDA Driver Version / Runtime Version 9.1 / 8.0
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla M40
Check your version of tensorflow using "pip3 list | grep tensorflow" If it is of version tensorflow-gpu (1.5.0) then the required cuda version is 9.0 and cuDNN v7.
Look into the following link for more details:
https://github.com/tensorflow/tensorflow/releases
Tensorflow installation guide needs to be updated.
I had the same problem. Tensorflow 1.5.0 is precompiled to CUDA 9.0 (which is outdated; Sept 2017).
The newest CUDA version is CUDA 9.1 (Dec. 2017) and sudo pip install tensorflow-gpu will not work with the newest CUDA 9.1. There are two solutions to the problem:
1.) Install CUDA 9.0 next to CUDA 9.1 (this worked for me)
2.) Build Tensorflow by yourself from the git source code
Either way do not forget to add the PATH variables to your operating system, otherwise you receive the error message stated in the question from your python interpreter.

How to install CUDA 8.0 in the latest version of Tensorflow (1.0) in AWS p2.xlarge instance, AMI ami-edb11e8d and nvidia drivers up to date (375.39)

I have upgraded to Tensorflow version 1.0 and installed CUDA 8.0 with the cudnn 5.1 version and the nvidia drivers up to date 375.39. My NVIDIA hardware is the one that is on Amazon Web Services using the p2.xlarge instance, a Tesla K-80. My OS is Linux 64-bit.
I get the next error message every time I use the command: tf.Session()
[ec2-user#ip-172-31-7-96 CUDA]$ python
Python 2.7.12 (default, Sep 1 2016, 22:14:00)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
>>> sess = tf.Session()
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: ip-172-31-7-96
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: ip-172-31-7-96
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: Invalid argument: expected %d.%d or %d.%d.%d form for driver version; got "1"
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:363] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 375.39 Tue Jan 31 20:47:00 PST 2017
GCC version: gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 375.39.0
I'm completely clueless about how to fix this.
I have tried different versions of Nvidia drivers and CUDA but still it does not work.
Any hints will be appreciated.
You need to install a NVIDIA Driver and run the CUDA 8.0 installer.
# Requirements
# - NVIDIA Driver - NVIDIA-Linux-x86_64-375.39.run - http://www.nvidia.fr/Download/index.aspx
# - CUDA runfile (local) - cuda_8.0.61_375.26_linux.run - https://developer.nvidia.com/cuda-downloads
# - cudnn-8.0-linux-x64-v5.0-ga.tgz
sudo apt update -y && sudo apt upgrade -y
sudo apt install build-essential linux-image-extra-`uname -r` -y
chmod +x NVIDIA-Linux-x86_64-375.39.run
sudo ./NVIDIA-Linux-x86_64-375.39.run
chmod +x cuda_8.0.61_375.26_linux.run
./cuda_8.0.61_375.26_linux.run --extract=`pwd`/extracts
sudo ./extracts/cuda-linux64-rel-8.0.61-21551265.run
echo -e "export CUDA_HOME=/usr/local/cuda\nexport PATH=\$PATH:\$CUDA_HOME/bin\nexport LD_LIBRARY_PATH=\$LD_LINKER_PATH:\$CUDA_HOME/lib64" >> ~/.bashrc
source .bashrc
tar xf cudnn-8.0-linux-x64-v5.0-ga.tgz
cd cuda
sudo cp lib64/* /usr/local/cuda/lib64/
sudo cp include/cudnn.h /usr/local/cuda/include/
Uninstall drivers & cuda, then follow the official guide to reinstall.
Run deviceQuery to check that the device is installed properly.
You can also try "NVIDIA Volta Deep Learning AMI" with p3 (v100 GPU) instance.
Sign up on https://www.nvidia.com/en-us/gpu-cloud/?ncid=van-gpu-cloud and get your "API Key" to use the AMI free of charge.
EC2/GPU config info: https://aws.amazon.com/blogs/aws/new-amazon-ec2-instances-with-up-to-8-nvidia-tesla-v100-gpus-p3/
The AWS Deep Learning AMI has CUDA 8, 9 and 10 pre-installed and so you should not have to do this installation now.
Reference: https://docs.aws.amazon.com/dlami/latest/devguide/overview-cuda.html

VAAPI Compatibility issue with "Intel Corporation 3rd Gen Core processor Graphics Controller"

i am getting below error while running vainfo
libva info: VA-API version 0.99.0 libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD' libva info: Trying to open /opt/intel/mediasdk/lib64/iHD_drv_video.so
libva info: Found init function __vaDriverInit_0_32
libva error: /opt/intel/mediasdk/lib64/iHD_drv_video.so init failed
libva info: va_openDriver() returns 1 vaInitialize failed with error code 1 (operation failed),exit
This is happening after installing Intel Media Server Studio 2017 (see), before that vainfo was working fine, as i installed below packages and drivers, suggested from here.
sudo apt-get install i965-va-driver libva-intel-vaapi-driver vainfo
Machine configuration : ubuntu 14.04 LTS
Processor : Intel i7(i7-3720QM)
Graphics : Intel Corporation 3rd Gen Core processor Graphics Controller
Is this just because Intel SDK does not support 3rd Generation Processor?
The problem is with version of Media SDK used, Media SDK-2017 supports only 4th/5th Generation Processors.
For 3rd Generation Machine Media SDK-2015-R1 is the compatible version, which can be downloaded from here. Using the correct version with corresponding hardware will solve this problem.

Cuda - compile local and run remote

I want to compile my program locally and next run on server, because I haven't cuda capable graphics card.
My computer:
Kubuntu 12.04 x32
Nvidia display driver - lack
Nvcc - v6.01
Gcc - 4.6.3
Server:
Ubuntu 13.10 x64
Graphics card - GF GTX 480
Nvidia display driver - 337.xx
Nvcc - v6.01
Gcc - 4.8.1
Compilation on local computer:
nvcc kernel.cu
Running on server:
./a.out
But I get following error - "Cuda driver version is insufficient for cuda runtime version."
What's wrong? When I compile my code on server it work without problem.
The problem might be caused by the fact that you compile on x32 but execute on x64 architecture.
This problem is also described here: https://devtalk.nvidia.com/default/topic/555955/32-bit-executable-fails-with-insufficient-driver-version-on-64-bit-linux-os/
The solution provided there is to install the missing 32bit gcc libraries, which in your case (Ubuntu) should possible through:
sudo apt-get install lib32stdc++6

Resources