GCP AI Platform Notebook driver too old? - pytorch

I am trying to run the following Hugging Face Transformers tutorial on GCP's AI Platform Notebook with 32 vCPUs, 208 GB RAM, and 2 NVIDIA Tesla T4s.
However, when I try to run the part
model = DistillBERTClass()
model.to(device)
I get the following Assertion Error:
AssertionError: The NVIDIA driver on your system is too old (found version 10010).
Please update your GPU driver by downloading and installing a new
version from the URL: http://www.nvidia.com/Download/index.aspx
Alternatively, go to: https://pytorch.org to install
a PyTorch version that has been compiled with your version
of the CUDA driver.
However, when I run
!nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 38C P0 22W / 70W | 10MiB / 15079MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:00:05.0 Off | 0 |
| N/A 39C P8 10W / 70W | 10MiB / 15079MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
The version on the NVIDIA driver is compatible with the latest PyTorch version, which I am using.
Has anyone else ran into this error, and is there a way around it?

You can either:
Please update your GPU driver by downloading and installing a new
version from the URL: http://www.nvidia.com/Download/index.aspx
Alternatively, go to: https://pytorch.org to install a PyTorch
version that has been compiled with your version of the CUDA driver.

You can try a newer NVIDIA driver version, we support latest CUDA 11 driver version, and then install Pytorch on top of it:
gcloud beta notebooks instances create cuda11 \
--vm-image-project=deeplearning-platform-release \
--vm-image-family=common-cu110-notebooks-debian-9 \
--machine-type=n1-standard-1 \
--location=us-west1-a \
--format=json
Image family:
common-cu110-notebooks-debian-9
common-cu110-notebooks-debian-10

Related

CUDA 11.3 not being detected by PyTorch [Anaconda]

I am running Ubuntu 20.04 on GTX 1050TI. I have installed CUDA 11.3.
nvidia-smi output:
Wed Apr 6 18:27:23 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| N/A 44C P8 N/A / N/A | 11MiB / 4040MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3060 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 4270 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
nvcc --version output:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Mar_21_19:15:46_PDT_2021
Cuda compilation tools, release 11.3, V11.3.58
Build cuda_11.3.r11.3/compiler.29745058_0
Anaconda PyTorch isn't detecting CUDA:
> import torch
> torch.cuda.is_available()
> False
Any ideas how to solve the issue?
The solution:
Conda in my case installed cpu build. You can easily identify your build type by running torch.version.cuda which should return a string in case you have the CUDA build. if you get None then you are running the cpu build and it will not detect CUDA
To fix that I installed torch using pip instead :
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error

I am trying to use GPU with Tensorflow. My Tensorflow version is 2.4.1 and I am using Cuda version 11.2. Here is the output of nvidia-smi.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce MX110 Off | 00000000:01:00.0 Off | N/A |
| N/A 52C P0 N/A / N/A | 254MiB / 2004MiB | 8% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1151 G /usr/lib/xorg/Xorg 37MiB |
| 0 N/A N/A 1654 G /usr/lib/xorg/Xorg 136MiB |
| 0 N/A N/A 1830 G /usr/bin/gnome-shell 68MiB |
| 0 N/A N/A 5443 G /usr/lib/firefox/firefox 0MiB |
| 0 N/A N/A 5659 G /usr/lib/firefox/firefox 0MiB |
+-----------------------------------------------------------------------------+
I am facing a strange issue. Previously when I was trying to list all the physical devices using tf.config.list_physical_devices() it was identifying one cpu and one gpu. AFter that I tried to do a simple matrix multiplication on the GPU. It failed with this error : failed to synchronize cuda stream CUDA_LAUNCH_ERROR (the error code was something like that, I forgot to note it). But after that when I again tried the same thing from another terminal, it failed to recognise any GPU. This time, listing physical devices produce this:
>>> tf.config.list_physical_devices()
2021-04-11 18:56:47.504776: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-11 18:56:47.507646: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-04-11 18:56:47.534189: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2021-04-11 18:56:47.534233: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: debadri-HP-Laptop-15g-dr0xxx
2021-04-11 18:56:47.534244: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: debadri-HP-Laptop-15g-dr0xxx
2021-04-11 18:56:47.534356: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 460.39.0
2021-04-11 18:56:47.534393: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 460.39.0
2021-04-11 18:56:47.534404: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 460.39.0
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
My OS is Ubuntu 20.04, Python version 3.8.5 and Tensorflow , as mentioned before 2.4.1 with Cuda version 11.2. I installed cuda from these instructions. One additional piece of information; when I import tensorflow , it shows the following output:
import tensorflow as tf
2021-04-11 18:56:07.716683: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
What am I missing? Why is it failing to recognise the GPU even though it was recognising previously?
tldr: Disable Secure Boot before installing the Nvidia Driver.
I had the exact same error, and I spent a ton of time trying to figure out if I had installed Tensorflow related stuff incorrectly. After many hours of problem solving, I found that my NVIDIA driver was having some problems because I never disabled secure boot in my BIOS when setting up Ubuntu 20.4. Here's what I suggest (I opted for using Docker w/ Tensorflow, which avoids having to install all theCuda related stuff) - I hope it works for you!
Disable Secure Boot in your BIOS
Make a fresh install on Ubuntu 20.4
Install Docker according to nvidia-container-toolkit's page.
curl https://get.docker.com | sh \
&& sudo systemctl --now enable docker
Install nvidia-container-toolkit from the same page.
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
Test to make sure that's working with
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Finally, use Tensorflow with Docker w/ GPU support!
docker run --gpus all -u $(id -u):$(id -g) -it -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter jupyter notebook --ip=0.0.0.0
I just made an account to say that #Nate's answer worked for me.
I have the exact same setting as you and have been trying for two days.
What I did in the end was
Reboot - F10 to the setting - Security - BIOS Secure Boot (or something like that I don't remember exactly) - Disabled
Then there was some extra steps with the confirmation but it worked fine. I did not re-install the whole Unbuntu. It was a bit too technically risky for me.
Then I tried the tf.config line and I got this:
2021-06-14 17:12:19.546509: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-06-14 17:12:26.754680: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-06-14 17:12:26.909679: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3593460000 Hz
2021-06-14 17:12:26.910016: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55a8352501c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-06-14 17:12:26.910040: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-06-14 17:12:26.972350: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-06-14 17:12:27.074861: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-14 17:12:27.075289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:0c:00.0 name: GeForce GTX 1650 computeCapability: 7.5
coreClock: 1.665GHz coreCount: 14 deviceMemorySize: 3.81GiB deviceMemoryBandwidth: 119.24GiB/s
There are more red lines on devices properties towards the end but I got
Default GPU Device: /device:GPU:0
Don't know why it works, but it works. Just change the security boot setting.
I don't have enough experience points to upvote Nate's answer. I will come back later. But he/she really offers a good solution.
Disabling Secure Boot solved the problem immediately. No need to reinstall anything.
> import tensorflow as tf
> tf.config.list_physical_devices("GPU")
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Python & Tensorflow & CUDA Environment Setup Problems

I had tensorflow 2.2 working with Python 3.7.4 on Windows 10 Enterprise 64-bit yesterday, including using the GPU. This morning, the same system no longer sees the GPU. I have uninstalled/reinstalled CUDA,
& the other requirements based on the tensorflow docs but it just refuses to work.
PC specs: i7 CPU 3.70GHz, 64GB RAM, NVidia GeForce GTX 780 Ti video card installed (driver 26.21.14.4122).
https://www.tensorflow.org/install/gpu says tensorflow requires NVidia CUDA Toolkit 10.1 specifically (not 10.0, not 10.2).
Naturally, that version refuses to install on my PC. these components fail during install:
Visual Studio Integration
NSight Systems
NSight Compute
So, I installed 10.2 which installs properly, but things don't run (which is not a surprise, given the tensorflow docs).
What's installed:
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 441.22 Driver Version: 441.22 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 780 Ti WDDM | 00000000:01:00.0 N/A | N/A |
| 27% 41C P8 N/A / N/A | 458MiB / 3072MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:32:27_Pacific_Daylight_Time_2019
Cuda compilation tools, release 10.2, V10.2.89
I know the nvcc output of 10.2.89 is not what I need, but it simply won't install 10.1 so I don't know what I can do. Is this a common problem? Is there a diagnostic I can run to ensure the card did not die? Should I downgrade my version of tensorflow? Should I abandon this environment all together? Is so, what is a stable environment to learn ML?
Below is how I got it working. Tensorflow 2.2.0, Windows 10, Python 3.7 (64-bit). Thanks again to Yahya for the gentle nudge towards this solution.
Uninstall every bit of NVIDIA software.
Install CUDA Toolkit 10.1. I did the Express Install of package cuda_10.1.243_win10_network.exe. Any other version of CUDA 10.1 did not install correctly.
Install CUDNN package 7.6. Extract all files from cudnn-10.1-windows10-x64-v7.6.5.32 into the CUDA file structure (i.e. C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1)
Add these directories to your path variables (assuming that you did not alter the path during installation):
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\bin
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\libnvvp
Reboot to initialize the Path variables.
Uninstall all tensorflow variants via PIP.
Install tensorflow 2.2 via PIP.
Then you can run the code below in bash to confirm that tensorflow is able to access your video card
# Check if tensorflow detects the GPU
import tensorflow as tf
from tensorflow.python.client import device_lib
# Query tensorflow to see if it recognizes your GPU. This will output in the bash window
physical_devices = tf.config.list_physical_devices()
GPU_devices = tf.config.list_physical_devices('GPU')
print("physical_devices:", physical_devices)
print("Num GPUs:", len(GPU_devices))

How to check if dlib is using GPU or not?

My machine has Geforce 940mx GDDR5 GPU.
I have installed all requirements to run GPU accelerated dlib (with GPU support):
CUDA 9.0 toolkit with all 3 patches updates from https://developer.nvidia.com/cuda-90-download-archive?target_os=Windows&target_arch=x86_64&target_version=10&target_type=exelocal
cuDNN 7.1.4
Then I executed all those below command after cloning dlib/davisKing repository on Github for compliling dlib with GPU support
:
$ git clone https://github.com/davisking/dlib.git
$ cd dlib
$ mkdir build
$ cd build
$ cmake .. -DDLIB_USE_CUDA=1 -DUSE_AVX_INSTRUCTIONS=1
$ cmake --build .
$ cd ..
$ python setup.py install --yes USE_AVX_INSTRUCTIONS --yes DLIB_USE_CUDA
Now how could I possibly check/confirm if dlib(or other libraries depend on dlib like face_recognition of Adam Geitgey) is using GPU inside python shell/Anaconda(jupyter Notebook)?
In addition to the previous answer using command,
dlib.DLIB_USE_CUDA
There are some alternative ways to make sure if dlib is actually using your GPU.
Easiest way to check it is to check if dlib recognizes your GPU.
import dlib.cuda as cuda
print(cuda.get_num_devices())
If the number of devices is >= 1 then dlib can use your device.
Another useful trick is to run your dlib code and at the same time run
$ nvidia-smi
This should give you full GPU utilization information where you can se ethe total utilization together with memory usage of each process separately.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48 Driver Version: 410.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 00000000:01:00.0 On | N/A |
| 0% 52C P2 36W / 151W | 763MiB / 8117MiB | 5% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1042 G /usr/lib/xorg/Xorg 18MiB |
| 0 1073 G /usr/bin/gnome-shell 51MiB |
| 0 1428 G /usr/lib/xorg/Xorg 167MiB |
| 0 1558 G /usr/bin/gnome-shell 102MiB |
| 0 2113 G ...-token=24AA922604256065B682BE6D9A74C3E1 33MiB |
| 0 3878 C python 385MiB |
+-----------------------------------------------------------------------------+
In some cases the Processes box might say something like "processes are not supported", this does not mean your GPU cannot run code but it does not just support this kind of logging.
If dlib.DLIB_USE_CUDA is true then it's using cuda, if it's false then it isn't.
As an aside, these steps do nothing and are not needed to use python:
$ mkdir build
$ cd build
$ cmake .. -DDLIB_USE_CUDA=1 -DUSE_AVX_INSTRUCTIONS=1
$ cmake --build .
Just running setup.py is all you need to do.
The following snippets have been simplified to either use or check whether dlib is using GPU or not.
First, Check whether dlib identifies your GPU or not.
import dlib.cuda as cuda;
print(cuda.get_num_devices());
Secondly, dlib.DLIB_USE_CUDA if it's false, simply allow it to use GPU support by
dlib.DLIB_USE_CUDA = True.

Theano Not Able To Find Gpu - Ubuntu 16.04

WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu is not available (error: cuda unavailable)
I get this error when trying to run any sample Theano program.
I have tried all the suggested fixes provided in this thread.
nvcc --version output:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17
nvidia-smi output:
Sat Dec 10 00:46:14 2016
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57 Driver Version: 367.57 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 0000:01:00.0 Off | N/A |
| 0% 37C P0 33W / 151W | 0MiB / 8112MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
gcc version:
(venv) rgalbo#blueberry:~$ gcc --version
gcc (Ubuntu 4.9.3-13ubuntu2) 4.9.3
I have been trying to get this to work for a while now, would like someone to point me in the right direction.
So I was finally able to get Theano to find the gpu, I went through the steps provided here in order to clean up any corrupt installation that may have occured from my initial installation of CUDA.
After this I ran sudo apt-get install cuda which installed the right driver packages for my nvidia graphics card. I then proceeded to install CUDA 8.0 from the deb and this was able to over-write the 7.5 version that was giving me issues.
This is the output I am now able to get from theano_test.py:
(venv) rgalbo#blueberry:~$ python theano_test.py
Using gpu device 0: GeForce GTX 1070 (CNMeM is disabled, cuDNN 5103)
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 0.185949 seconds
Result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761
1.62323296]
Used the gpu
and here is my ~/.theanorc file:
(venv) rgalbo#blueberry:~$ cat ~/.theanorc
[global]
floatX = float32
device = gpu
[nvcc]
flags=-D_FORCE_INLINE
[cuda]
root = /usr/local/cuda-8.0
After each separate install I updated and rebooted the server just for good luch, which I found to be helpful.

Resources