I successfully ran pytorch, however after a system reboot I get the following error calling torch.cuda.is_available():
UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/conda/conda-bld/pytorch_1616554782469/work/c10/cuda/CUDAFunctions.cpp:109.)
Output of nvidia-smi:
nvidia-smi
Thu Jun 24 09:11:39 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... On | 00000000:00:04.0 Off | 0 |
| N/A 40C P0 26W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Environment information:
python collect_env.py
Collecting environment information...
/lib/python3.9/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/conda/conda-bld/pytorch_1616554782469/work/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
PyTorch version: 1.8.1
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A
OS: Debian GNU/Linux 10 (buster) (x86_64)
GCC version: (Debian 8.3.0-6) 8.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.28
Python version: 3.9 (64-bit runtime)
Python platform: Linux-4.19.0-17-cloud-amd64-x86_64-with-glibc2.28
Is CUDA available: False
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Tesla P100-PCIE-16GB
Nvidia driver version: 455.23.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.8.1
[pip3] torchaudio==0.8.0a0+e4e171a
[pip3] torchmetrics==0.3.2
[pip3] torchvision==0.9.1
[conda] _tflow_select 2.3.0 mkl
[conda] blas 1.0 mkl conda-forge
[conda] cudatoolkit 11.1.74 h6bb024c_0 nvidia
[conda] mkl 2020.2 256
[conda] mkl-service 2.3.0 py39he8ac12f_0
[conda] mkl_fft 1.3.0 py39h54f3939_0
[conda] mkl_random 1.0.2 py39h63df603_0
[conda] numpy 1.19.2 py39h89c1606_0
[conda] numpy-base 1.19.2 py39h2ae0177_0
[conda] pytorch 1.8.1 py3.9_cuda11.1_cudnn8.0.5_0 pytorch
[conda] tensorflow 2.4.1 mkl_py39h4683426_0
[conda] tensorflow-base 2.4.1 mkl_py39h43e0292_0
[conda] torchaudio 0.8.1 py39 pytorch
[conda] torchmetrics 0.3.2 pyhd8ed1ab_0 conda-forge
[conda] torchvision 0.9.1 py39_cu111 pytorch
I've recently ran into this error when migrating my gpu containers from nvidia docker to podman. The root cause for me was that /dev/nvidia-uvm* files were missing that CUDA apparently needs to work. Check that you have them:
# ls -ld /dev/nvidia*
drwxr-x--- 2 root root 80 Oct 6 21:11 /dev/nvidia-caps
crw-rw-rw- 1 root root 195, 254 Oct 6 21:08 /dev/nvidia-modeset
crw-rw-rw- 1 root root 237, 0 Oct 6 21:13 /dev/nvidia-uvm <-IMPORTANT
crw-rw-rw- 1 root root 237, 1 Oct 6 21:13 /dev/nvidia-uvm-tools <-IMPORTANT
crw-rw-rw- 1 root root 195, 0 Oct 6 21:08 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Oct 6 21:08 /dev/nvidiactl
sudo nvidia-modprobe -c 0 -u should load kernel modules and create these dev files if you don't see them. Or alternatively look for /sbin/create-uvm-dev-node script from ubuntu that they created to fix their issue with them.
If you use GPU inside a container/VM, these dev files also need to be present in the container. Normally nvidia runtime scripts will take care of passing them through. If that doesn't happen you could try passing some explicit --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools to docker/podman run.
Related
I am using Nvidia V100 with the following specs:
(pytorch) [s.1915438#cl1 aneurysm]$ srun nvidia-smi
Sun Jul 17 16:17:27 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:D8:00.0 Off | 0 |
| N/A 31C P0 25W / 250W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
The Python, Pytorch and CUDA version is as follows:
Python 3.8.13 (default, Mar 28 2022, 11:38:47)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.12.0+cu113'
When I run a python file, containing a machine learning model, I get the following error.
(pytorch) [s.1915438#cl1 aneurysm]$ srun python aneurysm.py
terminate called after throwing an instance of 'std::runtime_error'
what(): the provided PTX was compiled with an unsupported toolchain.
srun: error: ccs2114: task 0: Aborted
Is it some kind of compatibility issue? Should I fallback to CUDA 10
.2 as the V100 is very old GPU?
Anyone using an old GPU from an HPC cluster is probably out of luck. In my case, I had Nvidia Driver 495 which is not very old. In fact, for CUDA 11.5 they recommend Nvidia Driver 470.
This is the official reply from Nvidia for a similar problem. They also recommend updating the driver. And most of the time HPC centres won't update the driver on personal requests.
1.by running the code below, you can reproduce the error
import torch
#torch.jit.script
def rotate_points_export(points):
return points
def xxx(input):
outputs = {}
outputs['a'] = input
outputs['b'] = input
outputs['b'] = rotate_points_export(outputs['b'])
return outputs
points = torch.rand((1,2,3,4))
model = torch.jit.trace(xxx, points, strict=False)
torch.jit.save(model, 'xx.pt')
The error messages are show below
model = torch.jit.trace(xxx, points, strict=False)
File "/home/intsig/codes/depedencies/hhdetection/lib/python3.8/site-packages/torch/jit/_trace.py", line 778, in trace
traced = torch._C._create_function_from_trace(
RuntimeError: values[i]->type()->isSubtypeOf(value_type) INTERNAL ASSERT FAILED at "/pytorch/torch/csrc/jit/ir/ir.cpp":1650, please report a bug to PyTorch.
The environ infos are show below
Collecting environment information...
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31
Python version: 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.13.0-52-generic-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 11.1.74
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1080 Ti
Nvidia driver version: 470.129.06
cuDNN version: Probably one of the following:
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.21.1
[pip3] torch==1.7.1
[pip3] torchvision==0.8.2
[conda] blas 1.0 mkl
[conda] mkl 2021.2.0 h06a4308_296
[conda] mkl-service 2.3.0 py38h27cfd23_1
[conda] mkl_fft 1.3.0 py38h42c9631_2
[conda] mkl_random 1.2.1 py38ha9443f7_2
[conda] numpy 1.20.1 py38h93e21f0_0
[conda] numpy-base 1.20.1 py38h7d8b39e_0
[conda] numpydoc 1.1.0 pyhd3eb1b0_1
I have CUDA 9.2 installed. For example:
(base) c:\>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Wed_Apr_11_23:16:30_Central_Daylight_Time_2018
Cuda compilation tools, release 9.2, V9.2.88
I installed PyTorch on Windows 10 using:
conda install pytorch cuda92 -c pytorch
pip3 install torchvision
I ran the test script:
(base) c:\>python
Python 3.6.5 |Anaconda custom (64-bit)| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from __future__ import print_function
>>> import torch
>>> x = torch.rand(5, 3)
>>> print(x)
tensor([[0.7041, 0.5685, 0.4036],
[0.3089, 0.5286, 0.3245],
[0.3504, 0.8638, 0.1118],
[0.6517, 0.9209, 0.6801],
[0.0315, 0.1923, 0.8720]])
>>> quit()
So for, so good. Then I ran:
(base) c:\>python
Python 3.6.5 |Anaconda custom (64-bit)| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
False
>>>
Why did PyTorch say CUDA was not available?
The GPU is a compute capability 3.0 Quadro K3000M:
(base) C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi.exe
Mon Oct 01 16:36:47 2018
NVIDIA-SMI 385.54 Driver Version: 385.54
-------------------------------+----------------------+----------------------
GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr.
ECC Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M.
0 Quadro K3000M WDDM | 00000000:01:00.0 Off |
N/A N/A 35C P0 N/A / N/A | 29MiB / 2048MiB | 0% Default
Ever since https://github.com/pytorch/pytorch/releases/tag/v0.3.1, PyTorch binary releases had removed support for old GPUs' with CUDA capability 3.0. According to https://en.wikipedia.org/wiki/CUDA, the compute capability of Quadro K3000M is 3.0.
Therefore, you might have to build pytorch from source or try other packages. Please refer to this thread for more information -- https://discuss.pytorch.org/t/pytorch-no-longer-supports-this-gpu-because-it-is-too-old/13803.
PyTorch official call for using CUDA 9.0 and I would suggest the same. In other cases, there are sometimes build issues which leads to 'CUDA not detected'.So, when using PyTorch its best to use CUDA 9.0 and CuDnn 7. I'll add a link where you can easily install Cuda 9.0 and CuDnn 7.
https://yangcha.github.io/CUDA90/
I had a similar problem, you have to check on NVIDIA control panel that your card is selected by default.
Is there way to check I installed GPU version of Tensorflow?
!nvidia-smi
Mon Dec 18 23:58:01 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 00000000:01:00.0 On | N/A |
| N/A 53C P0 31W / N/A | 1093MiB / 8105MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1068 G /usr/lib/xorg/Xorg 599MiB |
| 0 2925 G compiz 290MiB |
| 0 3611 G ...-token=11A9F5872A56620B72D1D5DF707CF1FC 200MiB |
| 0 5786 G /usr/bin/nvidia-settings 0MiB |
+-----------------------------------------------------------------------------+
But when I try to detect the list local devices, only CPU got detected.
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 3303842605833347443
]
Do I have to set something else to use the GPU for Keras or Tensorflow?
Use pip install tensorflow-gpu or conda install tensorflow-gpu for gpu version of tensorflow. If you are using keras-gpu conda install -c anaconda keras-gpu command will automatically install the tensorflow-gpu version. Before doing these any command make sure that you uninstalled the normal tensorflow .
You may need this shell to config your tensorflow-gpu.
You can run this, if you want to check tensorflow-gpu.
import tensorflow as tf
with tf.device('/gpu:0'):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print sess.run(c)
The official documents: Using GPUs.
The simple way using tensorflow is:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
With Keras:
from keras import backend as K
K.tensorflow_backend._get_available_gpus()
I have the same problem, but everything was in this page couldn't solve my problem. I decided to update my display adapter. Follow this way:
Control Panel>Device Manager>display adapter>Right click>Update Driver
After that, you must restart your computer, But you should consider that it's not only source of your problem.
I ran into the subj when using a proper tensorflow-gpu docker container, and using tensorflow-gpu installed into a virtualenv within the container. Most likely this combination properly shields GPU capabilities, which otherwise are available if running python just in the container without virtualenv.
I think that you need to install cuda (it must show up in your nvidia-smi)
Did you check compatibility between your cuda/CUDNN version and tensorflow-gpu ?
This may help you:
https://punndeeplearningblog.com/development/tensorflow-cuda-cudnn-compatibility/
So I installed tensorflow and the CPU Version works fine but I can't seem to get the GPU to work.
I installed Cuda by downloading the .deb from Nvidia.
I copied the cudNN content
cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 6
#define CUDNN_MINOR 0
#define CUDNN_PATCHLEVEL 20
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
#include "driver_types.h"
I entered the path in ~/.profile
export CUDA_HOME=/usr/local/cuda
export CUDA_ROOT=/usr/local/cuda
export PATH=$PATH:$CUDA_ROOT/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_ROOT/lib64
Oh and Nvidia-smi shows:
| NVIDIA-SMI 375.39 Driver Version: 375.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 0000:01:00.0 On | N/A |
| 10% 54C P0 42W / 200W | 591MiB / 8105MiB | 4% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1368 G /usr/bin/X 344MiB |
| 0 3078 G cinnamon 129MiB |
| 0 6549 G /usr/lib/virtualbox/VirtualBox 20MiB |
| 0 15491 G ...bleH2AndQuicRequests/Enabled/*NetworkTime 96MiB |
Yet still using Tensorflow I get:
>>> python
>>> import tensorflow as tf
>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
Device mapping: no known devices.
I tensorflow/core/common_runtime/direct_session.cc:257] Device mapping:
I hope you can tell me what else I can do.
Best regards and thanks in advance
I think what you've installed is just a simple (CPU) version of tensorflow. You have to install the gpu version of tensorflow. The easiest way to do this is using the anaconda distribution of Python & tensorflow.
kmario23 ❯ conda install -c anaconda tensorflow-gpu
Fetching package metadata .........
Solving package specifications: ..........
Package plan for installation in environment /home/kmario23/anaconda3:
The following packages will be downloaded:
package | build
---------------------------|-----------------
conda-env-2.6.0 | 0 502 B anaconda
cudatoolkit-7.5 | 0 217.2 MB anaconda
cudnn-5.1 | 0 77.2 MB anaconda
tensorflow-gpu-1.0.1 | py35_4 77.4 MB anaconda
conda-4.3.16 | py35_0 510 KB anaconda
------------------------------------------------------------
Total: 372.4 MB
The following NEW packages will be INSTALLED:
cudatoolkit: 7.5-0 anaconda
cudnn: 5.1-0 anaconda
tensorflow-gpu: 1.0.1-py35_4 anaconda
The following packages will be UPDATED:
conda: 4.2.13-py35_0 conda-forge --> 4.3.16-py35_0 anaconda
Proceed ([y]/n)?
Go ahead and install it, right away.
Also, have a look that a simple package search would list a separate tensorflow version for GPU.
kmario23 ❯ anaconda search -t conda tensorflow
Using Anaconda API: https://api.anaconda.org
Run 'anaconda show <USER/PACKAGE>' to get more details:
Packages:
Name | Version | Package Types | Platforms
------------------------- | ------ | --------------- | ---------------
HCC/tensorflow | 1.0.0 | conda | linux-64
HCC/tensorflow-cpucompat | 1.0.0 | conda | linux-64
HCC/tensorflow-fma | 1.0.0 | conda | linux-64
SentientPrime/tensorflow | 0.6.0 | conda | osx-64
: TensorFlow helps the tensors flow
acellera/tensorflow-cuda | 0.12.1 | conda | linux-64
anaconda/tensorflow | 1.0.1 | conda | linux-64
anaconda/tensorflow-gpu | 1.0.1 | conda | linux-64
conda-forge/tensorflow | 1.0.0 | conda | linux-64, win-64, osx-64
: TensorFlow helps the tensors flow
I would recommend installing the tensorflow-gpu from the anaconda channel anaconda/tensorflow-gpu.