Run tensorflow on gpu - linux

This is my GPU info:
[root#happy mytflayer]# lspci | grep -i vga
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Oland XT [Radeon HD 8670 / R7 250/350] (rev 87)
I just installed tensorflow on CentOS7. How can I run it on GPU?

AMD cards don't do tensorflow, or machine learning in general. Caffe has some OpenCL support, but everyone I talked to said it was really slow

Related

Tensorflow GPU crashing kernel/shell

When every I try to run any training model the shell or kernel restarts. Spyder 5.3.2 Kernel Screenshot
This is the kernel output:
Segmentation Models: using keras framework.
2022-08-31 12:48:32.675519: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-31 12:48:33.019498: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3491 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
Epoch 1/10
2022-08-31 12:48:37.509234: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8500
Restarting kernel...
This started happening after the GPU setup for tensorflow (2.9.1). Python(3.9)
I have tried rebuilding it using blazel to enable GPU according to tensorflow.org website.
I have installed supported CUDA and cuDNN files.
The GPU is detected here
GPU Screenshot:

ttyACM0 does not exist in jetson racecar TX2

I have a jetson racecar tx2 and this is its details:
NVIDIA Jetson TX2
L4T 32.2.1 [ JetPack 4.2.2 ]
Ubuntu 18.04.2 LTS
Kernel Version: 4.9.140-tegra
CUDA 10.0.326
CUDA Architecture: 6.2
OpenCV version: 3.4.0
OpenCV Cuda: YES
CUDNN: 7.5.0.56
TensorRT: 5.1.6.1
Vision Works: 1.6.0.500n
VPI: NOT_INSTALLED
Vulcan: 1.1.70
When is try to start teleport it gives me this error message:
Device not found: IMU or VESC not found -> /dev/ttyACM1 /dev/ttyACM0
I checked out ttyACM0 it doesnot exists I tried to installed ttyACM modules but it gives that module and kernel versions are different.
the usb list as shown in figure:USB List
anyone can help me please?

Running vulkaninfo returns error: vulkaninfo.h:477: failed with ERROR_INITIALIZATION_FAILED

Iḿ trying to get vulkan to work but I get the following error:
vulkaninfo
ERROR: [Loader Message] Code 0 : /usr/lib/i386-linux-gnu/libvulkan_radeon.so: wrong ELF class: ELFCLASS32
ERROR: [Loader Message] Code 0 : /usr/lib/i386-linux-gnu/libvulkan_intel.so: wrong ELF class: ELFCLASS32
/build/vulkan-tools-KEbD_A/vulkan-tools-1.2.131.1+dfsg1/vulkaninfo/vulkaninfo.h:477: failed with ERROR_INITIALIZATION_FAILED
Following command dumps:
lspci -nnk | grep -iA2 vga
00:02.0 VGA compatible controller [0300]: Intel Corporation Core Processor Integrated Graphics Controller [8086:0046] (rev 02)
Subsystem: Dell Core Processor Integrated Graphics Controller [1028:0410]
Kernel driver in use: i915
I have added the following to my grub config and initialized it
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdgpu.si_support=1 radeon.si_support=0 amdgpu.cik_support=1 radeon.cik_support=0"
followed by a reboot. The result is the same error :(
what am I doing wrong, can anyone help me?
Before I forget I installed vukan and mesa vulkan drivers and am running Ubuntu 20.04 LTS on a Dell Latitude E4310. Please help, I just want to play some windows (directX11) games with Wine.
This kind of cryptic error message can happen because vulkaninfo doesn't find any supported GPU.
It is likely that your GPU is not supported by Vulkan (too old), and so you won't be able to use DXVK (DirectX to Vulkan). You still may be able to run games without Vulkan by forcing Wine to use WineD3D (DirectX to OpenGl) instead. See Xaero_Vincent's answer in this reddit thread:
In Lutris you can easily disable DXVK as a option and on steam you can
force OpenGL-based WineD3D:
PROTON_USE_WINED3D=1 %command%
You'll notice though that DirectX 10/11 games will generally run
slower under OpenGL and some games will likely have graphics
artifacts, since DXVK is more mature and further developed.

Pytorch says that CUDA is not available (on Ubuntu)

I'm trying to run Pytorch on a laptop that I have. It's an older model but it does have an Nvidia graphics card. I realize it is probably not going to be sufficient for real machine learning but I am trying to do it so I can learn the process of getting CUDA installed.
I have followed the steps on the installation guide for Ubuntu 18.04 (my specific distribution is Xubuntu).
My graphics card is a GeForce 845M, verified by lspci | grep nvidia:
01:00.0 3D controller: NVIDIA Corporation GM107M [GeForce 845M] (rev a2)
01:00.1 Audio device: NVIDIA Corporation Device 0fbc (rev a1)
I also have gcc 7.5 installed, verified by gcc --version
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
And I have the correct headers installed, verified by trying to install them with sudo apt-get install linux-headers-$(uname -r):
Reading package lists... Done
Building dependency tree
Reading state information... Done
linux-headers-4.15.0-106-generic is already the newest version (4.15.0-106.107).
I then followed the installation instructions using a local .deb for version 10.1.
Now, when I run nvidia-smi, I get:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce 845M On | 00000000:01:00.0 Off | N/A |
| N/A 40C P0 N/A / N/A | 88MiB / 2004MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 982 G /usr/lib/xorg/Xorg 87MiB |
+-----------------------------------------------------------------------------+
and I run nvcc -V I get:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
I then performed the post-installation instructions from section 6.1, and so as a result, echo $PATH looks like this:
/home/isaek/anaconda3/envs/stylegan2_pytorch/bin:/home/isaek/anaconda3/bin:/home/isaek/anaconda3/condabin:/usr/local/cuda-10.1/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
echo $LD_LIBRARY_PATH looks like this:
/usr/local/cuda-10.1/lib64
and my /etc/udev/rules.d/40-vm-hotadd.rules file looks like this:
# On Hyper-V and Xen Virtual Machines we want to add memory and cpus as soon as they appear
ATTR{[dmi/id]sys_vendor}=="Microsoft Corporation", ATTR{[dmi/id]product_name}=="Virtual Machine", GOTO="vm_hotadd_apply"
ATTR{[dmi/id]sys_vendor}=="Xen", GOTO="vm_hotadd_apply"
GOTO="vm_hotadd_end"
LABEL="vm_hotadd_apply"
# Memory hotadd request
# CPU hotadd request
SUBSYSTEM=="cpu", ACTION=="add", DEVPATH=="/devices/system/cpu/cpu[0-9]*", TEST=="online", ATTR{online}="1"
LABEL="vm_hotadd_end"
After all of this, I even compiled and ran the samples. ./deviceQuery returns:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce 845M"
CUDA Driver Version / Runtime Version 10.1 / 10.1
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 2004 MBytes (2101870592 bytes)
( 4) Multiprocessors, (128) CUDA Cores/MP: 512 CUDA Cores
GPU Max Clock rate: 863 MHz (0.86 GHz)
Memory Clock rate: 1001 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 1048576 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 1
Result = PASS
and ./bandwidthTest returns:
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: GeForce 845M
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 11.7
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 11.8
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 14.5
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
But after all of this, this Python snippet (in a conda environment with all dependencies installed):
import torch
torch.cuda.is_available()
returns False
Does anybody have any idea about how to resolve this? I've tried to add /usr/local/cuda-10.1/bin to etc/environment like this:
PATH=$PATH:/usr/local/cuda-10.1/bin
And restarting the terminal, but that didn't fix it. I really don't know what else to try.
EDIT - Results of collect_env for #kHarshit
Collecting environment information...
PyTorch version: 1.5.0
Is debug build: No
CUDA used to build PyTorch: 10.2
OS: Ubuntu 18.04.4 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: Could not collect
Python version: 3.6
Is CUDA available: No
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: GeForce 845M
Nvidia driver version: 418.87.00
cuDNN version: Could not collect
Versions of relevant libraries:
[pip] numpy==1.18.5
[pip] pytorch-ranger==0.1.1
[pip] stylegan2-pytorch==0.12.0
[pip] torch==1.5.0
[pip] torch-optimizer==0.0.1a12
[pip] torchvision==0.6.0
[pip] vector-quantize-pytorch==0.0.2
[conda] numpy 1.18.5 pypi_0 pypi
[conda] pytorch-ranger 0.1.1 pypi_0 pypi
[conda] stylegan2-pytorch 0.12.0 pypi_0 pypi
[conda] torch 1.5.0 pypi_0 pypi
[conda] torch-optimizer 0.0.1a12 pypi_0 pypi
[conda] torchvision 0.6.0 pypi_0 pypi
[conda] vector-quantize-pytorch 0.0.2 pypi_0 pypi
PyTorch doesn't use the system's CUDA library. When you install PyTorch using the precompiled binaries using either pip or conda it is shipped with a copy of the specified version of the CUDA library which is installed locally. In fact, you don't even need to install CUDA on your system to use PyTorch with CUDA support.
There are two scenarios which could have caused your issue.
You installed the CPU only version of PyTorch. In this case PyTorch wasn't compiled with CUDA support so it didn't support CUDA.
You installed the CUDA 10.2 version of PyTorch. In this case the problem is that your graphics card currently uses the 418.87 drivers, which only support up to CUDA 10.1. The two potential fixes in this case would be to either install updated drivers (version >= 440.33 according to Table 2) or to install a version of PyTorch compiled against CUDA 10.1.
To determine the appropriate command to use when installing PyTorch you can use the handy widget in the "Install PyTorch" section at pytorch.org. Just select the appropriate operating system, package manager, and CUDA version then run the recommended command.
In your case one solution was to use
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
which explicitly specifies to conda that you want to install the version of PyTorch compiled against CUDA 10.1.
For more information about PyTorch CUDA compatibility with respect drivers and hardware see this answer.
Edit After you added the output of collect_env we can see that the problem was that you had the CUDA 10.2 version of PyTorch installed. Based on that an alternative solution would have been to update the graphics driver as elaborated in item 2 and the linked answer.
TL; DR
Install NVIDIA Toolkit provided by Canonical or NVIDIA third-party PPA.
Reboot your workstation.
Create a clean Python virtual environment (or reinstall all CUDA dependent packages).
Description
First install NVIDIA CUDA Toolkit provided by Canonical:
sudo apt install -y nvidia-cuda-toolkit
or follow NVIDIA developers instructions:
# ENVARS ADDED **ONLY FOR READABILITY**
NVIDIA_CUDA_PPA=https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/
NVIDIA_CUDA_PREFERENCES=https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
NVIDIA_CUDA_PUBKEY=https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
# Add NVIDIA Developers 3rd-Party PPA
sudo wget ${NVIDIA_CUDA_PREFERENCES} -O /etc/apt/preferences.d/nvidia-cuda
sudo apt-key adv --fetch-keys ${NVIDIA_CUDA_PUBKEY}
echo "deb ${NVIDIA_CUDA_PPA} /" | sudo tee /etc/apt/sources.list.d/nvidia-cuda.list
# Install development tools
sudo apt update
sudo apt install -y cuda
then reboot the OS load the kernel with the NVIDIA drivers
Create an environment using your favorite manager (conda, venv, etc)
conda create -n stack-overflow pytorch torchvision
conda activate stack-overflow
or reinstall pytorch and torchvision into the existing one:
conda activate stack-overflow
conda install --force-reinstall pytorch torchvision
otherwise NVIDIA CUDA C/C++ bindings may not be correctly detected.
Finally ensure CUDA is correctly detected:
(stack-overflow)$ python3 -c 'import torch; print(torch.cuda.is_available())'
True
Versions
NVIDIA CUDA Toolkit v11.6
Ubuntu LTS 20.04.x
Ubuntu LTS 22.04 (prior official release)
In my case, just restarting my machine made the GPU active again. The initial message I got was that the GPU is currently in use by another application. But when I looked at nvidia-smi, there was nothing that I saw. So, no changes to dependencies, and it just started working again.
Another possible scenario is that environment variable CUDA_VISIBLE_DEVICES is not set correctly before installing PyTorch.
In my case it worked to do as follows:
remove the CUDA drivers
sudo apt-get remove --purge nvidia*
Then get the exact installation script of the drivers based on your distro and system from the link: https://developer.nvidia.com/cuda-downloads?target_os=Linux
In my case it was dabian on x64 so I did:
wget https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install cuda
And now nvidia-smi works as intended!
I hope that helps
If your CUDA version does not match what PyTorch expects, you will see this issue.
On Arch / Manjaro:
Get Pytorch from here: https://pytorch.org/get-started/locally/
Note what CUDA version you are getting PyTorch for
Get the same CUDA version from here: https://archive.archlinux.org/packages/c/cuda/
Install CUDA using (e.g.) sudo pacman -U --noconfirm cuda-11.6.2-1-x86_64.pkg.tar.zst
Do not update to a newer version of CUDA than PyTorch expects. If PyTorch wants 11.6 and you have updated to 11.7, you will get the error message.
Make sure that os.environ['CUDA_VISIBLE_DEVICES'] = '0' is set after if __name__ == "__main__":. So your code should look like this:
import torch
import os
if __name__ == "__main__":
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
print(torch.cuda.is_available()) // true
...

I get "Runtime API error : invalid device ordinal." when I run cuda code on Ubuntu 10.04 using a GTX 590

I am trying to run a cuda application on an Ubuntu 10.04 system with a GeForce GTX 590.
I'm using the 295.41 drivers. I have set up 3 other systems with this code and all have worked. Two of them had GT 640s and one had a GTX 480 (or 460 - I can't quite remember). I have run cuda code on this machine with the same hardware before but it has since been formatted.
I get the invalid device ordinal error when I run my code and also when I run the SDK examples. I set up this machine with gentoo and got this error - I thought it could have something to do with the OS so I installed ubuntu and have the same problem. I can't think of what else to try. Does anyone have any suggestions?
Below is some output that could be handy.
user#pchan1:~$ lspci | grep nVidia
02:00.0 PCI bridge: nVidia Corporation Device 05b1 (rev a3)
03:00.0 PCI bridge: nVidia Corporation Device 05b1 (rev a3)
03:02.0 PCI bridge: nVidia Corporation Device 05b1 (rev a3)
06:00.0 PCI bridge: nVidia Corporation Device 05b9 (rev a3)
07:00.0 PCI bridge: nVidia Corporation Device 05b9 (rev a3)
07:02.0 PCI bridge: nVidia Corporation Device 05b9 (rev a3)
08:00.0 3D controller: nVidia Corporation Device 1088 (rev a1)
08:00.1 Audio device: nVidia Corporation Device 0e09 (rev a1)
09:00.0 VGA compatible controller: nVidia Corporation Device 1088 (rev a1)
09:00.1 Audio device: nVidia Corporation Device 0e09 (rev a1)
user#pchan1:~$ ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195, 0 2012-10-30 10:22 /dev/nvidia0
crw-rw-rw- 1 root root 195, 1 2012-10-30 10:22 /dev/nvidia1
crw-rw-rw- 1 root root 195, 255 2012-10-30 10:22 /dev/nvidiactl
Edit: When I had this system working I was using a 64 bit os and the 64 bit drivers. I am now using a 32 bit os and 32 bit drivers.
Another Edit:
Thanks very much Przemyslaw Zych. You helped me solve the problem.
I had to blacklist Nouveau by doing the following.
Add a file in /etc/modprobe.d called blacklist-nouveau.conf (just the .conf ending is important) and in that file put the following two lines.
blacklist nouveau
options nouveau modeset=0
As instructed at the following guide
ftp://download.nvidia.com/XFree86/Linux-x86_64/256.44/README/commonproblems.html
Problem solved :)
As Przemyslaw Zych suggested there was another driver using the GPU (in this case nouveau).
To use the nvidia driver nouveau must be disabled. The procedure is listed here - ftp://download.nvidia.com/XFree86/Linux-x86_64/256.44/README/commonproblems.html - and I will summarise it below.
Create a file in /etc/modprobe.d called blacklist-nouveau.conf
Add the following two lines:
blacklist nouveau
options nouveau modeset=0
Then reboot the pc. This should prevent nouveau from loading and allow the nvidia drivers to be used.

Resources