Google colab notebook comes packaged with PyTorch and GPU support -- see this related SO answer, and then you should go change notebook runtime to GPU from CPU (from the top Colab menubar).
Yet trying to train a model from GPyTorch always fails on
/usr/local/lib/python3.6/dist-packages/gpytorch/utils/cholesky.py in psd_safe_cholesky(A, upper, out, jitter)
19 """
20 try:
---> 21 L = torch.cholesky(A, upper=upper, out=out)
22 # TODO: Remove once fixed in pytorch (#16780)
23 if A.dim() > 2 and A.is_cuda:
RuntimeError: CUDA error: invalid device function
Why?
Relevant info:
import torch
torch.cuda.is_available()
True
print(torch.__version__)
print(gpytorch.__version__)
print(torch.backends.cudnn.version())
1.3.0+cu100
0.3.6
7603
Related
I wrote an LSTM NLP classifier with PyTorch, in google colab and it worked well. Now, I run it on google colab pro, but I get this error:
RuntimeError: cuDNN version incompatibility: PyTorch was compiled against (8, 3, 2) but found runtime version (8, 0, 5). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN.one possibility is that there is a conflicting cuDNN in LD_LIBRARY_PATH.
I have no idea how to fix this. I'm using GPU on colab pro.
I've tried this link and it didn't work.
How I declared device:
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
Fixed via upgrading cuDNN to 8.4
reference (https://github.com/JaidedAI/EasyOCR/issues/716)
if you are using google colab uae this command
!pip install --upgrade torch torchvision
I created an AWS GPU instance with g4dn.xlarge instance type.
I installed Python and Jupyter-notebook as well.
When I am trying to load the GPU details in the Jupyter notebook with the below code:
import tensorflow as tf
tf.config.list_physical_devices()
Output:
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
I tried few other methods as well
1
import tensorflow as tf
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
print("Name:", gpu.name, " Type:", gpu.device_type)
2
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
3
from tensorflow.python.client import device_lib
def get_available_gpus():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU']
get_available_gpus()
All the codes will respond with no GPU.
what could be the possible options to access the GPU?
Since you said you installed Python yourself, it's likely you didn't start with a deep learning AMI with all the drivers installed, so you'd have to install Nvidia drivers, CUDA, and cudnn. But trying to install Nvidia drivers on an AWS EC2 instance can be tough...
Solution: start with the deep learning AMIs:
https://aws.amazon.com/machine-learning/amis/
SOLUTION at the bottom!
I want to do Object Detection with this tutorial:
https://towardsdatascience.com/building-your-own-object-detector-pytorch-vs-tensorflow-and-how-to-even-get-started-1d314691d4ae
Although I have compatible versions of Pytorch, Torchvision and Cuda:
conda list torch gives me:
I get the following RunTime Error at the bottom:
RuntimeError: Couldn't load custom C++ ops. This can happen if your
PyTorch and torchvision versions are incompatible, or if you had
errors while compiling torchvision from source. For further
information on the compatible versions, check
https://github.com/pytorch/vision#installation for the compatibility
matrix. Please check your PyTorch version with torch__version__ and
your torchvision version with torchvision__version__ and verify if
they are compatible, and if not please reinstall torchvision so that
it matches your PyTorch install.
when running:
num_epochs = 10
for epoch in range(num_epochs):
train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)#.to_fp16()
lr_scheduler.step()
evaluate(model, data_loader_test, device=device)
Is it really an error resulting from incompatibility of pytorch and torchvision?
Thank you very much.
SOLUTION:
I imported torchvision from the wrong directory. I found out using following:
import torchvision
print(torchvision.__path__)
I am facing problem while training neural networks using tensorflow-keras. I am getting this error:
F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch
value instead of handling error Internal: failed to get device
attribute 13 for device 0: CUDA_ERROR_UNKNOWN: unknown error
I was getting this error initially. Then I referred to the solution in failed-to-get-device-attribute-13-for-device-0. I updated the graphics driver. This worked for some 3-4 runs and now I am getting the same error again.
Following are the details of my environment:
Python 3.7 (Anaconda)
Tensorflow 2.1
Nvidia GeForce RTX 2060, 6GB Graphics
Windows 10 Version 1809
Well in my case limiting the GPU memory worked.
Add the following code segment at the beginning of your code:
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpus[0], True)
I prototyped a Python deep learning piece of code working on Windows and I can't make it work on Linux. I identified that the problem comes from load_model.
Here is the piece of Python code that behaves differently on Windows and in Linux.
Both Keras installations were made from the github source repository from Keras Team because the model format is not recognized by the standard Keras package, a patch was done very recently for the characters format in the Github source code.
Do you have an idea of what's going on?
The code:
from keras.models import load_model, Model
import sys
import keras
import tensorflow as tf
import os
import platform
print("----------------------------------------------")
print("Operating system:")
print (os.name)
print(platform.system())
print(platform.release())
print("----------------------------------------------")
print("Python version:")
print(sys.version)
print("----------------------------------------------")
print("Tensorflow version: ", tf.__version__)
print("----------------------------------------------")
print("Keras version : ", keras.__version__)
print("----------------------------------------------")
yolo_model = load_model("model.h5")
Windows output:
Using TensorFlow backend.
----------------------------------------------
Operating system:
nt
Windows
7
----------------------------------------------
Python version:
3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)]
----------------------------------------------
Tensorflow version: 1.4.0
----------------------------------------------
Keras version : 2.1.2
----------------------------------------------
2018-01-06 21:54:37.700794: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instruc
ions that this TensorFlow binary was not compiled to use: AVX AVX2
C:\Users\David\AppData\Local\Programs\Python\Python36\lib\site-packages\keras-2.1.2-py3.6.egg\keras\models.py:252: UserWarning: No training configuration found
in save file: the model was *not* compiled. Compile it manually.
Linux output:
Using TensorFlow backend.
----------------------------------------------
Operating system:
posix
Linux
4.9.0-5-amd64
----------------------------------------------
Python version:
3.5.3 (default, Jan 19 2017, 14:11:04)
[GCC 6.3.0 20170118]
----------------------------------------------
Tensorflow version: 1.4.1
----------------------------------------------
Keras version : 2.1.2
----------------------------------------------
----------------------------------------------
2018-01-06 21:47:58.099715: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
Erreur de segmentation
The french Erreur de segmentation means Segmentation fault
Thank you for your help!
Glassfrog
I only found a workaround.
As the model file was a data conversion from another weights file in another format, I went and regenerated the Keras model for the latest version of Keras.
Now It works.
But I still don't know what caused the segmentation fault.
From what I can tell, the segfault happens at the model creation, but I have no idea why.
I could debug this by saving the model and weights independently:
from keras.models import load_model
x = load_model('combined_model.h5') # runs only on the source machine
with open('model.json', 'w') as fp:
fp.write(x.to_json())
x.save_weights('weights.h5')
on the other machine I tried to load the model from the JSON file, but got the segmentation fault as well:
from keras.models import model_from_json
with open('model.json', 'r') as fp:
model = model_from_json(fp.read()) # segfaults here
If it is possible to simply re-create the model on the target machine by creating the Sequential model again, you can simply load the weights in the model:
from keras import Sequential
# ...
new_model = Sequential()
# [...] run your model creation here...
new_model.load_weights('weights.h5')
new_model.predict(...) # this should work now