Theano GPU Installation Issue - theano

I have installed Anaconda, theano , GPU Toolkit ver 8. I am getting this error.
ERROR: refusing to load cuda driver library because the version is blacklisted. Versions 373.06 and below are known to be ok.
If you want to bypass this check and force the driver load define GPUARRAY_FORCE_CUDA_DRIVER_LOAD in your environement.
ERROR (theano.gpuarray): Could not initialize pygpu, support disabled

See discussion here.
I reinstalled libgpuarray and it worked for me.

Your cuda driver is faulty. Install another one, preferably 373.06. Using your current driver will result in wrong computations. DO NOT force the driver load

Related

UserWarning: CUDA initialization:

I have installed Pytorch 1.8.1+cu102 using a virtual environment on a HPC cluster.
torch.cuda.is_available()
is giving me the below output
UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 10010). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
False
What could be wrong ? I am not sure how I can update the driver. My requirements are:
torch==1.8.1+cu102
torch-cluster==1.5.9
torch-geometric==1.7.0
Firstly, you need to check which version you need for Pytorch. You can find the cuda version corresponding to Pytorch in the link below.
https://pytorch.org/get-started/previous-versions/
After you find the version, you need to check whether the version is available for your GPU device. You can find the list in the link below.
https://developer.nvidia.com/cuda-gpus
If there is no match, you need to change either pytorch requirement or your GPU device.

Coral Edge TPU USB Accelerator driver fails to install

I'm trying to install and test my Coral Edge TPU. I'm following the instructions here: https://coral.ai/docs/accelerator/get-started/
The first step is to install drivers from the coral website, but I'm getting the following errors. I've tried running with and without admin, and uninstalling and installing again, but I get the same errors.
Has anyone else run into this issue? I'm on Windows 10.
Installing UsbDk
Installing Windows drivers
Microsoft PnP Utility
Adding driver package: coral.inf
Driver package added successfully.
Published Name: oem69.inf
Adding driver package: Coral_USB_Accelerator.inf
Failed to add driver package: The hash for the file is not present in the specified catalog file. The file is likely corrupt or the victim of tampering.
Adding driver package: Coral_USB_Accelerator_(DFU).inf
Failed to add driver package: The hash for the file is not present in the specified catalog file. The file is likely corrupt or the victim of tampering.
Total driver packages: 3
Added driver packages: 1
Installing performance counters
Info: Provider {aaa5bf9e-c44b-4177-af65-d3a06ba45fe7} defined in
C:\Users\User\Downloads\edgetpu_runtime_20201204\edgetpu_runtime\third_party\coral_accelerator_windows\coral.man is already installed in system repository. Info: Successfully installed performance counters in
C:\Users\User\Downloads\edgetpu_runtime_20201204\edgetpu_runtime\third_party\coral_accelerator_windows\coral.manCopying edgetpu and libusb to System32 1 file(s) copied.
1 file(s) copied.
Install complete
Press any key to continue . . .
This is a bug in the coral software. According to this thread https://github.com/google-coral/edgetpu/issues/260 they messed up a few things in the newest version (at the time 2.5.0). Starting with a fresh virtual environment and using the release 2.1.0 and corresponding driver for Python 3.7 (3.8, 3.9 not supported as of 2.1.0) fixed the issue.
From that thread:
For now, I suggest rolling back to the older driver:
https://dl.google.com/coral/edgetpu_api/edgetpu_runtime_20200728.zip
And you also need to remove your current tflite_runtime and downgrade
it to an older version (make sure to change to the right python
version):
pip3 install
https://dl.google.com/coral/python/tflite_runtime-2.1.0.post1-cp36-cp36m-win_amd64.whl
pip3 install
https://dl.google.com/coral/python/tflite_runtime-2.1.0.post1-cp37-cp37m-win_amd64.whl
Apologies, we are working to get this fixed ASAP
we recently uploaded a new package this should fix this:
https://dl.google.com/coral/edgetpu_api/edgetpu_runtime_20210119.zip
source: coral.ai/software
You need to disable driver signature enforcement using the advanced boot menu. After that you can install it. You can do that by going to going to the Settings app >> go to the “Update and Security -> Recovery” page. Click on the “Restart now” button under the “Advanced Startup” section.

PyTorch can't see GPU (torch.cuda.is_availble() returns False)

I have a problem where
import torch
print(torch.cuda_is_available())
will print False, and I can't use the GPU available. I've tried it on conda environment, where I've installed the PyTorch version corresponding to the NVIDIA driver I have. I've also tried it in docker container, where I've done the same. I've tried both of these options on a remote server, but they both failed. I know that I've installed the correct driver versions because I've checked the version with nvcc --version before installing PyTorch, and I've checked the GPU connection with nvidia-smi which displays the GPUs on the machines correctly.
Also, I've checked this post and tried exporting CUDA_VISIBLE_DEVICES, but had no luck.
On the server I have NVIDIA V100 GPUs with CUDA version 10.0 (for conda environment) and version 10.2 on a docker container I've built. Any help or push in the right direction would be greatly appreciated. Thanks!
For anyone else having this problem, it turned out my server manager has not updated the drivers for the server.
I switched to a different server, installed anaconda and things started working like it should, i.e., torch.cuda.is_available() returns True after setting up a fresh environment.

Keras Illegal Instruction 4

I have been trying to get Keras working on my laptop running El Capitan but when I attempt to import it I get the following message
Using TensorFlow backend.
Illegal instruction: 4
I've looked for solutions, and have tried updating theano, installing mxnet-mkl, and running an older version of numpy (1.13) to avoid a FutureWarning error. None of these has seemed to fix the issue, though. I feel like I must be missing something somewhere.
Try installing an older version of TensorFlow (1.5 seems to work). AVX2 instructions are now enabled by default in latest TF pip wheels, meaning if your processor doesn't support these instructions, then you will get an illegal instruction error.
Another solution is to install TensorFlow from source, which should detect your processor and configure it accordingly.

nvidia error on Azure DSVM/DLVM

I have been creating a few Ubuntu DSVMs and DLVMs on Azure with GPU and I keep getting intermittent errors. These manifest by nvidia-smi being really slow or getting the following error:
2018/01/11 19:42:33 Error: nvml: Driver/library version mismatch
This will appear if I try to run nvidia-smi or nvidia-docker. A reboot usually fixes it but it can reappear.
Does this sound like an intermittent error? Is there something that I can do to mitigate this?
NVIDIA just released a new version of the GPU driver for the GPUs used in Azure. The Ubuntu DSVM is configured to automatically install updates, so these will be installed for you in the background. The issue, though, is that the driver is compiled into the kernel, so you must reboot to get the new driver. The message Driver/library version mismatch means that the version in the kernel can’t use the installed libraries (because they were upgraded). This is why rebooting usually fixes it.
There is a second issue you might be facing: Azure released a new kernel a few days ago that is incompatible with the 387 version of the GPU driver. You won’t get this driver by default on the DSVM, but you might if you installed other packages. This error is different – something like nvidia-smi could not communicate with the nvidia module. The only way to fix it is to (1) get the very latest kernel with apt update and apt upgrade, then reboot, and (2) install a different driver with apt install nvidia-384.

Resources