Porting Torch GPU code with Numba Cuda kernels to work on Apple silicone - pytorch

I have a simulation written in Python which utilizes the GPU mainly through PyTorch operations, but in a couple of places I had to write a couple of (relatively simple) custom kernels via Numba's cuda library (using as_cuda_array() on the torch tensors to get a DeviceNDArray handle).
Iv'e now moved to an Apple machine with a M1 processor. It seems that the torch code can be easily edited to run on the Apple GPU, but Numba has no such option.
What would be the easiest option to rewrite the code to work on Apple silicone?

Related

NVIDIA vs PyTorch versions of cuDNN

After installing PyTorch as per the official command:
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.2 -c pytorch, my cuDNN version shown in conda list is pytorch 1.7.1 py3.8_cuda10.2.89_cudnn7.6.5_0 pytorch whereas my system has cudnn8.5.0.
Does it have an affect on how we train models?
TLDR; Probably no, but depends on the difference between versions.
Explanation
In reality upgrades (like what you have conda cudnn7.6.5_0 -> cudnn8.5.0 of the system) usually don't harm training because versions are backward compatible for a while. After a while, things get deprecated though (years probably), so you should try to not totally make this absurdly large, such that CUDA version uses operations that aren't implemented. You should be more interested in which minimal version is required from your GPU (like 3090's from Nvidia which requires CUDA 11.1 and above).
In reality, torch uses simple operation implementation of cudnn, and usually, they don't change that much. As Nvidia describes what cuDNN is:
The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.
If you are interested to check the actual matrix support from Nvidia, check this website documentation and click on the version support matrix of a specific version. You can see that cuDNN 8.5.0 for CUDA 10.2 is supported by 8.5.0. If you think what py3.8_cuda10.2.89_cudnn7.6.5_0 means, it is saying that your cuda10.2.89 was compiled with the primitives available in cudnn7.6.5.
Also, consider that you also have to have the required version of Nvidia driver.

Google Colab + Pytorch: RuntimeError: No CUDA GPUs are available

Screenshot of error:
Hello, I am trying to run this Pytorch application, which is a CNN for classifying dog and cat pics.
I am using Google Colab for the GPU, but for some reason, I get RuntimeError: No CUDA GPUs are available. This is weird because I specifically both enabled the GPU in Colab settings, then tested if it was available with torch.cuda.is_available(), which returned true.
The weirdest thing is that this error doesn't appear until about 1.5 minutes after I run the code. You would think that if it couldn't detect the GPU, it would notify me sooner.
I've had no problems using the Colab GPU when running other Pytorch applications using the exact same notebook. I can only imagine it's a problem with this specific code, but the returned error is so bizarre that I had to ask on StackOverflow to make sure.
Try again, this is usually a transient issue when there are no Cuda GPUs available
Recently I had a similar problem, where Cobal print(torch.cuda.is_available()) was True, but print(torch.cuda.is_available()) was False on a specific project. Both of our projects have this code similar to os.environ["CUDA_VISIBLE_DEVICES"]. I can use this code comment and find that the GPU can be used.
-------My English is poor, I use Google Translate

opencv doesn't use all GPU memory

I'm trying to use the cvlib package which use yolov3 model to recognize objects on images on windows 10.
Let's take an easy example:
import cvlib as cv
import time
from cvlib.object_detection import draw_bbox
inittimer=time.time()
bbox, label, conf = cv.detect_common_objects(img,confidence=0.5,model='yolov3-worker',enable_gpu=True)
print('The process tooks %.3f s'%(time.time()-inittimer)
output_image = draw_bbox(img, bbox, label, conf)
The results give ~60ms.
cvlib use opencv to compute this cnn part.
If now I try to see how much GPU tensorflow used, using subprocess, It tooks only 824MiB.
while the program runs, if I start nvidia-smi it gives me this result:
As u can see there is much more memory available here. My question is simple.. why Cvlib (and so tensorflow) doesn't use all of it to improve the time's detection?
EDIT:
As far as I understand, cvlib use tensorflow but it also use opencv detector. I installed opencv using cmake and Cuda 10.2
I don't understand why but in the nvidia-smi it's written CUDA Version : 11.0 which is not. Maybe that's the part of the problem?
You can verify if opencv is using CUDA or not. This can be done using the following
import cv2
print(cv2.cuda.getCudaEnabledDeviceCount())
This should get you the number of CUDA enabled devices in your machine. You should also check the build information by using the following
import cv2
print cv2.getBuildInformation()
The output for both the above cases can indicate whether your opencv can access GPU or not. In case it doesn't access GPU then you may consider reinstallation.
I got it! The problem come from the fact that I created a new Net object for each itteration.
Here is the related issue on github where you can follow it: https://github.com/opencv/opencv/issues/16348
With a custom function, it now works at ~60 fps. Be aware that cvlib is, maybe, not done for real time computation.
workon opencv_cuda
cd opencv
mkdir build
cd build
cmake -D CMAKE_BUILD_TYPE=RELEASE
and share the result.
It should be something like this

complie and using opencv on gpu for image processing with python

I want to do some image processing with python using Opencv library.
actually i want to read a lot of images from an object storage and do some image processing on each image and want to do it as fast as possible.
I want to compile and use Opencv on gpu instead of cpu to gain most speed.
is there anyway to using gpu for this purpose?
(I know that it is possible with c++)
is there GPU module in OpenCV for python? I didn't any wrapper for python.
Yes opencv does have gpu module in python 3. You can visit https://docs.opencv.org/2.4/modules/gpu/doc/introduction.html
for more information. There are plenty of examples.
There is an example of gpu acceleration https://answers.opencv.org/question/203403/opencv-gpu-accelerated-using-python/

Object detection slow and does not use GPU

I need to use Tensorflow Object Detection API to make some classification connected with recognition.
My problem is that using the API for detection with a pretrained coco model takes too much time and for sure does not use the GPU. I checked my tensorflow-gpu installation on different scripts and it works fine, but when I use this model for detection I can only see increse in CPU usage.
I checked different version of tensorflow (1.12, 1.14), different combinations of CUDA Toolkit (9.0, 10.0) and CuDNN (7.4.2, 7.5.1, 7.6.1) but it is all the same, also tried it on both Windows 7 and Ubuntu 16.04, no difference. My project however requires much faster detection time.
System information:
System: Windows 7, Ubuntu 16.04
Tensorflow: 1.12, 1.14
GPU: GTX 970
Run following python code, if it detects GPU then you can use GPU for training otherwise there is some problem,
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
One more thing, just because your CPU is utilizing does not mean GPU is not at work. CPU always be busy, GPU should also spike when you are training.
Paste the output of above code in the comment if you are not sure about the output.
Edit: After chat with OP on comments, I see the suggested code, and it is using pretrained model, so no training happening here. You are using model and not training a new model. So no gpu is being used.

Resources