How to know on which GPU tensorflow model is training on - python-3.x

I have installed tensorflow-gpu to train my models on GPU and have confirmed the installation from below.
import tensorflow as tf
tf.config.list_physical_devices()
#[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
# PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
I started training an image classification model and I hope it runs on GPU automatically until and unless specified to run manually on a device. But while training the model I could see in my task manager that there were 2 GPU's and Intel Graphics card was GPU 0 and NVIDIA GeForce GTX1660Ti was GPU1. Does that mean tensorflow didn't detect my NVIDIA card or is it the actual GPU that was detected?
While training the model I could see that my NVIDIA GPU utilization was very low. Not sure on which device my model was trained.
Can someone clarify please.
Further version details. tf.__version__ (2.6.0), python 3.7, CUDA 11.4, cudnn 8.2

Try to enable debug:
tf.debugging.set_log_device_placement(True)
I think your Intel GPU is ignored by tf.config.list_physical_devices().

Related

PYTORCH uses multi-machine multi-GPU distributed training jammed

I used two servers, one with a GPU of 3070, the other with a GPU of 2080ti, and CUDA Version: 11.7 still stuck on this line.
model = torch.nn.parallel.DistributedDataParallel(model,find_unused_parameters=False,output_device=None,device_ids=None)

Why I can't use GPU when I training yolov5&yolov6 on my own dataset?

As I describe in the title, I can not use GPU to accelerate my training process both in yolov5 and yolov6.
My torch version is 1.12, the output of torch.cuda.is_available() is Ture.
Every configration I have done are all follow the Tutorials in their offical github.
By the way, a few time ago I can train yolov5 and accelerate with GPU successfully, how oddly it is?
The training config and gpu usage snapshot is attached.
gpu usage
training config

Pytorch Model set GPU to run on Nvidia gpu

I am learning ML and trying to run the model(Pytorch) on my Nvidia GTX 1650.
torch.cuda.is_available() => True
model.to(device)
Implemented the above lines to run the model on GPU, but the task manager shows two GPU
1. Intel Graphics
2. Nvidia GTX 1650
The fluctuation in CPU usage is shown on Intel and not on Nvidia.
How I can run it on Nvidia GPU?
NOTE: The code is working fine and is getting executed on the Intel one with around 90-100s time of epoch.
Just do
device = torch.device("cuda:0")
Try this, hope it works
device = 'cuda'
check = torch.cuda.current_device()
print(torch.cuda.get_device_name(check))
#model.to(device)

TF2 Model: How run training on GPU, and evaluation on CPU

I run training phase of TF2 model (based on object detection pre-trained models from TF2-Models Zoo) on GPU (Nvidia 3070).
Is there some way to define evaluation phase (for checkpoints created by training) on CPU?
Cause train phase allocates almost all memory of GPU, I cant run both of them (train and eval) on GPU.
OS - Ubuntu 20.04
GPU - Nvidia 3070 (driver 460)
TF - 2.4.1
Python - 3.8.5
Thank you.
In my case, the solution is into evaluation function define:
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

Keras, Tensorflow are reserving all GPU memory on model build

my GPU is NVIDIA RTX 2080 TI
Keras 2.2.4
Tensorflow-gpu 1.12.0
CUDA 10.0
Once I load build a model ( before compilation ), I found that GPU memory is fully allocated
[0] GeForce RTX 2080 Ti | 50'C, 15 % | 10759 / 10989 MB | issd/8067(10749M)
What could be the reason, how can i debug it?
I don't have spare memory to load the data even if I load via generators
I have tried to monitor the GPUs memory usage found out it is full just after building the layers (before compiling model)
I meet a similar problem when I load pre-trained ResNet50. The GPU memory usage just surges to 11GB while ResNet50 usually only consumes less than 150MB.
The problem in my case is that I also import PyTorch without actually used it in my code. After commented it, everything works fine.
But I have another PC with the same code that works just fine. So I uninstall and reinstall the Tensorflow and PyTorch with the correct version. Then everything works fine even if I import PyTorch.

Resources