I run training phase of TF2 model (based on object detection pre-trained models from TF2-Models Zoo) on GPU (Nvidia 3070).
Is there some way to define evaluation phase (for checkpoints created by training) on CPU?
Cause train phase allocates almost all memory of GPU, I cant run both of them (train and eval) on GPU.
OS - Ubuntu 20.04
GPU - Nvidia 3070 (driver 460)
TF - 2.4.1
Python - 3.8.5
Thank you.
In my case, the solution is into evaluation function define:
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
Related
I used two servers, one with a GPU of 3070, the other with a GPU of 2080ti, and CUDA Version: 11.7 still stuck on this line.
model = torch.nn.parallel.DistributedDataParallel(model,find_unused_parameters=False,output_device=None,device_ids=None)
As I describe in the title, I can not use GPU to accelerate my training process both in yolov5 and yolov6.
My torch version is 1.12, the output of torch.cuda.is_available() is Ture.
Every configration I have done are all follow the Tutorials in their offical github.
By the way, a few time ago I can train yolov5 and accelerate with GPU successfully, how oddly it is?
The training config and gpu usage snapshot is attached.
gpu usage
training config
I have installed tensorflow-gpu to train my models on GPU and have confirmed the installation from below.
import tensorflow as tf
tf.config.list_physical_devices()
#[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
# PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
I started training an image classification model and I hope it runs on GPU automatically until and unless specified to run manually on a device. But while training the model I could see in my task manager that there were 2 GPU's and Intel Graphics card was GPU 0 and NVIDIA GeForce GTX1660Ti was GPU1. Does that mean tensorflow didn't detect my NVIDIA card or is it the actual GPU that was detected?
While training the model I could see that my NVIDIA GPU utilization was very low. Not sure on which device my model was trained.
Can someone clarify please.
Further version details. tf.__version__ (2.6.0), python 3.7, CUDA 11.4, cudnn 8.2
Try to enable debug:
tf.debugging.set_log_device_placement(True)
I think your Intel GPU is ignored by tf.config.list_physical_devices().
I am training a pretty intensive ML model using a GPU and what will often happen that if I start training the model, then let it train for a couple of epochs and notice that my changes have not made a significant difference in the loss/accuracy, I will make edits, re-initialize the model and re-start training from epoch 0. In this case, I often get OOM errors.
My guess is that despite me overriding all the model variables something is still taking up space in-memory.
Is there a way to clear the memory of the GPU in Tensorflow 1.15 so that I don't have to keep restarting the kernel each time I want to start training from scratch?
It depends on exactly what GPUs you're using. I'm assuming you're using NVIDIA, but even then depending on the exact GPU there are three ways to do this-
nvidia-smi -r works on TESLA and other modern variants.
nvidia-smi --gpu-reset works on a variety of older GPUs.
Rebooting is the only options for the rest, unfortunately.
my GPU is NVIDIA RTX 2080 TI
Keras 2.2.4
Tensorflow-gpu 1.12.0
CUDA 10.0
Once I load build a model ( before compilation ), I found that GPU memory is fully allocated
[0] GeForce RTX 2080 Ti | 50'C, 15 % | 10759 / 10989 MB | issd/8067(10749M)
What could be the reason, how can i debug it?
I don't have spare memory to load the data even if I load via generators
I have tried to monitor the GPUs memory usage found out it is full just after building the layers (before compiling model)
I meet a similar problem when I load pre-trained ResNet50. The GPU memory usage just surges to 11GB while ResNet50 usually only consumes less than 150MB.
The problem in my case is that I also import PyTorch without actually used it in my code. After commented it, everything works fine.
But I have another PC with the same code that works just fine. So I uninstall and reinstall the Tensorflow and PyTorch with the correct version. Then everything works fine even if I import PyTorch.