PYTORCH uses multi-machine multi-GPU distributed training jammed - pytorch

I used two servers, one with a GPU of 3070, the other with a GPU of 2080ti, and CUDA Version: 11.7 still stuck on this line.
model = torch.nn.parallel.DistributedDataParallel(model,find_unused_parameters=False,output_device=None,device_ids=None)

Related

How to know on which GPU tensorflow model is training on

I have installed tensorflow-gpu to train my models on GPU and have confirmed the installation from below.
import tensorflow as tf
tf.config.list_physical_devices()
#[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
# PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
I started training an image classification model and I hope it runs on GPU automatically until and unless specified to run manually on a device. But while training the model I could see in my task manager that there were 2 GPU's and Intel Graphics card was GPU 0 and NVIDIA GeForce GTX1660Ti was GPU1. Does that mean tensorflow didn't detect my NVIDIA card or is it the actual GPU that was detected?
While training the model I could see that my NVIDIA GPU utilization was very low. Not sure on which device my model was trained.
Can someone clarify please.
Further version details. tf.__version__ (2.6.0), python 3.7, CUDA 11.4, cudnn 8.2
Try to enable debug:
tf.debugging.set_log_device_placement(True)
I think your Intel GPU is ignored by tf.config.list_physical_devices().

Clearing memory when training Machine Learning models with Tensorflow 1.15 on GPU

I am training a pretty intensive ML model using a GPU and what will often happen that if I start training the model, then let it train for a couple of epochs and notice that my changes have not made a significant difference in the loss/accuracy, I will make edits, re-initialize the model and re-start training from epoch 0. In this case, I often get OOM errors.
My guess is that despite me overriding all the model variables something is still taking up space in-memory.
Is there a way to clear the memory of the GPU in Tensorflow 1.15 so that I don't have to keep restarting the kernel each time I want to start training from scratch?
It depends on exactly what GPUs you're using. I'm assuming you're using NVIDIA, but even then depending on the exact GPU there are three ways to do this-
nvidia-smi -r works on TESLA and other modern variants.
nvidia-smi --gpu-reset works on a variety of older GPUs.
Rebooting is the only options for the rest, unfortunately.

TF2 Model: How run training on GPU, and evaluation on CPU

I run training phase of TF2 model (based on object detection pre-trained models from TF2-Models Zoo) on GPU (Nvidia 3070).
Is there some way to define evaluation phase (for checkpoints created by training) on CPU?
Cause train phase allocates almost all memory of GPU, I cant run both of them (train and eval) on GPU.
OS - Ubuntu 20.04
GPU - Nvidia 3070 (driver 460)
TF - 2.4.1
Python - 3.8.5
Thank you.
In my case, the solution is into evaluation function define:
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

Keras, Tensorflow are reserving all GPU memory on model build

my GPU is NVIDIA RTX 2080 TI
Keras 2.2.4
Tensorflow-gpu 1.12.0
CUDA 10.0
Once I load build a model ( before compilation ), I found that GPU memory is fully allocated
[0] GeForce RTX 2080 Ti | 50'C, 15 % | 10759 / 10989 MB | issd/8067(10749M)
What could be the reason, how can i debug it?
I don't have spare memory to load the data even if I load via generators
I have tried to monitor the GPUs memory usage found out it is full just after building the layers (before compiling model)
I meet a similar problem when I load pre-trained ResNet50. The GPU memory usage just surges to 11GB while ResNet50 usually only consumes less than 150MB.
The problem in my case is that I also import PyTorch without actually used it in my code. After commented it, everything works fine.
But I have another PC with the same code that works just fine. So I uninstall and reinstall the Tensorflow and PyTorch with the correct version. Then everything works fine even if I import PyTorch.

sklearn and Tensorflow with dual CPU machine

I am thinking about building a dual-CPU machine for machine learning. I already have a fast GPU in my current rig but I am limited to 32GB of DDR3, I have an i7-4790k and I am planning to upgrade to dual E5 2683 v3's.
I need CPU computing power for sklearn and grid search. Does Sklearn work on 2 cpu's the same it does on 1? Will it use all the cores on both CPU's when n_jobs=-1?
Will tensorflow only work on the one CPU when training on my GPU? If I just copy and pasted the MNIST for experts tutorial on the TF website would it use both CPU's and my GPU without specifying the devices?
I choose not to put this on the superuser forum because it is more about the software than the hardware.
From what I read even if you add an instruction like
with tf.Session() as sess:
with tf.device("/cpu:0"):
...
It treats it as a recommendation and might use the GPU when it sees fit.
I guess it might use the other CPU

Resources