I have multiple GPU devices and want to run a Pytorch on them. I have already tried MULTI-GPU EXAMPLES and DATA PARALLELISM in my code by
device = torch.device("cuda:0,1,2")
model = torch.nn.DataParallel(model, device_ids=[0, 1, 2])
model.to(device)
in my code. But the training is still performed on one GPU (cuda:0). In the shell, I have also selected GPUs using export CUDA_VISIBLE_DEVICES=0,1,2 before running my code.
Could anyone help me to solve this issue please?
Related
I am learning ML and i want to re train a AI model for lane detection.
I want to be familiar with the ML training process. The accuracy/result is not my primary goal and i do not need a best ML model for lane detection.
I found this AI model and want to try it out. But i have been facing a problem:
I do not have a GPU, so i wish i can train this model with my CPU. But sadly some part of this code is written with CUDA. Is there a way, i can convert this GPU code to CPU code only?
Should i find another AI-model only for the CPU training?
you can use the tensor.to(device) command to move a tensor to a device.
The .to() command is also used to move a whole model to a device, like in the post you linked to.
Another possibility is to set the device of a tensor during creation using the device= keyword argument, like in t = torch.tensor(some_list, device=device)
To set the device dynamically in your code, you can use
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
to set cuda as your device if possible.
Above is the answer for how to add CUDA in the code. SO Use Cntrl + F and remove all the keywords which forces code to run on GPU. Such as "device", "to"
I met a strange problem. I trained my model with one GPU(RTX Titan), and it doesn't converge. However, it worked well on two same GPUs with the same settings. There is nothing to do with the batch size. And I use the torch.fft and torch.Transformer layer. I use Python 3.8, Pytorch 1.71 and Cuda 10.1.
I currently train my model using GPUs using Pytorch Lightning
trainer = pl.Trainer( gpus=[0,1],
distributed_backend='ddp',
resume_from_checkpoint=hparams["resume_from_checkpoint"])
trainer.fit(model, train_dataloader=train_loader, val_dataloaders=val_loader)
Instructions are also clear for how to run test samples with trainer defined to use GPU
trainer.test(test_dataloader=test_dataloader)
and also how to load a model and use it interactively
model = transformer.Model.load_from_checkpoint('/checkpoints/run_300_epoch_217.ckpt')
results = model(in_data,
I use the later to interface with an interactive system via sockets in a docker container.
Is there a proper way to make this Pytorch Lightning model run on GPU?
Lightning instructions say not to use model.to(device), but it appears to work just like Pytorch. Reason for instructions to avoid a side effect?
I started reading about ONNX, but would rather just have an easy way to specify GPU since the interactive setup works perfectly with cpu.
My understanding is that "Remove any .cuda() or to.device() calls" is only for using with the Lightning trainer, because the trainer handles that itself.
If you don't use the trainer, a LightningModule module is basically just a regular PyTorch model with some naming conventions. So using model.to(device) is how to run on GPU.
Since I am kind of new in this field I tried following the official tutorial from tensorflow for predicting time series. https://www.tensorflow.org/tutorials/structured_data/time_series
Following problem occurs:
-When training a multivariate model, after 2 or 3 epochs the kernel dies and restarts.
However this doesn't happen with a simpler univariate model, which has only one LSTM layer (not really sure if this makes a difference).
Second however, this problem just happened today. Yesterday the training of the multivariate model was possible and error-free.
As can be seen in the tutorial in the link below the model looks like this:
multi_step_model = tf.keras.models.Sequential()
multi_step_model.add(tf.keras.layers.LSTM(32,return_sequences=True,input_shape=x_train_multi.shape[-2:]))
multi_step_model.add(tf.keras.layers.LSTM(16, activation='relu'))
multi_step_model.add(tf.keras.layers.Dense(72))
multi_step_model.compile(optimizer=tf.keras.optimizers.RMSprop(clipvalue=1.0), loss='mae')
And the kernel dies after executing the following cell (usually after 2 or 3 epochs).
multi_step_history = multi_step_model.fit(train_data_multi, epochs=10,
steps_per_epoch=300,
validation_data=val_data_multi,
validation_steps=50)
I have uninstalled and reinstalled tf, restarted my laptop, but nothing seems to work.
Any ideas?
OS: Windows 10
Surface Book 1
Problem was a too big batch size. Reducing it from 1024 to 256 solved the crashing problem.
Solution taken from the comment of rbwendt on this thread on github.
I need to use Tensorflow Object Detection API to make some classification connected with recognition.
My problem is that using the API for detection with a pretrained coco model takes too much time and for sure does not use the GPU. I checked my tensorflow-gpu installation on different scripts and it works fine, but when I use this model for detection I can only see increse in CPU usage.
I checked different version of tensorflow (1.12, 1.14), different combinations of CUDA Toolkit (9.0, 10.0) and CuDNN (7.4.2, 7.5.1, 7.6.1) but it is all the same, also tried it on both Windows 7 and Ubuntu 16.04, no difference. My project however requires much faster detection time.
System information:
System: Windows 7, Ubuntu 16.04
Tensorflow: 1.12, 1.14
GPU: GTX 970
Run following python code, if it detects GPU then you can use GPU for training otherwise there is some problem,
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
One more thing, just because your CPU is utilizing does not mean GPU is not at work. CPU always be busy, GPU should also spike when you are training.
Paste the output of above code in the comment if you are not sure about the output.
Edit: After chat with OP on comments, I see the suggested code, and it is using pretrained model, so no training happening here. You are using model and not training a new model. So no gpu is being used.