Will switching GPU device affect the gradient in PyTorch back propagation? - pytorch

I use the Pytorch. In the computation, I move some data and operators A in the GPU. In the middle step, I move the data and operators B to CPU and continue the forward.
My question is that:
My operator B is very memory-consuming that cannot be used in GPU. Will this affect (some parts compute in GPU and the others are computed in CPU) the backpropagation?

Pytorch keeps track of the location of tensors. If you use .cpu() or .to('cpu') pytorch's native commands you should be okay.
See, e.g., this model parallel tutorial - the computation is split between two different GPU devices.

If your model fits into the GPU memory, you might let PyTorch do the parallel distribution for you within the DataParallel (one process multiple threads) or DistributedDataParallel (multiple processes multiple threads, single or multiple nodes) frameworks.
Code below checks if you have a gpu device torch.cuda.device_count() > 1 and sets the DataParallel mode model = nn.DataParallel(model)
model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
# dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
model = nn.DataParallel(model)
model.to(device)
DataParallel replicates the same model to all GPUs, where each GPU consumes a different partition of the input data, it can significantly accelerate the training process, but it does not work for some use cases where the model is too large to fit into a single GPU.
To solve this problem, you might resort to a model parallel approach, which splits a single model onto different GPUs, rather than replicating the entire model on each GPU.
(e.g. a model m contains 10 layers: when using DataParallel, each GPU
will have a replica of each of these 10 layers, whereas when using
model parallel on two GPUs, each GPU could host 5 layers)
An example where .to('cuda:0') indicates where the layer should be positioned.
import torch
import torch.nn as nn
import torch.optim as optim
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = torch.nn.Linear(10, 10).to('cuda:0')
self.relu = torch.nn.ReLU()
self.net2 = torch.nn.Linear(10, 5).to('cuda:1')
def forward(self, x):
x = self.relu(self.net1(x.to('cuda:0')))
return self.net2(x.to('cuda:1'))
backward() then automatically takes location into consideration.
model = ToyModel()
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)
optimizer.zero_grad()
outputs = model(torch.randn(20, 10))
labels = torch.randn(20, 5).to('cuda:1')
loss_fn(outputs, labels).backward()
optimizer.step()
https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html

This snippet suggests that the gradient is preserved when computation goes through different devices.
def change_device():
import torch.nn as nn
a = torch.rand((4, 32))
m1 = nn.Linear(32, 32)
cpu = m1(a)
gpu = cpu.to(0)
m2 = nn.Linear(32, 32).to(0)
out = m2(gpu)
loss = out.sum()
loss.backward()
print(m1.weight.grad)
# works like magic
"""
tensor([[ 0.7746, 1.0342, 0.8706, ..., 1.0993, 0.7975, 0.3915],
[-0.5369, -0.7169, -0.6034, ..., -0.7619, -0.5527, -0.2713],
[ 0.3607, 0.4815, 0.4053, ..., 0.5118, 0.3713, 0.1823],
...,
[ 1.1200, 1.4955, 1.2588, ..., 1.5895, 1.1531, 0.5660],
[-0.1582, -0.2112, -0.1778, ..., -0.2245, -0.1629, -0.0799],
[-0.4531, -0.6050, -0.5092, ..., -0.6430, -0.4665, -0.2290]])
"""
Modifying this snippet, the gradient is preserved when tensor moves from gpu to cpu as well.

Related

Memory bottleneck with autoregressive transformer decoding

I am trying to train a transformer model for sequence modeling. Below is a standalone example:
import torch
import torch.nn as nn
criterion = nn.MSELoss()
decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=12)
memory = torch.rand(10, 32, 512)
y = torch.rand(20, 32, 512)
start_token = torch.ones((1,32,512))
tgt_input = torch.cat((start_token,y[:-1,:]),axis=0)
optimizer = torch.optim.Adam(transformer_decoder.parameters())
###################Teacher forced
while(True):
optimizer.zero_grad()
out = transformer_decoder(tgt_input, memory, nn.Transformer.generate_square_subsequent_mask(20,20))
loss = criterion(out,y)
print("loss: ", loss.item())
loss.backward()
optimizer.step()
For a 12 layer decoder, the model works fine on a personal machine with 8GB memory. The model is autoregressive and works with shifted targets. Given we provide targets above, I refer to this setting as "teacher forced".
However, at inference stage, we will not have targets fed as above, and one would need to condition on targets generated on the go. This setting is as follows:
###################Non Teacher forced
while(True):
optimizer.zero_grad()
predictions = torch.ones((1,32,512))
for i in range(1,21):
predictions = torch.cat((predictions, transformer_decoder(tgt_input[:i], memory, nn.Transformer.generate_square_subsequent_mask(i,i))[-1].unsqueeze(0)),axis=0)
print("i: ", i, "predictions.shape: ", predictions.shape)
loss = criterion(predictions[1:],y)
print("loss: ", loss.item())
loss.backward()
optimizer.step()
I wish to train the model with a hybrid training strategy with, without teacher forcing. However, the non-teacher forced strategy causes out-of-memory exception and doesn't work. For final inference (testing), usually, with torch.no_grad() it can work, but not in training. Can anyone explain as to why this causes memory bottlenecks exactly?
This is because of the rolling of the computational graph. For the teacher forced model, gradients are not propagated after the true values. However, for non-teacher forced model they backpropagate making the accumulation of gradients (similar to RNN).

increasing batch size during inference

I believed that the inference time per batch was independent of the batch size when using a GPU, but this minimal example tells me that this doesn't seem true:
import torch
from torch import nn
from tqdm import tqdm
BATCH_SIZE = 32
N_ITER = 10000
class NN(nn.Module):
def __init__(self):
super(NN, self).__init__()
self.layer = nn.Conv2d(3, 32, kernel_size=5, stride=1, padding=3, bias=False)
def forward(self, input):
out = self.layer(input)
return out
cnn = NN().cuda()
cnn.eval()
tensor = torch.rand(BATCH_SIZE, 3, 999, 999).cuda()
with torch.no_grad():
for _ in tqdm(range(N_ITER), mininterval=0.1):
out = cnn(tensor)
When increasing BATCH_SIZE, the "it/s" shown by tqdm increases proportionally:
Plot of inference time vs batch size
It was my believe that the GPU can process the entire tensor simultaneously, as long as it doesn't use all the memory. Maybe I don't understand something about how GPUs process data in parallel, so I would appreciate some insights here.
I am using a NVIDIA GeForce 2080 Ti, pytorch 1.6.0 and CUDA 10.2.
You are wrong. GPUs have a lot of cores but that does not mean that they can process all the data at the same time. For instance, RTX 2080Ti only has 4352 cores.

How to compute gradient of the error with respect to the model input?

Given a simple 2 layer neural network, the traditional idea is to compute the gradient w.r.t. the weights/model parameters. For an experiment, I want to compute the gradient of the error w.r.t the input. Are there existing Pytorch methods that can allow me to do this?
More concretely, consider the following neural network:
import torch.nn as nn
import torch.nn.functional as F
class NeuralNet(nn.Module):
def __init__(self, n_features, n_hidden, n_classes, dropout):
super(NeuralNet, self).__init__()
self.fc1 = nn.Linear(n_features, n_hidden)
self.sigmoid = nn.Sigmoid()
self.fc2 = nn.Linear(n_hidden, n_classes)
self.dropout = dropout
def forward(self, x):
x = self.sigmoid(self.fc1(x))
x = F.dropout(x, self.dropout, training=self.training)
x = self.fc2(x)
return F.log_softmax(x, dim=1)
I instantiate the model and an optimizer for the weights as follows:
import torch.optim as optim
model = NeuralNet(n_features=args.n_features,
n_hidden=args.n_hidden,
n_classes=args.n_classes,
dropout=args.dropout)
optimizer_w = optim.SGD(model.parameters(), lr=0.001)
While training, I update the weights as usual. Now, given that I have values for the weights, I should be able to use them to compute the gradient w.r.t. the input. I am unable to figure out how.
def train(epoch):
t = time.time()
model.train()
optimizer.zero_grad()
output = model(features)
loss_train = F.nll_loss(output[idx_train], labels[idx_train])
acc_train = accuracy(output[idx_train], labels[idx_train])
loss_train.backward()
optimizer_w.step()
# grad_features = loss_train.backward() w.r.t to features
# features -= 0.001 * grad_features
for epoch in range(args.epochs):
train(epoch)
It is possible, just set input.requires_grad = True for each input batch you're feeding in, and then after loss.backward() you should see that input.grad holds the expected gradient. In other words, if your input to the model (which you call features in your code) is some M x N x ... tensor, features.grad will be a tensor of the same shape, where each element of grad holds the gradient with respect to the corresponding element of features. In my comments below, I use i as a generalized index - if your parameters has for instance 3 dimensions, replace it with features.grad[i, j, k], etc.
Regarding the error you're getting: PyTorch operations build a tree representing the mathematical operation they are describing, which is then used for differentiation. For instance c = a + b will create a tree where a and b are leaf nodes and c is not a leaf (since it results from other expressions). Your model is the expression, and its inputs as well as parameters are the leaves, whereas all intermediate and final outputs are not leaves. You can think of leaves as "constants" or "parameters" and of all other variables as of functions of those. This message tells you that you can only set requires_grad of leaf variables.
Your problem is that at the first iteration, features is random (or however else you initialize) and is therefore a valid leaf. After your first iteration, features is no longer a leaf, since it becomes an expression calculated based on the previous ones. In pseudocode, you have
f_1 = initial_value # valid leaf
f_2 = f_1 + your_grad_stuff # not a leaf: f_2 is a function of f_1
to deal with that you need to use detach, which breaks the links in the tree, and makes the autograd treat a tensor as if it was constant, no matter how it was created. In particular, no gradient calculations will be backpropagated through detach. So you need something like
features = features.detach() - 0.01 * features.grad
Note: perhaps you need to sprinkle a couple more detaches here and there, which is hard to say without seeing your whole code and knowing the exact purpose.

How can I change the padded input size per channel in Pytorch?

I am trying to set up an image classifier using Pytorch. My sample images have 4 channels and are 28x28 pixels in size. I am trying to use the built-in torchvision.models.inception_v3() as my model. Whenever I try to run my code, I get this error:
RuntimeError: Calculated padded input size per channel: (1 x 1).
Kernel size: (3 x 3). Kernel size can't greater than actual input size
at
/opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THNN/generic/SpatialConvolutionMM.c:48
I can't find how to change the padded input size per channel or quite figure out what the error means. I figure that I must modify the padded input size per channel since I can't edit the Kernel size in the pre-made model.
I have tried padding, but it didn't help.
Here is a shortened part of my code that throws the error when I call train():
import torch
import torchvision as tv
import torch.optim as optim
from torch import nn
from torch.utils.data import DataLoader
model = tv.models.inception_v3()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.0001, weight_decay=0)
lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=4, gamma=0.9)
trn_dataset = tv.datasets.ImageFolder(
"D:/tests/classification_test_data/trn",
transform=tv.transforms.Compose([tv.transforms.RandomRotation((0,275)), tv.transforms.RandomHorizontalFlip(),
tv.transforms.ToTensor()]))
trn_dataloader = DataLoader(trn_dataset, batch_size=32, num_workers=4, shuffle=True)
for epoch in range(0, 10):
train(trn_dataloader, model, criterion, optimizer, lr_scheduler, 6, 32)
print("End of training")
def train(train_loader, model, criterion, optimizer, scheduler, num_classes, batch_size):
model.train()
scheduler.step()
for index, data in enumerate(train_loader):
inputs, labels = data
optimizer.zero_grad()
outputs = model(inputs)
outputs_flatten = flatten_outputs(outputs, num_classes)
loss = criterion(outputs_flatten, labels)
loss.backward()
optimizer.step()
def flatten_outputs(predictions, number_of_classes):
logits_permuted = predictions.permute(0, 2, 3, 1)
logits_permuted_cont = logits_permuted.contiguous()
outputs_flatten = logits_permuted_cont.view(-1, number_of_classes)
return outputs_flatten
It could be due the following. Pytorch documentation for Inception_v3 model notes that the model expects input of shape Nx3x299x299. This is because the architecture contains a fully connected layer which fixed shape.
Important: In contrast to the other models the inception_v3 expects tensors with a size of N x 3 x 299 x 299, so ensure your images are sized accordingly.
https://pytorch.org/docs/stable/torchvision/models.html#inception-v3
May be this is a late post, but i tried to sort out this with a simple technique.
In got this kind of error, i was using custom conv2d module, and somehow i missed sending the padding to my nn.conv2d.
I found out this error by,
In my conv2d implementation, i printed out the shape of the output variable, and found the exact bug in my code.
model = VGG_BNN_ReLU('VGG11',10)
import torch
x = torch.randn(1,3,32,32)
model.forward(x)
Hope this helps.Happy learning

model selection with Keras and Theano takes a very long time

I am performing nested cross-validation for model selection and performance estimation for a set of recurrent neural networks with different architectures and parameters using Keras and Theano, which are set up to run on a AWS P2 instance which has a Tesla K80 GPU with CUDA and cuDNN installed/enabled.
To perform model selection, I compare 30 models sampled from the parameter space using
param_grid = {
'nb_hidden_layers': [1, 2, 3],
'dropout_frac': [0.15, 0.20],
'output_activation': ['sigmoid', 'softmax'],
'optimization': ['Adedelta', 'RMSprop', 'Adam'],
'learning_rate': [0.001, 0.005, 0.010],
'batch_size': [64, 100, 150, 200],
'nb_epoch': [10, 15, 20],
'perform_batchnormalization': [True, False]
}
params_list = list(ParameterSampler(param_grid, n_iter = 30))
I then construct a RNN model using the function NeuralNetworkClassifier() defined below
def NeuralNetworkClassifier(params, units_in_hidden_layer = [50, 75, 100, 125, 150]):
nb_units_in_hidden_layers = np.random.choice(units_in_hidden_layer, size = params['nb_hidden_layers'], replace = False)
layers = [8] # number of features in every week
layers.extend(nb_units_in_hidden_layers)
layers.extend([1]) # node identifying quit/stay
model = Sequential()
# constructing all layers up to, but not including, the penultimate one
layer_idx = -1 # this ensures proper generalization nb_hidden_layers = 1 (for which the loop below will never run)
for layer_idx in range(len(layers) - 3):
model.add(LSTM(input_dim = layers[layer_idx], output_dim = layers[layer_idx + 1], init = 'he_uniform', return_sequences = True)) # all LSTM layers, up to and including the penultimate one, need return_sequences = True
if params['perform_batchnormalization'] == True:
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(params['dropout_frac']))
# constructing the penultimate layer
model.add(LSTM(input_dim = layers[layer_idx + 1], output_dim = layers[(layer_idx + 1) + 1], init = 'he_uniform', return_sequences = False)) # the last LSTM layer needs return_sequences = False
if params['perform_batchnormalization'] == True:
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(params['dropout_frac']))
# constructing the final layer
model.add(Dense(output_dim = layers[-1], init = 'he_normal'))
model.add(Activation(params['output_activation']))
if params['optimization'] == 'SGD':
optim = SGD()
optim.lr.set_value(params['learning_rate'])
elif params['optimization'] == 'RMSprop':
optim = RMSprop()
optim.lr.set_value(params['learning_rate'])
elif params['optimization'] == 'Adam':
optim = Adam()
elif params['optimization'] == 'Adedelta':
optim = Adadelta()
model.compile(loss = 'binary_crossentropy', optimizer = optim, metrics = ['precision'])
return model
which construct a RNN whose number of hidden layers is given by the parameter 'nb_hidden_layers' in param_grid and the number of hidden units in each layer is randomly sampled from the list [50, 75, 100, 125, 150]. At the end, this function compiles the model and returns it.
During the nested cross-validation (CV), the inner loop (which runs IN times) compares the performance of the 30 randomly selected model. After this step, I pick the best-performing model in the outer loop and estimate its performance on a hold-out dataset; this scheme is repeated OUT times. Therefore, I am compileing a RNN model OUTxINx30 times, and this takes an extremely long time; for example, when OUT=4 and IN=3, my method takes between 6 to 7 hours to finish.
I see that the GPU is being used sporadically (but the GPU usage never goes above 40%); however, most of the time, it is the CPU that is being used. My (uneducated) guess is that compile is being done on the CPU many many times and takes the bulk of the computing time, whereas model fitting and predicting are done on the GPU and takes a short time.
My questions:
Is there a way to remedy this situation?
Is compile actually done on the CPU?
How do people do nested CV to select the best RNN architecture?
Is it reasonable for me to perform this scheme on the production server? Do you suggest I do one big nested CV, that might take 24 hours, to select the best performing model and just use that one model afterwards on the production server?
Thank you all.
I can't answer all your questions, still hope it helps.
Compilation is done in CPU because it's mainly composed of symbolic graph operations and code generation. To make things worse, theano graph optimization uses pure python code, which can be an overhead compared to a C/C++ implementation.
To improve theano compilation time (at the cost of runtime performance):
Use less aggressive optimization
In /home/ec2-user/.theanorc add line:
optimizer = fast_compile
Or totally disable optimization with:
optimizer = None
Precompile some blocks
If there are common blocks shared amoung your models, you can precompile them with theano.OpFromGraph
You can't do this in Keras alone, though.
Switch framework
Keras does support tensorflow backend. Compared to theano, tensorflow work more like a VM than a compiler. Typically TF runs slower than theano but compiles much faster.

Resources