Custom backward/optimization steps in pytorch-lightning - pytorch

I would like to implement the training loop below in pytorch-lightning (to be read as pseudo-code). The peculiarity is that the backward and optimization steps are not performed for every batch.
(Background: I am trying to implement a few-shots learning algorithm; although I need to make predictions at every step -- forward method
-- I need to perform the gradient updates at random -- if- block.
for batch in batches:
x, y = batch
loss = forward(x,y)
optimizer.zero_grad()
if np.random.rand() > 0.5:
loss.backward()
optimizer.step()
My proposed solution entails implementing the backward and the optimizer_step methods as follows:
def backward(self, use_amp, loss, optimizer):
self.compute_grads = False
if np.random.rand() > 0.5:
loss.backward()
nn.utils.clip_grad_value_(self.enc.parameters(), 1)
nn.utils.clip_grad_value_(self.dec.parameters(), 1)
self.compute_grads = True
return
def optimizer_step(self, current_epoch, batch_nb, optimizer, optimizer_i, second_order_closure=None):
if self.compute_grads:
optimizer.step()
optimizer.zero_grad()
return
Note: In this way I need to store a compute_grads attribute at the class level.
What is the "best-practice" way to implement it in pytorch-lightning? Is there a better way to use the hooks?

This is a good way to do it! that's what the hooks are for.
There is a new Callbacks module that might also be helpful:
https://pytorch-lightning.readthedocs.io/en/0.7.1/callbacks.html

Related

Why my cross entropy loss function does not converge?

I try to write a cross entropy loss function by myself. My loss function gives the same loss value as the official one, but when i use my loss function in the code instead of official cross entropy loss function, the code does not converge. When i use the official cross entropy loss function, the code converges. Here is my code, please give me some suggestions. Thanks very much
The input 'out' is a tensor (B*C) and 'label' contains class indices (1 * B)
class MylossFunc(nn.Module):
def __init__(self):
super(MylossFunc, self).__init__()
def forward(self, out, label):
out = torch.nn.functional.softmax(out, dim=1)
n = len(label)
loss = torch.FloatTensor([0])
loss = Variable(loss, requires_grad=True)
tmp = torch.log(out)
#print(out)
torch.scalar_tensor(-100)
for i in range(n):
loss = loss - torch.max(tmp[i][label[i]], torch.scalar_tensor(-100) )/n
loss = torch.sum(loss)
return loss
Instead of using torch.softmax and torch.log, you should use torch.log_softmax, otherwise your training will become unstable with nan values everywhere.
This happens because when you take the softmax of your logits using the following line:
out = torch.nn.functional.softmax(out, dim=1)
you might get a zero in one of the components of out, and when you follow that by applying torch.log it will result in nan (since log(0) is undefined). That is why torch (and other common libraries) provide a single stable operation, log_softmax, to avoid the numerical instabilities that occur when you use torch.softmax and torch.log individually.

Trying to accumulate gradients in Pytorch, but getting RuntimeError when calling loss.backward

I'm trying to train a model in Pytorch, and I'd like to have a batch size of 8, but due to memory limitations, I can only have a batch size of at most 4. I've looked all around and read a lot about accumulating gradients, and it seems like the solution to my problem.
However, I seem to have trouble implementing it. Every time I run the code I get RuntimeError: Trying to backward through the graph a second time. I don't understand why since my code looks like all these other examples I've seen (unless I'm just missing something major):
https://stackoverflow.com/a/62076913/1227353
https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255
https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903/20
One caveat is that the labels for my images are all different size, so I can't send the output batch and the label batch into the loss function; I have to iterate over them together. This is what an epoch looks like (it's been pared down for the sake of brevity):
# labels_batch contains labels of different sizes
for batch_idx, (inputs_batch, labels_batch) in enumerate(dataloader):
outputs_batch = model(inputs_batch)
# have to do this because labels can't be stacked into a tensor
for output, label in zip(outputs_batch, labels_batch):
output_scaled = interpolate(...) # make output match label size
loss = train_criterion(output_scaled, label) / (BATCH_SIZE * 2)
loss.backward()
if batch_idx % 2 == 1:
optimizer.step()
optimizer.zero_grad()
Is there something I'm missing? If I do the following I also get an error:
# labels_batch contains labels of different sizes
for batch_idx, (inputs_batch, labels_batch) in enumerate(dataloader):
outputs_batch = model(inputs_batch)
# CHANGE: we're gonna accumulate losses manually
batch_loss = 0
# have to do this because labels can't be stacked into a tensor
for output, label in zip(outputs_batch, labels_batch):
output_scaled = interpolate(...) # make output match label size
loss = train_criterion(output_scaled, label) / (BATCH_SIZE * 2)
batch_loss += loss # CHANGE: accumulate!
# CHANGE: do backprop outside for loop
batch_loss.backward()
if batch_idx % 2 == 1:
optimizer.step()
optimizer.zero_grad()
The error I get in this case is RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn. This happens when the next epoch starts though... (INCORRECT, SEE EDIT BELOW)
How can I train my model with gradient accumulation? Or am I doomed to train with a batch size of 4 or less?
Oh and as a side question, does the location of where I put loss.backward() affect what I need to normalize the loss by? Or is it always normalized by BATCH_SIZE * 2?
EDIT:
The second code segment was getting an error due to the fact that I was doing torch.set_grad_enabled(phase == 'train') but I had forgotten to wrap the call to batch_loss.backward() with an if phase == 'train'... my bad
So now the second segment of code seems to work and do gradient accumulation, but why doesn't the first bit of code work? It feel equivalent to setting BATCH_SIZE as 1. Furthermore, I'm creating a new loss object each time, so shouldn't the calls to backward() operate on different graphs entirely?
It seems you have two issues here, you said you couldn't have batch_size=8 because of memory limitations but later state that your labels are not of the same size. The latter seems much more important than the former. Anyway, I will try to answer your questions best I can.
How can I train my model with gradient accumulation? Or am I doomed to train with a batch size of 4 or less?
You want to call .backward() on every loop cycle otherwise the batch will have no effect on the training. You can then call step() and zero_grad() only when batch_idx % 2 is True (i.e. for every other batch).
Here's an example which accumulates the gradient, not the loss:
model = nn.Linear(10, 3)
optim = torch.optim.SGD(model.parameters(), lr=0.1)
ds = TensorDataset(torch.rand(100, 10), torch.rand(100, 3))
dl = DataLoader(ds, batch_size=4)
for i, (x, y) in enumerate(dl):
y_hat = model(x)
loss = F.l1_loss(y_hat, y) / 2
loss.backward()
if i % 2:
optim.step()
optim.zero_grad()
Note this approach is different to accumulating the loss, and back-propagating only all batches (or part of the batches) have gone through the network. In the example above we backpropagate every 4 datapoints and updating the model every 8 datapoints.
Oh and as a side question, does the location of where I put loss.backward() affect what I need to normalize the loss by? Or is it always normalized by BATCH_SIZE * 2?
Usually torch's built-in losses have reduction='mean' set as default. This means the loss gets averaged over all batch elements that contributed to calculating the loss. So this will depend on your loss implementation.
However if you are using gradient accumalation, then yes you will need to average your loss by the number of accumulation steps (here loss = F.l1_loss(y_hat, y) / 2). Since your gradients will be accumulated twice.
To read more about this, I recommend taking a look at this other SO post.

Pytorch: Custom Loss involving Norm of End-to-End Jacobian

Cross posting from Pytorch discussion boards
I want to train a network using a modified loss function that has both a typical classification loss (e.g. nn.CrossEntropyLoss) as well as a penalty on the Frobenius norm of the end-to-end Jacobian (i.e. if f(x) is the output of the network, \nabla_x f(x)).
I’ve implemented a model that can successfully learn using nn.CrossEntropyLoss. However, when I try adding the second loss function (by doing two backwards passes), my training loop runs, but the model never learns. Furthermore, if I calculate the end-to-end Jacobian, but don’t include it in the loss function, the model also never learns. At a high level, my code does the following:
Forward pass to get predicted classes, yhat, from inputs x
Call yhat.backward(torch.ones(appropriate shape), retain_graph=True)
Jacobian norm = x.grad.data.norm(2)
Set loss equal to classification loss + scalar coefficient * jacobian norm
Run loss.backward()
I suspect that I’m misunderstanding how backward() works when run twice, but I haven’t been able to find any good resources to clarify this.
Too much is required to produce a working example, so I’ve tried to extract the relevant code:
def train_model(model, train_dataloader, optimizer, loss_fn, device=None):
if device is None:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.train()
train_loss = 0
correct = 0
for batch_idx, (batch_input, batch_target) in enumerate(train_dataloader):
batch_input, batch_target = batch_input.to(device), batch_target.to(device)
optimizer.zero_grad()
batch_input.requires_grad_(True)
model_batch_output = model(batch_input)
loss = loss_fn(model_output=model_batch_output, model_input=batch_input, model=model, target=batch_target)
train_loss += loss.item() # sum up batch loss
loss.backward()
optimizer.step()
and
def end_to_end_jacobian_loss(model_output, model_input):
model_output.backward(
torch.ones(*model_output.shape),
retain_graph=True)
jacobian = model_input.grad.data
jacobian_norm = jacobian.norm(2)
return jacobian_norm
Edit 1: I swapped my previous implementation with .backward() to autograd.grad and it apparently works! What's the difference?
def end_to_end_jacobian_loss(model_output, model_input):
jacobian = autograd.grad(
outputs=model_output['penultimate_layer'],
inputs=model_input,
grad_outputs=torch.ones(*model_output['penultimate_layer'].shape),
retain_graph=True,
only_inputs=True)[0]
jacobian_norm = jacobian.norm(2)
return jacobian_norm

Pytorch - For loop LSTM and Trying to backward through the graph a second time, but the buffers have already been freed

I'm a pytorch beginner. I have a for loop LSTM in my model for some reason. When I train the model in the second epoch I got the RuntimeError like this:
Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
I tried to use retain_graph=True in loss, but this causes the training process really slow.
My network architecture as follows: LSTM and full connect layer
def __init__(self,embedding_dim,hidden_size,batch_size,compare_num):
super(MultiLayerClassifer,self).__init__()
self.studentPath_layers=compare_num
self.embedding_dim=embedding_dim
self.hidden_size=hidden_size
self.batch_size=batch_size
self.lstm=nn.LSTM(self.embedding_dim*3,hidden_size,batch_first=True)
self.l1=nn.Linear(hidden_size,int(hidden_size/2))
self.l2=nn.Linear(int(hidden_size/2),1)
self.allLinear1=nn.Linear(self.studentPath_layers,class_num)
And forward:
def forward(self,inputs):
sList=[]
for i in range(self.studentPath_layers):
if i >= inputs.size()[1]:
break
sentence=inputs[:,i]
sentence_embedding=sentence.view
(sentence.shape[0],sentence.shape[1],-1)
self.hidden=self.init_hidden(sentence.shape[0])
output,self.hidden=self.lstm(sentence_embedding,self.hidden)
s=self.l1(output[:,-1,:])
s=self.l2(s)
s=torch.tanh(s)
sList.append(s)
sVariable=torch.stack(sList,-1)
y=self.allLinear1(sVariable)
return y
Train process in an epoch as follows:
for studentIdList,courseIdList,scoreList,sentenceList,embeddingList,typeList in t:
model.train()
embeddings_in=embeddingList.to(device)
target=torch.tensor([tag_to_ix[s] for s in scoreList],dtype=torch.long).to(device)
output=model(embeddings_in)
output=output.reshape((output.shape[0],3)).to(device)
train_acc+=(torch.max(output, 1)[1]==target).sum().item()
loss=loss_function(output,target)
if weight_decay > 0:
loss = loss + reg_loss(model)
train_loss+=loss.item()
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_num+=embeddings_in.shape[0]
step+=1
Can anyone help me? I think the for loop cause the problem, but I'm confused after tried several methods.

Resume training with different loss function

I want to implement a two-step learning process where:
pre-train a model for a few epochs using the loss function loss_1
change the loss function to loss_2 and continue the training for fine-tuning
Currently, my approach is:
model.compile(optimizer=opt, loss=loss_1, metrics=['accuracy'])
model.fit_generator(…)
model.compile(optimizer=opt, loss=loss_2, metrics=['accuracy'])
model.fit_generator(…)
Note that the optimizer remains the same, and only the loss function changes. I'd like to smoothly continue training, but with a different loss function. According to this post, re-compiling the model loses the optimizer state. Questions:
a) Will I lose the optimizer state even if I use the same optimizer, eg Adam?
b) if the answer to a) is yes, any suggestions on how to change the loss function to a new one without reseting the optimizer state?
EDIT:
As suggested by Simon Caby and based on this thread, I created a custom loss function with two loss computations that depend on epoch number. However, it does not work for me. My approach:
def loss_wrapper(t_change, current_epoch):
def custom_loss(y_true, y_pred):
c_epoch = K.get_value(current_epoch)
if c_epoch < t_change:
# compute loss_1
else:
# compute loss_2
return custom_loss
And I compile as follows, after initializing current_epoch:
current_epoch = K.variable(0.)
model.compile(optimizer=opt, loss=loss_wrapper(5, current_epoch), metrics=...)
To update the current_epoch, I create the following callback:
class NewCallback(Callback):
def __init__(self, current_epoch):
self.current_epoch = current_epoch
def on_epoch_end(self, epoch, logs={}):
K.set_value(self.current_epoch, epoch)
model.fit_generator(..., callbacks=[NewCallback(current_epoch)])
The callback updates self.current_epoch every epoch correctly. But the update does not reach the custom loss function. Instead, current_epoch keeps the initialization value forever, and loss_2 is never executed.
Any suggestion is welcome, thanks!
My answers :
a) yes, and you should probably make your own learning rate scheduler in order to keep control of it :
keras.callbacks.LearningRateScheduler(schedule, verbose=0)
b) yes you can create your own loss function, including one that flutuates between two different loss methods. see : "Advanced Keras — Constructing Complex Custom Losses and Metrics"
https://towardsdatascience.com/advanced-keras-constructing-complex-custom-losses-and-metrics-c07ca130a618
If you change:
def loss_wrapper(t_change, current_epoch):
def custom_loss(y_true, y_pred):
c_epoch = K.get_value(current_epoch)
if c_epoch < t_change:
# compute loss_1
else:
# compute loss_2
return custom_loss
to:
def loss_wrapper(t_change, current_epoch):
def custom_loss(y_true, y_pred):
# compute loss_1 and loss_2
bool_case_1=K.less(current_epoch,t_change)
num_case_1=K.cast(bool_case_1,"float32")
loss = (num_case_1)*loss_1 + (1-num_case_1)*loss_2
return loss
return custom_loss
it works.
We are essentially required to turn python code into compositions of backend functions for the loss to work without having to update in a re-compile of model.compile(...). I am not satisfied with these hacks, and wish it was possible to set model.loss in a callback without re-compiling model.compile(...) after (since then the optimizer states are reset).

Resources