pytorch - difference between using ingore_index and not - pytorch

Good morning.
I'm training image captioning model and just wondering if there's any different between those two code.
I'm training with first code and I found out that in some sample, model's keep producing random values after the seq(where there should be ideally padding).
Is there any difference between those two code?
# index 0 is for pad token
criterion = nn.CrossEntropyLoss(ignore_index=0)
'computing loss'
loss = criterion(pred, target)
loss.backward()
optimizer.step()
criterion = nn.CrossEntropyLoss()
'computing loss'
pad_location = torch.ne(target, 0)
loss = criterion(pred, target)
loss *= pad_location
loss.backward()
optimizer.step()
Thanks

Related

I am kinda new to the pytorch, now struggling with a classification problem

I built a very simple structure
class classifier (nn.Module):
def __init__(self):
super().__init__()
self.classify = nn.Sequential(
nn.Linear(166,80),
nn.Tanh(),
nn.Linear(80,40),
nn.Tanh(),
nn.Linear(40,1),
nn.Softmax()
)
def forward (self, x):
pred = self.classify(x)
return pred
model = classifier()
The loss function and optimizer are defined as
criteria = nn.BCEWithLogitsLoss()
iteration = 1000
learning_rate = 0.1
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
and here is the training and evaluation section
for epoch in range (iteration):
model.train()
y_pred = model(x_train)
loss = criteria(y_pred,y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
model.eval()
with torch.inference_mode():
test_pred = model(x_test)
test_loss = criteria(test_pred, y_test)
if epoch % 100 == 0:
print(loss)
print(test_loss)
I received the same loss values, and by debugging, I found that the weights were not being updated.
The problem is in the network architecture: you are using a Softmax layer on a single valued output at the end. As per the definition of the softmax function, for a output vector x, we have, for index i:
softmax(x_i) = e^{x_i} / sum_j (e^{x_j})
Here, you only have a single valued output. Due to this, the output of your neural network is always 1, irrespective of the inputs or the weights. To fix this, remove the Softmax layer at the end. An activation function like Sigmoid might be more appropriate, and in fact you are already applying this when using the BCEWithLogitsLoss.
The problem lies here
y_pred = model(x_train)
loss = criteria(y_pred,y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
after loss is calculated, you are clearing the gradients by doing optimizer.zero_grad()
the ideal case should be:
optimizer.zero_grad()
y_pred = model(x_train)
loss = criteria(y_pred,y_train)
loss.backward()
optimizer.step()

Pytorch Repeating loss and AUC- When using cumulative loss

I am using PyTorch to accumulate and add losses, and then implement backpropagation(loss.backward()) at the end.
At this time, the loss is not updated and remains almost the same, and the AUC repeats exactly the same. Are there any points I haven't considered when using cumulative losses?
Thank you so much for any reply. :)
Below is the loss calculation that occurs in one batch.
opt.zero_grad()
for s in range(len(qshft)):
for a in range(len(qshft[0])):
if(m[s][a]):
y_pred = (y[s][a] * one_hot(qshft[s].long(), self.num_q)).sum(-1)
y_pred = torch.masked_select(y_pred, m[s])
t = torch.masked_select(rshft[s], m[s])
loss += binary_cross_entropy(y_pred, t).clone().detach().requires_grad_(True)
count += 1
loss = torch.tensor(loss/count,requires_grad=True)
loss.backward()
opt.step()
loss_mean.append(loss.detach().cpu().numpy())
Your following operation of detach removes the computation graph, so the loss.backward() and opt.step() won't update your weights which results in repeating loss and AUC.
loss += binary_cross_entropy(y_pred, t).clone().detach().requires_grad_(True)
You can do
loss += binary_cross_entropy(y_pred, t)
and change
loss = torch.tensor(loss/count,requires_grad=True)
to
loss = loss/count
But make sure you reset count and loss to 0 every time you go into this part.

loss.backward() with minibatch in pytorch

I came across this code online and I was wondering if I interpreted it correctly. Below is a part of a gradient descent process. full code available through the link https://jovian.ml/aakashns/03-logistic-regression. My question is as followed: During the training step, I guess the author is trying to minimize the loss for each batch by updating the parameters. However, how can we be sure the total loss of all training samples is minimized if loss.backward() is only applied to the batch loss?
def fit(epochs, lr, model, train_loader, val_loader, opt_func=torch.optim.SGD):
history = []
optimizer = opt_func(model.parameters(), lr)
for epoch in range(epochs):
# Training Phase
for batch in train_loader:
loss = model.training_step(batch)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Validation phase
result = evaluate(model, val_loader)
model.epoch_end(epoch, result)
history.append(result)
return history

How to train a CNN model?

When trying to train the CNN model, I came across a code shown below:
def train(n_epochs, loaders, model, optimizer, criterion):
for epoch in range(1,n_epochs):
train_loss = 0
valid_loss = 0
model.train()
for i, (data,target) in enumerate(loaders['train']):
# zero the parameter (weight) gradients
optimizer.zero_grad()
# forward pass to get outputs
output = model(data)
# calculate the loss
loss = criterion(output, target)
# backward pass to calculate the parameter gradients
loss.backward()
# update the parameters
optimizer.step()
Can someone please tell me why is the second for loop used?
i.e; for i, (data,target) in enumerate(loaders['train']):
And why optimizer.zero_grad() and optimizer.step() is used?
torch.utils.data.DataLoader comes in handy when you need to prepare data batches (and perhaps shuffle them before every run).
data_train_loader = DataLoader(data_train, batch_size=64, shuffle=True)
In the above code, first for-loop iterates through the number of epochs while second loop iterates through the training dataset converted into batches via above code. For example:
for batch_idx, samples in enumerate(data_train_loader):
# samples will be a 64 x D dimensional tensor
# batch_idx is each batch index
Learn more about torch.utils.data.DataLoader from here.
Optimizer.zero_gradient(): Before the backward pass, use the optimizer object to zero all of the gradients for the tensors it will update (which are the learnable weights of the model)
optimizer.step(): We generally use optimizer.step() to make the gradient descent step. Calling the step function on an Optimizer makes an update to its parameters.
Learn more about these from here.
Optimizer is used first to load the params like this (missing in your code):
optimizer = optim.Adam(model.parameters(), lr=0.001, momentum=0.9)
This code
loss = criterion(output, target)
Is used to calculate the loss of a single batch where targets is what you got from a tuple (data,target) and data is used as the input for the model, where we got the output.
This step:
optimizer.zero_grad()
Will zero all the gradients found in the optimizer, which is very important on initialization.
The part
loss.backward()
Calculates the gradients, and the optimizer.step() updates our model weights and biases (parameters).
In PyTorch you typically use DataLoader class to load the trainging and validation sets.
loaders['train']
Is probable the full train set, which represents a single epoch.

How to deal with mini-batch loss in Pytorch?

I feed mini-batch data to model, and I just want to know how to deal with the loss. Could I accumulate the loss, then call the backward like:
...
def neg_log_likelihood(self, sentences, tags, length):
self.batch_size = sentences.size(0)
logits = self.__get_lstm_features(sentences, length)
real_path_score = torch.zeros(1)
total_score = torch.zeros(1)
if USE_GPU:
real_path_score = real_path_score.cuda()
total_score = total_score.cuda()
for logit, tag, leng in zip(logits, tags, length):
logit = logit[:leng]
tag = tag[:leng]
real_path_score += self.real_path_score(logit, tag)
total_score += self.total_score(logit, tag)
return total_score - real_path_score
...
loss = model.neg_log_likelihood(sentences, tags, length)
loss.backward()
optimizer.step()
I wonder that if the accumulation could lead to gradient explosion?
So, should I call the backward in loop:
for sentence, tag , leng in zip(sentences, tags, length):
loss = model.neg_log_likelihood(sentence, tag, leng)
loss.backward()
optimizer.step()
Or, use the mean loss just like the reduce_mean in tensorflow
loss = reduce_mean(losses)
loss.backward()
The loss has to be reduced by mean using the mini-batch size. If you look at the native PyTorch loss functions such as CrossEntropyLoss, there is a separate parameter reduction just for this and the default behaviour is to do mean on the mini-batch size.
We usually
get the loss by the loss function
(if necessary) manipulate the loss, for example do the class weighting and etc
calculate the mean loss of the mini-batch
calculate the gradients by the loss.backward()
(if necessary) manipulate the gradients, for example, do the gradient clipping for some RNN models to avoid gradient explosion
update the weights using the optimizer.step() function
So in your case, you can first get the mean loss of the mini-batch and then calculate the gradient using the loss.backward() function and then utilize the optimizer.step() function for the weight updating.

Resources