pytorch , changing learning rate during training - pytorch

x=np.linspace(0,20,100)
g=1+0.2*np.exp(-0.1*(x-7)**2)
y=np.sin(g*x)
plt.plot(x,y)
plt.show()
x=torch.from_numpy(x)
y=torch.from_numpy(y)
x=x.reshape((100,1))
y=y.reshape((100,1))
MM=nn.Sequential()
MM.add_module('L1',nn.Linear(1,128))
MM.add_module('R1',nn.ReLU())
MM.add_module('L2',nn.Linear(128,128))
MM.add_module('R2',nn.ReLU())
MM.add_module('L3',nn.Linear(128,128))
MM.add_module('R3',nn.ReLU())
MM.add_module('L4',nn.Linear(128,128))
MM.add_module('R5',nn.ReLU())
MM.add_module('L5',nn.Linear(128,1))
MM.double()
L=nn.MSELoss()
lr=3e-05 ######
opt=torch.optim.Adam(MM.parameters(),lr) #########
Epo=[]
COST=[]
for epoch in range(8000):
opt.zero_grad()
err=L(torch.sin(MM(x)),y)
Epo.append(epoch)
COST.append(err)
err.backward()
if epoch%100==0:
print(err)
opt.step()
Epo=np.array(Epo)/1000.
COST=np.array(COST)
pred=torch.sin(MM(x)).detach().numpy()
Trans=MM(x).detach().numpy()
x=x.reshape((100))
pred=pred.reshape((100))
Trans=Trans.reshape((100))
fig = plt.figure(figsize=(10,10))
#ax = fig.gca(projection='3d')
ax = fig.add_subplot(2,2,1)
surf = ax.plot(x,y,'r')
#ax.plot_surface(x_dat,y_dat,z_pred)
#ax.plot_wireframe(x_dat,y_dat,z_pred,linewidth=0.1)
fig.tight_layout()
#plt.show()
ax = fig.add_subplot(2,2,2)
surf = ax.plot(x,pred,'g')
fig.tight_layout()
ax = fig.add_subplot(2,2,3)
surff=ax.plot(Epo,COST,'y+')
plt.ylim(0,1100)
ax = fig.add_subplot(2,2,4)
surf = ax.plot(x,Trans,'b')
fig.tight_layout()
plt.show()
This is the original code 1.
For changing learning rate during training, I tried to move the position of 'opt' as
Epo=[]
COST=[]
for epoch in range(8000):
lr=3e-05 ######
opt=torch.optim.Adam(MM.parameters(),lr) #########
opt.zero_grad()
err=L(torch.sin(MM(x)),y)
Epo.append(epoch)
COST.append(err)
err.backward()
if epoch%100==0:
print(err)
opt.step()
This is code 2.
The code 2 also operate, but the result is quite different with code 1.
What is the difference and for changing learning rate during training(like lr=(1-epoch/10000 *0.99), what should I do?

You shouldn't move the optimizer definition into the training loop, because the optimizer keeps many other information related to training history, e.g in case of Adam there are running averages of gradients that are stored and updated dynamically in the optimizer's internal mechanism,...
So instanciating a new optimizer each iteration makes you lose this history track.
To update the learning rate dynamically there are lot of schedulers classes proposed in pytorch (exponential decay, cyclical decay, cosine annealing , ...). you can check them from the documentation for the full list of schedulers or you can implement your own if needed: https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
Example from the documentation: to decay the learning rate by multiplying it by 0.5 each 10 epochs you can use the StepLR scheduler as follows:
opt = torch.optim.Adam(MM.parameters(), lr)
scheduler = torch.optim.lr_scheduler.StepLR(opt, step_size=10, gamma=0.5)
And in your original code 1 you can do :
for epoch in range(8000):
opt.zero_grad()
err=L(torch.sin(MM(x)),y)
Epo.append(epoch)
COST.append(err)
err.backward()
if epoch%100==0:
print(err)
opt.step()
scheduler.step()
As I say you have many other type of lr schedulers so you can choose from the documentation or implement your own

Related

Pytorch - Repeating Loss

I am new to PyTorch and I found a problem when displaying the loss of my model.
Pytorch Adam Optimizer - Model Loss Figure
Pytorch SGD Optimizer - Model Loss Figure
As you can see, the model seem to go up and down multiple times, with a recurrent pattern (the pattern starting to repeat at the begging of every epoch).
The full code can be found at: https://github.com/19valentin99/Kaggle/tree/main/Iris%20Flowers
in main_test.py (the # lines are the ones that I used to debug the code and the answer should be below).
When we just take the loss of the last element (or the loss over the
whole epoch) we will see a smooth decrease in loss
The reason your loss is smooth is because you are looking at the loss of the exact same batch on every iteration. Indeed your train data loader isn't shuffling your instance:
train2 = DataLoader(flowers_data_train, batch_size=BATCH_SIZE)
This means the same batch will appear last on every epoch. That's all there is to it, this doesn't mean the learning is different, it means you are looking at a part of the complete dataset loss.
The difference between "not working" and "working" is based of when the loss is recorded.
The idea is that: overall, the loss converges, but in this time until it converges it jumps up and down.
While it jumps up and down, we might see a pattern if we are sampling too often. The pattern is given by the data we use for training (as the data we use to train is the same every epoch - in batches).
As a result:
For the not-working version: I was recording the loss every epoch, after every batch.
For the working version: I was recording only the latest loss in the epoch.
Pytorch Adam Optimizer - Model Loss (working)
Pytorch SGD Optimizer - Model Loss (working)
Furthermore, I will attach the code which generates the non working version:
loss_list = []
for epoch in range(EPOCHS):
for idx, (x, y) in enumerate(train_load):
x, y = x.to(device), y.to(device)
#Compute Error
prediction = model(x)
#print(prediction, y)
loss = loss_fn(prediction, y)
#debuging
loss_list.append(loss.item())
##Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
plt.plot(loss_list)
plt.show()
The working code:
loss_list2 = np.zeros((EPOCHS,))
for epoch in range(EPOCHS):
for batch, (x, y) in enumerate(train_load):
x = x.to(device=device)
y = y.to(device=device)
y_pred = model(x)
loss = loss_fn(y_pred, y)
loss_list2[epoch] = loss.item()
# Zero gradients
optimizer.zero_grad()
loss.backward()
optimizer.step()
plt.plot(loss_list2)
plt.show()
In the end, I would like to mention that I know that there are a couple of other threads out there that say how to solve this problem (like: clip the gradients, remove the last batch, model is too simple to capture the data) but in the end, what I discovered is that it wasn't actually a problem but more "when the recording of the data is done".
I hope that this will help other people as well.

Trying to accumulate gradients in Pytorch, but getting RuntimeError when calling loss.backward

I'm trying to train a model in Pytorch, and I'd like to have a batch size of 8, but due to memory limitations, I can only have a batch size of at most 4. I've looked all around and read a lot about accumulating gradients, and it seems like the solution to my problem.
However, I seem to have trouble implementing it. Every time I run the code I get RuntimeError: Trying to backward through the graph a second time. I don't understand why since my code looks like all these other examples I've seen (unless I'm just missing something major):
https://stackoverflow.com/a/62076913/1227353
https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255
https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903/20
One caveat is that the labels for my images are all different size, so I can't send the output batch and the label batch into the loss function; I have to iterate over them together. This is what an epoch looks like (it's been pared down for the sake of brevity):
# labels_batch contains labels of different sizes
for batch_idx, (inputs_batch, labels_batch) in enumerate(dataloader):
outputs_batch = model(inputs_batch)
# have to do this because labels can't be stacked into a tensor
for output, label in zip(outputs_batch, labels_batch):
output_scaled = interpolate(...) # make output match label size
loss = train_criterion(output_scaled, label) / (BATCH_SIZE * 2)
loss.backward()
if batch_idx % 2 == 1:
optimizer.step()
optimizer.zero_grad()
Is there something I'm missing? If I do the following I also get an error:
# labels_batch contains labels of different sizes
for batch_idx, (inputs_batch, labels_batch) in enumerate(dataloader):
outputs_batch = model(inputs_batch)
# CHANGE: we're gonna accumulate losses manually
batch_loss = 0
# have to do this because labels can't be stacked into a tensor
for output, label in zip(outputs_batch, labels_batch):
output_scaled = interpolate(...) # make output match label size
loss = train_criterion(output_scaled, label) / (BATCH_SIZE * 2)
batch_loss += loss # CHANGE: accumulate!
# CHANGE: do backprop outside for loop
batch_loss.backward()
if batch_idx % 2 == 1:
optimizer.step()
optimizer.zero_grad()
The error I get in this case is RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn. This happens when the next epoch starts though... (INCORRECT, SEE EDIT BELOW)
How can I train my model with gradient accumulation? Or am I doomed to train with a batch size of 4 or less?
Oh and as a side question, does the location of where I put loss.backward() affect what I need to normalize the loss by? Or is it always normalized by BATCH_SIZE * 2?
EDIT:
The second code segment was getting an error due to the fact that I was doing torch.set_grad_enabled(phase == 'train') but I had forgotten to wrap the call to batch_loss.backward() with an if phase == 'train'... my bad
So now the second segment of code seems to work and do gradient accumulation, but why doesn't the first bit of code work? It feel equivalent to setting BATCH_SIZE as 1. Furthermore, I'm creating a new loss object each time, so shouldn't the calls to backward() operate on different graphs entirely?
It seems you have two issues here, you said you couldn't have batch_size=8 because of memory limitations but later state that your labels are not of the same size. The latter seems much more important than the former. Anyway, I will try to answer your questions best I can.
How can I train my model with gradient accumulation? Or am I doomed to train with a batch size of 4 or less?
You want to call .backward() on every loop cycle otherwise the batch will have no effect on the training. You can then call step() and zero_grad() only when batch_idx % 2 is True (i.e. for every other batch).
Here's an example which accumulates the gradient, not the loss:
model = nn.Linear(10, 3)
optim = torch.optim.SGD(model.parameters(), lr=0.1)
ds = TensorDataset(torch.rand(100, 10), torch.rand(100, 3))
dl = DataLoader(ds, batch_size=4)
for i, (x, y) in enumerate(dl):
y_hat = model(x)
loss = F.l1_loss(y_hat, y) / 2
loss.backward()
if i % 2:
optim.step()
optim.zero_grad()
Note this approach is different to accumulating the loss, and back-propagating only all batches (or part of the batches) have gone through the network. In the example above we backpropagate every 4 datapoints and updating the model every 8 datapoints.
Oh and as a side question, does the location of where I put loss.backward() affect what I need to normalize the loss by? Or is it always normalized by BATCH_SIZE * 2?
Usually torch's built-in losses have reduction='mean' set as default. This means the loss gets averaged over all batch elements that contributed to calculating the loss. So this will depend on your loss implementation.
However if you are using gradient accumalation, then yes you will need to average your loss by the number of accumulation steps (here loss = F.l1_loss(y_hat, y) / 2). Since your gradients will be accumulated twice.
To read more about this, I recommend taking a look at this other SO post.

TensorFlow 2.0 learning rate scheduler with tf.GradientTape

I am using TensorFlow 2.0 and Python 3.8 and I want to use a learning rate scheduler for which I have a function. I have to train a neural network for 160 epochs with the following where the learning rate is to be decreased by a factor of 10 at 80 and 120 epochs, where the initial learning rate = 0.01.
def scheduler(epoch, current_learning_rate):
if epoch == 79 or epoch == 119:
return current_learning_rate / 10
else:
return min(current_learning_rate, 0.001)
How can I use this learning rate scheduler function with 'tf.GradientTape()'? I know how to use this using "model.fit()" as a callback:
callback = tf.keras.callbacks.LearningRateScheduler(scheduler)
How do I use this while using custom training loops with "tf.GradientTape()"?
Thanks!
The learning rate for different epochs can be set using lr attribute of tensorflow keras optimizer. lr attribute of the optimizer still exists since tensorflow 2 has backward compatibility for keras (For more details refer the source code here).
Below is a small snippet of how the learning rate can be varied across different epochs. self._train_step is similar to the train_step function defined here.
def set_learning_rate(epoch):
if epoch > 180:
optimizer.lr = 0.5e-6
elif epoch > 160:
optimizer.lr = 1e-6
elif epoch > 120:
optimizer.lr = 1e-5
elif epoch > 3:
optimizer.lr = 1e-4
def train(epochs, train_data, val_data):
prev_val_loss = float('inf')
for epoch in range(epochs):
self.set_learning_rate(epoch)
for images, labels in train_data:
self._train_step(images, labels)
for images, labels in val_data:
self._test_step(images, labels)
Another alternative would be to use tf.keras.optimizers.schedules
learning_rate_fn = keras.optimizers.schedules.PiecewiseConstantDecay(
[80*num_steps, 120*num_steps, 160*num_steps, 180*num_steps],
[1e-3, 1e-4, 1e-5, 1e-6, 5e-6]
)
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate_fn)
Note that here one cant directly provide the epochs, instead the number of steps have to be given, where each step is len(train_data)/batch_size.
A learning rate schedule needs a step value that can not be specified when using GradientTape followed by optimizer.apply_gradient().
So you should not pass directly the schedule as the learning_rate of the optimizer.
Instead, you can first call the schedule function to get the value for current step and then update the learning rate value in the optimizer:
optim = tf.keras.optimizers.SGD()
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(1e-2,1000,.9)
for step in range(0,1000):
lr = lr_schedule(step)
optim.learning_rate = lr
with GradientTape() as tape:
call func to differentiate
optim.apply_gradient(func,...)

Need very different learning rate for manual updates vs. using model

I am currently just trying to write some pedagogical material, in which I borrow from some common examples that have been reworked numerous times on the web.
I have a simple bit of code where I manually create tensors for layers, and update them within a loop. E.g.:
w1 = torch.randn(D_in, H, dtype=torch.float, requires_grad=True)
w2 = torch.randn(H, D_out, dtype=torch.float, requires_grad=True)
learning_rate = 1e-6
for t in range(501):
y_pred = x.mm(w1).clamp(min=0).mm(w2)
loss = (y_pred - y).pow(2).sum()
loss.backward()
w1 -= learning_rate * w1.grad
w2 -= learning_rate * w2.grad
w1.grad.zero_()
w2.grad.zero_()
This works great. Then I construct similar code using actual modules:
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-4
for t in range(501):
y_pred = model(x)
loss = loss_fn(y_pred, y)
model.zero_grad()
loss.backward()
for param in model.parameters():
param.data -= learning_rate * param.grad
This also works great.
BUT there is a difference here. If I use a 1e-4 LR in the manual case, the loss explodes, become large, then inf, then nan. So that's no good. If I use a 1e-6 LR in the model case, the loss decreases far too slowly.
Basically I'm just trying to understand why learning rate means something very different in these two snippets which are otherwise equivalent.
The crucial difference is the initialization of the weights. The weight matrix in a nn.Linear is initialized smart. I'm pretty sure that if you construct both the models and copy the weight matrices in one way or the other, you'll get consistent behavior.
Additionally, please note that the two models are not equivalent, as your handcrafted model lacks biases. Which matters.

How to properly update the weights in PyTorch?

I'm trying to implement the gradient descent with PyTorch according to this schema but can't figure out how to properly update the weights. It is just a toy example with 2 linear layers with 2 nodes in hidden layer and one output.
Learning rate = 0.05;
target output = 1
https://hmkcode.github.io/ai/backpropagation-step-by-step/
Forward
Backward
My code is as following:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
class MyNet(nn.Module):
def __init__(self):
super(MyNet, self).__init__()
self.linear1 = nn.Linear(2, 2, bias=None)
self.linear1.weight = torch.nn.Parameter(torch.tensor([[0.11, 0.21], [0.12, 0.08]]))
self.linear2 = nn.Linear(2, 1, bias=None)
self.linear2.weight = torch.nn.Parameter(torch.tensor([[0.14, 0.15]]))
def forward(self, inputs):
out = self.linear1(inputs)
out = self.linear2(out)
return out
losses = []
loss_function = nn.L1Loss()
model = MyNet()
optimizer = optim.SGD(model.parameters(), lr=0.05)
input = torch.tensor([2.0,3.0])
print('weights before backpropagation = ', list(model.parameters()))
for epoch in range(1):
result = model(input )
loss = loss_function(result , torch.tensor([1.00],dtype=torch.float))
print('result = ', result)
print("loss = ", loss)
model.zero_grad()
loss.backward()
print('gradients =', [x.grad.data for x in model.parameters()] )
optimizer.step()
print('weights after backpropagation = ', list(model.parameters()))
The result is following :
weights before backpropagation = [Parameter containing:
tensor([[0.1100, 0.2100],
[0.1200, 0.0800]], requires_grad=True), Parameter containing:
tensor([[0.1400, 0.1500]], requires_grad=True)]
result = tensor([0.1910], grad_fn=<SqueezeBackward3>)
loss = tensor(0.8090, grad_fn=<L1LossBackward>)
gradients = [tensor([[-0.2800, -0.4200], [-0.3000, -0.4500]]),
tensor([[-0.8500, -0.4800]])]
weights after backpropagation = [Parameter containing:
tensor([[0.1240, 0.2310],
[0.1350, 0.1025]], requires_grad=True), Parameter containing:
tensor([[0.1825, 0.1740]], requires_grad=True)]
Forward pass values:
2x0.11 + 3*0.21=0.85 ->
2x0.12 + 3*0.08=0.48 -> 0.85x0.14 + 0.48*0.15=0.191 -> loss =0.191-1 = -0.809
Backward pass: let's calculate w5 and w6 (output node weights)
w = w - (prediction-target)x(gradient)x(output of previous node)x(learning rate)
w5= 0.14 -(0.191-1)*1*0.85*0.05= 0.14 + 0.034= 0.174
w6= 0.15 -(0.191-1)*1*0.48*0.05= 0.15 + 0.019= 0.169
In my example Torch doesn't multiply the loss by derivative so we get wrong weights after updating. For the output node we got new weights w5,w6 [0.1825, 0.1740] , when it should be [0.174, 0.169]
Moving backward to update the first weight of the output node (w5) we need to calculate: (prediction-target)x(gradient)x(output of previous node)x(learning rate)=-0.809*1*0.85*0.05=-0.034. Updated weight w5 = 0.14-(-0.034)=0.174. But instead pytorch calculated new weight = 0.1825. It forgot to multiply by (prediction-target)=-0.809. For the output node we got gradients -0.8500 and -0.4800. But we still need to multiply them by loss 0.809 and learning rate 0.05 before we can update the weights.
What is the proper way of doing this?
Should we pass 'loss' as an argument to backward() as following: loss.backward(loss) .
That seems to fix it. But I couldn't find any example on this in documentation.
You should use .zero_grad() with optimizer, so optimizer.zero_grad(), not loss or model as suggested in the comments (though model is fine, but it is not clear or readable IMO).
Except that your parameters are updated fine, so the error is not on PyTorch's side.
Based on gradient values you provided:
gradients = [tensor([[-0.2800, -0.4200], [-0.3000, -0.4500]]),
tensor([[-0.8500, -0.4800]])]
Let's multiply all of them by your learning rate (0.05):
gradients_times_lr = [tensor([[-0.014, -0.021], [-0.015, -0.0225]]),
tensor([[-0.0425, -0.024]])]
Finally, let's apply ordinary SGD (theta -= gradient * lr), to get exactly the same results as in PyTorch:
parameters = [tensor([[0.1240, 0.2310], [0.1350, 0.1025]]),
tensor([[0.1825, 0.1740]])]
What you have done is taken the gradients calculated by PyTorch and multiplied them with the output of previous node and that's not how it works!.
What you've done:
w5= 0.14 -(0.191-1)*1*0.85*0.05= 0.14 + 0.034= 0.174
What should of been done (using PyTorch's results):
w5 = 0.14 - (-0.85*0.05) = 0.1825
No multiplication of previous node, it's done behind the scenes (that's what .backprop() does - calculates correct gradients for all of the nodes), no need to multiply them by previous ones.
If you want to calculate them manually, you have to start at the loss (with delta being one) and backprop all the way down (do not use learning rate here, it's a different story!).
After all of them are calculated, you can multiply each weight by optimizers learning rate (or any other formula for that matter, e.g. Momentum) and after this you have your correct update.
How to calculate backprop
Learning rate is not part of backpropagation, leave it alone until you calculate all of the gradients (it confuses separate algorithms together, optimization procedures and backpropagation).
1. Derivative of total error w.r.t. output
Well, I don't know why you are using Mean Absolute Error (while in the tutorial it is Mean Squared Error), and that's why both those results vary. But let's go with your choice.
Derivative of | y_true - y_pred | w.r.t. to y_pred is 1, so IT IS NOT the same as loss. Change to MSE to get equal results (here, the derivative will be (1/2 * y_pred - y_true), but we usually multiply MSE by two in order to remove the first multiplication).
In MSE case you would multiply by the loss value, but it depends entirely on the loss function (it was a bit unfortunate that the tutorial you were using didn't point this out).
2. Derivative of total error w.r.t. w5
You could probably go from here, but... Derivative of total error w.r.t to w5 is the output of h1 (0.85 in this case). We multiply it by derivative of total error w.r.t. output (it is 1!) and obtain 0.85, as done in PyTorch. Same idea goes for w6.
I seriously advise you not to confuse learning rate with backprop, you are making your life harder (and it's not easy with backprop IMO, quite counterintuitive), and those are two separate things (can't stress that one enough).
This source is nice, more step-by-step, with a little more complicated network idea (activations included), so you can get a better grasp if you go through all of it.
Furthermore, if you are really keen (and you seem to be), to know more ins and outs of this, calculate the weight corrections for other optimizers (say, nesterov), so you know why we should keep those ideas separated.

Resources