I would like to update learning rates corresponding to each weight matrix and each bias in pytorch during training. The answers here and here and many other answers I found online talk about doing this using the model's param_groups which to the best of my knowledge applies learning rates in groups, not layer weight/bias specific. I also want to update the learning rates during training, not pre-setting them with torch.optim.
Any help is appreciated.
Updates to model parameters are handled by an optimizer in PyTorch. When you define the optimizer you have the option of partitioning the model parameters into different groups, called param groups. Each param group can have different optimizer settings. For example one group of parameters could have learning rate of 0.1 and another could have learning rate of 0.01.
To do what you're asking, you can just make every parameter belong to a different param group. You'll need some way to keep track of which param group corresponds to which parameter. Once you've defined the optimizer with different groups you can update the learning rate whenever you want, including at training time.
For example, say we have the following simple linear model
import torch
import torch.nn as nn
import torch.optim as optim
class LinearModel(nn.Module):
def __init__(self):
super().__init__()
self.layer1 = nn.Linear(10, 20)
self.layer2 = nn.Linear(20, 1)
def forward(self, x):
return self.layer2(self.layer1(x))
model = LinearModel()
and suppose we want learning rates for each trainable parameter initialized according to the following:
learning_rates = {
'layer1.weight': 0.01,
'layer1.bias': 0.1,
'layer2.weight': 0.001,
'layer2.bias': 1.0}
We can use this dictionary to define a different learning rate for each parameter when we initialize the optimizer.
# Build param_group where each group consists of a single parameter.
# `param_group_names` is created so we can keep track of which param_group
# corresponds to which parameter.
param_groups = []
param_group_names = []
for name, parameter in model.named_parameters():
param_groups.append({'params': [parameter], 'lr': learning_rates[name]})
param_group_names.append(name)
# optimizer requires default learning rate even if its overridden by all param groups
optimizer = optim.SGD(param_groups, lr=10)
Alternatively, we could omit the 'lr' entry and each param group would be initialized with the default learning rate (lr=10 in this case).
At training time if we wanted to update the learning rates we could do so by iterating over each of the optimizer.param_groups and updating the 'lr' entry for each of them. For example, in the following simplified training loop, we update the learning rates before each step.
for i in range(10):
output = model(torch.zeros(1, 10))
loss = output.sum()
optimizer.zero_grad()
loss.backward()
# we can change the learning rate whenever we want for each param group
print(f'step {i} learning rates')
for name, param_group in zip(param_group_names, optimizer.param_groups):
param_group['lr'] = learning_rates[name] / (i + 1)
print(f' {name}: {param_group["lr"]}')
optimizer.step()
which prints
step 0 learning rates
layer1.weight: 0.01
layer1.bias: 0.1
layer2.weight: 0.001
layer2.bias: 1.0
step 1 learning rates
layer1.weight: 0.005
layer1.bias: 0.05
layer2.weight: 0.0005
layer2.bias: 0.5
step 2 learning rates
layer1.weight: 0.0033333333333333335
layer1.bias: 0.03333333333333333
layer2.weight: 0.0003333333333333333
layer2.bias: 0.3333333333333333
step 3 learning rates
layer1.weight: 0.0025
layer1.bias: 0.025
layer2.weight: 0.00025
layer2.bias: 0.25
step 4 learning rates
layer1.weight: 0.002
layer1.bias: 0.02
layer2.weight: 0.0002
layer2.bias: 0.2
step 5 learning rates
layer1.weight: 0.0016666666666666668
layer1.bias: 0.016666666666666666
layer2.weight: 0.00016666666666666666
layer2.bias: 0.16666666666666666
step 6 learning rates
layer1.weight: 0.0014285714285714286
layer1.bias: 0.014285714285714287
layer2.weight: 0.00014285714285714287
layer2.bias: 0.14285714285714285
step 7 learning rates
layer1.weight: 0.00125
layer1.bias: 0.0125
layer2.weight: 0.000125
layer2.bias: 0.125
step 8 learning rates
layer1.weight: 0.0011111111111111111
layer1.bias: 0.011111111111111112
layer2.weight: 0.00011111111111111112
layer2.bias: 0.1111111111111111
step 9 learning rates
layer1.weight: 0.001
layer1.bias: 0.01
layer2.weight: 0.0001
layer2.bias: 0.1
Related
I have a problem with imbalanced labels, for example 90% of the data have the label 0 and the rest 10% have the label 1.
I want to teach the network with minibatches. So I want the optimizer to give the examples labeled with 1 a learning rate (or somehow change the gradients to be) greater by 9 than those with label 0.
is there any way of doing that?
The problem is that the whole training process is done in this line:
history = model.fit(trainX, trainY, epochs=1, batch_size=minibatch_size, validation_data=(valX, valY), verbose=0)
is there a way to change the fit method in the low level?
You can try using the class_weight parameter of keras.
From keras doc:
class_weight: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only).
Example of using it in imbalance data:
https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#class_weights
class_weights={"class_1": 1, "class_2": 10}
history = model.fit(trainX, trainY, epochs=1, batch_size=minibatch_size, validation_data=(valX, valY), verbose=0, class_weight=class_weights)
Full example:
# Examine the class label imbalance
# you can use your_df['label_class_column'] or just the trainY values.
neg, pos = np.bincount(your_df['label_class_column'])
total = neg + pos
print('Examples:\n Total: {}\n Positive: {} ({:.2f}% of total)\n'.format(
total, pos, 100 * pos / total))
# Scaling by total/2 helps keep the loss to a similar magnitude.
# The sum of the weights of all examples stays the same.
weight_for_0 = (1 / neg)*(total)/2.0
weight_for_1 = (1 / pos)*(total)/2.0
class_weight = {0: weight_for_0, 1: weight_for_1}
x=np.linspace(0,20,100)
g=1+0.2*np.exp(-0.1*(x-7)**2)
y=np.sin(g*x)
plt.plot(x,y)
plt.show()
x=torch.from_numpy(x)
y=torch.from_numpy(y)
x=x.reshape((100,1))
y=y.reshape((100,1))
MM=nn.Sequential()
MM.add_module('L1',nn.Linear(1,128))
MM.add_module('R1',nn.ReLU())
MM.add_module('L2',nn.Linear(128,128))
MM.add_module('R2',nn.ReLU())
MM.add_module('L3',nn.Linear(128,128))
MM.add_module('R3',nn.ReLU())
MM.add_module('L4',nn.Linear(128,128))
MM.add_module('R5',nn.ReLU())
MM.add_module('L5',nn.Linear(128,1))
MM.double()
L=nn.MSELoss()
lr=3e-05 ######
opt=torch.optim.Adam(MM.parameters(),lr) #########
Epo=[]
COST=[]
for epoch in range(8000):
opt.zero_grad()
err=L(torch.sin(MM(x)),y)
Epo.append(epoch)
COST.append(err)
err.backward()
if epoch%100==0:
print(err)
opt.step()
Epo=np.array(Epo)/1000.
COST=np.array(COST)
pred=torch.sin(MM(x)).detach().numpy()
Trans=MM(x).detach().numpy()
x=x.reshape((100))
pred=pred.reshape((100))
Trans=Trans.reshape((100))
fig = plt.figure(figsize=(10,10))
#ax = fig.gca(projection='3d')
ax = fig.add_subplot(2,2,1)
surf = ax.plot(x,y,'r')
#ax.plot_surface(x_dat,y_dat,z_pred)
#ax.plot_wireframe(x_dat,y_dat,z_pred,linewidth=0.1)
fig.tight_layout()
#plt.show()
ax = fig.add_subplot(2,2,2)
surf = ax.plot(x,pred,'g')
fig.tight_layout()
ax = fig.add_subplot(2,2,3)
surff=ax.plot(Epo,COST,'y+')
plt.ylim(0,1100)
ax = fig.add_subplot(2,2,4)
surf = ax.plot(x,Trans,'b')
fig.tight_layout()
plt.show()
This is the original code 1.
For changing learning rate during training, I tried to move the position of 'opt' as
Epo=[]
COST=[]
for epoch in range(8000):
lr=3e-05 ######
opt=torch.optim.Adam(MM.parameters(),lr) #########
opt.zero_grad()
err=L(torch.sin(MM(x)),y)
Epo.append(epoch)
COST.append(err)
err.backward()
if epoch%100==0:
print(err)
opt.step()
This is code 2.
The code 2 also operate, but the result is quite different with code 1.
What is the difference and for changing learning rate during training(like lr=(1-epoch/10000 *0.99), what should I do?
You shouldn't move the optimizer definition into the training loop, because the optimizer keeps many other information related to training history, e.g in case of Adam there are running averages of gradients that are stored and updated dynamically in the optimizer's internal mechanism,...
So instanciating a new optimizer each iteration makes you lose this history track.
To update the learning rate dynamically there are lot of schedulers classes proposed in pytorch (exponential decay, cyclical decay, cosine annealing , ...). you can check them from the documentation for the full list of schedulers or you can implement your own if needed: https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
Example from the documentation: to decay the learning rate by multiplying it by 0.5 each 10 epochs you can use the StepLR scheduler as follows:
opt = torch.optim.Adam(MM.parameters(), lr)
scheduler = torch.optim.lr_scheduler.StepLR(opt, step_size=10, gamma=0.5)
And in your original code 1 you can do :
for epoch in range(8000):
opt.zero_grad()
err=L(torch.sin(MM(x)),y)
Epo.append(epoch)
COST.append(err)
err.backward()
if epoch%100==0:
print(err)
opt.step()
scheduler.step()
As I say you have many other type of lr schedulers so you can choose from the documentation or implement your own
For example, set lr = 0.01 for the first 100 epochs, lr = 0.001 from epoch 101 to epoch 1000, lr = 0.0005 for epoch 1001-4000. Basically my learning rate plan is not letting it decay exponentially with a fixed number of steps. I know it can be achieved by self-defined functions, just curious if there are already developed functions to do that.
torch.optim.lr_scheduler.LambdaLR is what you are looking for. It returns multiplier of initial learning rate so you can specify any value for any given epoch. For your example it would be:
def lr_lambda(epoch: int):
if 100 < epoch < 1000:
return 0.1
if 1000 < epoch 4000:
return 0.05
# Optimizer has lr set to 0.01
scheduler = LambdaLR(optimizer, lr_lambda=[lambda1, lambda2])
for epoch in range(100):
train(...)
validate(...)
optimizer.step()
scheduler.step()
In PyTorch there are common functions (like MultiStepLR or ExponentialLR) but for custom use case (as is yours), LambdaLR is the easiest.
I am using TensorFlow 2.0 and Python 3.8 and I want to use a learning rate scheduler for which I have a function. I have to train a neural network for 160 epochs with the following where the learning rate is to be decreased by a factor of 10 at 80 and 120 epochs, where the initial learning rate = 0.01.
def scheduler(epoch, current_learning_rate):
if epoch == 79 or epoch == 119:
return current_learning_rate / 10
else:
return min(current_learning_rate, 0.001)
How can I use this learning rate scheduler function with 'tf.GradientTape()'? I know how to use this using "model.fit()" as a callback:
callback = tf.keras.callbacks.LearningRateScheduler(scheduler)
How do I use this while using custom training loops with "tf.GradientTape()"?
Thanks!
The learning rate for different epochs can be set using lr attribute of tensorflow keras optimizer. lr attribute of the optimizer still exists since tensorflow 2 has backward compatibility for keras (For more details refer the source code here).
Below is a small snippet of how the learning rate can be varied across different epochs. self._train_step is similar to the train_step function defined here.
def set_learning_rate(epoch):
if epoch > 180:
optimizer.lr = 0.5e-6
elif epoch > 160:
optimizer.lr = 1e-6
elif epoch > 120:
optimizer.lr = 1e-5
elif epoch > 3:
optimizer.lr = 1e-4
def train(epochs, train_data, val_data):
prev_val_loss = float('inf')
for epoch in range(epochs):
self.set_learning_rate(epoch)
for images, labels in train_data:
self._train_step(images, labels)
for images, labels in val_data:
self._test_step(images, labels)
Another alternative would be to use tf.keras.optimizers.schedules
learning_rate_fn = keras.optimizers.schedules.PiecewiseConstantDecay(
[80*num_steps, 120*num_steps, 160*num_steps, 180*num_steps],
[1e-3, 1e-4, 1e-5, 1e-6, 5e-6]
)
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate_fn)
Note that here one cant directly provide the epochs, instead the number of steps have to be given, where each step is len(train_data)/batch_size.
A learning rate schedule needs a step value that can not be specified when using GradientTape followed by optimizer.apply_gradient().
So you should not pass directly the schedule as the learning_rate of the optimizer.
Instead, you can first call the schedule function to get the value for current step and then update the learning rate value in the optimizer:
optim = tf.keras.optimizers.SGD()
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(1e-2,1000,.9)
for step in range(0,1000):
lr = lr_schedule(step)
optim.learning_rate = lr
with GradientTape() as tape:
call func to differentiate
optim.apply_gradient(func,...)
I'm trying to implement the gradient descent with PyTorch according to this schema but can't figure out how to properly update the weights. It is just a toy example with 2 linear layers with 2 nodes in hidden layer and one output.
Learning rate = 0.05;
target output = 1
https://hmkcode.github.io/ai/backpropagation-step-by-step/
Forward
Backward
My code is as following:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
class MyNet(nn.Module):
def __init__(self):
super(MyNet, self).__init__()
self.linear1 = nn.Linear(2, 2, bias=None)
self.linear1.weight = torch.nn.Parameter(torch.tensor([[0.11, 0.21], [0.12, 0.08]]))
self.linear2 = nn.Linear(2, 1, bias=None)
self.linear2.weight = torch.nn.Parameter(torch.tensor([[0.14, 0.15]]))
def forward(self, inputs):
out = self.linear1(inputs)
out = self.linear2(out)
return out
losses = []
loss_function = nn.L1Loss()
model = MyNet()
optimizer = optim.SGD(model.parameters(), lr=0.05)
input = torch.tensor([2.0,3.0])
print('weights before backpropagation = ', list(model.parameters()))
for epoch in range(1):
result = model(input )
loss = loss_function(result , torch.tensor([1.00],dtype=torch.float))
print('result = ', result)
print("loss = ", loss)
model.zero_grad()
loss.backward()
print('gradients =', [x.grad.data for x in model.parameters()] )
optimizer.step()
print('weights after backpropagation = ', list(model.parameters()))
The result is following :
weights before backpropagation = [Parameter containing:
tensor([[0.1100, 0.2100],
[0.1200, 0.0800]], requires_grad=True), Parameter containing:
tensor([[0.1400, 0.1500]], requires_grad=True)]
result = tensor([0.1910], grad_fn=<SqueezeBackward3>)
loss = tensor(0.8090, grad_fn=<L1LossBackward>)
gradients = [tensor([[-0.2800, -0.4200], [-0.3000, -0.4500]]),
tensor([[-0.8500, -0.4800]])]
weights after backpropagation = [Parameter containing:
tensor([[0.1240, 0.2310],
[0.1350, 0.1025]], requires_grad=True), Parameter containing:
tensor([[0.1825, 0.1740]], requires_grad=True)]
Forward pass values:
2x0.11 + 3*0.21=0.85 ->
2x0.12 + 3*0.08=0.48 -> 0.85x0.14 + 0.48*0.15=0.191 -> loss =0.191-1 = -0.809
Backward pass: let's calculate w5 and w6 (output node weights)
w = w - (prediction-target)x(gradient)x(output of previous node)x(learning rate)
w5= 0.14 -(0.191-1)*1*0.85*0.05= 0.14 + 0.034= 0.174
w6= 0.15 -(0.191-1)*1*0.48*0.05= 0.15 + 0.019= 0.169
In my example Torch doesn't multiply the loss by derivative so we get wrong weights after updating. For the output node we got new weights w5,w6 [0.1825, 0.1740] , when it should be [0.174, 0.169]
Moving backward to update the first weight of the output node (w5) we need to calculate: (prediction-target)x(gradient)x(output of previous node)x(learning rate)=-0.809*1*0.85*0.05=-0.034. Updated weight w5 = 0.14-(-0.034)=0.174. But instead pytorch calculated new weight = 0.1825. It forgot to multiply by (prediction-target)=-0.809. For the output node we got gradients -0.8500 and -0.4800. But we still need to multiply them by loss 0.809 and learning rate 0.05 before we can update the weights.
What is the proper way of doing this?
Should we pass 'loss' as an argument to backward() as following: loss.backward(loss) .
That seems to fix it. But I couldn't find any example on this in documentation.
You should use .zero_grad() with optimizer, so optimizer.zero_grad(), not loss or model as suggested in the comments (though model is fine, but it is not clear or readable IMO).
Except that your parameters are updated fine, so the error is not on PyTorch's side.
Based on gradient values you provided:
gradients = [tensor([[-0.2800, -0.4200], [-0.3000, -0.4500]]),
tensor([[-0.8500, -0.4800]])]
Let's multiply all of them by your learning rate (0.05):
gradients_times_lr = [tensor([[-0.014, -0.021], [-0.015, -0.0225]]),
tensor([[-0.0425, -0.024]])]
Finally, let's apply ordinary SGD (theta -= gradient * lr), to get exactly the same results as in PyTorch:
parameters = [tensor([[0.1240, 0.2310], [0.1350, 0.1025]]),
tensor([[0.1825, 0.1740]])]
What you have done is taken the gradients calculated by PyTorch and multiplied them with the output of previous node and that's not how it works!.
What you've done:
w5= 0.14 -(0.191-1)*1*0.85*0.05= 0.14 + 0.034= 0.174
What should of been done (using PyTorch's results):
w5 = 0.14 - (-0.85*0.05) = 0.1825
No multiplication of previous node, it's done behind the scenes (that's what .backprop() does - calculates correct gradients for all of the nodes), no need to multiply them by previous ones.
If you want to calculate them manually, you have to start at the loss (with delta being one) and backprop all the way down (do not use learning rate here, it's a different story!).
After all of them are calculated, you can multiply each weight by optimizers learning rate (or any other formula for that matter, e.g. Momentum) and after this you have your correct update.
How to calculate backprop
Learning rate is not part of backpropagation, leave it alone until you calculate all of the gradients (it confuses separate algorithms together, optimization procedures and backpropagation).
1. Derivative of total error w.r.t. output
Well, I don't know why you are using Mean Absolute Error (while in the tutorial it is Mean Squared Error), and that's why both those results vary. But let's go with your choice.
Derivative of | y_true - y_pred | w.r.t. to y_pred is 1, so IT IS NOT the same as loss. Change to MSE to get equal results (here, the derivative will be (1/2 * y_pred - y_true), but we usually multiply MSE by two in order to remove the first multiplication).
In MSE case you would multiply by the loss value, but it depends entirely on the loss function (it was a bit unfortunate that the tutorial you were using didn't point this out).
2. Derivative of total error w.r.t. w5
You could probably go from here, but... Derivative of total error w.r.t to w5 is the output of h1 (0.85 in this case). We multiply it by derivative of total error w.r.t. output (it is 1!) and obtain 0.85, as done in PyTorch. Same idea goes for w6.
I seriously advise you not to confuse learning rate with backprop, you are making your life harder (and it's not easy with backprop IMO, quite counterintuitive), and those are two separate things (can't stress that one enough).
This source is nice, more step-by-step, with a little more complicated network idea (activations included), so you can get a better grasp if you go through all of it.
Furthermore, if you are really keen (and you seem to be), to know more ins and outs of this, calculate the weight corrections for other optimizers (say, nesterov), so you know why we should keep those ideas separated.