With SGD learning rate should not be changed during epochs but it is. Help me understand why it happens please and how to prevent this LR changing?
import torch
params = [torch.nn.Parameter(torch.randn(1, 1))]
optimizer = torch.optim.SGD(params, lr=0.9)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)
for epoch in range(5):
print(scheduler.get_lr())
scheduler.step()
Output is:
[0.9]
[0.7290000000000001]
[0.6561000000000001]
[0.5904900000000002]
[0.5314410000000002]
My torch version is 1.4.0
Since you are using the command torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9) (meaning actually torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.9)) thus you are multiplying the learning rate by gamma=0.9 every step_size=1 step:
0.9 = 0.9
0.729 = 0.9*0.9*0.9
0.6561 = 0.9*0.9*0.9*0.9
0.59049 = 0.9*0.9*0.9*0.9*0.9
The only "strange" point is that it missing 0.81=0.9*0.9 at the second step (UPDATE: see Szymon Maszke answer for an explanation)
To prevent early decreasing, if you have N samples in your dataset, and the batch size is D, then set torch.optim.lr_scheduler.StepLR(optimizer, step_size=N/D, gamma=0.9) to decrease at each epoch. To decrease each E epoch set torch.optim.lr_scheduler.StepLR(optimizer, step_size=E*N/D, gamma=0.9)
This is just what torch.optim.lr_scheduler.StepLR is supposed to do. It changes the learning rate. From the pytorch documentation:
Decays the learning rate of each parameter group by gamma every step_size epochs. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr
If you are trying to optimize params, your code should look more like this (just a toy example, the precise form of loss will depend on your application)
for epoch in range(5):
optimizer.zero_grad()
loss = (params[0]**2).sum()
loss.backward()
optimizer.step()
To expand upon xiawi's answer about "strange" behavior (0.81 is missing): It is PyTorch's default way since 1.1.0 release, check documentation, namely this part:
[...] If you use the learning rate scheduler (calling
scheduler.step()) before the optimizer’s update (calling
optimizer.step()), this will skip the first value of the learning rate
schedule.
Additionally you should get a UserWarning thrown by this function after the first get_lr() call as you do not call optimizer.step() at all.
Related
I want to train a model in two stages. The first one is a pre-training with teacher forcing, and the second one is a regular training (without teacher forcing). The difference here is that the model is instantiated with use_teacher_forcing=True in the first case and use_teacher_forcing=False in the latter.
To do so, I currently run two trainings, where the second training resumes from the first trainings checkpoint, by passing the last checkpoint to the lightning trainer.
Regarding the learning rate, I want to decay it over several milestones as well in pre-training as well as in regular training. For instance, if I use 5 epochs of pre-training and 5 epochs of training, I want the learning rate to be as follows:
0
1
2
3
4
5
6
7
8
9
1e-4
1e-4
1e-5
1e-5
1e-6
1e-4
1e-4
1e-5
1e-5
1e-6
However, I cannot find a way reset the learning rate to its initial value at the beginning of the regular training, since the scheduler is also loaded from the checkpoint.
Is there a way to do this?
I am using torch 1.9.0 und pytorch-lightning 1.3.8 and am not able to upgrade to later versions.
I came across the following solution.
Apparently, it's not that hard to implement and use a custom learning rate scheduler. I'll leave the code here in case anybody stumbles upon the same problem.
class MultiStepLRWithReset(_LRScheduler):
def __init__(self, optimizer, milestones, reset_epochs, reset_lr_to=None, gamma=0.1, last_epoch=-1, verbose=False):
full_milestones = milestones
for reset_epoch in reset_epochs:
full_milestones += [m + reset_epoch for m in milestones]
self.milestones = Counter(full_milestones)
self.reset_epochs = reset_epochs
self.reset_lr_to = reset_lr_to
self.gamma = gamma
super(MultiStepLRWithReset, self).__init__(optimizer, last_epoch, verbose)
def get_lr(self):
if not self._get_lr_called_within_step:
warnings.warn("To get the last learning rate computed by the scheduler, "
"please use `get_last_lr()`.", UserWarning)
if self.last_epoch in self.reset_epochs:
if self.reset_lr_to is None:
return [group['initial_lr'] for group in self.optimizer.param_groups]
else:
return [self.reset_lr_to for _ in self.optimizer.param_groups]
if self.last_epoch not in self.milestones:
return [group['lr'] for group in self.optimizer.param_groups]
return [group['lr'] * self.gamma ** self.milestones[self.last_epoch]
for group in self.optimizer.param_groups]
You will have to create a LRScheduler for the entire training as it will not be reinstantiated for the second training stage if all the pytorch training components are loaded from their last checkpoint.
We have been trying for a while to get this to work. This is probably the easiest example to create and so now we need help. We've been changing the number of epochs in the fit function and that's giving us different results, but never anything good, and when we increase them too much they will always converge on 0.5.
#%%
inputValues = numpy.array([[0,0],[0,1],[1,0],[1,1]])
inputResults = numpy.array([[0],[1],[1],[0]])
print(inputValues)
print(inputResults)
#%%
model = keras.Sequential([
keras.layers.Flatten(input_shape=(2,)),
keras.layers.Dense(units=2, activation=("relu")),
keras.layers.Dense(units=2, activation=("softmax"))
])
model.compile(loss = keras.losses.SparseCategoricalCrossentropy(), optimizer = tensorflow.optimizers.Adam(), metrics=['accuracy'])
model.fit(inputValues, inputResults, epochs=2500)
model.summary()
print(model.weights)
#%%
print(model.predict_proba(inputValues))
print("End of file.")
From my understanding of ANN's we should have 2 inputs in the first layer, specifically for the XOR example. And two outputs for the output (either a 0, or a 1). I assume that since it is not required to say what these outputs are (0 or 1), tensor flow is dealing with this automatically by comparing the results in the fit function? Lastly, we have tried with both a hidden layer (of 2) and without and still don't seem to get any better results.
Could someone let us know what we have done wrong?
Your problem is essentially a binary classification problem, because the output can be either 0 or 1. For this you don't need two ouput neurons; one will do, with a sigmoid function that will return either a 0 or a 1 as output (sigmoid generally works well for binary classification, because its characteristic S-shape will get you values either close to 0 or close to 1).
Another adjustment you need to make is to set the loss function to binary crossentropy; your choice, sparse categorical crossentropy, is suitable for classifications into more than 2 categories.
So the code that I tried is:
model = keras.Sequential([
keras.layers.Flatten(input_shape=(2,)),
keras.layers.Dense(units=4, activation=("sigmoid")),
keras.layers.Dense(units=1, activation=("sigmoid"))
])
model.compile(loss = keras.losses.BinaryCrossentropy(from_logits=False), optimizer = optimizers.Adam(), metrics=['accuracy'])
model.fit(inputValues, inputResults, epochs=2500)
With these settings I got training accuracy to 1.0000. It took a while to get there, and I suppose that could be sped up by playing around with the learning rate, but it should be enough to get the job done.
I have a strange issue already mentioned here: LinearSVC Feature Selection returns different coef_ in Python
but I cannot really relate to that.
I have a Regularised L1 logistic regression that I am using for feature selection.
When I simply rerun the code the number of the feature selected changes.
The target variable is binary 1, 0. The number of feature is 709. The training observations are 435, so the feature are more than the observations. The penalty C has been obtained through TimeSeriesSplit CV and never changes when I rerun, I verified that.
Below the code for the feature selection part..
X=df_training_features
y=df_training_targets
lr_l1 = LogisticRegression(C = LR_penalty.C, max_iter=10000,class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1, l1_ratio=None, n_jobs=None,
penalty='l1', random_state=None, solver='liblinear', tol=0.0001, verbose=0,
warm_start=False).fit(X,y)
model = SelectFromModel(lr_l1, threshold=1e-5, prefit=True)
feature_idx = model.get_support()
feature_name = X.columns[feature_idx]
X_new = model.transform(X)
# Plot
importance = lr_l1.coef_[0]
for i,v in enumerate(importance):
if np.abs(v)>=1e-5:
print('Feature: %0d, Score: %.5f' % (i,v))
sel = importance[np.abs(importance)>=1e-5]
# plot feature importance
plt.figure(figsize=(12, 10))
pyplot.bar([x for x in feature_name], sel)
pyplot.xticks(fontsize=10, rotation=70)
pyplot.ylabel('Feature Importance', fontsize = 14)
pyplot.show()
As seen above, the result sometimes gives me 22 feature selected (first plot), and some other times 24 (second plot), or 23. Not sure what is happening. I thought the issue was in the SelectFromModel so I decided to explicitly state the threshold 1e-5 (which is the default for l1 regularisation), but nothing changes.
It is always the same features which are sometimes in and sometimes out so I checked their coefficients as I thought they might be close to that threshold instead they are not (1 or 2 order of magnitude higher).
Can please anybody help? I have been struggling more than a day on this
You used solver=liblinear. From the documentation:
random_state : int, RandomState instance, default=None
Used when solver == ‘sag’, ‘saga’ or ‘liblinear’ to shuffle the data. See Glossary for details.
So try setting a fixed value for random_state and you should converge to the same results.
After a very quick search, I found liblinear uses coordinate descent to minimize the cost function (source). This means that it will choose a random set of coefficients and minimize the cost function one step at a time. I suppose your results are slightly different because they each started at different points.
I have tried several methods to display the learning rate of a model effectively used at the last epoch in Keras.
Some research has shown it was possible to change the learning rate using callbacks, or to display the learning rate with a custom metric.
But the displayed learning rate was always the ORIGINAL learning rate, whatever method I tried.
Some answers imply re-calculates how much the rate should be based on the formula. But what I want is simply to get the learning rate which was used for backpropagation, without calculating it based on the algorithm.
Here is some code I used:
callback_list = []
metric_list = ['accuracy']
# Add checkpoints to save weights in case the test set acc improved
#...
if show_learn_param:
learn_param = Callback_show_learn_param()
callback_list.append(learn_param)
# Add metric if needed
def get_lr_metric(optimizer):
def lr(y_true, y_pred):
return optimizer.lr #K.eval(optimizer.lr)
return lr
lr_metric = get_lr_metric(optimizer)
metric_list.append(lr_metric)
Here is the definition of the callback:
class Callback_show_learn_param(Callback):
def on_epoch_end(self, epoch, logs=None):
lr = self.model.optimizer.lr
decay = self.model.optimizer.decay
iterations = self.model.optimizer.iterations
lr_with_decay = lr / (1. + decay * K.cast(iterations, K.dtype(decay)))
# Beta values
beta_1=self.model.optimizer.beta_1
beta_2=self.model.optimizer.beta_2
print("lr", K.eval(lr), "decay", K.eval(decay), "lr_with_decay", K.eval(lr_with_decay),
"beta_1", K.eval(beta_1), "beta_2", K.eval(beta_2))
Basically, the displayed values are constant and do not change. It makes sense for beta values and decay. The shown learning rate seems to be the initial one. For the learning rate, I could not find a way to display this simple value: effective learning rate really used.
There is BTW an easier way to display this initial learning rate:
import keras.backend as K
print(K.eval(model.optimizer.lr))
You need to use K.get_value to obtain the learning rate. Have a look at LearningRateScheduler and how that callback obtains the learning rate from the model. In your case you should be able to print the learning rate:
def on_epoch_end(self, epoch, logs=None):
lr = float(K.get_value(self.model.optimizer.lr))
print("Learning rate:", lr)
To be clear, by weights I mean the entries in the matrices (Ws) of the affine transformation in a node of a neural net.
I start with categorical_crossentropy as my loss function. And I want to add an additional term to penalize negative weights.
To this end I want to introduce a term of the form
theano.tensor.sum(theano.tensor.exp(-10 * ws))
Where "ws" are the weights.
If I follow the source code of categorical_crossentropy:
if true_dist.ndim == coding_dist.ndim:
return -tensor.sum(true_dist *tensor.log(coding_dist), axis=coding_dist.ndim - 1)
elif true_dist.ndim == coding_dist.ndim - 1:
return crossentropy_categorical_1hot(coding_dist, true_dist)
else:
raise TypeError('rank mismatch between coding and true distributions')
Seems like I should update the third line (from the bottom) to read
crossentropy_categorical_1hot(coding_dist, true_dist) + theano.tensor.sum(theano.tensor.exp(- 10 * ws))
And change the declaration of the function to be
my_categorical_crossentropy(coding_dist, true_dist, ws) Where in calling for my_categorical_crossentropy I write
loss = my_categorical_crossentropy(net_output, true_output, l_layers[1].W)
with, for a start, l_layers[1].W to be the weights coming from the first layer of my neural net.
With those updates, I go on writing:
loss = aggregate(loss, mode = 'mean')
updates = sgd(loss, all_params, learning_rate = 0.005)
train = theano.function([l_input.input_var, true_output], loss, updates = updates)
[...]
This passes the compiler and everything runs smoothly, the training of the network completes. However, for some reason the additional term " theano.tensor.sum(theano.tensor.exp(- 10 * ws)) is ignored, it seems not to effect the loss value.
I was trying to look into Theano documentation, but so far I could not figure out what might be wrong? The weighs l_layers[1].W are shared variables, so I could not pass those as
train = theano.function([l_input.input_var, true_output, l_layers[1].W], loss, updates = updates)
Any comments are welcome. Thanks!
Solution
Though, I didn't find why what I did, didn't work, adding the penalty term outside the 'categorical_crossentropy' as suggested in the comments did solve the problem:
loss = aggregate(categorical_crossentropy(net_output, true_output) + theano.tensor.sum(theano.tensor.exp(- 10 * l_layers[1].W))