AdamW and Adam with weight decay - pytorch

Is there any difference between torch.optim.Adam(weight_decay=0.01) and torch.optim.AdamW(weight_decay=0.01)?
Link to the docs: torch.optim

Yes, Adam and AdamW weight decay are different.
Hutter pointed out in their paper (Decoupled Weight Decay Regularization) that the way weight decay is implemented in Adam in every library seems to be wrong, and proposed a simple way (which they call AdamW) to fix it.
In Adam, the weight decay is usually implemented by adding wd*w (wd is weight decay here) to the gradients (Ist case), rather than actually subtracting from weights (IInd case).
# Ist: Adam weight decay implementation (L2 regularization)
final_loss = loss + wd * all_weights.pow(2).sum() / 2
# IInd: equivalent to this in SGD
w = w - lr * w.grad - lr * wd * w
These methods are same for vanilla SGD, but as soon as we add momentum, or use a more sophisticated optimizer like Adam, L2 regularization (first equation) and weight decay (second equation) become different.
AdamW follows the second equation for weight decay.
In Adam
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
In AdamW
weight_decay (float, optional) – weight decay coefficient (default: 1e-2)
Read more on the fastai blog.

Related

Different loss values and accuracies of MLP regressor in keras and scikit-learn

I have a neural network with one hidden layer implemented in both Keras and scikit-learn for solving a regression problem. In scikit-learn I used the MLPregressor class with mostly default parameters and in Keras I have a hidden Dense layer with parameters set to the same defaults as scikit-learn (which uses Adam with same learning rate and epsilon and a batch_size of 200). When I train the networks the scikit-learn model has a loss value that is about half of keras and its accuracy (measured in mean absolute error) is also better. Shouldn't the loss values be similar if not identical and the accuracies also be similar? Has anyone experienced something similar and able to make the Keras model more accurate?
Scikit-learn model:
clf = MLPRegressor(hidden_layer_sizes=(1600,), max_iter=1000, verbose=True, learning_rate_init=.001)
Keras model:
inputs = keras.Input(shape=(cols,))
x = keras.layers.Dense(1600, activation='relu', kernel_initializer="glorot_uniform", bias_initializer="glorot_uniform", kernel_regularizer=keras.regularizers.L2(.0001))(inputs)
outputs = keras.layers.Dense(1,kernel_initializer="glorot_uniform", bias_initializer="glorot_uniform", kernel_regularizer=keras.regularizers.L2(.0001))(x)
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=keras.optimizers.Adam(epsilon=1e-8, learning_rate=.001),loss="mse")
model.fit(x=X, y=y, epochs=1000, batch_size=200)
It is because the formula of mean squared loss(MSE) from scikit-learn is different from that of tensorflow.
From the source code of scikit-learn:
def squared_loss(y_true, y_pred):
return ((y_true - y_pred) ** 2).mean() / 2
while MSE from tensorflow:
backend.mean(math_ops.squared_difference(y_pred, y_true), axis=-1)
As you can see the scikit-learn one is divided by 2, coherent with what you said:
the scikit-learn model has a loss value that is about half of keras
That implied the models from keras and scikit-learn actually achieved similar performance. That also implied learning rate 0.001 in scikit-learn is not equivalent to the same learning rate in tensorflow.
Also, another smaller but significant difference is the formula of L2 regularization.
From the source code of scikit-learn,
# Add L2 regularization term to loss
values = 0
for s in self.coefs_:
s = s.ravel()
values += np.dot(s, s)
loss += (0.5 * self.alpha) * values / n_samples
while that of tensorflow is loss = l2 * reduce_sum(square(x)).
Therefore, with the same l2 regularization parameter, tensorflow one has stronger regularization, which will result in poorer fit to the training data.

How to properly update the weights in PyTorch?

I'm trying to implement the gradient descent with PyTorch according to this schema but can't figure out how to properly update the weights. It is just a toy example with 2 linear layers with 2 nodes in hidden layer and one output.
Learning rate = 0.05;
target output = 1
https://hmkcode.github.io/ai/backpropagation-step-by-step/
Forward
Backward
My code is as following:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
class MyNet(nn.Module):
def __init__(self):
super(MyNet, self).__init__()
self.linear1 = nn.Linear(2, 2, bias=None)
self.linear1.weight = torch.nn.Parameter(torch.tensor([[0.11, 0.21], [0.12, 0.08]]))
self.linear2 = nn.Linear(2, 1, bias=None)
self.linear2.weight = torch.nn.Parameter(torch.tensor([[0.14, 0.15]]))
def forward(self, inputs):
out = self.linear1(inputs)
out = self.linear2(out)
return out
losses = []
loss_function = nn.L1Loss()
model = MyNet()
optimizer = optim.SGD(model.parameters(), lr=0.05)
input = torch.tensor([2.0,3.0])
print('weights before backpropagation = ', list(model.parameters()))
for epoch in range(1):
result = model(input )
loss = loss_function(result , torch.tensor([1.00],dtype=torch.float))
print('result = ', result)
print("loss = ", loss)
model.zero_grad()
loss.backward()
print('gradients =', [x.grad.data for x in model.parameters()] )
optimizer.step()
print('weights after backpropagation = ', list(model.parameters()))
The result is following :
weights before backpropagation = [Parameter containing:
tensor([[0.1100, 0.2100],
[0.1200, 0.0800]], requires_grad=True), Parameter containing:
tensor([[0.1400, 0.1500]], requires_grad=True)]
result = tensor([0.1910], grad_fn=<SqueezeBackward3>)
loss = tensor(0.8090, grad_fn=<L1LossBackward>)
gradients = [tensor([[-0.2800, -0.4200], [-0.3000, -0.4500]]),
tensor([[-0.8500, -0.4800]])]
weights after backpropagation = [Parameter containing:
tensor([[0.1240, 0.2310],
[0.1350, 0.1025]], requires_grad=True), Parameter containing:
tensor([[0.1825, 0.1740]], requires_grad=True)]
Forward pass values:
2x0.11 + 3*0.21=0.85 ->
2x0.12 + 3*0.08=0.48 -> 0.85x0.14 + 0.48*0.15=0.191 -> loss =0.191-1 = -0.809
Backward pass: let's calculate w5 and w6 (output node weights)
w = w - (prediction-target)x(gradient)x(output of previous node)x(learning rate)
w5= 0.14 -(0.191-1)*1*0.85*0.05= 0.14 + 0.034= 0.174
w6= 0.15 -(0.191-1)*1*0.48*0.05= 0.15 + 0.019= 0.169
In my example Torch doesn't multiply the loss by derivative so we get wrong weights after updating. For the output node we got new weights w5,w6 [0.1825, 0.1740] , when it should be [0.174, 0.169]
Moving backward to update the first weight of the output node (w5) we need to calculate: (prediction-target)x(gradient)x(output of previous node)x(learning rate)=-0.809*1*0.85*0.05=-0.034. Updated weight w5 = 0.14-(-0.034)=0.174. But instead pytorch calculated new weight = 0.1825. It forgot to multiply by (prediction-target)=-0.809. For the output node we got gradients -0.8500 and -0.4800. But we still need to multiply them by loss 0.809 and learning rate 0.05 before we can update the weights.
What is the proper way of doing this?
Should we pass 'loss' as an argument to backward() as following: loss.backward(loss) .
That seems to fix it. But I couldn't find any example on this in documentation.
You should use .zero_grad() with optimizer, so optimizer.zero_grad(), not loss or model as suggested in the comments (though model is fine, but it is not clear or readable IMO).
Except that your parameters are updated fine, so the error is not on PyTorch's side.
Based on gradient values you provided:
gradients = [tensor([[-0.2800, -0.4200], [-0.3000, -0.4500]]),
tensor([[-0.8500, -0.4800]])]
Let's multiply all of them by your learning rate (0.05):
gradients_times_lr = [tensor([[-0.014, -0.021], [-0.015, -0.0225]]),
tensor([[-0.0425, -0.024]])]
Finally, let's apply ordinary SGD (theta -= gradient * lr), to get exactly the same results as in PyTorch:
parameters = [tensor([[0.1240, 0.2310], [0.1350, 0.1025]]),
tensor([[0.1825, 0.1740]])]
What you have done is taken the gradients calculated by PyTorch and multiplied them with the output of previous node and that's not how it works!.
What you've done:
w5= 0.14 -(0.191-1)*1*0.85*0.05= 0.14 + 0.034= 0.174
What should of been done (using PyTorch's results):
w5 = 0.14 - (-0.85*0.05) = 0.1825
No multiplication of previous node, it's done behind the scenes (that's what .backprop() does - calculates correct gradients for all of the nodes), no need to multiply them by previous ones.
If you want to calculate them manually, you have to start at the loss (with delta being one) and backprop all the way down (do not use learning rate here, it's a different story!).
After all of them are calculated, you can multiply each weight by optimizers learning rate (or any other formula for that matter, e.g. Momentum) and after this you have your correct update.
How to calculate backprop
Learning rate is not part of backpropagation, leave it alone until you calculate all of the gradients (it confuses separate algorithms together, optimization procedures and backpropagation).
1. Derivative of total error w.r.t. output
Well, I don't know why you are using Mean Absolute Error (while in the tutorial it is Mean Squared Error), and that's why both those results vary. But let's go with your choice.
Derivative of | y_true - y_pred | w.r.t. to y_pred is 1, so IT IS NOT the same as loss. Change to MSE to get equal results (here, the derivative will be (1/2 * y_pred - y_true), but we usually multiply MSE by two in order to remove the first multiplication).
In MSE case you would multiply by the loss value, but it depends entirely on the loss function (it was a bit unfortunate that the tutorial you were using didn't point this out).
2. Derivative of total error w.r.t. w5
You could probably go from here, but... Derivative of total error w.r.t to w5 is the output of h1 (0.85 in this case). We multiply it by derivative of total error w.r.t. output (it is 1!) and obtain 0.85, as done in PyTorch. Same idea goes for w6.
I seriously advise you not to confuse learning rate with backprop, you are making your life harder (and it's not easy with backprop IMO, quite counterintuitive), and those are two separate things (can't stress that one enough).
This source is nice, more step-by-step, with a little more complicated network idea (activations included), so you can get a better grasp if you go through all of it.
Furthermore, if you are really keen (and you seem to be), to know more ins and outs of this, calculate the weight corrections for other optimizers (say, nesterov), so you know why we should keep those ideas separated.

How can I calculate the loss without the weight decay in Keras?

I defined a convolutional layer and also use the L2 weight decay in Keras.
When I define the loss in the model.fit(), has all the weight decay loss been included in this loss? If the weight decay loss has been included in the total loss, how can I get the loss without this weight decay during the training?
I want to investigate the loss without the weight decay, while I want this weight decay to attend this training.
Yes, weight decay losses are included in the loss value printed on the screen.
The value you want to monitor is the total loss minus the sum of regularization losses.
The total loss is just model.total_loss
.
The regularization losses are collected in the list model.losses.
The following lines can be found in the source code of model.compile():
# Add regularization penalties
# and other layer-specific losses.
for loss_tensor in self.losses:
total_loss += loss_tensor
To get the loss without weight decay, you can reverse the above operations. I.e., the value to be monitored is model.total_loss - sum(model.losses).
Now, how to monitor this value is a bit tricky. Fortunately, the list of metrics used by a Keras model is not fixed until model.fit() is called. So you can append this value to the list, and it'll be printed on the screen during model fitting.
Here's a simple example:
input_tensor = Input(shape=(64, 64, 3))
hidden = Conv2D(32, 1, kernel_regularizer=l2(0.01))(input_tensor)
hidden = GlobalAveragePooling2D()(hidden)
out = Dense(1)(hidden)
model = Model(input_tensor, out)
model.compile(loss='mse', optimizer='adam')
loss_no_weight_decay = model.total_loss - sum(model.losses)
model.metrics_tensors.append(loss_no_weight_decay)
model.metrics_names.append('loss_no_weight_decay')
When you run model.fit(), something like this will be printed to the screen:
Epoch 1/1
100/100 [==================] - 0s - loss: 0.5764 - loss_no_weight_decay: 0.5178
You can also verify whether this value is correct by computing the L2 regularization manually:
conv_kernel = model.layers[1].get_weights()[0]
print(np.sum(0.01 * np.square(conv_kernel)))
In my case, the printed value is 0.0585, which is indeed the difference between loss and loss_no_weight_decay (with some rounding error).

Effect of class_weight and sample_weight in Keras

Can someone tell me mathematically how sample_weight and class_weight are used in Keras in the calculation of loss function and metrics? A simple mathematical express will be great.
It is a simple multiplication. The loss contributed by the sample is magnified by its sample weight. Assuming i = 1 to n samples, a weight vector of sample weights w of length n, and that the loss for sample i is denoted L_i:
In Keras in particular, the product of each sample's loss with its weight is divided by the fraction of weights that are not 0 such that the loss per batch is proportional to the number of weight > 0 samples. Let p be the proportion of non-zero weights.
Here's the relevant snippet of code from the Keras repo:
score_array = loss_fn(y_true, y_pred)
if weights is not None:
score_array *= weights
score_array /= K.mean(K.cast(K.not_equal(weights, 0), K.floatx()))
return K.mean(score_array)
class_weight is used in the same way as sample_weight; it is just provided as a convenience to specify certain weights across entire classes.
The sample weights are currently not applied to metrics, only loss.

LASSO fit scikit-learn - obtain likelihood

I'm using LASSO from scikit-learn package to optimize the parameters of a penalized linear regression problem. I'm not only interested in the optimal choice of parameters, but also in the likelihood of the data with respect to the optimized parameters. Is there an easy way to get the full likelihood after fitting?
It is slightly deceiving to consider the lasso in a maximum likelihood framework. The prior distribution on the coefficients is then a laplacian distribution exp(-np.prod(np.abs(coef))), which yields sparsity only as an "artifact" at its optimum. The probability of obtaining a sparse sample from this distribution is 0 (it happens "almost never").
This disclaimer out of the way, you can write
import numpy as np
from sklearn.linear_model import Lasso
est = Lasso(alpha=10.)
est.fit(X, y)
coef = est.coef_
data_loss = 0.5 * ((X.dot(coef) - y) ** 2).sum()
n_samples, n_features = X.shape
penalty = n_samples * est.alpha * np.abs(coef).sum()
likelihood = np.exp(-(data_loss + penalty))

Resources