Why unused embedding vector changed after backward? - pytorch

I'm partially fine-tuning a embedding matrix (just for some particular index). But in backward phase, other embeddings which is not expected to be modified are changed.
Here is an example:
import torch
from torch import nn
embeds = nn.Embedding(10, 3)
optim = torch.optim.Adam(params=embeds.parameters(), lr=0.001, weight_decay=1e-6)
indexes = torch.tensor([1,2,3]) # expected to finetune
bce_loss = nn.BCELoss()
optim.zero_grad()
loss = bce_loss(embeds(indexes).sum(dim=-1), torch.zeros(3))
loss.backward()
optim.step()
I only want to finetune 1st, 2nd, 3rd embeddings, but after backward, the entire embedding matrix is changed. Why? And how to solve this problem? THX!

The change in embedding matrix after optim.step() is not entirely due to the loss you defined (i.e. BCELoss). You are overlooking the other part of the loss that is being implicitely defined due to the weight_decay parameter in your optimizer. When you supply a non-zero weight decay, it adds a second commponent to the loss that is over all parameters of the model. Just turn that off
optim = torch.optim.Adam(..., weight_decay=0.)
and you will see expected result.

Related

Tensorflow 1.15 / Keras 2.3.1 Model.train_on_batch() returns more values than there are outputs/loss functions

I am trying to train a model that has more than one output and as a result, also has more than one loss function attached to it when I compile it.
I haven't done something similar in the past (not from scratch at least).
Here's some code I am using to figure out how this works.
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model
batch_size = 50
input_size = 10
i = Input(shape=(input_size,))
x = Dense(100)(i)
x_1 = Dense(output_size)(x)
x_2 = Dense(output_size)(x)
model = Model(i, [x_1, x_2])
model.compile(optimizer = 'adam', loss = ["mse", "mse"])
# Data creation
x = np.random.random_sample([batch_size, input_size]).astype('float32')
y = np.random.random_sample([batch_size, output_size]).astype('float32')
loss = model.train_on_batch(x, [y,y])
print(loss) # sample output [0.8311912, 0.3519104, 0.47928077]
I would expect the variable loss to have two entries (one for each loss function), however, I get back three. I thought maybe one of them is the weighted average but that does not look to be the case.
Could anyone explain how passing in multiple loss functions works, because obviously, I am misunderstanding something.
I believe the three outputs are the sum of all the losses, followed by the individual losses on each output.
For example, if you look at the sample output you've printed there:
0.3519104 + 0.47928077 = 0.83119117 ≈ 0.8311912
Your assumption that there should be two losses in incorrect. You have a model with two outputs, and you specified one loss for each output, but the model has to be trained on a single loss, so Keras trains the model on a new loss that is the sum of the per-output losses.
You can control how these losses are mixed using the loss_weights parameter in model.compile. I think by default it takes weights values equal to 1.0.
So in the end what train_on_batch returns is the loss, output one mse, and output two mse. That is why you get three values.

Pytorch Categorical Cross Entropy loss function behaviour

I have question regarding the computation made by the Categorical Cross Entropy Loss from Pytorch.
I have made this easy code snippet and because I use the argmax of the output tensor as the targets, I cannot understand why the loss is still high.
import torch
import torch.nn as nn
ce_loss = nn.CrossEntropyLoss()
output = torch.randn(3, 5, requires_grad=True)
targets = torch.argmax(output, dim=1)
loss = ce_loss(outputs, targets)
print(loss)
Thanks for the help understanding it.
Best regards
Jerome
So here is a sample data from your code with the output, label and loss having the following values
outputs = tensor([[ 0.5968, -0.8249, 1.5018, 2.7888, -0.6125],
[-1.1534, -0.4921, 1.0688, 0.2241, -0.0257],
[ 0.3747, 0.8957, 0.0816, 0.0745, 0.2695]], requires_grad=True)requires_grad=True)
labels = tensor([3, 2, 1])
loss = tensor(0.7354, grad_fn=<NllLossBackward>)
So let's examine the values,
If you compute the softmax output of your logits (outputs), using something like this torch.softmax(outputs,axis=1) you will get
probs = tensor([[0.0771, 0.0186, 0.1907, 0.6906, 0.0230],
[0.0520, 0.1008, 0.4801, 0.2063, 0.1607],
[0.1972, 0.3321, 0.1471, 0.1461, 0.1775]], grad_fn=<SoftmaxBackward>)
So these will be your prediction probabilities.
Now cross-entropy loss is nothing but a combination of softmax and negative log likelihood loss. Hence, your loss can simply be computed using
loss = (torch.log(1/probs[0,3]) + torch.log(1/probs[1,2]) + torch.log(1/probs[2,1])) / 3
, which is the average of the negative log of the probabilities of your true labels. The above equation evaluates to 0.7354, which is equivalent to the value returned from the nn.CrossEntropyLoss module.

Lack of Sparse Solution with L1 Regularization in Pytorch

I am trying to implement L1 regularization onto the first layer of a simple neural network (1 hidden layer). I looked into some other posts on StackOverflow that apply l1 regularization using Pytorch to figure out how it should be done (references: Adding L1/L2 regularization in PyTorch?, In Pytorch, how to add L1 regularizer to activations?). No matter how high I increase lambda (the l1 regularization strength parameter) I do not get true zeros in the first weight matrix. Why would this be? (Code is below)
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
class Network(nn.Module):
def __init__(self,nf,nh,nc):
super(Network,self).__init__()
self.lin1=nn.Linear(nf,nh)
self.lin2=nn.Linear(nh,nc)
def forward(self,x):
l1out=F.relu(self.lin1(x))
out=F.softmax(self.lin2(l1out))
return out, l1out
def l1loss(layer):
return torch.norm(layer.weight.data, p=1)
nf=10
nc=2
nh=6
learningrate=0.02
lmbda=10.
batchsize=50
net=Network(nf,nh,nc)
crit=nn.MSELoss()
optimizer=torch.optim.Adagrad(net.parameters(),lr=learningrate)
xtr=torch.Tensor(xtr)
ytr=torch.Tensor(ytr)
#ytr=torch.LongTensor(ytr)
xte=torch.Tensor(xte)
yte=torch.LongTensor(yte)
#cyte=torch.Tensor(yte)
it=200
for epoch in range(it):
per=torch.randperm(len(xtr))
for i in range(0,len(xtr),batchsize):
ind=per[i:i+batchsize]
bx,by=xtr[ind],ytr[ind]
optimizer.zero_grad()
output, l1out=net(bx)
# l1reg=l1loss(net.lin1)
loss=crit(output,by)+lmbda*l1loss(net.lin1)
loss.backward()
optimizer.step()
print('Epoch [%i/%i], Loss: %.4f' %(epoch+1,it, np.float32(loss.data.numpy())))
corr=0
tot=0
for x,y in list(zip(xte,yte)):
output,_=net(x)
_,pred=torch.max(output,-1)
tot+=1 #y.size(0)
corr+=(pred==y).sum()
print(corr)
Note: The data has 10 features (2 classes and 800 training samples) and only the first 2 are relevant (by design) so one would assume true zeros should be easy enough to learn.
Your usage of layer.weight.data removes the parameter (which is a PyTorch variable) from its automatic differentiation context, making it a constant when the optimiser takes the gradients. This results in zero gradients and that the L1 loss is not computed.
If you remove the .data, the norm is computed of the PyTorch variable and the gradients should be correct.
For more information on PyTorch's automatic differentiation mechanics, see this docs article or this tutorial.

what exactly does 'tf.contrib.rnn.DropoutWrapper'' in tensorflow do? ( three citical questions)

As I know, DropoutWrapper is used as follows
__init__(
cell,
input_keep_prob=1.0,
output_keep_prob=1.0,
state_keep_prob=1.0,
variational_recurrent=False,
input_size=None,
dtype=None,
seed=None
)
.
cell = tf.nn.rnn_cell.LSTMCell(state_size, state_is_tuple=True)
cell = tf.nn.rnn_cell.DropoutWrapper(cell, output_keep_prob=0.5)
cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)
the only thing I know is that it is use for dropout while training.
Here are my three questions
What are input_keep_prob,output_keep_prob and state_keep_prob respectively?
(I guess they define dropout probability of each part of RNN, but exactly
where?)
Is dropout in this context applied to RNN not only when training but also prediction process? If it's true, is there any way to decide whether I do or don't use dropout at prediction process?
As API documents in tensorflow web page, if variational_recurrent=True dropout works according to the method on a paper
"Y. Gal, Z Ghahramani. "A Theoretically Grounded Application of Dropout in Recurrent Neural Networks". https://arxiv.org/abs/1512.05287 " I understood this paper roughly. When I train RNN, I use 'batch' not single time-series. In this case, tensorflow automatically assign different dropout mask to different time-series in a batch?
input_keep_prob is for the dropout level (inclusion probability) added when fitting feature weights. output_keep_prob is for the dropout level added for each RNN unit output. state_keep_prob is for the hidden state that is fed to the next layer.
You can initialize each of the above mentioned parameters as follows:
import tensorflow as tf
dropout_placeholder = tf.placeholder_with_default(tf.cast(1.0, tf.float32))
tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.BasicRNNCell(n_hidden_rnn),
input_keep_prob = dropout_placeholder, output_keep_prob = dropout_placeholder,
state_keep_prob = dropout_placeholder)
The default dropout level will be 1 during prediction or anything else that we can feed during training.
The masking is done for the fitted weights rather than for the sequences that are included in the batch. As far as I know, it's done for the entire batch.
keep_prob = tf.cond(dropOut,lambda:tf.constant(0.9), lambda:tf.constant(1.0))
cells = rnn.DropoutWrapper(cells, output_keep_prob=keep_prob)

Strange behaviour sequence to sequence learning for variable length sequences

I am training a sequence to sequence model for variable length sequences with Keras, but I am running into some unexpected problems. It is unclear to me whether the behaviour I am observing is the desired behaviour of the library and why it would be.
Model Creation
I've made a recurrent model with an embeddings layer and a GRU recurrent layer that illustrates the problem. I used mask_zero=0.0 for the embeddings layer instead of a masking layer, but changing this doesn't seem to make a difference (nor does adding a masking layer before the output):
import numpy
from keras.layers import Embedding, GRU, TimeDistributed, Dense, Input
from keras.models import Model
import keras.preprocessing.sequence
numpy.random.seed(0)
input_layer = Input(shape=(3,), dtype='int32', name='input')
embeddings = Embedding(input_dim=20, output_dim=2, input_length=3, mask_zero=True, name='embeddings')(input_layer)
recurrent = GRU(5, return_sequences=True, name='GRU')(embeddings)
output_layer = TimeDistributed(Dense(1), name='output')(recurrent)
model = Model(input=input_layer, output=output_layer)
output_weights = model.layers[-1].get_weights()
output_weights[1] = numpy.array([0.2])
model.layers[-1].set_weights(output_weights)
model.compile(loss='mse', metrics=['mse'], optimizer='adam', sample_weight_mode='temporal')
I use masking and the sample_weight parameter to exclude the padding values from the training/evaluation. I will test this model on one input/output sequence which I pad using the Keras padding function:
X = [[1, 2]]
X_padded = keras.preprocessing.sequence.pad_sequences(X, dtype='float32', maxlen=3)
Y = [[[1], [2]]]
Y_padded = keras.preprocessing.sequence.pad_sequences(Y, maxlen=3, dtype='float32')
Output Shape
Why the output is expected to be formatted in this way. Why can I not use input/output sequences that have exactly the same dimensionality? model.evaluate(X_padded, Y_padded) gives me a dimensionality error.
Then, when I run model.predict(X_padded) I get the following output (with numpy.random.seed(0) before generating the model):
[[[ 0.2 ]
[ 0.19946882]
[ 0.19175649]]]
Why isn't the first input masked for the output layer? Is the output_value computed anyways (and equal to the bias, as the hidden layer values are 0? This does not seem desirable. Adding a Masking layer before the output layer does not solve this problem.
MSE calculation
Then, when I evaluate the model (model.evaluate(X_padded, Y_padded)), this returns the Mean Squared Error (MSE) of the entire sequence (1.3168) including this first value, which I suppose is to be expected when it isn't masked, but not what I would want.
From the Keras documentation I understand I should use the sample_weight parameter to solve this problem, which I tried:
sample_weight = numpy.array([[0, 1, 1]])
model_evaluation = model.evaluate(X_padded, Y_padded, sample_weight=sample_weight)
print model.metrics_names, model_evaluation
The output I get is
['loss', 'mean_squared_error'] [2.9329459667205811, 1.3168648481369019]
This leaves the metric (MSE) unaltered, it is still the MSE over all values, including the one that I wanted masked. Why? This is not what I want when I evaluate my model. It does cause a change in the loss value, which appears to be the MSE over the last two values normalised to not give more weight to longer sequences.
Am I doing something wrong with the sample weights? Also, I can really not figure out how this loss value came about. What should I do to exclude the padded values from both training and evaluation (I assume the sample_weight parameter works the same in the fit function).
It was indeed a bug in the library, in Keras 2 this issue is resolved.

Resources