I have some losses in a loop storing them in a tensor loss. Now I want to multiply a weight tensor to the loss tensor to have final loss, but after torch.dot(), the result scalar, ll_new, has requires_grad=False. The following is my code.
loss_vector = torch.FloatTensor(total_loss_q)
w_norm = F.softmax(loss_vector, dim=0)
ll_new = torch.dot(loss_vector,w_norm)
How can I have requires_grad=False for the ll_new after doing the above?
I think the issue is in the line: loss_vector = torch.FloatTensor(total_loss_q) as requires_grad for loss_vector is False (default value). So, you should do:
loss_vector = torch.FloatTensor(total_loss_q, requires_grad=True)
The issue most likely lies within this part:
I have some losses in a loop storing them in a tensor loss
You are most likely losing requires_grad somewhere in the process before torch.dot. E.g. if you use something like .item() on individual losses when constructing total_loss_q tensor.
What type is your total_loss_q? If it is a list of integers then there is no way your gradients will propagate through that. You need to construct total_loss_q in such a way that it is a tensor which knows how each individual loss was constructed (i.e. can propagate gradients to your trainable weights).
Related
question:
fc = nn.Linear(n,3);
I want to freeze the parameters of the third output of the fc when I train this layer.
According to #PlainRavioli , it's not possible yet and you can set the gradient to zero so the current weights do not change.
But you have to do this after calling loss.backward() and before calling optimizer.step().
So being fc = nn.Linear(n,3), for freezing the parameters of the third output:
loss.backward()
fc.weight.grad[2,:] = torch.zeros_like(fc.weight.grad[2,:])
fc.bias.grad[2] = torch.zeros_like(fc.bias.grad[2])
optimizer.step()
Calling loss.backward() computes dloss/dx and do x.grad += dloss/dx. So before this operation, gradients are set to None.
With monolithic layer you can't. But you can split layer, making a separate Linear for each output channel:
fc1 = nn.Linear(n,1)
fc2 = nn.Linear(n,1)
fc3 = nn.Linear(n,1)
Here you can freeze fc3
I believe it is not possible yet, however the common solution is to set the gradient to zero before backpropagation so the current weights do not change.
With param the parameter of your layer you want to freeze:
param.grad[2,:] = torch.zeros_like(param.grad[2,:])
before calling loss.backward()
Don't forget to set to zero the adequate bias gradient also !
Make sure the weights you are targetting did not change afterwards
What's the correct way to do gradient descent on an arbitrary function with no input using Pytorch?
x = torch.tensor(x_init, requires_grad=True)
opt = torch.optim.Adam([x])
cost_fnx = cost(x)
for iteration_count in range(100):
opt.zero_grad()
cost_fnx.backward()
opt.step()
When I tried the above, I got this error:
RuntimeError: Trying to backward through the graph a second time (or directly access saved variables after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved variables after calling backward.
The error occurs because you are trying to backpropagate on the same graph multiple times. You most likely need to recompute the cost value (your regularizer function since it only has the model's parameters as input) to backpropagate again. Something like:
x = x_init.requires_grad_(True)
opt = torch.optim.Adam([x])
for iteration_count in range(2):
cost_fnx = cost(x)
opt.zero_grad()
cost_fnx.backward()
opt.step()
I use keras and I want to combine ssim and some other functions in loss function (link).
To see the values and effect of each components, I define each one as metric as well. But it leads to compute each one twice, once in loss function and another in metric function.
Is there a way to compute each component once and use the values of them as metric and in loss function?
Thanks
Try to define your network as a multi-output Model, like:
Model(inputs=your_input, outputs=[your_output, your_output, ...])
Then you can define a different loss function for each of the (identical) outputs, like this:
model.compile(..., loss=[my_loss_1, my_loss_2, ...], loss_weights=[w1, w2, ...])
To make it a bit nicer, you can use "identity" lambda layers to define separate named output layers (and set the losses by name).
out_loss1 = Lambda(lambda x: x, name='loss1')(my_output)
out_loss2 = Lambda(lambda x: x, name='loss2')(my_output)
...
model.compile(..., loss={'loss1': my_loss_1, 'loss2': my_loss_2, ...}, loss_weights={'loss1':w1, 'loss2':w2, ...})
This way your overall loss is a combination of several loss functions, you get the values for each loss and no redundant computations.
This release of PyTorch seems provide the PackedSequence for variable lengths of input for recurrent neural network. However, I found it's a bit hard to use it correctly.
Using pad_packed_sequence to recover an output of a RNN layer which were fed by pack_padded_sequence, we got a T x B x N tensor outputs where T is the max time steps, B is the batch size and N is the hidden size. I found that for short sequences in the batch, the subsequent output will be all zeros.
Here are my questions.
For a single output task where the one would need the last output of all the sequences, simple outputs[-1] will give a wrong result since this tensor contains lots of zeros for short sequences. One will need to construct indices by sequence lengths to fetch the individual last output for all the sequences. Is there more simple way to do that?
For a multiple output task (e.g. seq2seq), usually one will add a linear layer N x O and reshape the batch outputs T x B x O into TB x O and compute the cross entropy loss with the true targets TB (usually integers in language model). In this situation, do these zeros in batch output matters?
Question 1 - Last Timestep
This is the code that i use to get the output of the last timestep. I don't know if there is a simpler solution. If it is, i'd like to know it. I followed this discussion and grabbed the relative code snippet for my last_timestep method. This is my forward.
class BaselineRNN(nn.Module):
def __init__(self, **kwargs):
...
def last_timestep(self, unpacked, lengths):
# Index of the last output for each sequence.
idx = (lengths - 1).view(-1, 1).expand(unpacked.size(0),
unpacked.size(2)).unsqueeze(1)
return unpacked.gather(1, idx).squeeze()
def forward(self, x, lengths):
embs = self.embedding(x)
# pack the batch
packed = pack_padded_sequence(embs, list(lengths.data),
batch_first=True)
out_packed, (h, c) = self.rnn(packed)
out_unpacked, _ = pad_packed_sequence(out_packed, batch_first=True)
# get the outputs from the last *non-masked* timestep for each sentence
last_outputs = self.last_timestep(out_unpacked, lengths)
# project to the classes using a linear layer
logits = self.linear(last_outputs)
return logits
Question 2 - Masked Cross Entropy Loss
Yes, by default the zero padded timesteps (targets) matter. However, it is very easy to mask them. You have two options, depending on the version of PyTorch that you use.
PyTorch 0.2.0: Now pytorch supports masking directly in the CrossEntropyLoss, with the ignore_index argument. For example, in language modeling or seq2seq, where i add zero padding, i mask the zero padded words (target) simply like this:
loss_function = nn.CrossEntropyLoss(ignore_index=0)
PyTorch 0.1.12 and older: In the older versions of PyTorch, masking was not supported, so you had to implement your own workaround. I solution that i used, was masked_cross_entropy.py, by jihunchoi. You may be also interested in this discussion.
A few days ago, I found this method which uses indexing to accomplish the same task with a one-liner.
I have my dataset batch first ([batch size, sequence length, features]), so for me:
unpacked_out = unpacked_out[np.arange(unpacked_out.shape[0]), lengths - 1, :]
where unpacked_out is the output of torch.nn.utils.rnn.pad_packed_sequence.
I have compared it with the method described here, which looks similar to the last_timestep() method Christos Baziotis is using above (also recommended here), and the results are the same in my case.
I've been trying to implement a custom objective function in Keras (the negative log likelihood of the normal distribution)
Keras expects one argument for the ground truth tensor, and one for the predictions tensor; for y_pred,I'm passing a tensor that should represent a nx2 matrix where the first column is the mean of the distribution, and the second the precision.
My problem is that I haven't been able to get a clear idea how I properly slice y_pred before passing it into the likelihood function without yielding the error
'Expected an array-like object, but found a Variable: maybe you are trying to call a function on a (possibly shared) variable instead of a numeric array?'
While I understand that I'm feeding l_func arguments of the variable type when it expects an array,I don't seem to be able to grok how to properly split the input y_pred variable into its mean and precision components to plug into the likelihood function. Here are some attempts; if someone could enlighten me about how to proceed, I would greatly appreciate it.
def log_likelihood(y_true,y_pred):
mu = T.vector('mu')
beta = T.vector('beta')
x=T.vector('x')
likelihood = .5*(beta*(x-mu)**2)-T.log(beta/(2*np.pi))
l_func = function([mu,beta,x], likelihood)
return(l_func(y_pred[:,0],y_pred[:,1],y_true))
def log_likelihood(y_true,y_pred):
likelihood = .5*(y_pred[:,1]*(y_true-y_pred[:,0])**2)-T.log(y_pred[:,1]/(2*np.pi))
l_func = function([y_true,y_pred], likelihood)
return(l_func(y_true,y_pred))
def log_likelihood(y_true,y_pred):
mu=y_pred[:,0]
beta=y_pred[:,1]
x=y_true
mu_function=function([y_pred],mu)
beta_function=function([y_pred],beta)
id_function=function([y_true],x)
likelihood = .5*(beta_function(y_pred)*(id_function(y_true)-mu_function(y_pred))**2)-T.log(beta_function(y_pred)/(2*np.pi))
l_func = function([y_true,y_pred], likelihood)
return(l_func(y_true,y_pred))