I have a loss based on 2 things:
MSE loss
A custom loss term based on the network weights.
I have this code:
net = CustomNet()
mse_loss = torch.nn.MSELoss()
def custom_loss(output, target):
weights =
return mse_loss(output, target) + torch.linalg.norm(weights # weights.T -
When I try to remove the MSE loss (so my loss is only based on the weights):
def custom_loss(output, target):
weights =
return torch.linalg.norm(weights # weights.T -
I am getting the error:
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
I can notice that mse loss has grad_fn=<MseLossBackward object at 0x14908c450>
What am I doing wrong? Why can't I use only the second loss?

You can't use the second term alone because it doesn't have a grad_fn function as the first term does. This means if you're having both terms it will only backpropagate on the first term (the MSE loss) and will not consider the second. Having no grad_fn means it is seen as a constant w.r.t. the input or parameter and has no effect on the gradient.
The tensors you use to compute the second term do not require a gradient. More specifically any tensor that you get using the data attribute won't require a gradient. In your case
Instead you should access the tensor directly via:
>>> weights = net.linear_layer.weight


Why my cross entropy loss function does not converge?

I try to write a cross entropy loss function by myself. My loss function gives the same loss value as the official one, but when i use my loss function in the code instead of official cross entropy loss function, the code does not converge. When i use the official cross entropy loss function, the code converges. Here is my code, please give me some suggestions. Thanks very much
The input 'out' is a tensor (B*C) and 'label' contains class indices (1 * B)
class MylossFunc(nn.Module):
def __init__(self):
super(MylossFunc, self).__init__()
def forward(self, out, label):
out = torch.nn.functional.softmax(out, dim=1)
n = len(label)
loss = torch.FloatTensor([0])
loss = Variable(loss, requires_grad=True)
tmp = torch.log(out)
for i in range(n):
loss = loss - torch.max(tmp[i][label[i]], torch.scalar_tensor(-100) )/n
loss = torch.sum(loss)
return loss
Instead of using torch.softmax and torch.log, you should use torch.log_softmax, otherwise your training will become unstable with nan values everywhere.
This happens because when you take the softmax of your logits using the following line:
out = torch.nn.functional.softmax(out, dim=1)
you might get a zero in one of the components of out, and when you follow that by applying torch.log it will result in nan (since log(0) is undefined). That is why torch (and other common libraries) provide a single stable operation, log_softmax, to avoid the numerical instabilities that occur when you use torch.softmax and torch.log individually.

torch.nn.CrossEntropyLoss over Multiple Batches

I am currently working with torch.nn.CrossEntropyLoss. As far as I know, it is common to compute the loss batch-wise. However, is there a possibility to compute the loss over multiple batches?
More concretely, assume we are given the data
import torch
features = torch.randn(no_of_batches, batch_size, feature_dim)
targets = torch.randint(low=0, high=10, size=(no_of_batches, batch_size))
loss_function = torch.nn.CrossEntropyLoss()
Is there a way to compute in one line
loss = loss_function(features, targets) # raises RuntimeError: Expected target size [no_of_batches, feature_dim], got [no_of_batches, batch_size]
Thank you in advance!
You can compute multiple cross-entropy losses but you'll need to do your own reduction. Since cross-entropy loss assumes the feature dim is always the second dimension of the features tensor you will also need to permute it first.
loss_function = torch.nn.CrossEntropyLoss(reduction='none')
loss = loss_function(features.permute(0,2,1), targets).mean(dim=1)
which will result in a loss tensor with no_of_batches entries.

keras and shape of input and losses

In all code examples for keras I see that the input shape is passed directly and it is surmised that the batch size is the first one , eg:
model = Sequential()
model.add(Dense(32, input_shape=(16,)))
# now the model will take as input arrays of shape (*, 16)
# and output arrays of shape (*, 32)
However when it comes to custom losses I see that the last axis (axis=-1) is used.
def loss(y_true,y_pred):
return K.mean(K.square(y_pred - y_true), axis=-1)
When writing the loss should one think of y_true and y_pred as batches or singular samples?
I'm assuming it's the former , but if that's the case I can't understand why it's specifying the last axis
In your custom loss function, you treat y_true and y_pred as batches which is also the case for the returned value of the function. If you only calculate one loss for your network, you could also get rid of the specified axis, since you only want a single value for your loss in the end.
But if you have multiple outputs in your network and you want to calculate the total loss, where each output might use its own loss functions, things begin to change.
Please check out:
where the function to calculate the total loss, _prepare_total_loss, is called.
In this function, the following code is executed:
output_loss = loss_fn(y_true, y_pred, sample_weight=sample_weight)
which returns the loss for a single output of your network. This is also where your custom loss function gets called. If there are multiple outputs, all of them are calculated, weighted and added to the total loss: total_loss += loss_weight * output_loss
In the end, _prepare_total_loss returns K.mean(total_loss). So in the simplest case, if your custom loss function returned a vector with its length equal to the batch size, and there is only one output with loss in your network, the final loss will be the mean of the output-vector returned by your custom loss.
But in case of multiple outputs and multiple losses, you first want to calculate the loss vector of a batch for each output and therefore loss function, take their weighted sum and then calculate the final loss by taking the mean of the resulting vector.
If your loss functions would return a single loss value each instead of a batch-sized vector, the final loss would be the mean of multiple mean loss values which differs from the mean loss of the whole batch.

gradients of custom loss

Suppose a model as in:
model = Model(inputs=[A, B], outputs=C)
With custom loss:
def actor_loss(y_true, y_pred):
log_lik = y_true * K.log(y_pred)
loss = -K.sum(log_lik * K.stop_gradient(B))
return loss
Now I'm trying to define a function that returns the gradients of the loss wrt to the weights for a given pair of input and target output and expose it as such.
Here is an idea of what I mean in pseudocode
def _get_grads(inputs, targets):
loss = model.loss(targets, model.output)
weights = model.trainable_weights
grads = K.gradients(loss, weights)
model.input[0] (aka 'A') <----inputs[0]
model.input[1] (aka 'B') <----inputs[1]
return K.function(model.input, grads)
self.get_grads = _get_grads
My question is how do I feed inputs argument to the graph inside said function.
(So far I've only worked with .fit and not with .gradients and I can't find any decent documentation with custom loss or multiple inputs)
If you call K.function, you get an actual callable function, so you should just call it with some parameter values. The format is exactly the same as, in your case it should be two arrays of values, including the batch dimension:
self.get_grads = _get_grads(inputs, targets)
grad_value = self.get_grads([input1, input2])
Where input1 and input2 are numpy arrays that include the batch dimension.
My understanding of K.function ,K.gradients and custom loss was fundamentally wrong. You use the function to construct a mini-graph that computes gradients of loss wrt to weights. No need for the function itself to have arguments.
def _get_grads():
targets = Input(shape=...)
loss = model.loss(targets, model.output)
weights = model.trainable_weights
grads = K.gradients(loss, weights)
return K.function(model.input + [targets], grads)
I was under the impression that _get_grads was itself K.function but that was wrong. _get_grads() returns K.function. And then you use that as
f = _get_grads() # constructs the mini-graph that gives gradients
grads = f([inputs, labels])
inputs is fed to model.inputs, labels to targets and it returns grads.

Keras ImageDataGenerator sample_weight with data augmentation

I have a question about the use of the sample_weight parameter in the context of data augmentation in Keras with the ImageDataGenerator. Let's say I have a series of simple images with just one class of objects. So, for each image, I will have a corresponding mask with pixels = 0 for the background and 1 for where the object is labeled.
However, this dataset is unbalanced because a significant amount of these images are empty, which mean with masks just containing 0.
If I understood well, the 'sample_weight' parameter of the flow method of ImageDataGenerator is here to put the focus on the the samples of my dataset that I find more interesting, i.e. where my object is present.
My question is: what is the concrete influence of this sample_weight parameter on the training of my model. Does it influence the data augmentation? If I use the 'validation_split' parameter, does it influence the way validation sets are generated?
Here is the part of my code my question refers to:
data_gen_args = dict(rotation_range=90,
rescale=1. / 255,
image_datagen = ImageDataGenerator(**data_gen_args)
imf = image_datagen.flow(
sample_weight = sample_weight,
save_to_dir = 'traindir',
save_prefix = 'train_'
valf = image_datagen.flow(
sample_weight = sample_weight,
save_to_dir = 'valdir',
save_prefix = 'val_'
model = unet.UNet2(numberOfClasses, imshape, '', learningRate, depth=4)
history = model.fit_generator(generator=imf,
Thank you in advance for your attention.
As for Keras 2.2.5 with preprocessing at 1.1.0, the sample_weight is passed along with the samples and applied during processing. When calling .fit_generator, the model is trained on batches, each batch using sample weights:
model.train_on_batch(x, y,
In the source code of .train_on_batch, the documentation states: "sample_weight: Optional array of the same length as x, containing weights to apply to the model's loss for each sample. (...)". The actual application of weights happens when calculating loss on each batch. When compiling a model, Keras generates a "weighted loss" function out of the desired loss function. The weighted computation is stated in the code as:
def weighted(y_true, y_pred, weights, mask=None):
"""Wrapper function.
# Arguments
y_true: `y_true` argument of `fn`.
y_pred: `y_pred` argument of `fn`.
weights: Weights tensor.
mask: Mask tensor.
# Returns
Scalar tensor.
# score_array has ndim >= 2
score_array = fn(y_true, y_pred)
if mask is not None:
# Cast the mask to floatX to avoid float64 upcasting in Theano
mask = K.cast(mask, K.floatx())
# mask should have the same shape as score_array
score_array *= mask
# the loss per batch should be proportional
# to the number of unmasked samples.
score_array /= K.mean(mask) + K.epsilon()
# apply sample weighting
if weights is not None:
# reduce score_array to same ndim as weight array
ndim = K.ndim(score_array)
weight_ndim = K.ndim(weights)
score_array = K.mean(score_array,
axis=list(range(weight_ndim, ndim)))
score_array *= weights
score_array /= K.mean(K.cast(K.not_equal(weights, 0), K.floatx()))
return K.mean(score_array)
This wrapper shows it first calculates the desired loss (call to fn(y_true, y_pred)), then applies weighing if weights where passed (either with sample_weight or class_weight).
With this context in mind:
what is the concrete influence of this sample_weight parameter on the training of my model.
Weights are basically multiplied to the loss (and normalized). So "heavy" weights (more than 1) samples cause more loss, so larger gradients. "Light" weights reduce the importance of the sample and lead to smaller gradients.
Does it influence the data augmentation?
It depends on what you mean. Here is what I can say from experience, where I perform augmentation before feeding a Keras data generator (doing so as there were issues in preprocessing, as far as I know still existing in Preprocessing 1.1.0):
When feeding already augmented data to the generator, the .flow call will require a sample weights list as long as the input data. So the influence of weighing on augmentation depends on how the weights are chosen. A data point augmented N times may assign the same weight to each augmentation, or 1/N depending on the intent.
The default behaviour in Keras seems to assign the same weight to each augmentation (transform) performed by Keras. The code looks pretty clear, although I have never relied on it.
If I use the 'validation_split' parameter, does it influence the way validation sets are generated?
The sample_weight parameter does not seem to interfere with validation_split. I have not looked into the code specifically, but splitting basically gets the input data, and keeps a split for validation---whatever the data is. When sample_weight is added, what changes is each data point: Without weight, data is (x, y); with weight, data becomes (x, y, weight).
