torch.nn.CrossEntropyLoss over Multiple Batches - pytorch

I am currently working with torch.nn.CrossEntropyLoss. As far as I know, it is common to compute the loss batch-wise. However, is there a possibility to compute the loss over multiple batches?
More concretely, assume we are given the data
import torch
features = torch.randn(no_of_batches, batch_size, feature_dim)
targets = torch.randint(low=0, high=10, size=(no_of_batches, batch_size))
loss_function = torch.nn.CrossEntropyLoss()
Is there a way to compute in one line
loss = loss_function(features, targets) # raises RuntimeError: Expected target size [no_of_batches, feature_dim], got [no_of_batches, batch_size]
?
Thank you in advance!

You can compute multiple cross-entropy losses but you'll need to do your own reduction. Since cross-entropy loss assumes the feature dim is always the second dimension of the features tensor you will also need to permute it first.
loss_function = torch.nn.CrossEntropyLoss(reduction='none')
loss = loss_function(features.permute(0,2,1), targets).mean(dim=1)
which will result in a loss tensor with no_of_batches entries.

Related

Why my cross entropy loss function does not converge?

I try to write a cross entropy loss function by myself. My loss function gives the same loss value as the official one, but when i use my loss function in the code instead of official cross entropy loss function, the code does not converge. When i use the official cross entropy loss function, the code converges. Here is my code, please give me some suggestions. Thanks very much
The input 'out' is a tensor (B*C) and 'label' contains class indices (1 * B)
class MylossFunc(nn.Module):
def __init__(self):
super(MylossFunc, self).__init__()
def forward(self, out, label):
out = torch.nn.functional.softmax(out, dim=1)
n = len(label)
loss = torch.FloatTensor([0])
loss = Variable(loss, requires_grad=True)
tmp = torch.log(out)
#print(out)
torch.scalar_tensor(-100)
for i in range(n):
loss = loss - torch.max(tmp[i][label[i]], torch.scalar_tensor(-100) )/n
loss = torch.sum(loss)
return loss
Instead of using torch.softmax and torch.log, you should use torch.log_softmax, otherwise your training will become unstable with nan values everywhere.
This happens because when you take the softmax of your logits using the following line:
out = torch.nn.functional.softmax(out, dim=1)
you might get a zero in one of the components of out, and when you follow that by applying torch.log it will result in nan (since log(0) is undefined). That is why torch (and other common libraries) provide a single stable operation, log_softmax, to avoid the numerical instabilities that occur when you use torch.softmax and torch.log individually.

Pytorch Categorical Cross Entropy loss function behaviour

I have question regarding the computation made by the Categorical Cross Entropy Loss from Pytorch.
I have made this easy code snippet and because I use the argmax of the output tensor as the targets, I cannot understand why the loss is still high.
import torch
import torch.nn as nn
ce_loss = nn.CrossEntropyLoss()
output = torch.randn(3, 5, requires_grad=True)
targets = torch.argmax(output, dim=1)
loss = ce_loss(outputs, targets)
print(loss)
Thanks for the help understanding it.
Best regards
Jerome
So here is a sample data from your code with the output, label and loss having the following values
outputs = tensor([[ 0.5968, -0.8249, 1.5018, 2.7888, -0.6125],
[-1.1534, -0.4921, 1.0688, 0.2241, -0.0257],
[ 0.3747, 0.8957, 0.0816, 0.0745, 0.2695]], requires_grad=True)requires_grad=True)
labels = tensor([3, 2, 1])
loss = tensor(0.7354, grad_fn=<NllLossBackward>)
So let's examine the values,
If you compute the softmax output of your logits (outputs), using something like this torch.softmax(outputs,axis=1) you will get
probs = tensor([[0.0771, 0.0186, 0.1907, 0.6906, 0.0230],
[0.0520, 0.1008, 0.4801, 0.2063, 0.1607],
[0.1972, 0.3321, 0.1471, 0.1461, 0.1775]], grad_fn=<SoftmaxBackward>)
So these will be your prediction probabilities.
Now cross-entropy loss is nothing but a combination of softmax and negative log likelihood loss. Hence, your loss can simply be computed using
loss = (torch.log(1/probs[0,3]) + torch.log(1/probs[1,2]) + torch.log(1/probs[2,1])) / 3
, which is the average of the negative log of the probabilities of your true labels. The above equation evaluates to 0.7354, which is equivalent to the value returned from the nn.CrossEntropyLoss module.

Pytorch: Custom Loss involving Norm of End-to-End Jacobian

Cross posting from Pytorch discussion boards
I want to train a network using a modified loss function that has both a typical classification loss (e.g. nn.CrossEntropyLoss) as well as a penalty on the Frobenius norm of the end-to-end Jacobian (i.e. if f(x) is the output of the network, \nabla_x f(x)).
I’ve implemented a model that can successfully learn using nn.CrossEntropyLoss. However, when I try adding the second loss function (by doing two backwards passes), my training loop runs, but the model never learns. Furthermore, if I calculate the end-to-end Jacobian, but don’t include it in the loss function, the model also never learns. At a high level, my code does the following:
Forward pass to get predicted classes, yhat, from inputs x
Call yhat.backward(torch.ones(appropriate shape), retain_graph=True)
Jacobian norm = x.grad.data.norm(2)
Set loss equal to classification loss + scalar coefficient * jacobian norm
Run loss.backward()
I suspect that I’m misunderstanding how backward() works when run twice, but I haven’t been able to find any good resources to clarify this.
Too much is required to produce a working example, so I’ve tried to extract the relevant code:
def train_model(model, train_dataloader, optimizer, loss_fn, device=None):
if device is None:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.train()
train_loss = 0
correct = 0
for batch_idx, (batch_input, batch_target) in enumerate(train_dataloader):
batch_input, batch_target = batch_input.to(device), batch_target.to(device)
optimizer.zero_grad()
batch_input.requires_grad_(True)
model_batch_output = model(batch_input)
loss = loss_fn(model_output=model_batch_output, model_input=batch_input, model=model, target=batch_target)
train_loss += loss.item() # sum up batch loss
loss.backward()
optimizer.step()
and
def end_to_end_jacobian_loss(model_output, model_input):
model_output.backward(
torch.ones(*model_output.shape),
retain_graph=True)
jacobian = model_input.grad.data
jacobian_norm = jacobian.norm(2)
return jacobian_norm
Edit 1: I swapped my previous implementation with .backward() to autograd.grad and it apparently works! What's the difference?
def end_to_end_jacobian_loss(model_output, model_input):
jacobian = autograd.grad(
outputs=model_output['penultimate_layer'],
inputs=model_input,
grad_outputs=torch.ones(*model_output['penultimate_layer'].shape),
retain_graph=True,
only_inputs=True)[0]
jacobian_norm = jacobian.norm(2)
return jacobian_norm

Keras ImageDataGenerator sample_weight with data augmentation

I have a question about the use of the sample_weight parameter in the context of data augmentation in Keras with the ImageDataGenerator. Let's say I have a series of simple images with just one class of objects. So, for each image, I will have a corresponding mask with pixels = 0 for the background and 1 for where the object is labeled.
However, this dataset is unbalanced because a significant amount of these images are empty, which mean with masks just containing 0.
If I understood well, the 'sample_weight' parameter of the flow method of ImageDataGenerator is here to put the focus on the the samples of my dataset that I find more interesting, i.e. where my object is present.
My question is: what is the concrete influence of this sample_weight parameter on the training of my model. Does it influence the data augmentation? If I use the 'validation_split' parameter, does it influence the way validation sets are generated?
Here is the part of my code my question refers to:
data_gen_args = dict(rotation_range=90,
width_shift_range=0.4,
height_shift_range=0.4,
zoom_range=0.4,
horizontal_flip=True,
fill_mode='reflect',
rescale=1. / 255,
validation_split=0.2,
data_format='channels_last'
)
image_datagen = ImageDataGenerator(**data_gen_args)
imf = image_datagen.flow(
x=stacked_images_channel,
y=stacked_masks_channel,
batch_size=batch_size,
shuffle=False,
seed=seed,subset='training',
sample_weight = sample_weight,
save_to_dir = 'traindir',
save_prefix = 'train_'
)
valf = image_datagen.flow(
x=stacked_images_channel,
y=stacked_masks_channel,
batch_size=batch_size,
shuffle=False,
seed=seed,subset='validation',
sample_weight = sample_weight,
save_to_dir = 'valdir',
save_prefix = 'val_'
)
STEP_SIZE_TRAIN=imf.n//imf.batch_size
STEP_SIZE_VALID=valf.n//valf.batch_size
model = unet.UNet2(numberOfClasses, imshape, '', learningRate, depth=4)
history = model.fit_generator(generator=imf,
steps_per_epoch=STEP_SIZE_TRAIN,
epochs=epochs,
validation_data=valf,
validation_steps=STEP_SIZE_VALID,
verbose=2
)
Thank you in advance for your attention.
As for Keras 2.2.5 with preprocessing at 1.1.0, the sample_weight is passed along with the samples and applied during processing. When calling .fit_generator, the model is trained on batches, each batch using sample weights:
model.train_on_batch(x, y,
sample_weight=sample_weight,
class_weight=class_weight)
In the source code of .train_on_batch, the documentation states: "sample_weight: Optional array of the same length as x, containing weights to apply to the model's loss for each sample. (...)". The actual application of weights happens when calculating loss on each batch. When compiling a model, Keras generates a "weighted loss" function out of the desired loss function. The weighted computation is stated in the code as:
def weighted(y_true, y_pred, weights, mask=None):
"""Wrapper function.
# Arguments
y_true: `y_true` argument of `fn`.
y_pred: `y_pred` argument of `fn`.
weights: Weights tensor.
mask: Mask tensor.
# Returns
Scalar tensor.
"""
# score_array has ndim >= 2
score_array = fn(y_true, y_pred)
if mask is not None:
# Cast the mask to floatX to avoid float64 upcasting in Theano
mask = K.cast(mask, K.floatx())
# mask should have the same shape as score_array
score_array *= mask
# the loss per batch should be proportional
# to the number of unmasked samples.
score_array /= K.mean(mask) + K.epsilon()
# apply sample weighting
if weights is not None:
# reduce score_array to same ndim as weight array
ndim = K.ndim(score_array)
weight_ndim = K.ndim(weights)
score_array = K.mean(score_array,
axis=list(range(weight_ndim, ndim)))
score_array *= weights
score_array /= K.mean(K.cast(K.not_equal(weights, 0), K.floatx()))
return K.mean(score_array)
This wrapper shows it first calculates the desired loss (call to fn(y_true, y_pred)), then applies weighing if weights where passed (either with sample_weight or class_weight).
With this context in mind:
what is the concrete influence of this sample_weight parameter on the training of my model.
Weights are basically multiplied to the loss (and normalized). So "heavy" weights (more than 1) samples cause more loss, so larger gradients. "Light" weights reduce the importance of the sample and lead to smaller gradients.
Does it influence the data augmentation?
It depends on what you mean. Here is what I can say from experience, where I perform augmentation before feeding a Keras data generator (doing so as there were issues in preprocessing, as far as I know still existing in Preprocessing 1.1.0):
When feeding already augmented data to the generator, the .flow call will require a sample weights list as long as the input data. So the influence of weighing on augmentation depends on how the weights are chosen. A data point augmented N times may assign the same weight to each augmentation, or 1/N depending on the intent.
The default behaviour in Keras seems to assign the same weight to each augmentation (transform) performed by Keras. The code looks pretty clear, although I have never relied on it.
If I use the 'validation_split' parameter, does it influence the way validation sets are generated?
The sample_weight parameter does not seem to interfere with validation_split. I have not looked into the code specifically, but splitting basically gets the input data, and keeps a split for validation---whatever the data is. When sample_weight is added, what changes is each data point: Without weight, data is (x, y); with weight, data becomes (x, y, weight).

Multi-label classification with class weights in Keras

I have a 1000 classes in the network and they have multi-label outputs. For each training example, the number of positive output is same(i.e 10) but they can be assigned to any of the 1000 classes. So 10 classes have output 1 and rest 990 have output 0.
For the multi-label classification, I am using 'binary-cross entropy' as cost function and 'sigmoid' as the activation function. When I tried this rule of 0.5 as the cut-off for 1 or 0. All of them were 0. I understand this is a class imbalance problem. From this link, I understand that, I might have to create extra output labels.Unfortunately, I haven't been able to figure out how to incorporate that into a simple neural network in keras.
nclasses = 1000
# if we wanted to maximize an imbalance problem!
#class_weight = {k: len(Y_train)/(nclasses*(Y_train==k).sum()) for k in range(nclasses)}
inp = Input(shape=[X_train.shape[1]])
x = Dense(5000, activation='relu')(inp)
x = Dense(4000, activation='relu')(x)
x = Dense(3000, activation='relu')(x)
x = Dense(2000, activation='relu')(x)
x = Dense(nclasses, activation='sigmoid')(x)
model = Model(inputs=[inp], outputs=[x])
adam=keras.optimizers.adam(lr=0.00001)
model.compile('adam', 'binary_crossentropy')
history = model.fit(
X_train, Y_train, batch_size=32, epochs=50,verbose=0,shuffle=False)
Could anyone help me with the code here and I would also highly appreciate if you could suggest a good 'accuracy' metric for this problem?
Thanks a lot :) :)
I have a similar problem and unfortunately have no answer for most of the questions. Especially the class imbalance problem.
In terms of metric there are several possibilities: In my case I use the top 1/2/3/4/5 results and check if one of them is right. Because in your case you always have the same amount of labels=1 you could take your top 10 results and see how many percent of them are right and average this result over your batch size. I didn't find a possibility to include this algorithm as a keras metric. Instead, I wrote a callback, which calculates the metric on epoch end on my validation data set.
Also, if you predict the top n results on a test dataset, see how many times each class is predicted. The Counter Class is really convenient for this purpose.
Edit: If found a method to include class weights without splitting the output.
You need a numpy 2d array containing weights with shape [number classes to predict, 2 (background and signal)].
Such an array could be calculated with this function:
def calculating_class_weights(y_true):
from sklearn.utils.class_weight import compute_class_weight
number_dim = np.shape(y_true)[1]
weights = np.empty([number_dim, 2])
for i in range(number_dim):
weights[i] = compute_class_weight('balanced', [0.,1.], y_true[:, i])
return weights
The solution is now to build your own binary crossentropy loss function in which you multiply your weights yourself:
def get_weighted_loss(weights):
def weighted_loss(y_true, y_pred):
return K.mean((weights[:,0]**(1-y_true))*(weights[:,1]**(y_true))*K.binary_crossentropy(y_true, y_pred), axis=-1)
return weighted_loss
weights[:,0] is an array with all the background weights and weights[:,1] contains all the signal weights.
All that is left is to include this loss into the compile function:
model.compile(optimizer=Adam(), loss=get_weighted_loss(class_weights))

Resources