Sigmoid vs Binary Cross Entropy Loss - pytorch

In my torch model, the last layer is a torch.nn.Sigmoid() and the loss is the torch.nn.BCELoss.
In the training step, the following error has occurred:
RuntimeError: torch.nn.functional.binary_cross_entropy and torch.nn.BCELoss are unsafe to autocast.
Many models use a sigmoid layer right before the binary cross entropy layer.
In this case, combine the two layers using torch.nn.functional.binary_cross_entropy_with_logits
or torch.nn.BCEWithLogitsLoss. binary_cross_entropy_with_logits and BCEWithLogits are
safe to autocast.
However, when trying to reproduce this error while computing the loss and backpropagation, everything goes correctly:
import torch
from torch import nn
# last layer
sigmoid = nn.Sigmoid()
# loss
bce_loss = nn.BCELoss()
# the true classes
true_cls = torch.tensor([
[0.],
[1.]])
# model prediction classes
pred_cls = sigmoid(
torch.tensor([
[0.4949],
[0.4824]],requires_grad=True)
)
pred_cls
# tensor([[0.6213],
# [0.6183]], grad_fn=<SigmoidBackward>)
out = bce_loss(pred_cls, true_cls)
out
# tensor(0.7258, grad_fn=<BinaryCrossEntropyBackward>)
out.backward()
What am i missing?
I appreciate any help you can provide.

You have to move it to cuda first and enable the autocast, like this:
import torch
from torch import nn
from torch.cuda.amp import autocast
# last layer
sigmoid = nn.Sigmoid().cuda()
# loss
bce_loss = nn.BCELoss().cuda()
# the true classes
true_cls = torch.tensor([
[0.],
[1.]]).cuda()
with autocast():
# model prediction classes
pred_cls = sigmoid(
torch.tensor([
[0.4949],
[0.4824]], requires_grad=True
).cuda()
)
pred_cls
# tensor([[0.6213],
# [0.6183]], grad_fn=<SigmoidBackward>)
out = bce_loss(pred_cls, true_cls)
out
# tensor(0.7258, grad_fn=<BinaryCrossEntropyBackward>)
out.backward()
RuntimeError: torch.nn.functional.binary_cross_entropy and torch.nn.BCELoss are unsafe to autocast.
Many models use a sigmoid layer right before the binary cross entropy layer.
In this case, combine the two layers using torch.nn.functional.binary_cross_entropy_with_logits
or torch.nn.BCEWithLogitsLoss. binary_cross_entropy_with_logits and BCEWithLogits are
safe to autocast.

Related

Intermediate Layer loss calculation for conditional Computation

I want to create an MLP based custom CNN model (multi-scaled) consists of several parallel small networks (capsules). These simple small networks are instantiated as a custom layer (conv2d->Flatten->Dense) for each convolution scale i.e. 3x3, 5x5. The purpose of these capsule networks is to generate intermediate loss consciousness to reduce overall global loss using the CNN model. I have written some sketchy codes but I'm not able to write the correct code for computing local loss using these capsules. Here's the code:
from tensorflow.keras import layers
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.layers import Layer
class capsule(tf.keras.layers.Layer):
def __init__(self):
super(capsule, self).__init__()
self.loss_fn = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
self.Flatten = tf.keras.layers.Flatten()
self.conv2D = tf.keras.layers.Conv2D(3,3,(1,1),padding='same', activation='relu',name="LocalLoss3x3")
self.classifier = tf.keras.layers.Dense(10,activation='softmax', name='capsule3Output')
def call(self, inputs):
x=self.conv2D(inputs)
x=self.Flatten(x)
x=self.classifier(x)
pred=self(x_train)
loss=self.loss_fn(pred,y_train)
#self.add_loss(self.rate * tf.reduce_sum(tf.square(inputs)))
return loss, x
(x_train, y_train), (x_test, y_test)= mnist.load_data()
from tensorflow.keras import layers
class SparseMLP(tf.keras.models.Model):
def __init__(self, output_dim):
super(SparseMLP, self).__init__()
self.dense_1 = layers.Dense(1, activation=tf.nn.relu)
self.capsule = capsule()
self.dense_2 = layers.Dense(output_dim)
def call(self, inputs):
x = self.dense_1(inputs)
loss,x = self.capsule(inputs)
return self.dense_2(x)
mlp = SparseMLP(10)
#x_train=x_train.reshape(-1,28,28,1)
y = mlp(x_train)
To include a loss within a layer , you can use add_loss function of tf.keras.layers.Layer class. This fucntion takes a loss value and adds it up to the global loss function define in compile function.
you can call self.add_loss(loss_value) from inside the call method of a custom
layer.Losses added in this way get added to the "main" loss during training
(the one passed to compile()).
So to make ur model consider the losses from intermediate layer , you should uncomment the add_loss fn , and then train the model in usual way that you train.
Please mind that it is totally fine to not declare a "main" loss in the compile function as there already is a loss that ur defining in your layer class.
Note that when you pass losses via add_loss(), it becomes possible to call compile() without a loss function, since the model already has a loss to minimize.
Please note that call function of SparseMLP model , should look like this:
x = self.dense_1(inputs)
# i dunno if u desire to do this, that is pass inputs in capsule
# instead of x.Currently the output from dense_1 is not used at all .
# so keep in mind to make sure ur passing proper inputs to layers.
# and u do not have to call loss here as it will tracked internally by
# keras.
x = self.capsule(inputs)
return self.dense_2(x)
So running your model like below should do the trick:
model.compile(loss = "define ur main loss is there is" , metrics = "define ur metrics")
model.fit(x = train_inst , y = train_targets)

How to specify the loss function when finetuning a model using the Huggingface TFTrainer Class?

I have followed the basic example as given below, from: https://huggingface.co/transformers/training.html
from transformers import TFBertForSequenceClassification, TFTrainer, TFTrainingArguments
model = TFBertForSequenceClassification.from_pretrained("bert-large-uncased")
training_args = TFTrainingArguments(
output_dir='./results', # output directory
num_train_epochs=3, # total # of training epochs
per_device_train_batch_size=16, # batch size per device during training
per_device_eval_batch_size=64, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
)
trainer = TFTrainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=tfds_train_dataset, # tensorflow_datasets training dataset
eval_dataset=tfds_test_dataset # tensorflow_datasets evaluation dataset
)
trainer.train()
But there seems to be no way to specify the loss function for the classifier. For-ex if I finetune on a binary classification problem, I would use
tf.keras.losses.BinaryCrossentropy(from_logits=True)
else I would use
tf.keras.losses.CategoricalCrossentropy(from_logits=True)
My set up is as follows:
transformers==4.3.2
tensorflow==2.3.1
python==3.6.12
Trainer has this capability to use compute_loss
For more you can look into the documentation:
https://huggingface.co/docs/transformers/main_classes/trainer#:~:text=passed%20at%20init.-,compute_loss,-%2D%20Computes%20the%20loss
Here is an example of how to customize Trainer to use a weighted loss (useful when you have an unbalanced training set):
from torch import nn
from transformers import Trainer
class CustomTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
labels = inputs.get("labels")
# forward pass
outputs = model(**inputs)
logits = outputs.get("logits")
# compute custom loss (suppose one has 3 labels with different weights)
loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 2.0, 3.0]))
loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
return (loss, outputs) if return_outputs else loss
create a class which inherits from PreTrainedModel and then in it's forward function create your respective loss function.

mse loss function not compatible with regularization loss (add_loss) on hidden layer output

I would like to code in tf.Keras a Neural Network with a couple of loss functions. One is a standard mse (mean squared error) with a factor loading, while the other is basically a regularization term on the output of a hidden layer. This second loss is added through self.add_loss() in a user-defined class inheriting from tf.keras.layers.Layer. I have a couple of questions (the first is more important though).
1) The error I get when trying to combine the two losses together is the following:
ValueError: Shapes must be equal rank, but are 0 and 1
From merging shape 0 with other shapes. for '{{node AddN}} = AddN[N=2, T=DT_FLOAT](loss/weighted_loss/value, model/new_layer/mul_1)' with input shapes: [], [100].
So it comes from the fact that the tensors which should add up to make one unique loss value have different shapes (and ranks). Still, when I try to print the losses during the training, I clearly see that the vectors returned as losses have shape batch_size and rank 1. Could it be that when the 2 losses are summed I have to provide them (or at least the loss of add_loss) as scalar? I know the mse is usually returned as a vector where each entry is the mse from one sample in the batch, hence having batch_size as shape. I think I tried to do the same with the "regularization" loss. Do you have an explanation for this behavio(u)r?
The sample code which gives me error is the following:
import numpy as np
import tensorflow as tf
from tensorflow.keras import backend as K
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Input
def rate_mse(rate=1e5):
#tf.function # also needed for printing
def loss(y_true, y_pred):
tmp = rate*K.mean(K.square(y_pred - y_true), axis=-1)
# tf.print('shape %s and rank %s output in mse'%(K.shape(tmp), tf.rank(tmp)))
tf.print('shape and rank output in mse',[K.shape(tmp), tf.rank(tmp)])
tf.print('mse loss:',tmp) # print when I put tf.function
return tmp
return loss
class newLayer(tf.keras.layers.Layer):
def __init__(self, rate=5e-2, **kwargs):
super(newLayer, self).__init__(**kwargs)
self.rate = rate
# #tf.function # to be commented for NN training
def call(self, inputs):
tmp = self.rate*K.mean(inputs*inputs, axis=-1)
tf.print('shape and rank output in regularizer',[K.shape(tmp), tf.rank(tmp)])
tf.print('regularizer loss:',tmp)
self.add_loss(tmp, inputs=True)
return inputs
tot_n = 10000
xx = np.random.rand(tot_n,1)
yy = np.pi*xx
train_size = int(0.9*tot_n)
xx_train = xx[:train_size]; xx_val = xx[train_size:]
yy_train = yy[:train_size]; yy_val = yy[train_size:]
reg_layer = newLayer()
input_layer = Input(shape=(1,)) # input
hidden = Dense(20, activation='relu', input_shape=(2,))(input_layer) # hidden layer
hidden = reg_layer(hidden)
output_layer = Dense(1, activation='linear')(hidden)
model = Model(inputs=[input_layer], outputs=[output_layer])
model.compile(optimizer='Adam', loss=rate_mse(), experimental_run_tf_function=False)
#model.compile(optimizer='Adam', loss=None, experimental_run_tf_function=False)
model.fit(xx_train, yy_train, epochs=100, batch_size = 100,
validation_data=(xx_val,yy_val), verbose=1)
#new_xx = np.random.rand(10,1); new_yy = np.pi*new_xx
#model.evaluate(new_xx,new_yy)
print(model.predict(np.array([[1]])))
2) I would also have a secondary question related to this code. I noticed that printing with tf.print inside the function rate_mse only works with tf.function. Similarly, the call method of newLayer is only taken into consideration if the same decorator is commented during training. Can someone explain why this is the case or reference me to a possible solution?
Thanks in advance to whoever can provide me help. I am currently using Tensorflow 2.2.0 and keras version is 2.3.0-tf.
I stuck with the same problem for a few days. "Standard" loss is going to be a scalar at the moment when we add it to the loss from add_loss. The only way how I get it working is to add one more axis while calculating mean. So we will get a scalar, and it will work.
tmp = self.rate*K.mean(inputs*inputs, axis=[0, -1])

How to use the input gradients as variables within a custom loss function in Keras?

I am using the input gradient as feature important and want to compare the feature importance of a train datapoint with the human annotated feature importance. I would like to make this comparison differentiable such that it can be learned through backpropagation. For that, I am writing a custom loss function that in addition to the regular loss (e.g. m.s.e. on the prediction vs true labels) also checks whether the input gradient is correct (e.g. m.s.e. of the input gradient vs the human annotated feature importance).
With the following code I am able to get the input gradient:
from keras import backend as K
import numpy as np
from keras.models import Model
from keras.layers import Input, Dense
def normalize(x):
# utility function to normalize a tensor by its L2 norm
return x / (K.sqrt(K.mean(K.square(x))) + 1e-5)
# Amount of training samples
N = 1000
input_dim = 10
# Generate training set make the 1st and 2nd feature same as the target feature
X = np.random.standard_normal(size=(N, input_dim))
y = np.random.randint(low=0, high=2, size=(N, 1))
X[:, 1] = y[:, 0]
X[:, 2] = y[:, 0]
# Create simple model
inputs = Input(shape=(input_dim,))
x = Dense(10, name="dense1")(inputs)
output = Dense(1, activation='sigmoid')(x)
model = Model(input=[inputs], output=output)
# Compile and fit model
model.compile(optimizer='adam', loss="mse", metrics=['accuracy'])
model.fit([X], y, epochs=100, batch_size=64)
# Get function to get input gradients
gradients = K.gradients(model.output, model.input)[0]
gradient_function = K.function([model.input], [normalize(gradients)])
# Get input gradient values of the training-set
grads_val = gradient_function([X])[0]
print(grads_val[:2])
This prints the following (you can see that the 1st and the 2nd features have the highest importance):
[[ 1.2629046e-02 2.2765596e+00 2.1479919e+00 2.1558853e-02
4.5277486e-03 2.9851785e-03 9.5279224e-04 -1.0903150e-02
-1.2230731e-02 2.1960819e-02]
[ 1.1318034e-02 2.0402350e+00 1.9250139e+00 1.9320872e-02
4.0577268e-03 2.6752844e-03 8.5390132e-04 -9.7713526e-03
-1.0961102e-02 1.9681118e-02]]
How can I write a custom loss function in which the input gradients are differentiable?
I started with the following loss function.
from keras.losses import mean_squared_error
def custom_loss():
# human annotated feature importance
# Let's say that it says to only look at the second feature
human_feature_importance = []
for i in range(N):
human_feature_importance.append([0,0,1,0,0,0,0,0,0,0])
def loss(y_true, y_pred):
# Get regular loss
regular_loss_value = mean_squared_error(y_true, y_pred)
# Somehow get the input gradient of each training sample as a tensor
# It should be differential w.r.t. all of the weights
gradients = ??
feature_importance_loss_value = mean_squared_error(gradients, human_feature_importance)
# Combine the both losses
return regular_loss_value + feature_importance_loss_value
return loss
I also found an implementation in tensorflow to make the input gradient differentialble: https://github.com/dtak/rrr/blob/master/rrr/tensorflow_perceptron.py#L18

Question about enabling/disabling dropout with keras functional API

I am using Keras functional API to build a classifier and I am using the training flag in the dropout layer to enable dropout when predicting new instances (in order to get an estimate of the uncertainty). In order to get the expected response one needs to repeat this prediction several times, with keras randomly activating links in the dense layer, and of course it is computational expensive. Therefore, I would also like to have the option to not use dropout at the prediction phase, i.e., use all the network links. Does anyone know how I can do this? Following is a sample code of what I am doing. I tried to look if predict has any relevant parameter but does not seem like it does (?). I can technically train the same model without the training flag at the dropout layer, but I do not want to do this (or better I want to have a more clean solution, rather than having 2 different models).
from sklearn.datasets import make_circles
from keras.models import Sequential
from keras.utils import to_categorical
from keras.layers import Dense
from keras.layers import Dropout
import numpy as np
import keras
# generate a 2d classification sample dataset
X, y = make_circles(n_samples=100, noise=0.1, random_state=1)
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
trainy = to_categorical(trainy)
testy = to_categorical(testy)
inputlayer = keras.layers.Input((2,))
d = keras.layers.Dense(500, activation = 'relu')(inputlayer)
d1 = keras.layers.Dropout(rate = .3)(d,training = True)
out = keras.layers.Dense(2, activation = 'softmax')(d1)
model = keras.Model(inputs = inputlayer, outputs = out)
model.compile(loss = 'categorical_crossentropy',metrics = ['accuracy'],optimizer='adam')
model.fit(x = trainX, y = trainy, validation_data=(testX, testy),epochs=1000, verbose=1)
# another prediction on a specific sample
print(model.predict(testX[0:1,:]))
# another prediction on the same sample
print(model.predict(testX[0:1,:]))
Running the above example I get the following output:
[[0.9230819 0.07691813]]
[[0.8222245 0.17777553]]
which is as expected, different class probabilities for the same input, since there is a random (de)activation of the links from the dropout layer.
Any suggestions on how I can enable/disable dropout at the prediction phase with the functional API?
Sure, you do not need to set the training flag when building the Dropout layer. After training your model you define this function:
mc_func = K.function([model.input, K.learning_phase()],
[model.output])
Then you call mc_func with your input and flag 1 to enable dropout, or 0 to disable it:
stochastic_pred = mc_func([some_input, 1])
deterministic_pred = mc_func([some_input, 0])

Resources