weighted_masked_objective in keras - keras

According to Keras Code, Keras computes the loss value considering optional weights and masks.
for i in range(len(self.outputs)):
if i in skip_target_indices:
continue
y_true = self.targets[i]
y_pred = self.outputs[i]
weighted_loss = weighted_losses[i]
sample_weight = sample_weights[i]
mask = masks[i]
loss_weight = loss_weights_list[i]
with K.name_scope(self.output_names[i] + '_loss'):
output_loss = weighted_loss(y_true, y_pred,
sample_weight, mask)
if len(self.outputs) > 1:
self.metrics_tensors.append(output_loss)
self.metrics_names.append(self.output_names[i] + '_loss')
if total_loss is None:
total_loss = loss_weight * output_loss
else:
total_loss += loss_weight * output_loss
On the other hand, in Keras documentation I see the basic loss function is introduced in compile function, and then sample or class weights can be introduced in fit command.
I am not sure how to relate 'weights and masks' in the first code to 'sample and class weights' in the second document. Can anybody give me more explanation?
My application is actually a Convolutional LSTM network, where I input a series of images and want the network to produce an output map (with the same size of input maps) of pixel classes, but some pixels don't have valid labels during training. Should I use weight or mask, sample or class?

Related

Why is the output of a convolutional network so exponentially large?

I am trying to reproduce a unet result on Carvana dataset using Ternausnet in PyTorch using Lightning.
I am using DiceLoss for that with sigmoid activation function. I think I am running into an issue of a vanishing gradient, because all gradients of weights are 0, and I see the output of the network with min value of order 10^8.
What could be the issue here? How can I address the vanishing gradient? Also, if I use a different criterion, I see a problem of loss going into negative values without stopping (for BCE with logits, for instance).
Here is the code for my Dice loss:
class DiceLoss(nn.Module):
def __init__(self):
super().__init__()
def forward(self, logits, targets, eps=0, threshold=None):
# comment out if your model contains a sigmoid or
# equivalent activation layer
proba = torch.sigmoid(logits)
proba = proba.view(proba.shape[0], 1, -1)
targets = targets.view(targets.shape[0], 1, -1)
if threshold:
proba = (proba > threshold).float()
# flatten label and prediction tensors
intersection = torch.sum(proba * targets, dim=1)
summation = torch.sum(proba, dim=1) + torch.sum(targets, dim=1)
dice = (2.0 * intersection + eps) / (summation + eps)
# print(intersection, summation, dice)
return (1 - dice).mean()

mse loss function not compatible with regularization loss (add_loss) on hidden layer output

I would like to code in tf.Keras a Neural Network with a couple of loss functions. One is a standard mse (mean squared error) with a factor loading, while the other is basically a regularization term on the output of a hidden layer. This second loss is added through self.add_loss() in a user-defined class inheriting from tf.keras.layers.Layer. I have a couple of questions (the first is more important though).
1) The error I get when trying to combine the two losses together is the following:
ValueError: Shapes must be equal rank, but are 0 and 1
From merging shape 0 with other shapes. for '{{node AddN}} = AddN[N=2, T=DT_FLOAT](loss/weighted_loss/value, model/new_layer/mul_1)' with input shapes: [], [100].
So it comes from the fact that the tensors which should add up to make one unique loss value have different shapes (and ranks). Still, when I try to print the losses during the training, I clearly see that the vectors returned as losses have shape batch_size and rank 1. Could it be that when the 2 losses are summed I have to provide them (or at least the loss of add_loss) as scalar? I know the mse is usually returned as a vector where each entry is the mse from one sample in the batch, hence having batch_size as shape. I think I tried to do the same with the "regularization" loss. Do you have an explanation for this behavio(u)r?
The sample code which gives me error is the following:
import numpy as np
import tensorflow as tf
from tensorflow.keras import backend as K
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Input
def rate_mse(rate=1e5):
#tf.function # also needed for printing
def loss(y_true, y_pred):
tmp = rate*K.mean(K.square(y_pred - y_true), axis=-1)
# tf.print('shape %s and rank %s output in mse'%(K.shape(tmp), tf.rank(tmp)))
tf.print('shape and rank output in mse',[K.shape(tmp), tf.rank(tmp)])
tf.print('mse loss:',tmp) # print when I put tf.function
return tmp
return loss
class newLayer(tf.keras.layers.Layer):
def __init__(self, rate=5e-2, **kwargs):
super(newLayer, self).__init__(**kwargs)
self.rate = rate
# #tf.function # to be commented for NN training
def call(self, inputs):
tmp = self.rate*K.mean(inputs*inputs, axis=-1)
tf.print('shape and rank output in regularizer',[K.shape(tmp), tf.rank(tmp)])
tf.print('regularizer loss:',tmp)
self.add_loss(tmp, inputs=True)
return inputs
tot_n = 10000
xx = np.random.rand(tot_n,1)
yy = np.pi*xx
train_size = int(0.9*tot_n)
xx_train = xx[:train_size]; xx_val = xx[train_size:]
yy_train = yy[:train_size]; yy_val = yy[train_size:]
reg_layer = newLayer()
input_layer = Input(shape=(1,)) # input
hidden = Dense(20, activation='relu', input_shape=(2,))(input_layer) # hidden layer
hidden = reg_layer(hidden)
output_layer = Dense(1, activation='linear')(hidden)
model = Model(inputs=[input_layer], outputs=[output_layer])
model.compile(optimizer='Adam', loss=rate_mse(), experimental_run_tf_function=False)
#model.compile(optimizer='Adam', loss=None, experimental_run_tf_function=False)
model.fit(xx_train, yy_train, epochs=100, batch_size = 100,
validation_data=(xx_val,yy_val), verbose=1)
#new_xx = np.random.rand(10,1); new_yy = np.pi*new_xx
#model.evaluate(new_xx,new_yy)
print(model.predict(np.array([[1]])))
2) I would also have a secondary question related to this code. I noticed that printing with tf.print inside the function rate_mse only works with tf.function. Similarly, the call method of newLayer is only taken into consideration if the same decorator is commented during training. Can someone explain why this is the case or reference me to a possible solution?
Thanks in advance to whoever can provide me help. I am currently using Tensorflow 2.2.0 and keras version is 2.3.0-tf.
I stuck with the same problem for a few days. "Standard" loss is going to be a scalar at the moment when we add it to the loss from add_loss. The only way how I get it working is to add one more axis while calculating mean. So we will get a scalar, and it will work.
tmp = self.rate*K.mean(inputs*inputs, axis=[0, -1])

Keras ImageDataGenerator sample_weight with data augmentation

I have a question about the use of the sample_weight parameter in the context of data augmentation in Keras with the ImageDataGenerator. Let's say I have a series of simple images with just one class of objects. So, for each image, I will have a corresponding mask with pixels = 0 for the background and 1 for where the object is labeled.
However, this dataset is unbalanced because a significant amount of these images are empty, which mean with masks just containing 0.
If I understood well, the 'sample_weight' parameter of the flow method of ImageDataGenerator is here to put the focus on the the samples of my dataset that I find more interesting, i.e. where my object is present.
My question is: what is the concrete influence of this sample_weight parameter on the training of my model. Does it influence the data augmentation? If I use the 'validation_split' parameter, does it influence the way validation sets are generated?
Here is the part of my code my question refers to:
data_gen_args = dict(rotation_range=90,
width_shift_range=0.4,
height_shift_range=0.4,
zoom_range=0.4,
horizontal_flip=True,
fill_mode='reflect',
rescale=1. / 255,
validation_split=0.2,
data_format='channels_last'
)
image_datagen = ImageDataGenerator(**data_gen_args)
imf = image_datagen.flow(
x=stacked_images_channel,
y=stacked_masks_channel,
batch_size=batch_size,
shuffle=False,
seed=seed,subset='training',
sample_weight = sample_weight,
save_to_dir = 'traindir',
save_prefix = 'train_'
)
valf = image_datagen.flow(
x=stacked_images_channel,
y=stacked_masks_channel,
batch_size=batch_size,
shuffle=False,
seed=seed,subset='validation',
sample_weight = sample_weight,
save_to_dir = 'valdir',
save_prefix = 'val_'
)
STEP_SIZE_TRAIN=imf.n//imf.batch_size
STEP_SIZE_VALID=valf.n//valf.batch_size
model = unet.UNet2(numberOfClasses, imshape, '', learningRate, depth=4)
history = model.fit_generator(generator=imf,
steps_per_epoch=STEP_SIZE_TRAIN,
epochs=epochs,
validation_data=valf,
validation_steps=STEP_SIZE_VALID,
verbose=2
)
Thank you in advance for your attention.
As for Keras 2.2.5 with preprocessing at 1.1.0, the sample_weight is passed along with the samples and applied during processing. When calling .fit_generator, the model is trained on batches, each batch using sample weights:
model.train_on_batch(x, y,
sample_weight=sample_weight,
class_weight=class_weight)
In the source code of .train_on_batch, the documentation states: "sample_weight: Optional array of the same length as x, containing weights to apply to the model's loss for each sample. (...)". The actual application of weights happens when calculating loss on each batch. When compiling a model, Keras generates a "weighted loss" function out of the desired loss function. The weighted computation is stated in the code as:
def weighted(y_true, y_pred, weights, mask=None):
"""Wrapper function.
# Arguments
y_true: `y_true` argument of `fn`.
y_pred: `y_pred` argument of `fn`.
weights: Weights tensor.
mask: Mask tensor.
# Returns
Scalar tensor.
"""
# score_array has ndim >= 2
score_array = fn(y_true, y_pred)
if mask is not None:
# Cast the mask to floatX to avoid float64 upcasting in Theano
mask = K.cast(mask, K.floatx())
# mask should have the same shape as score_array
score_array *= mask
# the loss per batch should be proportional
# to the number of unmasked samples.
score_array /= K.mean(mask) + K.epsilon()
# apply sample weighting
if weights is not None:
# reduce score_array to same ndim as weight array
ndim = K.ndim(score_array)
weight_ndim = K.ndim(weights)
score_array = K.mean(score_array,
axis=list(range(weight_ndim, ndim)))
score_array *= weights
score_array /= K.mean(K.cast(K.not_equal(weights, 0), K.floatx()))
return K.mean(score_array)
This wrapper shows it first calculates the desired loss (call to fn(y_true, y_pred)), then applies weighing if weights where passed (either with sample_weight or class_weight).
With this context in mind:
what is the concrete influence of this sample_weight parameter on the training of my model.
Weights are basically multiplied to the loss (and normalized). So "heavy" weights (more than 1) samples cause more loss, so larger gradients. "Light" weights reduce the importance of the sample and lead to smaller gradients.
Does it influence the data augmentation?
It depends on what you mean. Here is what I can say from experience, where I perform augmentation before feeding a Keras data generator (doing so as there were issues in preprocessing, as far as I know still existing in Preprocessing 1.1.0):
When feeding already augmented data to the generator, the .flow call will require a sample weights list as long as the input data. So the influence of weighing on augmentation depends on how the weights are chosen. A data point augmented N times may assign the same weight to each augmentation, or 1/N depending on the intent.
The default behaviour in Keras seems to assign the same weight to each augmentation (transform) performed by Keras. The code looks pretty clear, although I have never relied on it.
If I use the 'validation_split' parameter, does it influence the way validation sets are generated?
The sample_weight parameter does not seem to interfere with validation_split. I have not looked into the code specifically, but splitting basically gets the input data, and keeps a split for validation---whatever the data is. When sample_weight is added, what changes is each data point: Without weight, data is (x, y); with weight, data becomes (x, y, weight).

How to regularize a layer's kernel weights bias weights in a single regularization function?

The Keras documentation introduces separate classes for weight regularization and bias regularization. These can be subclasses to add a custom regularizer. An example from the Keras docs:
def my_regularizer(x):
return 1e-3 * tf.reduce_sum(tf.square(x))
where x can be either the kernel weights or the bias weights. I however want to regularize my layer with a function that include both the layer weights and the layer bias. Is there a way that incorporates both of these into a single function?
For example I would like to have as regularizer:
def l1_special_reg(weight_matrix, bias_vector):
return 0.01 * K.sum(K.abs(weight_matrix)-K.abs(bias_vector))
Thanks,
You can call layer[idx].trainable_weights, it will return both weights and bias. After that you can manually add that regularization loss in model loss function as follows:
model.layers[-1].trainable_weights
[<tf.Variable 'dense_2/kernel:0' shape=(100, 10) dtype=float32_ref>,
<tf.Variable 'dense_2/bias:0' shape=(10,) dtype=float32_ref>]
Complete example with loss function:
# define model
def l1_reg(weight_matrix):
return 0.01 * K.sum(K.abs(weight_matrix))
wts = model.layers[-1].trainable_weights # -1 for last dense layer.
reg_loss = l1_reg(wts[0]) + l1_reg(wts[1])
def custom_loss(reg_loss):
def orig_loss(y_true, y_pred):
return K.categorical_crossentropy(y_true, y_pred) + reg_loss
return orig_loss
model.compile(loss=custom_loss(reg_loss),
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
In TensorFlow 2 this can be achieved with the model.add_loss() function. Say you have a weights and a bias tensor of some layer:
w, b = layer.trainable_weights()
Then you can regularize this layer by adding the regularization function a loss term to the model object as follows:
def l1_special_reg(weight_matrix, bias_vector):
return 0.01 * K.sum(K.abs(weight_matrix)-K.abs(bias_vector))
model.add_loss(l1_special_reg(w, b))
Naturally, you can do this for each layer independently.

Udacity Deep Learning, Assignment 3, Part 3: Tensorflow dropout function

I am now on assignment 3 of the Udacity Deep Learning class. I have most of it completed and it's working but I noticed that problem 3, which is about using 'dropout' with tensorflow, seems to degrade my performance rather than improve it.
So I think I'm doing something wrong. I'll put my full code here. If someone can explain to me how to properly use dropout, I'd appreciate it. (Or confirm I'm using it correctly and it's just not helping in this case.). It drops accuracy from over 94% (without dropout) down to 91.5%. If you aren't using L2 regularization, the degradation is even larger.
def create_nn(dataset, weights_hidden, biases_hidden, weights_out, biases_out):
# Original layer
logits = tf.add(tf.matmul(tf_train_dataset, weights_hidden), biases_hidden)
# Drop Out layer 1
logits = tf.nn.dropout(logits, 0.5)
# Hidden Relu layer
logits = tf.nn.relu(logits)
# Drop Out layer 2
logits = tf.nn.dropout(logits, 0.5)
# Output: Connect hidden layer to a node for each class
logits = tf.add(tf.matmul(logits, weights_out), biases_out)
return logits
# Create model
batch_size = 128
hidden_layer_size = 1024
beta = 1e-3
graph = tf.Graph()
with graph.as_default():
# Input data. For the training data, we use a placeholder that will be fed
# at run time with a training minibatch.
tf_train_dataset = tf.placeholder(tf.float32,
shape=(batch_size, image_size * image_size))
tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)
# Variables.
weights_hidden = tf.Variable(
#tf.truncated_normal([image_size * image_size, num_labels]))
tf.truncated_normal([image_size * image_size, hidden_layer_size]))
#biases = tf.Variable(tf.zeros([num_labels]))
biases_hidden = tf.Variable(tf.zeros([hidden_layer_size]))
weights_out = tf.Variable(tf.truncated_normal([hidden_layer_size, num_labels]))
biases_out = tf.Variable(tf.zeros([num_labels]))
# Training computation.
#logits = tf.matmul(tf_train_dataset, weights_out) + biases_out
logits = create_nn(tf_train_dataset, weights_hidden, biases_hidden, weights_out, biases_out)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits))
loss += beta * (tf.nn.l2_loss(weights_hidden) + tf.nn.l2_loss(weights_out))
# Optimizer.
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
# Predictions for the training, validation, and test data.
train_prediction = tf.nn.softmax(logits)
#valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset, weights_out) + biases_out)
#test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights_out) + biases_out)
valid_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset, weights_hidden) + biases_hidden), weights_out) + biases_out)
test_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset, weights_hidden) + biases_hidden), weights_out) + biases_out)
num_steps = 10000
with tf.Session(graph=graph) as session:
tf.global_variables_initializer().run()
print("Initialized")
for step in range(num_steps):
# Pick an offset within the training data, which has been randomized.
# Note: we could use better randomization across epochs.
offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
#offset = (step * batch_size) % (3*128 - batch_size)
#print(offset)
# Generate a minibatch.
batch_data = train_dataset[offset:(offset + batch_size), :]
batch_labels = train_labels[offset:(offset + batch_size), :]
# Prepare a dictionary telling the session where to feed the minibatch.
# The key of the dictionary is the placeholder node of the graph to be fed,
# and the value is the numpy array to feed to it.
feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
_, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
if (step % 500 == 0):
print("Minibatch loss at step %d: %f" % (step, l))
print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
print("Validation accuracy: %.1f%%" % accuracy(valid_prediction.eval(), valid_labels))
print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
You would need to turn off dropout during inference. It may not be obvious at first, but the fact that dropout is hardcoded in the NN architecture means it will affect the test data during inference. You can avoid this by creating a placeholder keep_prob, rather than providing the value 0.5 directly. For example:
keep_prob = tf.placeholder(tf.float32)
logits = tf.nn.dropout(logits, keep_prob)
To turn on dropout during training, set the keep_prob value to 0.5:
feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob: 0.5}
During inference/evaluation, you should be able to do something like this to set keep_prob to 1.0 in eval:
accuracy.eval(feed_dict={x: test_prediction, y_: test_labels, keep_prob: 1.0}
EDIT:
Since the issue does not seem to be that dropout is used at inference, the next culprit would be that the dropout is too high for this network size. You can potentially try decreasing the dropout to 20% (i.e. keep_prob=0.8), or increasing the size of the network to give the model an opportunity to learn the representations.
I actually gave it a try with your code, and I'm getting around ~93.5% with 20% dropout with this network size. I have added some additional resources below, including the original Dropout paper to help clarify the intuition behind it, and expands on more tips when using dropout such as increasing the learning rate.
References:
Deep MNIST for Experts: has an example on the above (dropout on/off) using MNIST
Dropout Regularization in Deep Learning Models With Keras
Dropout: A Simple Way to Prevent Neural Networks from Overfitting
2 things I think can cause the problem.
First of all I would not recommend using dropout in first layer (that too 50%, use lower, in range 10-25% if you have to)) as when you use such a high dropout even higher level features are not learnt and propagated to deeper layers. Also try a range of dropouts from 10% to 50% and see how accuracy changes. There is no way to know beforehand what value will work
Secondly, you do not usually use dropout at inference. To fix that pass in keep_prob parameter of dropout as a placeholder and set it to 1 when inferencing.
Also, if the accuracy values you state are training accuracy then there may not even be much of a problem in first place as dropout will usually decrease training accuracy by small amounts as you are not overfitting, its the test/validation accuracy that needs to be closely monitored

Resources