Nan loss in keras with triplet loss - keras

I'm trying to learn an embedding for Paris6k images combining VGG and Adrian Ung triplet loss. The problem is that after a small amount of iterations, in the first epoch, the loss becomes nan, and then the accuracy and validation accuracy grow to 1.
I've already tried lowering the learning rate, increasing the batch size (only to 16 beacuse of memory), changing optimizer (Adam and RMSprop), checking if there are None values on my dataset, changing data format from 'float32' to 'float64', adding a little bias to them and simplify the model.
Here is my code:
base_model = VGG16(include_top = False, input_shape = (512, 384, 3))
input_images = base_model.input
input_labels = Input(shape=(1,), name='input_label')
embeddings = Flatten()(base_model.output)
labels_plus_embeddings = concatenate([input_labels, embeddings])
model = Model(inputs=[input_images, input_labels], outputs=labels_plus_embeddings)
batch_size = 16
epochs = 2
embedding_size = 64
opt = Adam(lr=0.0001)
model.compile(loss=tl.triplet_loss_adapted_from_tf, optimizer=opt, metrics=['accuracy'])
label_list = np.vstack(label_list)
x_train = image_list[:2500]
x_val = image_list[2500:]
y_train = label_list[:2500]
y_val = label_list[2500:]
dummy_gt_train = np.zeros((len(x_train), embedding_size + 1))
dummy_gt_val = np.zeros((len(x_val), embedding_size + 1))
H = model.fit(
x=[x_train,y_train],
y=dummy_gt_train,
batch_size=batch_size,
epochs=epochs,
validation_data=([x_val, y_val], dummy_gt_val),callbacks=callbacks_list)
The images are 3366 with values scaled in range [0, 1].
The network takes dummy values because it tries to learn embeddings from images in a way that images of the same class should have small distance, while images of different classes should have high distances and than the real class is part of the training.
I've noticed that I was previously making an incorrect class division (and keeping images that should be discarded), and I didn't have the nan loss problem.
What should I try to do?
Thanks in advance and sorry for my english.

In some case, the random NaN loss can be caused by your data, because if there are no positive pairs in your batch, you will get a NaN loss.
As you can see in Adrian Ung's notebook (or in tensorflow addons triplet loss; it's the same code) :
semi_hard_triplet_loss_distance = math_ops.truediv(
math_ops.reduce_sum(
math_ops.maximum(
math_ops.multiply(loss_mat, mask_positives), 0.0)),
num_positives,
name='triplet_semihard_loss')
There is a division by the number of positives pairs (num_positives), which can lead to NaN.
I suggest you try to inspect your data pipeline in order to ensure there is at least one positive pair in each of your batches. (You can for example adapt some of the code in the triplet_loss_adapted_from_tf to get the num_positives of your batch, and check if it is greater than 0).

Try increasing your batch size. It happened to me also. As mentioned in the previous answer, network is unable to find any num_positives. I had 250 classes and was getting nan loss initially. I increased it to 128/256 and then there was no issue.
I saw that Paris6k has 15 classes or 12 classes. Increase your batch size 32 and if the GPU memory occurs you can try with model with less parameters. You can work on Efficient B0 model for starting. It has 5.3M compared to VGG16 which has 138M parameters.

I have implemented a package for triplet generation so that every batch is guaranteed to include postive pairs. It is compatible with TF/Keras only.
https://github.com/ma7555/kerasgen (Disclaimer: I am the owner)

Related

Anomalies have similar error values to normal data

I have inertial measurement unit (IMU) data for which I am building an anomaly detection autoencoder neural net. I have about 5k training samples of which I am using 10% for validation. I also have about 50 (though I can make more) samples to test anomaly detection. My dataset has 12 IMU features. I train for about 10,000 epochs and I attain mean squared errors for reconstruction (MSE) of about 0.004 during training. After training, I perform an MSE calculation on the test data and I get values very similar to those in the train data (0.003) and I do not know why!
I am making my test set by slicing 50 samples from the overall data (not part of X_train) and changing one of the features to all zeros. I have also tried adding noise to one of the features as well as making multiple features zero.
np.random.seed(404)
np.random.shuffle(all_imu_data)
norm_imu_data = all_imu_data[:len_slice]
anom_imu_data = all_imu_data[len_slice:]
anom_imu_data[:,6] = 0
scaler = MinMaxScaler()
norm_data = scaler.fit_transform(norm_imu_data)
anom_data = scaler.transform(anom_imu_data)
X_train = pd.DataFrame(norm_data)
X_test = pd.DataFrame(anom_data)
I have tried many different network sizes by ranging number of hidden layers and number of hidden nodes/layer. As an example, I show a topology like [12-7-4-7-12]:
input_dim = num_features
input_layer = Input(shape=(input_dim, ))
encoder = Dense(int(7), activation="tanh", activity_regularizer=regularizers.l1(10e-5))(input_layer)
encoder = Dense(int(4), activation="tanh")(encoder)
decoder = Dense(int(7), activation="tanh")(encoder)
decoder = Dense(int(input_dim), activation="tanh")(decoder)
autoencoder = Model(inputs=input_layer, outputs=decoder)
autoencoder.compile(optimizer='adam', loss='mse', metrics=['mse'])
history = autoencoder.fit(X_train, X_train,
epochs=nb_epoch,
batch_size=batch_size,
shuffle=True,
validation_split=0.1,
verbose=1,
callbacks=[checkpointer, tensorboard]).history
pred_train = autoencoder.predict(X_train)
pred_test = autoencoder.predict(X_test)
mse_train = np.mean(np.power(X_train - pred_train, 2), axis=1)
mse_test = np.mean(np.power(X_test - pred_test, 2), axis=1)
print('MSE mean() - X_train:', np.mean(mse_train))
print('MSE mean() - X_test:', np.mean(mse_test))
After doing this, I get MSE mean numbers of 0.004 for Train and 0.003 for Test. Therefore, I cannot select a good threshold for anomalous data, as there are a lot of normal points that have larger MSE scores than the 'anomalous' data.
Any thoughts as to why this network is unable to detect these anomalies?
It is completely normal. You train your autoencoder on a sub sample of your whole data. Therefore, there are also anomalies contaminating your training data. The purpose of the autoencoder is to find a perfect reconstruction of your original data which it does including the anomalies. It is a very powerful tool, so if you show it anomalies in the training data, it will reconstruct them easily.
You need to remove 5% of your anomalous data with another anomaly detection algorithm (for example isolation forest) and do the subsampling on that part of the data (without outliers).
After that, you can find your outliers easily.

Tensorflow - The prediction from Neural Network usually goes for all-0s in classification

I'm new to Deep Learning and currently, I work with the Classification Problem. I've implemented it with the last Fully-Connected Layer & Activation as following Tensorflow:
predictions = tf.layers.dense(attention_layer_output, nb_classes, name="Output_Layer")
predictions = tf.reduce_sum(predictions, axis = 0)
targets_raw_ = tf.nn.sigmoid(predictions)
targets_ = tf.round(targets_raw_)
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels = self._targets, logits = predictions)
is_correct = tf.equal(targets_, self._targets)
self.accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32))
tf.summary.scalar('Accuracy', self.accuracy)
adam_opt = tf.train.AdamOptimizer(self._learning_rate)
self.optimizer = adam_opt.minimize(cross_entropy)
I've tested with the first 30 epochs, and it's almost overfitting with one of these classes. When I tried to debug what happened in the above code by tf.Print, I've found that the prediction is usually [0 0 0 0 0 0] in case nb_classes = 6.
So, the accuracy of training usually goes around 83,33%, which means 5/6 class is correct.
Do I have to do anything else with the above code or I still have to wait for training with more epoch?
What I understand from your description is that you have imbalanced training set, i.e. not all classes are represented equally. In that case, accuracy is not a good metric because if one of the classes appears most often and the model predicts this class always, accuracy is still good despite the fact that model is useless. Have a look at AUC instead.

I don't understand the code for training a classifier in pytorch

I don't understand the line labels.size(0). I'm new to Pytorch and been quite confused about the data structure.
correct = 0
total = 0
with torch.no_grad():
for data in testloader:
images, labels = data
outputs = net(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print('Accuracy of the network on the 10000 test images: %d %%' % (
100 * correct / total))`
labels is a Tensor with dimensions [N, 1], where N is equal to the number of samples in the batch. .size(...) returns a subclass of tuple (torch.Size) with the dimensions of the Tensor, and .size(0) returns an integer with the value of the first (0-based) dimension (i.e., N).
To answer your question
In PyTorch, tensor.size() allows you to check out the shape of a tensor.
In your code,
images, labels = data
images and labels will each contain N number of training examples depends on your batch size. If you check out the shape of labels, it should be [N, 1], where N is the size of mini-batch training example.
A bit of prescience for those who are new to training a neural network.
When training a neural network, practitioners will forward pass the dataset through the network and optimize the gradients.
Say your training dataset contain 1 million images, and your training script is designed in a way to pass all 1 million images in a single epoch. The problem with this approach is it will take a really long time for you to receive feedback from your neural network. This is where mini-batch training comes in.
In PyTorch, the DataLoader class allows us to split the dataset into multiple batches. If your training loader contains 1 Million examples and batch size is 1000, you will expect each epoch will iterate 1000 step through all the mini-batches. This way, you can observe and optimize the training performance better.

VGG bottleneck features + LSTM in keras

I have pre-stored bottleneck features (.npy files) obtained from VGG16 for around 10k images. Training a SVM classifier (3-class classification) on these features gave me an accuracy of 90% on the test set. These images are obtained from videos. I want to train an LSTM in keras on top of these features. My code snippet can be found below. The issue is that the training accuracy is not going above 43%, which is unexpected. Please help me in debugging the issue. I have tried with different learning rates.
#Asume all necessary imports done
classes = 3
frames = 5
channels = 3
img_height = 224
img_width = 224
epochs = 20
#Model definition
model = Sequential()
model.add(TimeDistributed(Flatten(),input_shape=(frames,7,7,512)))
model.add(LSTM(256,return_sequences=False))
model.add(Dense(1024,activation="relu"))
model.add(Dense(3,activation="softmax"))
optimizer = Adam(lr=0.1,beta_1=0.9,beta_2=0.999,epsilon=None,decay=0.0)
model.compile (loss="categorical_crossentropy",optimizer=optimizer,metrics=["accuracy"])
model.summary()
train_data = np.load(open('bottleneck_features_train.npy','rb'))
#final_img_data shape --> 2342,5,7,7,512
#one_hot_labels shape --> 2342,3
model.fit(final_img_data,one_hot_labels,epochs=epochs,batch_size=2)
You are probably missing the local minimum, because learning rate is too high. Try to decrease learning rate to 0.01 -- 0.001 and increase number of epochs. Also, decrease Dense layer neurons from 1024 to half. Otherwise you may overfit.

Multi-label classification with class weights in Keras

I have a 1000 classes in the network and they have multi-label outputs. For each training example, the number of positive output is same(i.e 10) but they can be assigned to any of the 1000 classes. So 10 classes have output 1 and rest 990 have output 0.
For the multi-label classification, I am using 'binary-cross entropy' as cost function and 'sigmoid' as the activation function. When I tried this rule of 0.5 as the cut-off for 1 or 0. All of them were 0. I understand this is a class imbalance problem. From this link, I understand that, I might have to create extra output labels.Unfortunately, I haven't been able to figure out how to incorporate that into a simple neural network in keras.
nclasses = 1000
# if we wanted to maximize an imbalance problem!
#class_weight = {k: len(Y_train)/(nclasses*(Y_train==k).sum()) for k in range(nclasses)}
inp = Input(shape=[X_train.shape[1]])
x = Dense(5000, activation='relu')(inp)
x = Dense(4000, activation='relu')(x)
x = Dense(3000, activation='relu')(x)
x = Dense(2000, activation='relu')(x)
x = Dense(nclasses, activation='sigmoid')(x)
model = Model(inputs=[inp], outputs=[x])
adam=keras.optimizers.adam(lr=0.00001)
model.compile('adam', 'binary_crossentropy')
history = model.fit(
X_train, Y_train, batch_size=32, epochs=50,verbose=0,shuffle=False)
Could anyone help me with the code here and I would also highly appreciate if you could suggest a good 'accuracy' metric for this problem?
Thanks a lot :) :)
I have a similar problem and unfortunately have no answer for most of the questions. Especially the class imbalance problem.
In terms of metric there are several possibilities: In my case I use the top 1/2/3/4/5 results and check if one of them is right. Because in your case you always have the same amount of labels=1 you could take your top 10 results and see how many percent of them are right and average this result over your batch size. I didn't find a possibility to include this algorithm as a keras metric. Instead, I wrote a callback, which calculates the metric on epoch end on my validation data set.
Also, if you predict the top n results on a test dataset, see how many times each class is predicted. The Counter Class is really convenient for this purpose.
Edit: If found a method to include class weights without splitting the output.
You need a numpy 2d array containing weights with shape [number classes to predict, 2 (background and signal)].
Such an array could be calculated with this function:
def calculating_class_weights(y_true):
from sklearn.utils.class_weight import compute_class_weight
number_dim = np.shape(y_true)[1]
weights = np.empty([number_dim, 2])
for i in range(number_dim):
weights[i] = compute_class_weight('balanced', [0.,1.], y_true[:, i])
return weights
The solution is now to build your own binary crossentropy loss function in which you multiply your weights yourself:
def get_weighted_loss(weights):
def weighted_loss(y_true, y_pred):
return K.mean((weights[:,0]**(1-y_true))*(weights[:,1]**(y_true))*K.binary_crossentropy(y_true, y_pred), axis=-1)
return weighted_loss
weights[:,0] is an array with all the background weights and weights[:,1] contains all the signal weights.
All that is left is to include this loss into the compile function:
model.compile(optimizer=Adam(), loss=get_weighted_loss(class_weights))

Resources