run_eagerly=True make the training result different in Tensorflow 2.3.2 - tensorflow2.x

Recently I come across a strange question in Running Neural network code on Tensorflow 2.3.2. The question is that when I only changed run_eagerly=True to run_eagerly=False in the config
model.compile(
loss={"label": "binary_crossentropy"},
metrics=tf.keras.metrics.AUC(name="auc"),
run_eagerly=True
)
the model will get quite different results, Mainly about AUC and LOSS changes
the AUC and Loss in run_eagerly=True is:
INFO:root:batch (99), speed (67.83 qps/s), loss (7.76), auc (0.50)
INFO:root:batch (199), speed (77.42 qps/s), loss (7.70), auc (0.50)
INFO:root:batch (299), speed (77.81 qps/s), loss (7.69), auc (0.50)
INFO:root:batch (399), speed (75.01 qps/s), loss (7.64), auc (0.50)
INFO:root:batch (499), speed (70.51 qps/s), loss (7.68), auc (0.50)
INFO:root:batch (599), speed (77.87 qps/s), loss (7.70), auc (0.50)
INFO:root:batch (699), speed (75.42 qps/s), loss (7.70), auc (0.50)
while I change the config to run_eagerly=True the result will be:
INFO:root:batch (199), speed (107.17 qps/s), loss (1.12), auc (0.51)
INFO:root:batch (299), speed (100.84 qps/s), loss (1.00), auc (0.52)
INFO:root:batch (399), speed (98.40 qps/s), loss (0.93), auc (0.53)
INFO:root:batch (499), speed (101.34 qps/s), loss (0.89), auc (0.55)
INFO:root:batch (599), speed (102.09 qps/s), loss (0.86), auc (0.56)
INFO:root:batch (699), speed (94.13 qps/s), loss (0.83), auc (0.57)
My model is defined as follows:
import logging
import tensorflow as tf
from models.core.fully_connected_layers import MultiHeadLayer, get_fc_layers
from .base_net import BaseNet
class SingleTowerNet(BaseNet):
def __init__(self, features, net_conf, **kwargs):
super(SingleTowerNet, self).__init__(**kwargs)
self.features = features
self.net_conf = net_conf
self.user_hidden_num_list = [int(i) for i in self.net_conf["User"].split(",")]
self.train_features = features.train_feature_names
self.user_feature_names = features.user_feature_names
self.label_name = features.label_feature_names[0]
assert len(features.label_feature_names) == 1, "must have only one label name"
self.preprocess_layers = self.get_deep_process_layer_map(self.train_features, features)
self.user_fc_layers = get_fc_layers(self.user_hidden_num_list)
def get_user_embedding(self, user_features):
user_emb_list = [self.preprocess_layers[name](user_features[name]) for name in
self.user_feature_names]
user_concat = tf.concat(user_emb_list, axis=1)
user_embedding = self.user_fc_layers(user_concat)
user_embedding_norm = tf.math.l2_normalize(user_embedding, axis=1)
return user_embedding_norm
def call(self, train_data, training=False):
user_feature = {name: train_data[name] for name in self.user_feature_names}
user_embedding_norm = self.get_user_embedding(user_feature)
d = tf.math.reduce_sum(user_embedding_norm * 10, axis=1)
p = tf.math.sigmoid(d)
if training:
return {self.label_name: p}
else:
label = tf.squeeze(train_data[self.label_name])
ad_group_id = tf.squeeze(train_data["ad_group_id"])
return {"predict": p,
"label": label,
"ad_group_id": ad_group_id}
Does anyone what happen ?

Related

I am kinda new to the pytorch, now struggling with a classification problem

I built a very simple structure
class classifier (nn.Module):
def __init__(self):
super().__init__()
self.classify = nn.Sequential(
nn.Linear(166,80),
nn.Tanh(),
nn.Linear(80,40),
nn.Tanh(),
nn.Linear(40,1),
nn.Softmax()
)
def forward (self, x):
pred = self.classify(x)
return pred
model = classifier()
The loss function and optimizer are defined as
criteria = nn.BCEWithLogitsLoss()
iteration = 1000
learning_rate = 0.1
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
and here is the training and evaluation section
for epoch in range (iteration):
model.train()
y_pred = model(x_train)
loss = criteria(y_pred,y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
model.eval()
with torch.inference_mode():
test_pred = model(x_test)
test_loss = criteria(test_pred, y_test)
if epoch % 100 == 0:
print(loss)
print(test_loss)
I received the same loss values, and by debugging, I found that the weights were not being updated.
The problem is in the network architecture: you are using a Softmax layer on a single valued output at the end. As per the definition of the softmax function, for a output vector x, we have, for index i:
softmax(x_i) = e^{x_i} / sum_j (e^{x_j})
Here, you only have a single valued output. Due to this, the output of your neural network is always 1, irrespective of the inputs or the weights. To fix this, remove the Softmax layer at the end. An activation function like Sigmoid might be more appropriate, and in fact you are already applying this when using the BCEWithLogitsLoss.
The problem lies here
y_pred = model(x_train)
loss = criteria(y_pred,y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
after loss is calculated, you are clearing the gradients by doing optimizer.zero_grad()
the ideal case should be:
optimizer.zero_grad()
y_pred = model(x_train)
loss = criteria(y_pred,y_train)
loss.backward()
optimizer.step()

Why is testing accuracy so low, could there be a bug in my code?

I've been training an image classification model using object detection and then applying image classification to the images. I have 87 custom classes in my data(not ImageNet classes), and just over 7000 images altogether(around 60 images per class). I am happy with my object detection code and I think it works quite well, however, for classification I have been using ResNet and AlexNet. I have tried AlexNet, ResNet18, ResNet50 and ResNet101 for training however, I am getting very low testing accuracies(around 10%), and my training accuracies are high for all models. I've also attempted regularisation and changing the learning rates, but I am not getting the higher accuracies(>80%) that I require. I wonder if there is a bug in my code, although I haven't been able to figure it out.
Here is my training code, I have also processed images in the way that Pytorch pretrained models expect:
import torch.nn as nn
import torch.optim as optim
from typing import Callable
import numpy as np
EPOCHS=100
resnet = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50')
resnet.eval()
resnet.fc = nn.Linear(2048, 87)
res_loss = nn.CrossEntropyLoss()
res_optimiser = optim.SGD(resnet.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-5)
def train_model(model, loss_fn, optimiser, modelsavepath):
train_acc = 0
for j in range(EPOCHS):
running_loss = 0.0
correct = 0
total = 0
for i, data in enumerate(training_generator, 0):
model.train()
inputs, labels, paths = data
total += 1
optimizer.zero_grad()
outputs = model(inputs)
_, predicted = torch.max(outputs, 1)
if(predicted.int() == labels.int()):
correct += 1
loss = loss_fn(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
train_acc = train_correct / len(training_generator)
print("Epoch:{}/{} AVG Training Loss:{:.3f} AVG Training Acc {:.2f}% ".format(j + 1, EPOCHS, train_loss, train_acc))
torch.save(model, modelsavepath)
train_model(resnet, res_loss, res_optimiser, 'resnet.pth')
Here is the testing code used for a single image, it is part of a class:
self.model.eval()
outputs = self.model(img[None, ...]) #models expect batches, so give it a singleton batch
scores, predictions = torch.max(outputs, 1)
predictions = predictions.numpy()[0]
possible_scores= np.argmax(scores.detach().numpy())
Is there a bug in my code, either testing or training, or is my model just overfitting? Additionally, is there a better image classification model that I could try?
Your dataset is very small, so you're most likely overfitting. Try:
decrease learning rate (try 0.001, 0.0001, 0.00001)
increase weight_decay (try 1e-4, 1e-3, 1e-2)
if you don't already, use image augmentations (at least the default ones, like random crop and flip).
Watch train/test loss curves when finetuning your model and stop training as soon as you see test accuracy going down while train accuracy goes up.

How to understand a periodicity in the training loss using a pre-trained model of PyTorch?

I'm using a pre-trained model from Pytorch ( Resnet 18,34,50) in order to classify images. During the training, a weird periodicity appears in the training as you can see in the image below. Did somebody already have a similar issue?In order to deal with the overfitting, I'm using Data augmentation in the preprocessing.
When using SGD as an optimizer with the following parameters, we obtain this sort of graph:
criterion: NLLLoss()
learning rate: 0.0001
epoch: 40
print every 40 iteration
We also try adam and Adam bound as optimizers but the same periodicity was observed.
Thank's in advance for your answer!
Here is the code :
def train_classifier():
start=0
stop=0
start = timeit.default_timer()
epochs = 40
steps = 0
print_every = 40
model.to('cuda')
epo=[]
train=[]
valid=[]
acc_valid=[]
for e in range(epochs):
print('Currently running epoch',e,':')
model.train()
running_loss = 0
for images, labels in iter(train_loader):
steps += 1
images, labels = images.to('cuda'), labels.to('cuda')
optimizer.zero_grad()
output = model.forward(images)
loss = criterion(output, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
if steps % print_every == 0:
model.eval()
# Turn off gradients for validation, saves memory and computations
with torch.no_grad():
validation_loss, accuracy = validation(model, val_loader, criterion)
print("Epoch: {}/{}.. ".format(e+1, epochs),
"Training Loss: {:.3f}.. ".format(running_loss/print_every),
"Validation Loss: {:.3f}.. ".format(validation_loss/len(val_loader)),
"Validation Accuracy: {:.3f}".format(accuracy/len(val_loader)))
stop = timeit.default_timer()
print('Time: ', stop - start)
acc_valid.append(accuracy/len(val_loader))
train.append(running_loss/print_every)
valid.append(validation_loss/len(val_loader))
epo.append(e+1)
running_loss = 0
model.train()
return train,epo,valid,acc_valid

Adding second hidden layer in Tensorflow breaks loss calculation

I'm am working on assignment three of the Udacity Deep Learning course. I have a working neural network with one hidden layer. However, when I add a second one, the loss results in nan.
This is the graph code:
num_nodes_layer_1 = 1024
num_nodes_layer_2 = 128
num_inputs = 28 * 28
num_labels = 10
batch_size = 128
graph = tf.Graph()
with graph.as_default():
# Input data. For the training data, we use a placeholder that will be fed
# at run time with a training minibatch.
tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, num_inputs))
tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)
# variables
# hidden layer 1
hidden_weights_1 = tf.Variable(tf.truncated_normal([num_inputs, num_nodes_layer_1]))
hidden_biases_1 = tf.Variable(tf.zeros([num_nodes_layer_1]))
# hidden layer 2
hidden_weights_2 = tf.Variable(tf.truncated_normal([num_nodes_layer_1, num_nodes_layer_2]))
hidden_biases_2 = tf.Variable(tf.zeros([num_nodes_layer_2]))
# linear layer
weights = tf.Variable(tf.truncated_normal([num_nodes_layer_2, num_labels]))
biases = tf.Variable(tf.zeros([num_labels]))
# Training computation.
y1 = tf.nn.relu(tf.matmul(tf_train_dataset, hidden_weights_1) + hidden_biases_1)
y2 = tf.nn.relu(tf.matmul(y1, hidden_weights_2) + hidden_biases_2)
logits = tf.matmul(y2, weights) + biases
# Calc loss
loss = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits_v2(labels=tf_train_labels, logits=logits))
# Optimizer.
# We are going to find the minimum of this loss using gradient descent.
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
# Predictions for the training, validation, and test data.
# These are not part of training, but merely here so that we can report
# accuracy figures as we train.
train_prediction = tf.nn.softmax(logits)
y1_valid = tf.nn.relu(tf.matmul(tf_valid_dataset, hidden_weights_1) + hidden_biases_1)
y2_valid = tf.nn.relu(tf.matmul(y1_valid, hidden_weights_2) + hidden_biases_2)
valid_prediction = tf.nn.softmax(tf.matmul(y2_valid, weights) + biases)
y1_test = tf.nn.relu(tf.matmul(tf_test_dataset, hidden_weights_1) + hidden_biases_1)
y2_test = tf.nn.relu(tf.matmul(y1_test, hidden_weights_2) + hidden_biases_2)
test_prediction = tf.nn.softmax(tf.matmul(y2_test, weights) + biases)
It does not give an error. But after the first time, the loss is unable to print and it doesn't learn.
Initialized
Minibatch loss at step 0: 2133.468750
Minibatch accuracy: 8.6%
Validation accuracy: 10.0%
Minibatch loss at step 400: nan
Minibatch accuracy: 9.4%
Validation accuracy: 10.0%
Minibatch loss at step 800: nan
Minibatch accuracy: 11.7%
Validation accuracy: 10.0%
Minibatch loss at step 1200: nan
Minibatch accuracy: 4.7%
Validation accuracy: 10.0%
Minibatch loss at step 1600: nan
Minibatch accuracy: 7.8%
Validation accuracy: 10.0%
Minibatch loss at step 2000: nan
Minibatch accuracy: 6.2%
Validation accuracy: 10.0%
Test accuracy: 10.0%
When I remove the second layer it trains and I get an accuracy of about 85%. With a second layer I would suspect the score to be between 80% and 90%.
Am I using the wrong optimizer? Is it just something stupid I missed?
This is the session code:
num_steps = 2001
with tf.Session(graph=graph) as session:
tf.global_variables_initializer().run()
print("Initialized")
for step in range(num_steps):
# Pick an offset within the training data, which has been randomized.
# Note: we could use better randomization across epochs.
offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
# Generate a minibatch.
batch_data = train_dataset[offset:(offset + batch_size), :]
batch_labels = train_labels[offset:(offset + batch_size), :]
# Prepare a dictionary telling the session where to feed the minibatch.
# The key of the dictionary is the placeholder node of the graph to be fed,
# and the value is the numpy array to feed to it.
feed_dict = {
tf_train_dataset : batch_data,
tf_train_labels : batch_labels,
}
_, l, predictions = session.run(
[optimizer, loss, train_prediction], feed_dict=feed_dict)
if (step % 400 == 0):
print("Minibatch loss at step %d: %f" % (step, l))
print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
print("Validation accuracy: %.1f%%" % accuracy(valid_prediction.eval(), valid_labels))
acc = accuracy(test_prediction.eval(), test_labels)
print("Test accuracy: %.1f%%" % acc)
Your learning rate of 0.5 is too high, set it to 0.05 and it'll converge.
Minibatch loss at step 0: 1506.469238
Minibatch loss at step 400: 7796.088867
Minibatch loss at step 800: 9893.363281
Minibatch loss at step 1200: 5089.553711
Minibatch loss at step 1600: 6148.481445
Minibatch loss at step 2000: 5257.598145
Minibatch loss at step 2400: 1716.116455
Minibatch loss at step 2800: 1600.826538
Minibatch loss at step 3200: 941.884766
Minibatch loss at step 3600: 1033.936768
Minibatch loss at step 4000: 1808.775757
Minibatch loss at step 4400: 113.909866
Minibatch loss at step 4800: 49.800560
Minibatch loss at step 5200: 20.392700
Minibatch loss at step 5600: 6.253595
Minibatch loss at step 6000: 4.372780
Minibatch loss at step 6400: 6.862935
Minibatch loss at step 6800: 6.951239
Minibatch loss at step 7200: 3.528607
Minibatch loss at step 7600: 2.968611
Minibatch loss at step 8000: 3.164592
...
Minibatch loss at step 19200: 2.141401
Also a couple of pointers:
tf_train_dataset and tf_train_labels should be tf.placeholders of shape [None, 784]. The None dimension allows you to vary the batch size during training, instead of being limited to a size number such as 128.
Instead of using tf_valid_dataset and tf_test_dataset as tf.constant, just pass your validation and test datasets in the respective feed_dicts, this will allow you to get rid of the extra ops at the end of your graph for validation and test accuracy.
I'd recommended sampling from a separate batch of validation and test data rather than using the same batch of data for each iteration of checking the val/test accuracy.

Why is accuracy different between Keras model.fit and model.evaluate?

I am trying to fit a Keras model and use both the history object and evaluate
function to see how well the model performs. The code to compute so is below:
optimizer = Adam (lr=learning_rate)
model.compile(loss='categorical_crossentropy',
optimizer=optimizer,
metrics=['accuracy')
for epoch in range (start_epochs, start_epochs + epochs):
history = model.fit(X_train, y_train, verbose=0, epochs=1,
batch_size=batch_size,
validation_data=(X_val, y_val))
print (history.history)
score = model.evaluate(X_train, y_train, verbose=0)
print ('Training accuracy', model.metrics_names, score)
score = model.evaluate(X_val, y_val, verbose=0)
print ('Validation accuracy', model.metrics_names, score)
To my surprise the accuracy and loss results of the training set differ between history and evaluate. As the results for the validation set are equal it seems some blunder from my side but I cannot find anything. I have given the output for the first four epochs below. I got the same results for metric 'mse': training set differs, test set equal. Anybody any idea?
{'val_loss': [13.354823187591416], 'loss': [2.7036468725265874], 'val_acc': [0.11738484422572477], 'acc': [0.21768202061048531]}
Training accuracy ['loss', 'acc'] [13.265716915499048, 0.1270430906536911]
Validation accuracy ['loss', 'acc'] [13.354821096026349, 0.11738484398216939]
{'val_loss': [11.733116257598105], 'loss': [1.8158155931229045], 'val_acc': [0.26745913783295899], 'acc': [0.34522040671733062]}
Training accuracy ['loss', 'acc'] [11.772184015560292, 0.26721149086656992]
Validation accuracy ['loss', 'acc'] [11.733116155570542, 0.26745913818722139]
{'val_loss': [7.1503656643815061], 'loss': [1.5667824202566349], 'val_acc': [0.26597325444044367], 'acc': [0.44378405117114739]}
Training accuracy ['loss', 'acc'] [7.0615554528994506, 0.26250619121327617]
Validation accuracy ['loss', 'acc'] [7.1503659895943672, 0.26597325408618128]
{'val_loss': [4.2865109046890693], 'loss': [1.4087548087645783], 'val_acc': [0.13893016366866509], 'acc': [0.49232293093422957]}
Training accuracy ['loss', 'acc'] [4.1341019072350802, 0.14338781575775195]
Validation accuracy ['loss', 'acc'] [4.2865103747125541, 0.13893016344725112]
There is nothing to be surprised, the metrics on the training set are just the mean over all batches during training, as the weights are changing with each batch.
Using model.evaluate will keep the model weights fixed and compute loss/accuracy for the whole data you give in. If you want to have the loss/accuracy on the training set, then you have to use model.evaluate and pass the training set to it. The history object does not have the true loss/accuracy on the training set.

Resources