The gradient cannot be calculated automatically - pytorch

I am a beginner of Deep Learning and trying to making discriminator that judge cats/non-cats.
But When I run the code following, runtime error occured.
I know that "requires_grad" must be set to True in order to calculate the gradient automatically, but since X_train and Y_train are variables for reading, they are set to False.
I would be grateful if you could modify this code.
X_train = torch.tensor(train_set_x, dtype=dtype,requires_grad=False)
Y_train = torch.tensor(train_set_y, dtype=dtype,requires_grad=False)
def train_model(X_train, Y_train, X_test, Y_test, n_h, num_iterations=10000,learning_rate=0.5, print_cost=False):
"""
Arguments:
X_train -- training set represented by a numpy array of shape (num_px * num_px * 3, m_train)
Y_train -- training labels represented by a numpy array (vector) of shape (1, m_train)
X_test -- test set represented by a numpy array of shape (num_px * num_px * 3, m_test)
Y_test -- test labels represented by a numpy array (vector) of shape (1, m_test)
n_h -- size of the hidden layer
num_iterations -- number of iterations in gradient descent loop
learning_rate -- hyperparameter representing the learning rate used in the update rule of optimize()
print_cost -- if True, print the cost every 200 iterations
Returns:
d -- dictionary containing information about the model.
"""
n_x = X.size(1)
n_y = Y.size(1)
# Create model
model = nn.Sequential(
nn.Linear(n_x,n_h),
nn.ReLU(),
nn.Linear(n_h,n_y),
nn.ReLU()
)
# Initialize parameters
for name, param in model.named_parameters():
if name.find('weight') != -1:
torch.nn.init.orthogonal_(param)
elif name.find('bias') != -1:
torch.nn.init.constant_(param, 0)
# Cost function
cost_fn = nn.BCELoss()
# Loop (gradient descent)
for i in range(0, num_iterations):
# Forward propagation: compute predicted labels by passing input data to the model.
Y_predicted = model(X_train)
A2 = (Y_predicted > 0.5).float()
# Cost function. Inputs: predicted and true values. Outputs: "cost".
cost = cost_fn(A2, Y_train)
# Print the cost every 100 iterations
if print_cost and i % 100 == 0:
print("Cost after iteration %i: %f" % (i, cost.item()))
# Zero the gradients before running the backward pass. See hint in problem description
model.zero_grad()
# Backpropagation. Compute gradient of the cost function with respect to all the
# learnable parameters of the model. Use autograd to compute the backward pass.
cost.backward()
# Gradient descent parameter update.
with torch.no_grad():
for param in model.parameters():
# Your code here !!
param -= learning_rate * param.grad
d = {"model": model,
"learning_rate": learning_rate,
"num_iterations": num_iterations}
return d
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

I believe your problem is that you are mixing numpy arrays and torch tensors. Pytorch tensors are a bit like numpy arrays, but they also kept in a computational graph that is responsible for the backward pass.
The description of your received variables X_train, Y_train, X_test, Y_test says they are numpy arrays. You should convert them all to torch tensors:
x = torch.tensor(x)
I also noticed that you are manually performing gradient updates. Unless that was your intention, I would recomend you using one of pytorch's optimizers.
from torch.optim import SGD
model = nn.Sequential(
nn.Linear(n_x,n_h),
nn.ReLU(),
nn.Linear(n_h,n_y),
nn.Sigmoid() # You are using BCELoss, you should give it an input from 0 to 1
)
optimizer = SGD(model.parameters(), lr=learning_rate)
cost_fn = nn.BCELoss()
optimizer.zero_grad()
y = model(x)
cost = cost_fn(y, target)
cost.backward()
optimizer.step() # << updated the gradients of your model
Notice that it is recomended to use torch.nn.BCEWithLogitsLoss instead of BCELoss. The first implements sigmoid and the binary cross entropy together with some math tricks to make it more numerically stable. Your model should look something like:
model = nn.Sequential(
nn.Linear(n_x,n_h),
nn.ReLU(),
nn.Linear(n_h,n_y)
)

Related

Results from Pytorch tutorial using Google collab not matching results in PyCharm

I'm following a tutorial on Youtube for Pytorch which uses torch.manual_seed to ensure results are the same or least in the same ballpark.
Now admittedly I'm no expert but on running the code in chapter 2 of the tutorial the resulting graph from my model seems way off from what it should be.
I've tried going through the code line by line but for the last 3 days I can't seem to find any differences between my code and the code used in the tutorial (other than variable names which I changed for clarity on my part and so I'm not just mindlessly copying).
I work a pretty busy menial job with variable work days so I don't get a lot of time off but I've spent 10 'off days' across a month trying to solve this and I just can't see it. Genuinely any help would be appreciated even if it's an eror on my part I would be alright with that being stated without saying what it is; I just want to know if I've done anything wrong at all.
Here's a link to the doc file for the tutorial if that helps:
https://www.learnpytorch.io/02_pytorch_classification/#31-going-from-raw-model-outputs-to-predicted-labels-logits-prediction-probabilities-prediction-labels
Here's my code:
`from sklearn.datasets import make_circles
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split as tts
import torch
from torch import nn
from helper_functions import plot_predictions, plot_decision_boundary
import os
from pathlib import Path
# Generate 1000 samples
sample_number = 1000
# Create circles
Features, labels = make_circles(sample_number,
noise=0.03, # <- adds a little noise to the dots
random_state=42) # <- sets random state seed for consistency
# View the first 5 values for both parameters
print(f'First 5 Features (X):\n{Features[:5]}\n'
f'First 5 labels (y):\n{labels[:5]}')
# Make a DataFrame of circle data
circles = pd.DataFrame({"inputType1": Features[:, 0], # <- everything in the 0th index is type 1
"inputType2": Features[:, 1], # <- everything in the 1st index is type 2
"output": labels
})
print(f'Created dataframe:\n'
f'{circles.head(10)}')
# Check the different labels
print(f'Number of value per class:\n'
f'{circles.output.value_counts()}')
# Visualise the dataframe
plt.scatter(x=Features[:, 0],
y=Features[:, 1],
c=labels,
cmap=plt.cm.RdYlBu)
# Display plot
# plt.show()
# Check the shapes of the features and labels
# ML deals with numerical representation
# Ensuring the input and output shapes are compatible is crucial
print(f'Circle shapes: {Features.shape, labels.shape}')
# View the first example of features and labels
Features_samples = Features[0]
labels_samples = labels[0]
print(f'Values for one sample of X: {Features_samples} and the same for y: {labels_samples}')
print(f'Shapes for one sample of X: {Features_samples.shape}'
f'\nand the same for y: {labels_samples.shape}')
# ^ Features dataset has 1000 samples with two feature classes
# ^ labels dataset has 1000 samples with no feature classes since it's a scalar
# Turning datasets into tensors
Features = torch.from_numpy(Features).type(torch.float)
labels = torch.from_numpy(labels).type(torch.float)
# View the first five samples
print(f'First 5 Features:\n'
f'{Features[:5]}\n'
f'First 5 labels:\n'
f'{labels[:5]}\n')
# Split data into train and test sets
input_data_train, input_data_test, model_output_train, model_output_test = tts(Features,
labels,
test_size=0.2,
random_state=42)
# Check that splits follow this pattern:
# 80% train, 20% test
print(f'Number of samples for input train:\n'
f'{len(input_data_train)}\n'
f'Number of samples for input test:\n'
f'{len(input_data_test)}\n'
f'Number of samples for output train:\n'
f'{len(model_output_train)}\n'
f'Number of samples for output test:\n'
f'{len(model_output_test)}\n')
# Begin building learning model
# Make device-agnostic code
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f'Learning model processing on: {device}\n')
"""# Assign parameters to the device
input_data_train = input_data_train.to(device)
input_data_test = input_data_test.to(device)
model_output_train = model_output_train.to(device)
model_output_test = model_output_test.to(device)"""
# 1. Construct a model class that subclasses nn.Module
class CircleClassificationModel(nn.Module):
def __init__(self):
super().__init__()
# 2. Create 2 nn.Linear layers for handling Feature and labels, input and output shapes
self.layer_1 = nn.Linear(in_features=2, out_features=5) # <- inputs 2 Features, outputs 5
self.layer_2 = nn.Linear(in_features=5, out_features=1) # <- inputs 5 Features, outputs 1 label
# 3. Define a forward method containing the forward pass computation
def forward(self, x):
# Return the output of layer_2, a single feature, which is the same shape as the label
return self.layer_2(self.layer_1(x))
# Computation goes through layer_1 first
# The output of layer_1 goes through layer_2
# 4. Create an instance of the model and send it to target device
classification_model_0 = CircleClassificationModel().to(device)
# Display model parameters
print(f'Model parameters for self defined model:\n'
f'{classification_model_0}\n')
# The above code can be written more succinctly using nn.Sequential
# Implements two layers of nn.Linear()
# Which calls the following equation
# y = ( x * (Weights).transposed ) + bias
classification_model_0 = nn.Sequential(
nn.Linear(in_features=2, out_features=5),
nn.Linear(in_features=5, out_features=1)
).to(device)
# Display model parameters
print(f'Model (nn.Sequential) parameters:\n'
f'{classification_model_0}\n\n')
# Make predictions with the model
untrained_predictions = classification_model_0(input_data_test.to(device))
print(f'Length of predictions: {len(untrained_predictions)}, Shape: {untrained_predictions.shape}')
print(f'Length of test samples: {len(model_output_test)}, Shape: {model_output_test.shape}')
print(f'\nFirst 10 predictions:\n'
f'{untrained_predictions[:10]}')
print(f'\nFirst 10 test labels:\n'
f'{model_output_test[:10]}')
# Create a loss function
# Unlike the regression model, the classification model uses a different loss type
# Binary Cross Entropy will be used for this task
# torch.nn.BCELoss() - measures the BCE between the target(labels) and the input(features)
# Another version may be used:
# torch.nn.BCEWithLogitsLoss() - same, except it has a built-in Sigmoid function
# loss_fn = nn.BCELoss() # <- BCELoss = no sigmoid built-in
loss_function = nn.BCEWithLogitsLoss() # <- BCEWithLogitsLoss = sigmoid built-in
# Create an optimiser
optimiser = torch.optim.SGD(params=classification_model_0.parameters(),
lr=0.1)
# Calculate accuracy (a classification metric)
# This acts as an evaluation metric
# Offers perspective into how the model is going
# The loss function measures how wrong the model but
# Evaluation metrics measure how right it is
# Accuracy will be the first metric to be utilised
# Accuracy can be measured by dividing the total number of correct predictions
# By the total number of overall predictions
def accuracy_function(label_actual, label_predicted):
# Calculates where 2 tensors are equal
correct = torch.eq(label_actual, label_predicted).sum().item()
accuracy = (correct / len(label_predicted)) * 100
return accuracy
# View the first 5 results of the forward pass on test data
# labels_logits represents the output of the forward pass method above
# Which utilises two nn.Linear() layers
labels_logits = classification_model_0(input_data_test.to(device))[:5]
print(f'First 5 outputs of the forward pass:\n'
f'{labels_logits}')
# Use the sigmoid function on the model labels_logits
# Turns the output of the forward pass into prediction probabilities
# Measures the odds the model classifies a data point into one class or the other
# In the case of this problem the classes are either 0 or 1
# It uses the logic:
# If labels_prediction_probabilities >= 0.5 then assign the label class (1)
# If labels_prediction_probabilities < 0.5 then assign the label class (0)
labels_prediction_probabilities = torch.sigmoid(labels_logits)
print(f'Output of the sigmoid-ed forward pass:\n'
f'{labels_prediction_probabilities}')
# Find the predicted labels (round the prediction probabilities as well)
label_predictions = torch.round(labels_prediction_probabilities)
# In full
labels_predictions_classes = \
torch.round(torch.sigmoid(classification_model_0(input_data_test.to(device))[:5]))
# Check for equality
print(torch.eq(label_predictions.squeeze(), labels_predictions_classes.squeeze()))
# Get rid of the extra dimensions
label_predictions.squeeze()
# Display model predictions
print(f'Model Predictions:\n'
f'{label_predictions}')
# Display test labels for comparison with model predictions
print(f'\nFirst five test data:\n'
f'{model_output_test[:5]}')
# Building the training loop
torch.manual_seed(42)
# Set the number of epochs
epochs = 100
# Process data on the target devices
input_data_train, model_output_train = input_data_train.to(device),\
model_output_train.to(device)
input_data_test, model_output_test = input_data_test.to(device),\
model_output_test.to(device)
# Build the training and evaluation loop
for epoch in range(epochs):
# Training
classification_model_0.train()
# todo: Do the Forward pass
# Model outputs raw labels_logits
train_labels_logits = classification_model_0(input_data_train).squeeze()
# ^ .squeeze() removes the extra dimensions, won't work if model and data on diff devices
train_label_prediction = torch.round(torch.sigmoid(train_labels_logits))
# ^ turns logits -> prediction probabilities -> prediction label classes
# todo: Calculate the loss
# 2. Calculates loss/accuracy
""" train_loss = loss_function(torch.sigmoid(train_labels_logits),
model_output_train) # <- nn.BCELoss needs torch.sigmoid() """
train_loss = loss_function(train_labels_logits,
model_output_train)
train_accuracy = accuracy_function(label_actual=model_output_train,
label_predicted=train_label_prediction)
# todo: Optimiser zero grad
optimiser.zero_grad()
# todo: Loss backwards
train_loss.backward()
# todo: optimiser step step step
optimiser.step()
# Testing
# todo: evaluate the model
classification_model_0.eval()
with torch.inference_mode():
# todo: Do the forward pass
test_logits = classification_model_0(input_data_test).squeeze()
test_predictions = torch.round(torch.sigmoid(test_logits))
# todo: calculate the loss
test_loss = loss_function(test_logits,
model_output_test)
test_accuracy = accuracy_function(label_actual=model_output_test,
label_predicted=test_predictions)
# todo: print model statistics every 10 epochs
if epoch % 10 == 0:
print(f'Epoch: {epoch} | Loss: {train_loss:.5f}, | Train Accuracy: {train_accuracy:.2f}%'
f'Test Loss: {test_loss:.5f}, | Test accuracy: {test_accuracy:.2f}%')
# Plot decision boundary for training and test sets
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title("My Train")
plot_decision_boundary(classification_model_0, input_data_train, model_output_train)
plt.subplot(1, 2, 2)
plt.title("My Test")
plot_decision_boundary(classification_model_0, input_data_test, model_output_test)
plt.show()`
AND HERE'S THE TUTORIAL CODE:
from sklearn.datasets import make_circles
# Make 1000 samples
n_samples = 1000
# Create circles
X, y = make_circles(n_samples,
noise=0.03, # a little bit of noise to the dots
random_state=42) # keep random state so we get the same values
print(f"First 5 X features:\n{X[:5]}")
print(f"\nFirst 5 y labels:\n{y[:5]}")
# Make DataFrame of circle data
import pandas as pd
circles = pd.DataFrame({"X1": X[:, 0],
"X2": X[:, 1],
"label": y
})
circles.head(10)
# Check different labels
circles.label.value_counts()
# Visualize with a plot
import matplotlib.pyplot as plt
plt.scatter(x=X[:, 0],
y=X[:, 1],
c=y,
cmap=plt.cm.RdYlBu);
# Check the shapes of our features and labels
X.shape, y.shape
# View the first example of features and labels
X_sample = X[0]
y_sample = y[0]
print(f"Values for one sample of X: {X_sample} and the same for y: {y_sample}")
print(f"Shapes for one sample of X: {X_sample.shape} and the same for y: {y_sample.shape}")
# Turn data into tensors
# Otherwise this causes issues with computations later on
import torch
X = torch.from_numpy(X).type(torch.float)
y = torch.from_numpy(y).type(torch.float)
# View the first five samples
X[:5], y[:5]
# Split data into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.2, # 20% test, 80% train
random_state=42) # make the random split reproducible
len(X_train), len(X_test), len(y_train), len(y_test)
# Standard PyTorch imports
import torch
from torch import nn
# Make device agnostic code
device = "cuda" if torch.cuda.is_available() else "cpu"
device
# 1. Construct a model class that subclasses nn.Module
class CircleModelV0(nn.Module):
def __init__(self):
super().__init__()
# 2. Create 2 nn.Linear layers capable of handling X and y input and output shapes
self.layer_1 = nn.Linear(in_features=2, out_features=5) # takes in 2 features (X), produces 5 features
self.layer_2 = nn.Linear(in_features=5, out_features=1) # takes in 5 features, produces 1 feature (y)
# 3. Define a forward method containing the forward pass computation
def forward(self, x):
# Return the output of layer_2, a single feature, the same shape as y
return self.layer_2(
self.layer_1(x)) # computation goes through layer_1 first then the output of layer_1 goes through layer_2
# 4. Create an instance of the model and send it to target device
model_0 = CircleModelV0().to(device)
model_0
# Replicate CircleModelV0 with nn.Sequential
model_0 = nn.Sequential(
nn.Linear(in_features=2, out_features=5),
nn.Linear(in_features=5, out_features=1)
).to(device)
model_0
# Make predictions with the model
untrained_preds = model_0(X_test.to(device))
print(f"Length of predictions: {len(untrained_preds)}, Shape: {untrained_preds.shape}")
print(f"Length of test samples: {len(y_test)}, Shape: {y_test.shape}")
print(f"\nFirst 10 predictions:\n{untrained_preds[:10]}")
print(f"\nFirst 10 test labels:\n{y_test[:10]}")
# Create a loss function
# loss_fn = nn.BCELoss() # BCELoss = no sigmoid built-in
loss_fn = nn.BCEWithLogitsLoss() # BCEWithLogitsLoss = sigmoid built-in
# Create an optimizer
optimizer = torch.optim.SGD(params=model_0.parameters(),
lr=0.1)
# Calculate accuracy (a classification metric)
def accuracy_fn(y_true, y_pred):
correct = torch.eq(y_true, y_pred).sum().item() # torch.eq() calculates where two tensors are equal
acc = (correct / len(y_pred)) * 100
return acc
# View the frist 5 outputs of the forward pass on the test data
y_logits = model_0(X_test.to(device))[:5]
y_logits
# Use sigmoid on model logits
y_pred_probs = torch.sigmoid(y_logits)
y_pred_probs
# Find the predicted labels (round the prediction probabilities)
y_preds = torch.round(y_pred_probs)
# In full
y_pred_labels = torch.round(torch.sigmoid(model_0(X_test.to(device))[:5]))
# Check for equality
print(torch.eq(y_preds.squeeze(), y_pred_labels.squeeze()))
# Get rid of extra dimension
y_preds.squeeze()
y_test[:5]
torch.manual_seed(42)
# Set the number of epochs
epochs = 100
# Put data to target device
X_train, y_train = X_train.to(device), y_train.to(device)
X_test, y_test = X_test.to(device), y_test.to(device)
# Build training and evaluation loop
for epoch in range(epochs):
### Training
model_0.train()
# 1. Forward pass (model outputs raw logits)
y_logits = model_0(
X_train).squeeze() # squeeze to remove extra `1` dimensions, this won't work unless model and data are on same device
y_pred = torch.round(torch.sigmoid(y_logits)) # turn logits -> pred probs -> pred labls
# 2. Calculate loss/accuracy
# loss = loss_fn(torch.sigmoid(y_logits), # Using nn.BCELoss you need torch.sigmoid()
# y_train)
loss = loss_fn(y_logits, # Using nn.BCEWithLogitsLoss works with raw logits
y_train)
acc = accuracy_fn(y_true=y_train,
y_pred=y_pred)
# 3. Optimizer zero grad
optimizer.zero_grad()
# 4. Loss backwards
loss.backward()
# 5. Optimizer step
optimizer.step()
### Testing
model_0.eval()
with torch.inference_mode():
# 1. Forward pass
test_logits = model_0(X_test).squeeze()
test_pred = torch.round(torch.sigmoid(test_logits))
# 2. Caculate loss/accuracy
test_loss = loss_fn(test_logits,
y_test)
test_acc = accuracy_fn(y_true=y_test,
y_pred=test_pred)
# Print out what's happening every 10 epochs
if epoch % 10 == 0:
print(
f"Epoch: {epoch} | Loss: {loss:.5f}, Accuracy: {acc:.2f}% | Test loss: {test_loss:.5f}, Test acc: {test_acc:.2f}%")
from helper_functions import plot_predictions, plot_decision_boundary
# Plot decision boundaries for training and test sets
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title("Tut Train")
plot_decision_boundary(model_0, X_train, y_train)
plt.subplot(1, 2, 2)
plt.title("Tut Test")
plot_decision_boundary(model_0, X_test, y_test)

Regression Model with 3 Hidden DenseVariational Layers in Tensorflow-Probability returns nan as loss during training

I am getting acquainted with Tensorflow-Probability and here I am running into a problem. During training, the model returns nan as the loss (possibly meaning a huge loss that causes overflowing). Since the functional form of the synthetic data is not overly complicated and the ratio of data points to parameters is not frightening at first glance at least I wonder what is the problem and how it could be corrected.
The code is the following --accompanied by some possibly helpful images:
# Create and plot 5000 data points
x_train = np.linspace(-1, 2, 5000)[:, np.newaxis]
y_train = np.power(x_train, 3) + 0.1*(2+x_train)*np.random.randn(5000)[:, np.newaxis]
plt.scatter(x_train, y_train, alpha=0.1)
plt.show()
# Define the prior weight distribution -- all N(0, 1) -- and not trainable
def prior(kernel_size, bias_size, dtype = None):
n = kernel_size + bias_size
prior_model = Sequential([
tfpl.DistributionLambda(
lambda t: tfd.MultivariateNormalDiag(loc = tf.zeros(n) , scale_diag = tf.ones(n)
))
])
return(prior_model)
# Define variational posterior weight distribution -- multivariate Gaussian
def posterior(kernel_size, bias_size, dtype = None):
n = kernel_size + bias_size
posterior_model = Sequential([
tfpl.VariableLayer(tfpl.MultivariateNormalTriL.params_size(n) , dtype = dtype), # The parameters of the model are declared Variables that are trainable
tfpl.MultivariateNormalTriL(n) # The posterior function will return to the Variational layer that will call it a MultivariateNormalTril object that will have as many dimensions
# as the parameters of the Variational Dense Layer. That means that each parameter will be generated by a distinct Normal Gaussian shifted and scaled
# by a mu and sigma learned from the data, independently of all the other weights. The output of this Variablelayer will become the input to the
# MultivariateNormalTriL object.
# The shape of the VariableLayer object will be defined by the number of paramaters needed to create the MultivariateNormalTriL object given
# that it will live in a Space of n dimensions (event_size = n). This number is returned by the tfpl.MultivariateNormalTriL.params_size(n)
])
return(posterior_model)
x_in = Input(shape = (1,))
x = tfpl.DenseVariational(units= 2**4,
make_prior_fn=prior,
make_posterior_fn=posterior,
kl_weight=1/x_train.shape[0],
activation='relu')(x_in)
x = tfpl.DenseVariational(units= 2**4,
make_prior_fn=prior,
make_posterior_fn=posterior,
kl_weight=1/x_train.shape[0],
activation='relu')(x)
x = tfpl.DenseVariational(units=tfpl.IndependentNormal.params_size(1),
make_prior_fn=prior,
make_posterior_fn=posterior,
kl_weight=1/x_train.shape[0])(x)
y_out = tfpl.IndependentNormal(1)(x)
model = Model(inputs = x_in, outputs = y_out)
def nll(y_true, y_pred):
return -y_pred.log_prob(y_true)
model.compile(loss=nll, optimizer= 'Adam')
model.summary()
Train the model
history = model.fit(x_train1, y_train1, epochs=500)
The problem seems to be in the loss function: negative log-likelihood of the independent normal distribution without any specified location and scale leads to the untamed variance which leads to the blowing up the final loss value. Since you're experimenting with the variational layers, you must be interested in the estimation of the epistemic uncertainty, to that end, I'd recommend to apply the constant variance.
I tried to make a couple of slight changes to your code within the following lines:
first of all, the final output y_out comes directly from the final variational layer without any IndpendnetNormal distribution layer:
y_out = tfpl.DenseVariational(units=1,
make_prior_fn=prior,
make_posterior_fn=posterior,
kl_weight=1/x_train.shape[0])(x)
second, the loss function now contains the necessary calculations with the normal distribution you need but with the static variance in order to avoid the blowing up of the loss during training:
def nll(y_true, y_pred):
dist = tfp.distributions.Normal(loc=y_pred, scale=1.0)
return tf.reduce_sum(-dist.log_prob(y_true))
then the model is compiled and trained in the same way as before:
model.compile(loss=nll, optimizer= 'Adam')
history = model.fit(x_train, y_train, epochs=3000)
and finally let's sample 100 different predictions from the trained model and plot these values to visualize the epistemic uncertainty of the model:
predicted = [model(x_train) for _ in range(100)]
for i, res in enumerate(predicted):
plt.plot(x_train, res , alpha=0.1)
plt.scatter(x_train, y_train, alpha=0.1)
plt.show()
After 3000 epochs the result looks like this (with the reduced number of training points to 3000 instead of 5000 to speed-up the training):
The model has 38,589 trainable parameters but you have only 5,000 points as data; so, effective training is impossible with so many parameters.

Measuring uncertainty using MC Dropout on pytorch

I am trying to implement Bayesian CNN using Mc Dropout on Pytorch,
the main idea is that by applying dropout at test time and running over many forward passes , you get predictions from a variety of different models.
I’ve found an application of the Mc Dropout and I really did not get how they applied this method and how exactly they did choose the correct prediction from the list of predictions
here is the code
def mcdropout_test(model):
model.train()
test_loss = 0
correct = 0
T = 100
for data, target in test_loader:
if args.cuda:
data, target = data.cuda(), target.cuda()
data, target = Variable(data, volatile=True), Variable(target)
output_list = []
for i in xrange(T):
output_list.append(torch.unsqueeze(model(data), 0))
output_mean = torch.cat(output_list, 0).mean(0)
test_loss += F.nll_loss(F.log_softmax(output_mean), target, size_average=False).data[0] # sum up batch loss
pred = output_mean.data.max(1, keepdim=True)[1] # get the index of the max log-probability
correct += pred.eq(target.data.view_as(pred)).cpu().sum()
test_loss /= len(test_loader.dataset)
print('\nMC Dropout Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.2f}%)\n'.format(
test_loss, correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))
train()
mcdropout_test()
I have replaced
data, target = Variable(data, volatile=True), Variable(target)
by adding
with torch.no_grad(): at the beginning
And this is how I have defined my CNN
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 192, 5, padding=2)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(192, 192, 5, padding=2)
self.fc1 = nn.Linear(192 * 8 * 8, 1024)
self.fc2 = nn.Linear(1024, 256)
self.fc3 = nn.Linear(256, 10)
self.dropout = nn.Dropout(p=0.3)
nn.init.xavier_uniform_(self.conv1.weight)
nn.init.constant_(self.conv1.bias, 0.0)
nn.init.xavier_uniform_(self.conv2.weight)
nn.init.constant_(self.conv2.bias, 0.0)
nn.init.xavier_uniform_(self.fc1.weight)
nn.init.constant_(self.fc1.bias, 0.0)
nn.init.xavier_uniform_(self.fc2.weight)
nn.init.constant_(self.fc2.bias, 0.0)
nn.init.xavier_uniform_(self.fc3.weight)
nn.init.constant_(self.fc3.bias, 0.0)
def forward(self, x):
x = self.pool(F.relu(self.dropout(self.conv1(x)))) # recommended to add the relu
x = self.pool(F.relu(self.dropout(self.conv2(x)))) # recommended to add the relu
x = x.view(-1, 192 * 8 * 8)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(self.dropout(x)))
x = self.fc3(self.dropout(x)) # no activation function needed for the last layer
return x
Can anyone help me to get the right implementation of the Monte Carlo Dropout method on CNN?
Implementing MC Dropout in Pytorch is easy. All that is needed to be done is to set the dropout layers of your model to train mode. This allows for different dropout masks to be used during the different various forward passes. Below is an implementation of MC Dropout in Pytorch illustrating how multiple predictions from the various forward passes are stacked together and used for computing different uncertainty metrics.
import sys
import numpy as np
import torch
import torch.nn as nn
def enable_dropout(model):
""" Function to enable the dropout layers during test-time """
for m in model.modules():
if m.__class__.__name__.startswith('Dropout'):
m.train()
def get_monte_carlo_predictions(data_loader,
forward_passes,
model,
n_classes,
n_samples):
""" Function to get the monte-carlo samples and uncertainty estimates
through multiple forward passes
Parameters
----------
data_loader : object
data loader object from the data loader module
forward_passes : int
number of monte-carlo samples/forward passes
model : object
keras model
n_classes : int
number of classes in the dataset
n_samples : int
number of samples in the test set
"""
dropout_predictions = np.empty((0, n_samples, n_classes))
softmax = nn.Softmax(dim=1)
for i in range(forward_passes):
predictions = np.empty((0, n_classes))
model.eval()
enable_dropout(model)
for i, (image, label) in enumerate(data_loader):
image = image.to(torch.device('cuda'))
with torch.no_grad():
output = model(image)
output = softmax(output) # shape (n_samples, n_classes)
predictions = np.vstack((predictions, output.cpu().numpy()))
dropout_predictions = np.vstack((dropout_predictions,
predictions[np.newaxis, :, :]))
# dropout predictions - shape (forward_passes, n_samples, n_classes)
# Calculating mean across multiple MCD forward passes
mean = np.mean(dropout_predictions, axis=0) # shape (n_samples, n_classes)
# Calculating variance across multiple MCD forward passes
variance = np.var(dropout_predictions, axis=0) # shape (n_samples, n_classes)
epsilon = sys.float_info.min
# Calculating entropy across multiple MCD forward passes
entropy = -np.sum(mean*np.log(mean + epsilon), axis=-1) # shape (n_samples,)
# Calculating mutual information across multiple MCD forward passes
mutual_info = entropy - np.mean(np.sum(-dropout_predictions*np.log(dropout_predictions + epsilon),
axis=-1), axis=0) # shape (n_samples,)
Moving on to the implementation which is posted in the question above, multiple predictions from T different forward passes are obtained by first setting the model to train mode (model.train()). Note that this is not desirable because unwanted stochasticity will be introduced in the predictions if there are layers other than dropout such as batch-norm in the model. Hence the best way is to just set the dropout layers to train mode as shown in the snippet above.

How to use the input gradients as variables within a custom loss function in Keras?

I am using the input gradient as feature important and want to compare the feature importance of a train datapoint with the human annotated feature importance. I would like to make this comparison differentiable such that it can be learned through backpropagation. For that, I am writing a custom loss function that in addition to the regular loss (e.g. m.s.e. on the prediction vs true labels) also checks whether the input gradient is correct (e.g. m.s.e. of the input gradient vs the human annotated feature importance).
With the following code I am able to get the input gradient:
from keras import backend as K
import numpy as np
from keras.models import Model
from keras.layers import Input, Dense
def normalize(x):
# utility function to normalize a tensor by its L2 norm
return x / (K.sqrt(K.mean(K.square(x))) + 1e-5)
# Amount of training samples
N = 1000
input_dim = 10
# Generate training set make the 1st and 2nd feature same as the target feature
X = np.random.standard_normal(size=(N, input_dim))
y = np.random.randint(low=0, high=2, size=(N, 1))
X[:, 1] = y[:, 0]
X[:, 2] = y[:, 0]
# Create simple model
inputs = Input(shape=(input_dim,))
x = Dense(10, name="dense1")(inputs)
output = Dense(1, activation='sigmoid')(x)
model = Model(input=[inputs], output=output)
# Compile and fit model
model.compile(optimizer='adam', loss="mse", metrics=['accuracy'])
model.fit([X], y, epochs=100, batch_size=64)
# Get function to get input gradients
gradients = K.gradients(model.output, model.input)[0]
gradient_function = K.function([model.input], [normalize(gradients)])
# Get input gradient values of the training-set
grads_val = gradient_function([X])[0]
print(grads_val[:2])
This prints the following (you can see that the 1st and the 2nd features have the highest importance):
[[ 1.2629046e-02 2.2765596e+00 2.1479919e+00 2.1558853e-02
4.5277486e-03 2.9851785e-03 9.5279224e-04 -1.0903150e-02
-1.2230731e-02 2.1960819e-02]
[ 1.1318034e-02 2.0402350e+00 1.9250139e+00 1.9320872e-02
4.0577268e-03 2.6752844e-03 8.5390132e-04 -9.7713526e-03
-1.0961102e-02 1.9681118e-02]]
How can I write a custom loss function in which the input gradients are differentiable?
I started with the following loss function.
from keras.losses import mean_squared_error
def custom_loss():
# human annotated feature importance
# Let's say that it says to only look at the second feature
human_feature_importance = []
for i in range(N):
human_feature_importance.append([0,0,1,0,0,0,0,0,0,0])
def loss(y_true, y_pred):
# Get regular loss
regular_loss_value = mean_squared_error(y_true, y_pred)
# Somehow get the input gradient of each training sample as a tensor
# It should be differential w.r.t. all of the weights
gradients = ??
feature_importance_loss_value = mean_squared_error(gradients, human_feature_importance)
# Combine the both losses
return regular_loss_value + feature_importance_loss_value
return loss
I also found an implementation in tensorflow to make the input gradient differentialble: https://github.com/dtak/rrr/blob/master/rrr/tensorflow_perceptron.py#L18

How to create an autoencoder where each layer of encoder should represent the same as a layer of the decoder

I want to build an autoencoder where each layer in the encoder has the same meaning as a correspondent layer in the decoder. So if the autoencoder is perfectly trained, the values of those layers should be roughly the same.
So lets say the autoencoder consists of e1 -> e2 -> e3 -> d2 -> d1, whereas e1 is the input and d1 the output. A normal autoencoder trains to have the same result in d1 as e1, but I want the additional constraint, that e2 and d2 are the same. Therefore I want an additional backpropagation path which leads from d2 to e2 and trains at the same time as the normal path from d1 to e1. (d stands for decoder, e for encoder).
I tried to use the error between e2 and d2 as a regularization term with the CustomRegularization layer from the first answer of this link https://github.com/keras-team/keras/issues/5563. I also use this for the error between e1 and d1 instead of the normal path.
The following code is written such that more than 1 intermediate layer can be handled and also uses 4 layers.
In the out commented code is a normal autoencoder which only propagates from start to end.
from keras.layers import Dense
import numpy as np
from keras.datasets import mnist
from keras.models import Model
from keras.engine.topology import Layer
from keras import objectives
from keras.layers import Input
import keras
import matplotlib.pyplot as plt
#A layer which can be given as an output to force a regularization term between two layers
class CustomRegularization(Layer):
def __init__(self, **kwargs):
super(CustomRegularization, self).__init__(**kwargs)
def call(self, x, mask=None):
ld=x[0]
rd=x[1]
bce = objectives.binary_crossentropy(ld, rd)
loss2 = keras.backend.sum(bce)
self.add_loss(loss2, x)
return bce
def get_output_shape_for(self, input_shape):
return (input_shape[0][0],1)
def zero_loss(y_true, y_pred):
return keras.backend.zeros_like(y_pred)
#Create regularization layer between two corresponding layers of encoder and decoder
def buildUpDownRegularization(layerNo, input, up_layers, down_layers):
for i in range(0, layerNo):
input = up_layers[i](input)
start = input
for i in range(layerNo, len(up_layers)):
input = up_layers[i](input)
for j in range(0, len(down_layers) - layerNo):
input = down_layers[j](input)
end = input
cr = CustomRegularization()([start, end])
return cr
# Define shape of the network, layers, some hyperparameters and training data
sizes = [784, 400, 200, 100, 50]
up_layers = []
down_layers = []
for i in range(1, len(sizes)):
layer = Dense(units=sizes[i], activation='sigmoid', input_dim=sizes[i-1])
up_layers.append(layer)
for i in range(len(sizes)-2, -1, -1):
layer = Dense(units=sizes[i], activation='sigmoid', input_dim=sizes[i+1])
down_layers.append(layer)
batch_size = 128
num_classes = 10
epochs = 100
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
x_train = x_train.reshape([x_train.shape[0], 28*28])
x_test = x_test.reshape([x_test.shape[0], 28*28])
y_train = x_train
y_test = x_test
optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
"""
### Normal autoencoder like in base mnist example
model = keras.models.Sequential()
for layer in up_layers:
model.add(layer)
for layer in down_layers:
model.add(layer)
model.compile(optimizer=optimizer, loss=keras.backend.binary_crossentropy)
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs)
score = model.evaluate(x_test, y_test, verbose=0)
#print('Test loss:', score[0])
#print('Test accuracy:', score[1])
decoded_imgs = model.predict(x_test)
n = 10 # how many digits we will display
plt.figure(figsize=(20, 4))
for i in range(n):
# display original
ax = plt.subplot(2, n, i + 1)
plt.imshow(x_test[i].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
# display reconstruction
ax = plt.subplot(2, n, i + 1 + n)
plt.imshow(decoded_imgs[i].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()
"""
### My autoencoder where each subpart is also an autoencoder
#This part is only because the model needs a path from start to end, contentwise this should do nothing
output = input = Input(shape=(sizes[0],))
for i in range(0, len(up_layers)):
output = up_layers[i](output)
for i in range(0, len(down_layers)):
output = down_layers[i](output)
crs = [output]
losses = [zero_loss]
#Build the regularization layer
for i in range(len(up_layers)):
crs.append(buildUpDownRegularization(i, input, up_layers, down_layers))
losses.append(zero_loss)
#Create and train model with adapted training data
network = Model([input], crs)
optimizer = keras.optimizers.Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
network.compile(loss=losses, optimizer=optimizer)
dummy_train = np.zeros([y_train.shape[0], 1])
dummy_test = np.zeros([y_test.shape[0], 1])
training_data = [y_train]
test_data = [y_test]
for i in range(len(network.outputs)-1):
training_data.append(dummy_train)
test_data.append(dummy_test)
network.fit(x_train, training_data, batch_size=batch_size, epochs=epochs,verbose=1, validation_data=(x_test, test_data))
score = network.evaluate(x_test, test_data, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
decoded_imgs = network.predict(x_test)
n = 10 # how many digits we will display
plt.figure(figsize=(20, 4))
for i in range(n):
# display original
ax = plt.subplot(2, n, i + 1)
plt.imshow(x_test[i].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
# display reconstruction
ax = plt.subplot(2, n, i + 1 + n)
plt.imshow(decoded_imgs[0][i].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()
If you run the code as is it will show, that the reproduction ability is no longer given in my code.
I expect a similar behavior to the uncommented code, which shows a normal autoencoder.
Edit: As mentioned in the answers this works well with MSE instead of crossentropy and a lr of .01. 100 epochs with that setting produce really good results.
Edit 2: I would like that the backpropagation works as in this [image] (https://imgur.com/OOo757x). So the backpropagation of the loss of a certain layer stops at the corresponding layer. I think I didn't make this clear before and I don't know if the code currently does that.
Edit 3: Although this code runs and returns a good looking solution the CustomRegularization layer is not doing what I thought it would do, therefore it does not do the same things as in the description.
It seems like the main issue is the use of binary cross-entropy to minimize the difference between encoder and decoder. The internal representation in the network is not going to be a single class probability like the output might be if you were classifying MNIST digits. I was able to get your network to output some reasonable-looking reconstructions with these simple changes:
Using objectives.mean_squared_error instead of objectives.binary_crossentropy in the CustomRegularization class
Changing number of epochs to 5
Changing learning rate to .01
Changes 2 and 3 were simply made to speed up the testing. Change 1 is the key here. Cross entropy is designed for problems where there is a binary "ground truth" variable and an estimate of that variable. However, you do not have a binary truth value in the middle of your network, only at the output layer. Thus a cross entropy loss function in the middle of the network doesn't make much sense (at least to me) -- it will be trying to measure entropy for a variable that isn't binary. Mean squared error, on the other hand, is a bit more generic and should work for this case since you are simply minimizing the difference between two real values. In essence, the middle of the network is performing regression (difference between activations in two continuous values, i.e. layers), not classification, so it needs a loss function that is appropriate for regression.
I also want to suggest that there may be a better approach to accomplish what you want. If you really want the encoder and decoder to be exactly the same, you can share weights between them. Then they will be identical, not just highly similar, and your model will have fewer parameters to train. There is a decent explanation of shared (tied) weights autoencoders with Keras here if you're curious.
Reading your code it does seem like it is doing what you want in your illustration, but I am not really sure how to verify that.

Resources