Measuring uncertainty using MC Dropout on pytorch - pytorch

I am trying to implement Bayesian CNN using Mc Dropout on Pytorch,
the main idea is that by applying dropout at test time and running over many forward passes , you get predictions from a variety of different models.
I’ve found an application of the Mc Dropout and I really did not get how they applied this method and how exactly they did choose the correct prediction from the list of predictions
here is the code
def mcdropout_test(model):
model.train()
test_loss = 0
correct = 0
T = 100
for data, target in test_loader:
if args.cuda:
data, target = data.cuda(), target.cuda()
data, target = Variable(data, volatile=True), Variable(target)
output_list = []
for i in xrange(T):
output_list.append(torch.unsqueeze(model(data), 0))
output_mean = torch.cat(output_list, 0).mean(0)
test_loss += F.nll_loss(F.log_softmax(output_mean), target, size_average=False).data[0] # sum up batch loss
pred = output_mean.data.max(1, keepdim=True)[1] # get the index of the max log-probability
correct += pred.eq(target.data.view_as(pred)).cpu().sum()
test_loss /= len(test_loader.dataset)
print('\nMC Dropout Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.2f}%)\n'.format(
test_loss, correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))
train()
mcdropout_test()
I have replaced
data, target = Variable(data, volatile=True), Variable(target)
by adding
with torch.no_grad(): at the beginning
And this is how I have defined my CNN
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 192, 5, padding=2)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(192, 192, 5, padding=2)
self.fc1 = nn.Linear(192 * 8 * 8, 1024)
self.fc2 = nn.Linear(1024, 256)
self.fc3 = nn.Linear(256, 10)
self.dropout = nn.Dropout(p=0.3)
nn.init.xavier_uniform_(self.conv1.weight)
nn.init.constant_(self.conv1.bias, 0.0)
nn.init.xavier_uniform_(self.conv2.weight)
nn.init.constant_(self.conv2.bias, 0.0)
nn.init.xavier_uniform_(self.fc1.weight)
nn.init.constant_(self.fc1.bias, 0.0)
nn.init.xavier_uniform_(self.fc2.weight)
nn.init.constant_(self.fc2.bias, 0.0)
nn.init.xavier_uniform_(self.fc3.weight)
nn.init.constant_(self.fc3.bias, 0.0)
def forward(self, x):
x = self.pool(F.relu(self.dropout(self.conv1(x)))) # recommended to add the relu
x = self.pool(F.relu(self.dropout(self.conv2(x)))) # recommended to add the relu
x = x.view(-1, 192 * 8 * 8)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(self.dropout(x)))
x = self.fc3(self.dropout(x)) # no activation function needed for the last layer
return x
Can anyone help me to get the right implementation of the Monte Carlo Dropout method on CNN?

Implementing MC Dropout in Pytorch is easy. All that is needed to be done is to set the dropout layers of your model to train mode. This allows for different dropout masks to be used during the different various forward passes. Below is an implementation of MC Dropout in Pytorch illustrating how multiple predictions from the various forward passes are stacked together and used for computing different uncertainty metrics.
import sys
import numpy as np
import torch
import torch.nn as nn
def enable_dropout(model):
""" Function to enable the dropout layers during test-time """
for m in model.modules():
if m.__class__.__name__.startswith('Dropout'):
m.train()
def get_monte_carlo_predictions(data_loader,
forward_passes,
model,
n_classes,
n_samples):
""" Function to get the monte-carlo samples and uncertainty estimates
through multiple forward passes
Parameters
----------
data_loader : object
data loader object from the data loader module
forward_passes : int
number of monte-carlo samples/forward passes
model : object
keras model
n_classes : int
number of classes in the dataset
n_samples : int
number of samples in the test set
"""
dropout_predictions = np.empty((0, n_samples, n_classes))
softmax = nn.Softmax(dim=1)
for i in range(forward_passes):
predictions = np.empty((0, n_classes))
model.eval()
enable_dropout(model)
for i, (image, label) in enumerate(data_loader):
image = image.to(torch.device('cuda'))
with torch.no_grad():
output = model(image)
output = softmax(output) # shape (n_samples, n_classes)
predictions = np.vstack((predictions, output.cpu().numpy()))
dropout_predictions = np.vstack((dropout_predictions,
predictions[np.newaxis, :, :]))
# dropout predictions - shape (forward_passes, n_samples, n_classes)
# Calculating mean across multiple MCD forward passes
mean = np.mean(dropout_predictions, axis=0) # shape (n_samples, n_classes)
# Calculating variance across multiple MCD forward passes
variance = np.var(dropout_predictions, axis=0) # shape (n_samples, n_classes)
epsilon = sys.float_info.min
# Calculating entropy across multiple MCD forward passes
entropy = -np.sum(mean*np.log(mean + epsilon), axis=-1) # shape (n_samples,)
# Calculating mutual information across multiple MCD forward passes
mutual_info = entropy - np.mean(np.sum(-dropout_predictions*np.log(dropout_predictions + epsilon),
axis=-1), axis=0) # shape (n_samples,)
Moving on to the implementation which is posted in the question above, multiple predictions from T different forward passes are obtained by first setting the model to train mode (model.train()). Note that this is not desirable because unwanted stochasticity will be introduced in the predictions if there are layers other than dropout such as batch-norm in the model. Hence the best way is to just set the dropout layers to train mode as shown in the snippet above.

Related

Pytorch - vanilla Gradient Visualization

I trained a neural network on MNIST using PyTorch:
class MnistCNN(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d( 1, 16, 3, stride = 1, padding = 2)
self.pool1 = nn.MaxPool2d(kernel_size = 2)
self.conv2 = nn.Conv2d(16, 32, 3, stride = 1, padding = 2)
self.pool2 = nn.MaxPool2d(kernel_size = 2)
self.dropout = nn.Dropout(0.5)
self.lin = nn.Linear(32 * 8 * 8, 10)
def forward(self, x):
# conv block
x = F.relu(self.conv1(x))
x = self.pool1(x)
# conv block
x = F.relu(self.conv2(x))
x = self.pool2(x)
# dense block
x = x.view(x.size(0), -1)
x = self.dropout(x)
return self.lin(x)
I would like to implement vanilla Gradient Visualization (see reference below) on my model.
Simonyan, K., Vedaldi, A., Zisserman, A.
Deep inside convolutional networks: Visualising image classification models and saliency maps.
arXiv preprint arXiv:1312.6034 (2013)
Question: How can I implement this method in PyTorch?
If I understand correctly, vanilla gradient visualization consists in computing the partial derivatives of the loss of my model w.r.t all the pixels in my input image. So to make it short, I need to tweek my self.conv1 layer so that it computes the gradient over its input pixels instead of the gradient over its weights.
Please correct me if I'm wrong.
You do not need to change anything about your conv layer. Each layer computes gradients both w.r.t. parameters (for updates) and w.r.t. inputs (for "downstream" gradients by the chain rule). Therefore, all you need is to set your input image's x gradient property to true:
x, y = ... # get one image from MNIST
x.requires_grad_(True) # indicate to pytorch that you would like to look at these gradients
pred = model(x)
loss = criterion(pred, y)
loss.backward() # propagate gradients
x.grad # <- here you should have the gradients of the loss w.r.t pixels

Pytorch and batches

I'm having trouble understanding how batches play a role into the Pytorch framework.
In this model:
class MyModel(nn.Module):
def __init__(self):
super(MyModel, self).__init__()
# 28x28x1 => 26x26x32
self.conv1 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3)
self.d1 = nn.Linear(26 * 26 * 32, 128)
self.d2 = nn.Linear(128, 10)
def forward(self, x):
# 32x1x28x28 => 32x32x26x26
x = self.conv1(x)
x = F.relu(x)
# flatten => 32 x (32*26*26)
x = x.flatten(start_dim = 1)
#x = x.view(32, -1)
# 32 x (32*26*26) => 32x128
x = self.d1(x)
x = F.relu(x)
# logits => 32x10
logits = self.d2(x)
out = F.softmax(logits, dim=1)
return out
In the forward definition, we pass in some x, ie. aggregated images for a batch from a DataLoader. Here, the 32x1x28x28 dimension indicates that there are 32 images in a batch. Do we just ignore this fact and Pytorch handles applying Conv2d to each sample? The forward propagation seems to be just relative to a single image.
Indeed, the network is agnostic to batches: The model is designed to classify a single image.
So why do we need batches for?
Each model has weights (aka parameters) and one needs to optimize the weights using the training images so that the model will classify images as correctly as possible.
This optimization process is usually carried out using Stochastic Gradient Descent (SGD): we are using the current values of the weights to classify a batch of images. Using the prediction the current model made, and the expected predictions we know should be (the "labels") we can compute a gradient of the weights and improve the model.

Why is the output of a convolutional network so exponentially large?

I am trying to reproduce a unet result on Carvana dataset using Ternausnet in PyTorch using Lightning.
I am using DiceLoss for that with sigmoid activation function. I think I am running into an issue of a vanishing gradient, because all gradients of weights are 0, and I see the output of the network with min value of order 10^8.
What could be the issue here? How can I address the vanishing gradient? Also, if I use a different criterion, I see a problem of loss going into negative values without stopping (for BCE with logits, for instance).
Here is the code for my Dice loss:
class DiceLoss(nn.Module):
def __init__(self):
super().__init__()
def forward(self, logits, targets, eps=0, threshold=None):
# comment out if your model contains a sigmoid or
# equivalent activation layer
proba = torch.sigmoid(logits)
proba = proba.view(proba.shape[0], 1, -1)
targets = targets.view(targets.shape[0], 1, -1)
if threshold:
proba = (proba > threshold).float()
# flatten label and prediction tensors
intersection = torch.sum(proba * targets, dim=1)
summation = torch.sum(proba, dim=1) + torch.sum(targets, dim=1)
dice = (2.0 * intersection + eps) / (summation + eps)
# print(intersection, summation, dice)
return (1 - dice).mean()

The gradient cannot be calculated automatically

I am a beginner of Deep Learning and trying to making discriminator that judge cats/non-cats.
But When I run the code following, runtime error occured.
I know that "requires_grad" must be set to True in order to calculate the gradient automatically, but since X_train and Y_train are variables for reading, they are set to False.
I would be grateful if you could modify this code.
X_train = torch.tensor(train_set_x, dtype=dtype,requires_grad=False)
Y_train = torch.tensor(train_set_y, dtype=dtype,requires_grad=False)
def train_model(X_train, Y_train, X_test, Y_test, n_h, num_iterations=10000,learning_rate=0.5, print_cost=False):
"""
Arguments:
X_train -- training set represented by a numpy array of shape (num_px * num_px * 3, m_train)
Y_train -- training labels represented by a numpy array (vector) of shape (1, m_train)
X_test -- test set represented by a numpy array of shape (num_px * num_px * 3, m_test)
Y_test -- test labels represented by a numpy array (vector) of shape (1, m_test)
n_h -- size of the hidden layer
num_iterations -- number of iterations in gradient descent loop
learning_rate -- hyperparameter representing the learning rate used in the update rule of optimize()
print_cost -- if True, print the cost every 200 iterations
Returns:
d -- dictionary containing information about the model.
"""
n_x = X.size(1)
n_y = Y.size(1)
# Create model
model = nn.Sequential(
nn.Linear(n_x,n_h),
nn.ReLU(),
nn.Linear(n_h,n_y),
nn.ReLU()
)
# Initialize parameters
for name, param in model.named_parameters():
if name.find('weight') != -1:
torch.nn.init.orthogonal_(param)
elif name.find('bias') != -1:
torch.nn.init.constant_(param, 0)
# Cost function
cost_fn = nn.BCELoss()
# Loop (gradient descent)
for i in range(0, num_iterations):
# Forward propagation: compute predicted labels by passing input data to the model.
Y_predicted = model(X_train)
A2 = (Y_predicted > 0.5).float()
# Cost function. Inputs: predicted and true values. Outputs: "cost".
cost = cost_fn(A2, Y_train)
# Print the cost every 100 iterations
if print_cost and i % 100 == 0:
print("Cost after iteration %i: %f" % (i, cost.item()))
# Zero the gradients before running the backward pass. See hint in problem description
model.zero_grad()
# Backpropagation. Compute gradient of the cost function with respect to all the
# learnable parameters of the model. Use autograd to compute the backward pass.
cost.backward()
# Gradient descent parameter update.
with torch.no_grad():
for param in model.parameters():
# Your code here !!
param -= learning_rate * param.grad
d = {"model": model,
"learning_rate": learning_rate,
"num_iterations": num_iterations}
return d
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
I believe your problem is that you are mixing numpy arrays and torch tensors. Pytorch tensors are a bit like numpy arrays, but they also kept in a computational graph that is responsible for the backward pass.
The description of your received variables X_train, Y_train, X_test, Y_test says they are numpy arrays. You should convert them all to torch tensors:
x = torch.tensor(x)
I also noticed that you are manually performing gradient updates. Unless that was your intention, I would recomend you using one of pytorch's optimizers.
from torch.optim import SGD
model = nn.Sequential(
nn.Linear(n_x,n_h),
nn.ReLU(),
nn.Linear(n_h,n_y),
nn.Sigmoid() # You are using BCELoss, you should give it an input from 0 to 1
)
optimizer = SGD(model.parameters(), lr=learning_rate)
cost_fn = nn.BCELoss()
optimizer.zero_grad()
y = model(x)
cost = cost_fn(y, target)
cost.backward()
optimizer.step() # << updated the gradients of your model
Notice that it is recomended to use torch.nn.BCEWithLogitsLoss instead of BCELoss. The first implements sigmoid and the binary cross entropy together with some math tricks to make it more numerically stable. Your model should look something like:
model = nn.Sequential(
nn.Linear(n_x,n_h),
nn.ReLU(),
nn.Linear(n_h,n_y)
)

How to create an autoencoder where each layer of encoder should represent the same as a layer of the decoder

I want to build an autoencoder where each layer in the encoder has the same meaning as a correspondent layer in the decoder. So if the autoencoder is perfectly trained, the values of those layers should be roughly the same.
So lets say the autoencoder consists of e1 -> e2 -> e3 -> d2 -> d1, whereas e1 is the input and d1 the output. A normal autoencoder trains to have the same result in d1 as e1, but I want the additional constraint, that e2 and d2 are the same. Therefore I want an additional backpropagation path which leads from d2 to e2 and trains at the same time as the normal path from d1 to e1. (d stands for decoder, e for encoder).
I tried to use the error between e2 and d2 as a regularization term with the CustomRegularization layer from the first answer of this link https://github.com/keras-team/keras/issues/5563. I also use this for the error between e1 and d1 instead of the normal path.
The following code is written such that more than 1 intermediate layer can be handled and also uses 4 layers.
In the out commented code is a normal autoencoder which only propagates from start to end.
from keras.layers import Dense
import numpy as np
from keras.datasets import mnist
from keras.models import Model
from keras.engine.topology import Layer
from keras import objectives
from keras.layers import Input
import keras
import matplotlib.pyplot as plt
#A layer which can be given as an output to force a regularization term between two layers
class CustomRegularization(Layer):
def __init__(self, **kwargs):
super(CustomRegularization, self).__init__(**kwargs)
def call(self, x, mask=None):
ld=x[0]
rd=x[1]
bce = objectives.binary_crossentropy(ld, rd)
loss2 = keras.backend.sum(bce)
self.add_loss(loss2, x)
return bce
def get_output_shape_for(self, input_shape):
return (input_shape[0][0],1)
def zero_loss(y_true, y_pred):
return keras.backend.zeros_like(y_pred)
#Create regularization layer between two corresponding layers of encoder and decoder
def buildUpDownRegularization(layerNo, input, up_layers, down_layers):
for i in range(0, layerNo):
input = up_layers[i](input)
start = input
for i in range(layerNo, len(up_layers)):
input = up_layers[i](input)
for j in range(0, len(down_layers) - layerNo):
input = down_layers[j](input)
end = input
cr = CustomRegularization()([start, end])
return cr
# Define shape of the network, layers, some hyperparameters and training data
sizes = [784, 400, 200, 100, 50]
up_layers = []
down_layers = []
for i in range(1, len(sizes)):
layer = Dense(units=sizes[i], activation='sigmoid', input_dim=sizes[i-1])
up_layers.append(layer)
for i in range(len(sizes)-2, -1, -1):
layer = Dense(units=sizes[i], activation='sigmoid', input_dim=sizes[i+1])
down_layers.append(layer)
batch_size = 128
num_classes = 10
epochs = 100
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
x_train = x_train.reshape([x_train.shape[0], 28*28])
x_test = x_test.reshape([x_test.shape[0], 28*28])
y_train = x_train
y_test = x_test
optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
"""
### Normal autoencoder like in base mnist example
model = keras.models.Sequential()
for layer in up_layers:
model.add(layer)
for layer in down_layers:
model.add(layer)
model.compile(optimizer=optimizer, loss=keras.backend.binary_crossentropy)
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs)
score = model.evaluate(x_test, y_test, verbose=0)
#print('Test loss:', score[0])
#print('Test accuracy:', score[1])
decoded_imgs = model.predict(x_test)
n = 10 # how many digits we will display
plt.figure(figsize=(20, 4))
for i in range(n):
# display original
ax = plt.subplot(2, n, i + 1)
plt.imshow(x_test[i].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
# display reconstruction
ax = plt.subplot(2, n, i + 1 + n)
plt.imshow(decoded_imgs[i].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()
"""
### My autoencoder where each subpart is also an autoencoder
#This part is only because the model needs a path from start to end, contentwise this should do nothing
output = input = Input(shape=(sizes[0],))
for i in range(0, len(up_layers)):
output = up_layers[i](output)
for i in range(0, len(down_layers)):
output = down_layers[i](output)
crs = [output]
losses = [zero_loss]
#Build the regularization layer
for i in range(len(up_layers)):
crs.append(buildUpDownRegularization(i, input, up_layers, down_layers))
losses.append(zero_loss)
#Create and train model with adapted training data
network = Model([input], crs)
optimizer = keras.optimizers.Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
network.compile(loss=losses, optimizer=optimizer)
dummy_train = np.zeros([y_train.shape[0], 1])
dummy_test = np.zeros([y_test.shape[0], 1])
training_data = [y_train]
test_data = [y_test]
for i in range(len(network.outputs)-1):
training_data.append(dummy_train)
test_data.append(dummy_test)
network.fit(x_train, training_data, batch_size=batch_size, epochs=epochs,verbose=1, validation_data=(x_test, test_data))
score = network.evaluate(x_test, test_data, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
decoded_imgs = network.predict(x_test)
n = 10 # how many digits we will display
plt.figure(figsize=(20, 4))
for i in range(n):
# display original
ax = plt.subplot(2, n, i + 1)
plt.imshow(x_test[i].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
# display reconstruction
ax = plt.subplot(2, n, i + 1 + n)
plt.imshow(decoded_imgs[0][i].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()
If you run the code as is it will show, that the reproduction ability is no longer given in my code.
I expect a similar behavior to the uncommented code, which shows a normal autoencoder.
Edit: As mentioned in the answers this works well with MSE instead of crossentropy and a lr of .01. 100 epochs with that setting produce really good results.
Edit 2: I would like that the backpropagation works as in this [image] (https://imgur.com/OOo757x). So the backpropagation of the loss of a certain layer stops at the corresponding layer. I think I didn't make this clear before and I don't know if the code currently does that.
Edit 3: Although this code runs and returns a good looking solution the CustomRegularization layer is not doing what I thought it would do, therefore it does not do the same things as in the description.
It seems like the main issue is the use of binary cross-entropy to minimize the difference between encoder and decoder. The internal representation in the network is not going to be a single class probability like the output might be if you were classifying MNIST digits. I was able to get your network to output some reasonable-looking reconstructions with these simple changes:
Using objectives.mean_squared_error instead of objectives.binary_crossentropy in the CustomRegularization class
Changing number of epochs to 5
Changing learning rate to .01
Changes 2 and 3 were simply made to speed up the testing. Change 1 is the key here. Cross entropy is designed for problems where there is a binary "ground truth" variable and an estimate of that variable. However, you do not have a binary truth value in the middle of your network, only at the output layer. Thus a cross entropy loss function in the middle of the network doesn't make much sense (at least to me) -- it will be trying to measure entropy for a variable that isn't binary. Mean squared error, on the other hand, is a bit more generic and should work for this case since you are simply minimizing the difference between two real values. In essence, the middle of the network is performing regression (difference between activations in two continuous values, i.e. layers), not classification, so it needs a loss function that is appropriate for regression.
I also want to suggest that there may be a better approach to accomplish what you want. If you really want the encoder and decoder to be exactly the same, you can share weights between them. Then they will be identical, not just highly similar, and your model will have fewer parameters to train. There is a decent explanation of shared (tied) weights autoencoders with Keras here if you're curious.
Reading your code it does seem like it is doing what you want in your illustration, but I am not really sure how to verify that.

Resources