Tuning multiple losses in a multi-headed neural network - pytorch

I have a network to simultaneously predict a max, and a min, using the same logits (an impossible task, but hear me out). Basically, I want to turn a knob to say "now predict the max of a given set of values" or to predict the min. If the knob is in between, it 'll predict a min or max with 50% probability. My code is based on the You Only Train Once paper: https://openreview.net/pdf?id=HyxY6JHKwr. The paper claims that you can train one network, and then tune how you combine the losses, to produce the network you want. So in my case, I want to tune it in such a way that my network either predicts the max of a given set of numbers, or the min. But I am failing at this task. My network model is as follows
class MyModel(Module):
def __init__(self, vocab_size, embedding_dim, input_dim):
super(MyModel, self).__init__()
self.input_dim = input_dim
self.embedding_dim = embedding_dim
self.emb = Embedding(num_embeddings = vocab_size, embedding_dim = embedding_dim)
self.l1 = Linear(input_dim * embedding_dim, 64)
self.l2 = Linear(64, 32)
self.l3 = Linear(32,10)
self.loss_parameter_mlp = Sequential(
Linear(2, 2),
Sigmoid(),
)
def forward(self, x, lambd):
lambd = self.loss_parameter_mlp(lambd)
x = self.emb(x).reshape(-1, self.input_dim * self.embedding_dim)
x = ReLU()(self.l1(x))
x = x * lambd[:,0].reshape(-1, 1) + lambd[:,1].reshape(-1,1)
x = ReLU()(self.l2(x))
x = x * lambd[:,0].reshape(-1, 1) + lambd[:,1].reshape(-1,1)
logits = ReLU()(self.l3(x))
return logits
My inputs are 10 integers from 1 to 99, and my model outputs are the logits - the argmax of which should contain either the min, or the max, based on my hyperparameters lambd. I specifically chose this problem, since I want the network to predict two polar opposites (max and min) at the same time, which it cannot. It's (in my mind) a simpler version of the problem the paper is trying to solve. My training code is as shown below
# Training
epochs = 200
alpha = np.linspace(0, 1, epochs)
np.random.shuffle(alpha)
np.linspace(1, 0, int(epochs/2))))
for epoch in range(epochs):
lambd = torch.tensor([[alpha[epoch], (1 - alpha[epoch])]], dtype=torch.float32)
for batch, x in enumerate(train_loader):
y_max = torch.argmax(x, axis=1)
y_min = torch.argmin(x, axis=1)
lambd_b = lambd.expand(len(y_max), -1)
y_pred = model(x, lambd_b)
loss_max = CE_loss(y_pred, y_max)
loss_min = CE_loss(y_pred, y_min)
optimizer.zero_grad()
loss = alpha[epoch] * loss_max + (1 - alpha[epoch]) * loss_min
loss.backward()
optimizer.step()
However, the network learns to ignore the parameters lambd (in other words, the knobs to tune max or min just doesn't work). The network does learn to predict max and min (they share the same accuracy) - which is expected. What should I do to ensure that the knobs work?

Related

Why is the output of a convolutional network so exponentially large?

I am trying to reproduce a unet result on Carvana dataset using Ternausnet in PyTorch using Lightning.
I am using DiceLoss for that with sigmoid activation function. I think I am running into an issue of a vanishing gradient, because all gradients of weights are 0, and I see the output of the network with min value of order 10^8.
What could be the issue here? How can I address the vanishing gradient? Also, if I use a different criterion, I see a problem of loss going into negative values without stopping (for BCE with logits, for instance).
Here is the code for my Dice loss:
class DiceLoss(nn.Module):
def __init__(self):
super().__init__()
def forward(self, logits, targets, eps=0, threshold=None):
# comment out if your model contains a sigmoid or
# equivalent activation layer
proba = torch.sigmoid(logits)
proba = proba.view(proba.shape[0], 1, -1)
targets = targets.view(targets.shape[0], 1, -1)
if threshold:
proba = (proba > threshold).float()
# flatten label and prediction tensors
intersection = torch.sum(proba * targets, dim=1)
summation = torch.sum(proba, dim=1) + torch.sum(targets, dim=1)
dice = (2.0 * intersection + eps) / (summation + eps)
# print(intersection, summation, dice)
return (1 - dice).mean()

Measuring uncertainty using MC Dropout on pytorch

I am trying to implement Bayesian CNN using Mc Dropout on Pytorch,
the main idea is that by applying dropout at test time and running over many forward passes , you get predictions from a variety of different models.
I’ve found an application of the Mc Dropout and I really did not get how they applied this method and how exactly they did choose the correct prediction from the list of predictions
here is the code
def mcdropout_test(model):
model.train()
test_loss = 0
correct = 0
T = 100
for data, target in test_loader:
if args.cuda:
data, target = data.cuda(), target.cuda()
data, target = Variable(data, volatile=True), Variable(target)
output_list = []
for i in xrange(T):
output_list.append(torch.unsqueeze(model(data), 0))
output_mean = torch.cat(output_list, 0).mean(0)
test_loss += F.nll_loss(F.log_softmax(output_mean), target, size_average=False).data[0] # sum up batch loss
pred = output_mean.data.max(1, keepdim=True)[1] # get the index of the max log-probability
correct += pred.eq(target.data.view_as(pred)).cpu().sum()
test_loss /= len(test_loader.dataset)
print('\nMC Dropout Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.2f}%)\n'.format(
test_loss, correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))
train()
mcdropout_test()
I have replaced
data, target = Variable(data, volatile=True), Variable(target)
by adding
with torch.no_grad(): at the beginning
And this is how I have defined my CNN
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 192, 5, padding=2)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(192, 192, 5, padding=2)
self.fc1 = nn.Linear(192 * 8 * 8, 1024)
self.fc2 = nn.Linear(1024, 256)
self.fc3 = nn.Linear(256, 10)
self.dropout = nn.Dropout(p=0.3)
nn.init.xavier_uniform_(self.conv1.weight)
nn.init.constant_(self.conv1.bias, 0.0)
nn.init.xavier_uniform_(self.conv2.weight)
nn.init.constant_(self.conv2.bias, 0.0)
nn.init.xavier_uniform_(self.fc1.weight)
nn.init.constant_(self.fc1.bias, 0.0)
nn.init.xavier_uniform_(self.fc2.weight)
nn.init.constant_(self.fc2.bias, 0.0)
nn.init.xavier_uniform_(self.fc3.weight)
nn.init.constant_(self.fc3.bias, 0.0)
def forward(self, x):
x = self.pool(F.relu(self.dropout(self.conv1(x)))) # recommended to add the relu
x = self.pool(F.relu(self.dropout(self.conv2(x)))) # recommended to add the relu
x = x.view(-1, 192 * 8 * 8)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(self.dropout(x)))
x = self.fc3(self.dropout(x)) # no activation function needed for the last layer
return x
Can anyone help me to get the right implementation of the Monte Carlo Dropout method on CNN?
Implementing MC Dropout in Pytorch is easy. All that is needed to be done is to set the dropout layers of your model to train mode. This allows for different dropout masks to be used during the different various forward passes. Below is an implementation of MC Dropout in Pytorch illustrating how multiple predictions from the various forward passes are stacked together and used for computing different uncertainty metrics.
import sys
import numpy as np
import torch
import torch.nn as nn
def enable_dropout(model):
""" Function to enable the dropout layers during test-time """
for m in model.modules():
if m.__class__.__name__.startswith('Dropout'):
m.train()
def get_monte_carlo_predictions(data_loader,
forward_passes,
model,
n_classes,
n_samples):
""" Function to get the monte-carlo samples and uncertainty estimates
through multiple forward passes
Parameters
----------
data_loader : object
data loader object from the data loader module
forward_passes : int
number of monte-carlo samples/forward passes
model : object
keras model
n_classes : int
number of classes in the dataset
n_samples : int
number of samples in the test set
"""
dropout_predictions = np.empty((0, n_samples, n_classes))
softmax = nn.Softmax(dim=1)
for i in range(forward_passes):
predictions = np.empty((0, n_classes))
model.eval()
enable_dropout(model)
for i, (image, label) in enumerate(data_loader):
image = image.to(torch.device('cuda'))
with torch.no_grad():
output = model(image)
output = softmax(output) # shape (n_samples, n_classes)
predictions = np.vstack((predictions, output.cpu().numpy()))
dropout_predictions = np.vstack((dropout_predictions,
predictions[np.newaxis, :, :]))
# dropout predictions - shape (forward_passes, n_samples, n_classes)
# Calculating mean across multiple MCD forward passes
mean = np.mean(dropout_predictions, axis=0) # shape (n_samples, n_classes)
# Calculating variance across multiple MCD forward passes
variance = np.var(dropout_predictions, axis=0) # shape (n_samples, n_classes)
epsilon = sys.float_info.min
# Calculating entropy across multiple MCD forward passes
entropy = -np.sum(mean*np.log(mean + epsilon), axis=-1) # shape (n_samples,)
# Calculating mutual information across multiple MCD forward passes
mutual_info = entropy - np.mean(np.sum(-dropout_predictions*np.log(dropout_predictions + epsilon),
axis=-1), axis=0) # shape (n_samples,)
Moving on to the implementation which is posted in the question above, multiple predictions from T different forward passes are obtained by first setting the model to train mode (model.train()). Note that this is not desirable because unwanted stochasticity will be introduced in the predictions if there are layers other than dropout such as batch-norm in the model. Hence the best way is to just set the dropout layers to train mode as shown in the snippet above.

PyTorch LSTM data dimension

I'm using every past 7 days' data to predict today's value (price).
For each day, there are 6 features (Let's call them feature1 - feature5, and price).
Suppose I have 1000 rows of data. Therefore, what should be the shape of my data to be used in LSTM Pytorch?
Is it (1000, 7, 6)?
If you check the documentation, LSTM requires the input of shape seq_len x batch_size x input_size. If you declare LSTM with batch_first = True, then LSTM would expect an input of shape batch_size x seq_len x input_size.
Now, in your case, since you have 1000 data records, I assume that is your training data size. You can split 1000 records into small batches and feed them to the LSTM.
For the seq_len and input_size, you can have the size 7 x 6 where 7 = number of days and 6 = number of features.
However, my concern is on your problem definition. In your problem, you have 5 features and price is the target variable, whose value you want the model to predict. So, you can feed the 5 feature values to LSTM and use the output vectors to predict the price value.
A reasonable network would be:
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
# input_size = 5 (number of features), output_size = 50
self.lstm = nn.LSTM(5, 50, 1, batch_first=True)
# output_size = 1 (target price)
self.dense = nn.Linear(50, 1)
def forward(self, x):
x = self.dense(self.lstm(x)[0])
return x
model = Model()
batch_input = torch.randn(16, 7, 5) # => batch_size = 16
y = model(batch_input) # => torch.Size([16, 7, 1])
Now, you can optimize the model using MSELoss since the task is more like a regression problem.

Pytorch customize weight

I have a network
class Net(nn.Module)
and two different weights w0 and w1 (concatenate weights of all layers into a vector). Now I want to optimize the network on the line connecting w0 and w1, which means that the weight will have the form theta * w0 + (1-theta) * w1. So now the parameter I want to optimize is no longer the weight itself, but the theta.
How can I implement this? In Pytorch, how can I define the parameter to be theta, and set the weight to be form I want. To be specific, if I create a new class
NetOnLine(nn.Module)
how should I write the forward(self, X) function?
You can define the parameter theta in your net as an nn.Parameter. You'd define the forward function the same way as normal - pass the data through the layers or operations you want and then return it.
Here's a minimal example, where I train a "network" to learn to multiply a Tensor by 2:
import numpy as np
import torch
class SampleNet(torch.nn.Module):
def __init__(self):
super(SampleNet, self).__init__()
self.theta = torch.nn.Parameter(torch.rand(1))
def forward(self, x):
x = x * self.theta.expand_as(x) # expand_as() to match sizes
return x
train_data = np.random.rand(1000, 10)
train_data[:, 5:] = 2 * train_data[:, :5]
train_data = torch.Tensor(train_data)
sample_net = SampleNet()
optimizer = torch.optim.Adam(params=sample_net.parameters())
mse_loss = torch.nn.MSELoss()
for epoch in range(5):
for data in train_data:
x = data[:5]
y = data[5:]
optimizer.zero_grad()
prediction = sample_net(x)
loss = mse_loss(y, prediction)
loss.backward()
optimizer.step()
print(f"Epoch {epoch}, Loss {loss.data.item()}")
print(f"Learned theta: {sample_net.theta.data.item()}")
which prints out
Epoch 0, Loss 0.03369491919875145
Epoch 1, Loss 0.0018534092232584953
Epoch 2, Loss 1.2343853995844256e-05
Epoch 3, Loss 2.2044337466553543e-09
Epoch 4, Loss 4.0527581290916714e-12
Learned theta: 1.999994158744812

tensorflow stop optimizing when activation step is in a function

I'm trying to reproduce the result of https://www.tensorflow.org/tutorials/mnist/beginners/
So I designed several functions to take care of the training step such as these two:
def layer_computation(previous_layer_output, weights, bias, activation):
return activation(tf.add(tf.matmul(previous_layer_output, weights), bias))
def multilayer_perceptron_forward(x, weights, biaises, activations):
return reduce(lambda output_layer, args: layer_computation(output_layer, *args),
zip(weights, biaises, activations), x)
By using these two functions for the training
def training(session,
features, labels,
mlp,
# cost = (tf.reduce_mean, ),
optimizer=tf.train.GradientDescentOptimizer,
epochs=100, learning_rate=0.001, display=100):
x = tf.placeholder("float")
y = tf.placeholder("float")
weights, biases, activations = mlp
pred = multilayer_perceptron_forward(x, weights, biases, activations)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
opti = optimizer(learning_rate).minimize(cost)
init = tf.global_variables_initializer()
session.run(init)
for i in range(1, epochs + 1):
batch_size = 100
avg_cost = 0
number_of_bacth = int(features.shape[0]/batch_size)
for j in range(number_of_bacth):
my_x = features[j*100:(j+1)*100, :]
my_y = labels[j*100:(j+1)*100, :]
_, c = session.run([opti, cost], feed_dict={x: my_x,
y: my_y})
avg_cost += c/number_of_bacth
if i % display == 0:
print("Epoch {i} cost = {cost}".format(i=i, cost=avg_cost))
The optimization stops at a cost of 2.3... and the overall accuracy is of 10% whereas in the example it get closer to zero and the accuracy is close to 96%. Does anyone have an explanation for this peculiar behavior?
PS when I'm using layer_computation in the source code in the example I also get stuck at a cost of 2.3.
I caught the error, I was trying to perform back-propagation on the last layer. This question may have been better on cross-validated.

Resources