how to add BatchNormalization with SWA:stochastic weights average? - python-3.x

I am a beginner in Deepleaning and Pytorch.
I don't understand how to use BatchNormalization in using SWA.
pytorch.org says in https://pytorch.org/blog/stochastic-weight-averaging-in-pytorch/:
Note that the SWA averages of the weights are never used to make
predictions during training, and so the batch normalization layers do
not have the activation statistics computed after you reset the
weights of your model with opt.swap_swa_sgd()
This means it's suitable for adding BatchNormalization layer after using SWA?
# it means, in my idea
#for example
opt = torchcontrib.optim.SWA(base_opt)
for i in range(100):
opt.zero_grad()
loss_fn(model(input), target).backward()
opt.step()
if i > 10 and i % 5 == 0:
opt.update_swa()
opt.swap_swa_sgd()
#save model once
torch.save(model,"swa_model.pt")
#model_load
saved_model=torch.load("swa_model.pt")
#it means adding BatchNormalization layer??
model2=saved_model
model2.add_module("Batch1",nn.BatchNorm1d(10))
# decay learning_rate more
learning_rate=0.005
optimizer = torch.optim.SGD(model2.parameters(), lr=learning_rate)
# train model again
for epoch in range(num_epochs):
loss = train(train_loader)
val_loss, val_acc = valid(test_loader)
I appreciate your replying.
following your advise,
I try to make example model adding optimizer.bn_update()
# add optimizer.bn_update() to model
criterion = nn.CrossEntropyLoss()
learning_rate=0.01
base_opt = torch.optim.SGD(model.parameters(), lr=0.1)
optimizer = SWA(base_opt, swa_start=10, swa_freq=5, swa_lr=0.05)
def train(train_loader):
#mode:train
model.train()
running_loss = 0
for batch_idx, (images, labels) in enumerate(train_loader):
optimizer.zero_grad()
outputs = model(images)
#loss
loss = criterion(outputs, labels)
running_loss += loss.item()
loss.backward()
optimizer.step()
optimizer.swap_swa_sgd()
train_loss = running_loss / len(train_loader)
return train_loss
def valid(test_loader):
model.eval()
running_loss = 0
correct = 0
total = 0
#torch.no_grad
with torch.no_grad():
for batch_idx, (images, labels) in enumerate(test_loader):
outputs = model(images)
loss = criterion(outputs, labels)
running_loss += loss.item()
_, predicted = torch.max(outputs, 1)
correct += (predicted == labels).sum().item()
total += labels.size(0)
val_loss = running_loss / len(test_loader)
val_acc = float(correct) / total
return val_loss, val_acc
num_epochs=30
loss_list = []
val_loss_list = []
val_acc_list = []
for epoch in range(num_epochs):
loss = train(train_loader)
val_loss, val_acc = valid(test_loader)
optimizer.bn_update(train_loader, model)
print('epoch %d, loss: %.4f val_loss: %.4f val_acc: %.4f'
% (epoch, loss, val_loss, val_acc))
# logging
loss_list.append(loss)
val_loss_list.append(val_loss)
val_acc_list.append(val_acc)
# optimizer.bn_updata()
optimizer.bn_update(train_loader, model)
# go on evaluating model,,,

What the documentation is telling you is that since SWA computes averages of weights but those weights aren't used for prediction during training the batch normalization layers won't see those weights. This means they haven't computed the respective statistics for them (as they were never able to) which is important because the weights are used during actual prediction (i.e. not during training).
This means, they assume you have batch normalization layers in your model and want to train it using SWA. This is (more or less) not straight-forward due to the reasons above.
One approach is given as follows:
To compute the activation statistics you can just make a forward pass on your training data using the SWA model once the training is finished.
Or you can use their helper class:
In the SWA class we provide a helper function opt.bn_update(train_loader, model). It updates the activation statistics for every batch normalization layer in the model by making a forward pass on the train_loader data loader. You only need to call this function once in the end of training.
In case you are using Pytorch's DataLoader class you can simply supply the model (after training) and the training loader to the bn_update function which updates all batch normalization statistics for you. You only need to call this function once in the end of training.
Steps to proceed:
Train your model that includes batch normalization layers using SWA
After your model has finished training, call opt.bn_update(train_loader, model) using your training data and providing your trained model

I tried to compare before and after using optimizer.bn_update() in Mnist Data.
as follows:
# using Mnist Data
X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
# to compare Test Data Accuracy
X_train_=X_train[0:50000]
y_train_=y_train[0:50000]
# to validate for test data
X_train_ToCompare=X_train[50000:60000]
y_train_ToCompare=y_train[50000:60000]
print(X_train_.shape)
print(y_train_.shape)
print(X_train_ToCompare.shape)
print(y_train_ToCompare.shape)
#(50000, 784)
#(50000,)
#(10000, 784)
#(10000,)
# like keras,simple MLP model
from torch import nn
model = nn.Sequential()
model.add_module('fc1', nn.Linear(784, 1000))
model.add_module('relu1', nn.ReLU())
model.add_module('fc2', nn.Linear(1000, 1000))
model.add_module('relu2', nn.ReLU())
model.add_module('fc3', nn.Linear(1000, 10))
print(model)
# using GPU
model.cuda()
criterion = nn.CrossEntropyLoss()
learning_rate=0.01
base_opt = torch.optim.SGD(model.parameters(), lr=0.1)
optimizer = SWA(base_opt, swa_start=10, swa_freq=5, swa_lr=0.05)
def train(train_loader):
model.train()
running_loss = 0
for batch_idx, (images, labels) in enumerate(train_loader):
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
running_loss += loss.item()
loss.backward()
optimizer.step()
optimizer.swap_swa_sgd()
train_loss = running_loss / len(train_loader)
return train_loss
def valid(test_loader):
model.eval()
running_loss = 0
correct = 0
total = 0
with torch.no_grad():
for batch_idx, (images, labels) in enumerate(test_loader):
outputs = model(images)
loss = criterion(outputs, labels)
running_loss += loss.item()
_, predicted = torch.max(outputs, 1)
correct += (predicted == labels).sum().item()
total += labels.size(0)
val_loss = running_loss / len(test_loader)
val_acc = float(correct) / total
return val_loss, val_acc
num_epochs=30
loss_list = []
val_loss_list = []
val_acc_list = []
for epoch in range(num_epochs):
loss = train(train_loader)
val_loss, val_acc = valid(test_loader)
optimizer.bn_update(train_loader, model)
print('epoch %d, loss: %.4f val_loss: %.4f val_acc: %.4f'
% (epoch, loss, val_loss, val_acc))
# logging
loss_list.append(loss)
val_loss_list.append(val_loss)
val_acc_list.append(val_acc)
# output:
# epoch 0, loss: 0.7832 val_loss: 0.5381 val_acc: 0.8866
# ...
# epoch 29, loss: 0.0502 val_loss: 0.0758 val_acc: 0.9772
#evaluate model
# attempt to evaluate model before optimizer.bn_update()
# using X_train_toCompare for test data
model.eval()
predicted_list=[]
for i in range(len(X_train_ToCompare)):
temp_predicted=model(torch.cuda.FloatTensor(X_train_ToCompare[i]))
_,y_predicted=torch.max(temp_predicte,0)
predicted_list.append(int(y_predicted))
sum(predicted_list==y_train_ToCompare)
# test accuracy 9757/10000
#after using:optimizer.bn_update
model.train()
optimizer.bn_update(train_loader, model)
# evaluate model
model.eval()
predicted_list_afterBatchNorm=[]
for i in range(len(X_train_ToCompare)):
temp_predicted=model(torch.cuda.FloatTensor(X_train_ToCompare[i]))
_,y_predicted=torch.max(temp_predicted,0)
predicted_list_afterBatchNorm.append(int(y_predicted))
sum(predicted_list_withNorm==y_train_ToCompare)
# test accuracy 9778/10000
I don't know if this way is correct to validate...
Using optimizer.bn_update() method, I confirm test accuracy is improved ofen.
but some test accuracy is descended:I think this is because of
simple MLP model and learning epochs are not enough.
there is need to try test more.
thank you for reply.

Related

I am kinda new to the pytorch, now struggling with a classification problem

I built a very simple structure
class classifier (nn.Module):
def __init__(self):
super().__init__()
self.classify = nn.Sequential(
nn.Linear(166,80),
nn.Tanh(),
nn.Linear(80,40),
nn.Tanh(),
nn.Linear(40,1),
nn.Softmax()
)
def forward (self, x):
pred = self.classify(x)
return pred
model = classifier()
The loss function and optimizer are defined as
criteria = nn.BCEWithLogitsLoss()
iteration = 1000
learning_rate = 0.1
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
and here is the training and evaluation section
for epoch in range (iteration):
model.train()
y_pred = model(x_train)
loss = criteria(y_pred,y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
model.eval()
with torch.inference_mode():
test_pred = model(x_test)
test_loss = criteria(test_pred, y_test)
if epoch % 100 == 0:
print(loss)
print(test_loss)
I received the same loss values, and by debugging, I found that the weights were not being updated.
The problem is in the network architecture: you are using a Softmax layer on a single valued output at the end. As per the definition of the softmax function, for a output vector x, we have, for index i:
softmax(x_i) = e^{x_i} / sum_j (e^{x_j})
Here, you only have a single valued output. Due to this, the output of your neural network is always 1, irrespective of the inputs or the weights. To fix this, remove the Softmax layer at the end. An activation function like Sigmoid might be more appropriate, and in fact you are already applying this when using the BCEWithLogitsLoss.
The problem lies here
y_pred = model(x_train)
loss = criteria(y_pred,y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
after loss is calculated, you are clearing the gradients by doing optimizer.zero_grad()
the ideal case should be:
optimizer.zero_grad()
y_pred = model(x_train)
loss = criteria(y_pred,y_train)
loss.backward()
optimizer.step()

MNIST dataset overfitting

I am working with the MNIST dataset and I have created the following network. I want to overfit the training data and I think I am doing that here. My training loss is lower than my validation loss. This is the code that I have come up with. Please look at it and let me know if I am overfitting the training data, if I am not then how do I go about it?
class NN(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Flatten(),
nn.Linear(784,4096),
nn.ReLU(),
nn.Linear(4096,2048),
nn.ReLU(),
nn.Linear(2048,1024),
nn.ReLU(),
nn.Linear(1024,512),
nn.ReLU(),
nn.Linear(512,256),
nn.ReLU(),
nn.Linear(256,128),
nn.ReLU(),
nn.Linear(128,64),
nn.ReLU(),
nn.Linear(64,32),
nn.ReLU(),
nn.Linear(32,16),
nn.ReLU(),
nn.Linear(16,10))
def forward(self,x):
return self.layers(x)
def accuracy_and_loss(model, loss_function, dataloader):
total_correct = 0
total_loss = 0
total_examples = 0
n_batches = 0
with torch.no_grad():
for data in testloader:
images, labels = data
outputs = model(images)
batch_loss = loss_function(outputs,labels)
n_batches += 1
total_loss += batch_loss.item()
_, predicted = torch.max(outputs, dim=1)
total_examples += labels.size(0)
total_correct += (predicted == labels).sum().item()
accuracy = total_correct / total_examples
mean_loss = total_loss / n_batches
return (accuracy, mean_loss)
def define_and_train(model,dataset_training, dataset_test):
trainloader = torch.utils.data.DataLoader( small_trainset, batch_size=500, shuffle=True)
testloader = torch.utils.data.DataLoader( dataset_test, batch_size=500, shuffle=True)
values = [1e-8,1e-7,1e-6,1e-5]
model = NN()
for params in values:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay = 1e-7)
train_acc = []
val_acc = []
train_loss = []
val_loss = []
for epoch in range(100):
total_loss = 0
total_correct = 0
total_examples = 0
n_mini_batches = 0
for i,mini_batch in enumerate(trainloader,0):
images,labels = mini_batch
optimizer.zero_grad()
outputs = model(images)
loss = loss_function(outputs,labels)
loss.backward()
optimizer.step()
n_mini_batches += 1
total_loss += loss.item()
_, predicted = torch.max(outputs, dim=1)
total_examples += labels.size(0)
total_correct += (predicted == labels).sum().item()
epoch_training_accuracy = total_correct / total_examples
epoch_training_loss = total_loss / n_mini_batches
epoch_val_accuracy, epoch_val_loss = accuracy_and_loss( model, loss_function, testloader )
print('Params %f Epoch %d loss: %.3f acc: %.3f val_loss: %.3f val_acc: %.3f'
%(params, epoch+1, epoch_training_loss, epoch_training_accuracy, epoch_val_loss, epoch_val_accuracy))
train_loss.append( epoch_training_loss )
train_acc.append( epoch_training_accuracy )
val_loss.append( epoch_val_loss )
val_acc.append( epoch_val_accuracy )
history = { 'train_loss': train_loss,
'train_acc': train_acc,
'val_loss': val_loss,
'val_acc': val_acc }
return ( history, model )
history1, net1 = define_and_train(model,dataset_training,dataset_test)
I am trying to overfit the training data so that later i can apply regularization and then reduce the overfitting which will give me a better understanding of the process
Although I won't attempt to provide a rigorous definition, the term "overfit" typically means that the training loss continues to decrease whereas the validation loss stays stagnant at a position higher than the training loss, or continues to increase with more iterations.
Therefore, it is difficult to know whether your network is overfitting solely based on your code alone. Since dense, fully-connected networks tend to overfit easily in the absence of dropout layers or other regularizers, my hunch would be that your network is indeed overfitting according to your intention. However, we would have to see your tensorboard logs or loss plot to determine whether the model is overfitting.
If you want to overfit your network to the dataset, I suggest that you construct a much larger model with more hidden layers. Overfitting occurs when the dataset is "too easy" for the model and it starts to remember the training set itself without learning generalizable patterns that can be applied to the validation set.

How to understand a periodicity in the training loss using a pre-trained model of PyTorch?

I'm using a pre-trained model from Pytorch ( Resnet 18,34,50) in order to classify images. During the training, a weird periodicity appears in the training as you can see in the image below. Did somebody already have a similar issue?In order to deal with the overfitting, I'm using Data augmentation in the preprocessing.
When using SGD as an optimizer with the following parameters, we obtain this sort of graph:
criterion: NLLLoss()
learning rate: 0.0001
epoch: 40
print every 40 iteration
We also try adam and Adam bound as optimizers but the same periodicity was observed.
Thank's in advance for your answer!
Here is the code :
def train_classifier():
start=0
stop=0
start = timeit.default_timer()
epochs = 40
steps = 0
print_every = 40
model.to('cuda')
epo=[]
train=[]
valid=[]
acc_valid=[]
for e in range(epochs):
print('Currently running epoch',e,':')
model.train()
running_loss = 0
for images, labels in iter(train_loader):
steps += 1
images, labels = images.to('cuda'), labels.to('cuda')
optimizer.zero_grad()
output = model.forward(images)
loss = criterion(output, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
if steps % print_every == 0:
model.eval()
# Turn off gradients for validation, saves memory and computations
with torch.no_grad():
validation_loss, accuracy = validation(model, val_loader, criterion)
print("Epoch: {}/{}.. ".format(e+1, epochs),
"Training Loss: {:.3f}.. ".format(running_loss/print_every),
"Validation Loss: {:.3f}.. ".format(validation_loss/len(val_loader)),
"Validation Accuracy: {:.3f}".format(accuracy/len(val_loader)))
stop = timeit.default_timer()
print('Time: ', stop - start)
acc_valid.append(accuracy/len(val_loader))
train.append(running_loss/print_every)
valid.append(validation_loss/len(val_loader))
epo.append(e+1)
running_loss = 0
model.train()
return train,epo,valid,acc_valid

Pytorch1.6 What is the actual learning rate during training?

I'd like to know the actual learning rate during training, here is my code.
learning_rate = 0.001
optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[1, 2], gamma=0.1)
def train(epoch):
train_loss = 0
for batch_idx, (input, target) in enumerate(train_loader):
predict_label = net(input)
loss = criterion(predict_label, target)
train_loss += loss.item()
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(optimizer.param_groups[0]['lr'])
scheduler.step()
print(scheduler.state_dict()['_last_lr'])
print(optimizer.param_groups[0]['lr'])
the output is 0.001, 0.0001, 0.0001. So what is the actual lr during optimizer.step()? 0.001 or 0.0001? Thanks.
The important part is here:
for batch_idx, (input, target) in enumerate(train_loader):
...
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(optimizer.param_groups[0]['lr']) #### CURRENT LEARNING RATE
scheduler.step() #step through learning rate
print(scheduler.state_dict()['_last_lr']) #### NEW LEARNING RATE
print(optimizer.param_groups[0]['lr']) #### NEW LEARNING RATE
Because you step your scheduler after your epoch, then the first epoch will have your initial value which is set to 0.001. If you run for multiple epochs then it will continue to be annealed.

PyTorch Getting Custom Loss Function Running

I'm trying to use a custom loss function by extending nn.Module, but I can't get past the error
element 0 of variables does not require grad and does not have a grad_fn
Note: my labels are lists of size: num_samples, but each batch will have the same labels throughout the batch, so we shrink labels for the whole batch to be a single label by calling .diag()
My code is as follows and is based on the transfer learning tutorial:
def train_model(model, criterion, optimizer, scheduler, num_epochs=25):
since = time.time()
best_model_wts = copy.deepcopy(model.state_dict())
best_acc = 0.0
for epoch in range(num_epochs):
print('Epoch {}/{}'.format(epoch, num_epochs - 1))
print('-' * 10)
# Each epoch has a training and validation phase
for phase in ['train', 'val']:
if phase == 'train':
scheduler.step()
model.train(True) # Set model to training mode
else:
model.train(False) # Set model to evaluate mode
running_loss = 0.0
running_corrects = 0
# Iterate over data.
for data in dataloaders[phase]:
# get the inputs
inputs, labels = data
inputs = inputs.float()
# wrap them in Variable
if use_gpu:
inputs = Variable(inputs.cuda())
labels = Variable(labels.cuda())
else:
inputs = Variable(inputs)
labels = Variable(labels)
# zero the parameter gradients
optimizer.zero_grad()
# forward
outputs = model(inputs)
#outputs = nn.functional.sigmoid(outputs).round()
_, preds = torch.max(outputs, 1)
label = labels.diag().float()
preds = preds.float()
loss = criterion(preds, label)
# backward + optimize only if in training phase
if phase == 'train':
loss.backward()
optimizer.step()
# statistics
running_loss += loss.data[0] * inputs.size(0)
running_corrects += torch.sum(pred == label.data)
epoch_loss = running_loss / dataset_sizes[phase]
epoch_acc = running_corrects / dataset_sizes[phase]
print('{} Loss: {:.4f} Acc: {:.4f}'.format(
phase, epoch_loss, epoch_acc))
# deep copy the model
if phase == 'val' and epoch_acc > best_acc:
best_acc = epoch_acc
best_model_wts = copy.deepcopy(model.state_dict())
print()
time_elapsed = time.time() - since
print('Training complete in {:.0f}m {:.0f}s'.format(
time_elapsed // 60, time_elapsed % 60))
print('Best val Acc: {:4f}'.format(best_acc))
# load best model weights
model.load_state_dict(best_model_wts)
return model
and my loss function is defined below:
class CustLoss(nn.Module):
def __init__(self):
super(CustLoss, self).__init__()
def forward(self, outputs, labels):
return cust_loss(outputs, labels)
def cust_loss(pred, targets):
'''preds are arrays of size classes with floats in them'''
'''targets are arrays of all the classes from the batch'''
'''we sum the classes from the batch and find the num correct'''
r = torch.sum(pred == targets)
return r
Then I run the following to run the model:
model_ft = models.resnet18(pretrained=True)
for param in model_ft.parameters():
param.requires_grad = False
num_ftrs = model_ft.fc.in_features
model_ft.fc = nn.Linear(num_ftrs, 3)
if use_gpu:
model_ft = model_ft.cuda()
criterion = CustLoss()
# Observe that all parameters are being optimized
optimizer_ft = optim.SGD(model_ft.fc.parameters(), lr=0.001, momentum=0.9)
# Decay LR by a factor of 0.1 every 7 epochs
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=7, gamma=0.1)
model_ft = train_model(model_ft, criterion, optimizer_ft, exp_lr_scheduler,num_epochs=25)
I tried getting it to work with other loss functions to no avail. I always get the same error when loss.backward() is called.
It was my understanding that I wouldn't need a custom implementation of loss.backward if I extend nn.Module.
You are subclassing nn.Module to define a function, in your case Loss function. So, when you compute loss.backward(), it tries to store the gradients in the loss itself, instead of the model and there is no variable in the loss for which to store the gradients. Your loss needs to be a function and not a module. See Extending autograd.
You have two options here -
The easiest one is to directly pass cust_loss function as criterion parameter to train_model.
You can extend torch.autograd.Function to define the custom loss (and if you wish, the backward function as well).
P.S. - It is mentioned that you need to implement the backward of the custom loss functions. This is not always the case. It is required only when your loss function is non-differentiable at some point. But, I do not think so that you’ll need to do that.

Resources