CUDA out of memory when fine-tuning a large model - pytorch

I have previously trained a VGG mode(say model1), and a two layer model(say model2) separately, now I have to train a new model which combines those two models together, and each part of the new model is initialized with the learned weights of model1 and model2, which I implemented as follows:
class TransferModel(nn.Module):
def __init__(self, VGG, TwoLayer):
super(TransferModel, self).__init__()
self.linear = TwoLayer
for param in self.vgg_layer.parameters():
param.requires_grad = True
def forward(self, x):
h1_vgg = self.vgg_layer(x)
y_pred = self.linear(h1_vgg)
return y_pred
# for image_id in train_ids[0:1]:
# img = load_image(train_id_to_file[image_id])
new_model=TransferModel(trained_vgg_instance, trained_twolayer_instance)
And when training, I try:
def train(model, learning_rate=0.001, batch_size=50, epochs=2):
optimizer=optim.Adam(model.parameters(), lr=learning_rate)
criterion = torch.nn.MultiLabelSoftMarginLoss()
x = torch.zeros([batch_size, 3, img_size, img_size])
y_true = torch.zeros([batch_size, 4096])
for epoch in range(epochs): # loop over the dataset multiple times
running_loss = 0.0
for i in range(20000):
for batch_num in range(int(20000/batch_size)):
for j in range(batch_size):
# ... some code to load batches of images into x....
y_true_batch=Variable(train_labels[batch_num*batch_size:(batch_num+1)*batch_size, :]).cuda()
y_pred =model(x_batch)
loss = criterion(y_pred, y_true_batch)
running_loss += loss
del x_batch, y_true_batch, y_pred
print("in epoch[%d] = %.8f " % (epoch, running_loss /(batch_num+1)))
running_loss = 0.0
print('Finished Training')
In the second iteration(batch_num=1) of the first epoch, I get this error:
CUDA out of memory. Tried to allocate 153.12 MiB (GPU 0; 5.93 GiB
total capacity; 4.83 GiB already allocated; 66.94 MiB free; 374.12 MiB
Although I have explicitly used 'del' in my training, by running nvidia-smi it looks like it doesn't do anything and the memory isn't being freed.
What should I do?

Change this line:
running_loss += loss
to this:
running_loss += loss.item()
By adding loss to running_loss, you are telling pytorch to keep all the gradients with respect to loss for that batch in memory, even when you start training on the next batch. Pytorch thinks that maybe you will want to use running_loss in some big loss function over multiple batches later, and therefore keeps all the gradients (and therefore activations) for all batches in memory.
By adding .item() you just get the loss as a python float, rather than a torch.FloatTensor. This float is detached from the pytorch graph and thus pytorch knows you don't want gradients with respect to it.
If you are running an older version of pytorch without .item(), you can try:
running_loss += float(loss).cpu().detach
This could also be caused by a similar bug in a test() loop, if you have one.


Strange loss curve while training EfficientNetV2 with Pytorch

I'm new to Pytorch. And I use the architecture that a pre-trained EfficientNetV2 model to connect to a single fully connected layer with one neuron using the ReLU activation function in regression task. However, both losses on training and validation set suddenly increase after first epoch and keep at about the same value during 50 epochs, then suddenly decrease to about same value as first epoch. Can anyone help me figure out what's happening?
Some codes for model and training process:
# hyper-parameter
image_size = 256
learning_rate = 1e-3
batch_size = 32
epochs = 60
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__() = models.efficientnet_v2_m(pretrained=True,weights='DEFAULT')[1] = nn.Linear(in_features=1280, out_features=1, bias=True) = nn.Sequential(,nn.ReLU())
def forward(self, input):
output =
return output
model = Model()
# Define the loss function with Classification Cross-Entropy loss and an optimizer with Adam optimizer
loss_fn = nn.L1Loss()
optimizer = Adam(model.parameters(), lr=0.001, weight_decay=0.0001)
# Function to test the model with the test dataset and print the accuracy for the test images
def testAccuracy():
loss = 0.0
total = 0.0
with torch.no_grad():
for data in validation_loader:
images, labels = data
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# print("The model test will be running on", device, "device")
# get the inputs
images = Variable(
labels = Variable(
# run the model on the test set to predict labels
outputs = model(images)
# the label with the highest energy will be our prediction
# print('outputs: ',outputs)
# print('labels: ',labels)
temp = loss_fn(outputs, labels.unsqueeze(1))
loss += loss_fn(outputs, labels.unsqueeze(1)).item()
total += 1
# compute the accuracy over all test images
mae = loss/total
# Training function. We simply have to loop over our data iterator and feed the inputs to the network and optimize.
def train(num_epochs):
best_accuracy = 0.0
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
train_loss_all = []
val_loss_all = []
for epoch in range(num_epochs): # loop over the dataset multiple times
running_loss = 0.0
total = 0
for i, (images, labels) in tqdm(enumerate(train_loader, 0),total=len(train_loader)):
# get the inputs
images = Variable(
labels = Variable(
# zero the parameter gradients
# predict classes using images from the training set
outputs = model(images)
# compute the loss based on model output and real labels
loss = loss_fn(outputs, labels.unsqueeze(1))
# backpropagate the loss
# adjust parameters based on the calculated gradients
# Let's print statistics for every one batch
running_loss += loss.item() # extract the loss value
total += 1
train_loss = running_loss/total
accuracy = testAccuracy()
if accuracy > best_accuracy:
best_accuracy = accuracy
history = {'train_loss':train_loss_all,'val_loss':val_loss_all}
Loss curve:
loss curve

Looking for help on why GPU is not used when I train a Pytorch model

The machine I am using for training has 4 GPUs. I am "moving" classifier, loss function and tensors to GPU. But when I run nvidia-smi on the machine while training is ongoing, I see GPU utilization is very low (3%) on one core and 0 on other cores.
Questions I have are
Is there an easier approach to ask Pytorch to use GPU and as many cores as available without me having to do so many .to(device) all over the place
Is there something other than .to(device) that is needed to use GPU?
Is there a way to see if training is happening on CPU vs GPU or is running nvidia-smi on the machine and looking at GPU utilization the only way?
How do I interpret GPU utilization of 3% in nvidia-smi. Does it mean CPU is being used in many places? If yes, is there a way to debug what is making the training use CPU?
Will setting num_workers to number of available cores in DataLoader class be enough to use multiple GPU cores? Is there any generic way to automatically learn number of GPU cores available?
Code used to train
torch.backends.cudnn.deterministic = True
start_time = time.time()
clf = MLP(len(X_training[0]), hidden_size=[100, 100, 100, 100, 100])
#Move to GPU if available
use_gpu = torch.cuda.is_available()
device = torch.device('cuda' if use_gpu else 'cpu')
# Define the loss function and optimizer
optimizer = torch.optim.Adam(clf.parameters(), lr=8e-4)
clf =
loss_function = nn.BCELoss()
loss_function =
# Run the training loop
# per_epoch_precision = []
# per_epoch_recall = []
for epoch in range(0, 150):
# Set current loss value
current_loss = 0.0
dataset = MyDataset(X_training, y_training, use_gpu)
kwargs = {'num_workers': 1, 'pin_memory': True} if use_gpu else {}
trainloader =, batch_size=10000, shuffle=True, **kwargs)
# Iterate over the DataLoader for training data
clf.train() # set to train mode
for i, data in enumerate(trainloader):
# Get inputs
inputs, targets = data
inputs =
targets =
# Zero the gradients
# Perform forward pass
outputs = clf(inputs)
# Compute loss
targets = targets.float().unsqueeze(1)
loss = loss_function(outputs, targets)
# Perform backward pass
# Perform optimization
# Print statistics
current_loss += loss.item()
if i % 20000 == 19999:
print("Loss after mini-batch %5d: %.3f" % (i + 1, current_loss / 500))
current_loss = 0.0
# Process is complete.
print("Training process has finished.")
print(f"Train time is {time.time() - start_time}")
class MyDataset(Dataset):
def __init__(self, x, y, use_gpu=False):
x = x.astype(np.float32)
self.x_train = torch.from_numpy(x)
self.y_train = torch.from_numpy(y.values)
if use_gpu:
device = torch.device("cuda")
# self.y_train = torch.LongTensor(y.values,
def __len__(self):
return len(self.y_train)
def __getitem__(self,idx):
return self.x_train[idx],self.y_train[idx]
class MLP(nn.Module):
def __init__(self, input_size, hidden_size, act_fn=nn.ReLU(), use_dropout=False, drop_rate=0.25):
super(MLP, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.layers = nn.Sequential()
if use_dropout:
self.layers.append(nn.Linear(self.input_size, self.hidden_size[0]))
for i in range(1, len(hidden_size)):
if use_dropout:
self.layers.append(nn.Linear(self.hidden_size[i - 1], self.hidden_size[i]))
if use_dropout:
self.layers.append(nn.Linear(self.hidden_size[-1], 1))
def forward(self, x):
return self.layers(x)

How to understand a periodicity in the training loss using a pre-trained model of PyTorch?

I'm using a pre-trained model from Pytorch ( Resnet 18,34,50) in order to classify images. During the training, a weird periodicity appears in the training as you can see in the image below. Did somebody already have a similar issue?In order to deal with the overfitting, I'm using Data augmentation in the preprocessing.
When using SGD as an optimizer with the following parameters, we obtain this sort of graph:
criterion: NLLLoss()
learning rate: 0.0001
epoch: 40
print every 40 iteration
We also try adam and Adam bound as optimizers but the same periodicity was observed.
Thank's in advance for your answer!
Here is the code :
def train_classifier():
start = timeit.default_timer()
epochs = 40
steps = 0
print_every = 40'cuda')
for e in range(epochs):
print('Currently running epoch',e,':')
running_loss = 0
for images, labels in iter(train_loader):
steps += 1
images, labels ='cuda'),'cuda')
output = model.forward(images)
loss = criterion(output, labels)
running_loss += loss.item()
if steps % print_every == 0:
# Turn off gradients for validation, saves memory and computations
with torch.no_grad():
validation_loss, accuracy = validation(model, val_loader, criterion)
print("Epoch: {}/{}.. ".format(e+1, epochs),
"Training Loss: {:.3f}.. ".format(running_loss/print_every),
"Validation Loss: {:.3f}.. ".format(validation_loss/len(val_loader)),
"Validation Accuracy: {:.3f}".format(accuracy/len(val_loader)))
stop = timeit.default_timer()
print('Time: ', stop - start)
running_loss = 0
return train,epo,valid,acc_valid

Pytorch quickstart calls model.eval() but not model.train()

In Pytorch quickstart tutorial the code uses model.eval() during evaluation/test but it does not call model.train() during training.
According to this and source, some modules like BatchNorm and Dropout need to know if the model is in train or evaluation mode. The model in the tutorial does not use any such module so it runs to convergence. Am I missing something or Pytorch's very first tutorial actually has a logical bug?
def train(dataloader, model, loss_fn, optimizer):
size = len(dataloader.dataset)
for batch, (X, y) in enumerate(dataloader):
X, y =,
# Compute prediction error
pred = model(X)
loss = loss_fn(pred, y)
# Backpropagation
if batch % 100 == 0:
loss, current = loss.item(), batch * len(X)
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")
You can see there is no model.train() in the above code.
def test(dataloader, model):
size = len(dataloader.dataset)
test_loss, correct = 0, 0
with torch.no_grad():
for X, y in dataloader:
X, y =,
pred = model(X)
test_loss += loss_fn(pred, y).item()
correct += (pred.argmax(1) == y).type(torch.float).sum().item()
test_loss /= size
correct /= size
print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")
At the second line, there is a model.eval().
Training loop:
epochs = 5
for t in range(epochs):
print(f"Epoch {t+1}\n-------------------------------")
train(train_dataloader, model, loss_fn, optimizer)
test(test_dataloader, model)
This loop calls train() and test() methods without any call to model.train(). So after the first call of test(), the model is always in "evaluation" mode. If we add a BatchNorm to the model we'll be on our way to encounter a hard-to-find bug.
Main question:
Is it good practice to always call model.train() during training and model.eval() during evaluation/test?

How do you test a custom dataset in Pytorch?

I've been following tutorials in Pytorch that use datasets from Pytorch that allow you to enable whether you'd like to train using the data or not... But now I'm using a .csv and a custom dataset.
class MyDataset(Dataset):
def __init__(self, root, n_inp):
self.df = pd.read_csv(root) = self.df.to_numpy()
self.x , self.y = (torch.from_numpy([:,:n_inp]),
def __getitem__(self, idx):
return self.x[idx, :], self.y[idx,:]
def __len__(self):
return len(
How can I tell Pytorch not to train my test_dataset so I can use it as a reference of how accurate my model is?
train_dataset = MyDataset("heart.csv", input_size)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle =True)
test_dataset = MyDataset("heart.csv", input_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle =True)
In pytorch, a custom dataset inherits the class Dataset. Mainly it contains two methods __len__() is to specify the length of your dataset object to iterate over and __getitem__() to return a batch of data at a time.
Once the dataloader objects are initialized (train_loader and test_loader as specified in your code), you need to write a train loop and a test loop.
def train(model, optimizer, loss_fn, dataloader):
for i, (input, gt) in enumerate(dataloader):
if params.use_gpu: #(If training using GPU)
input, gt = input.cuda(non_blocking = True), gt.cuda(non_blocking = True)
predicted = model(input)
loss = loss_fn(predicted, gt)
and your test loop should be:
def test(model,loss_fn, dataloader):
for i, (input, gt) in enumerate(dataloader):
if params.use_gpu: #(If training using GPU)
input, gt = input.cuda(non_blocking = True), gt.cuda(non_blocking = True)
predicted = model(input)
loss = loss_fn(predicted, gt)
In additional you can use metrics dictionary to log your predicted, loss, epochs etc,. The main difference between training and test loop is that we exclude back propagation (zero_grad(), backward(), step()) in inference stage.
for epoch in range(1, epochs + 1):
train(model, optimizer, loss_fn, train_loader)
test(model, loss_fn, test_loader)
There are a couple of things to note when you're testing in pytorch:
Put your model into evaluation mode so that things like dropout and batch normalization aren't in training mode: model.eval()
Put a wrapper around your testing code to avoid the computation of gradients (saving memory and time): with torch.no_grad():
Normalise or standardise your data according to your training set only. This is important for min/max normalisation or z-score standardisation so that the model accurately reflects test performance.
Other than that, what you've written looks pretty fine to me, as you're not applying any transforms to your data (for example, image flipping or gaussian noise injections). To show what code should look like in test mode, see below:
for e in range(num_epochs):
for B, (dat, label) in enumerate(train_loader):
#transforms here
out = model(
loss = criterion(out)
with torch.no_grad():
global_corr = 0
for B, (dat,label) in enumerate(test_loader):
out = model(
# get batch eval metrics here!
