How to recover from pytroch CUDA out of memory? - pytorch

I tried the following code. When the code in try failed because of out of CUDA memory, I reduced the batch size to a half in except but it still appear the same issue for running the model in except but I'm sure half of the batch size is runnable since I have tried to directly run the code in except without trying the full batch. It works fine. By the way, is there any way to automatically set the batch size to fully use the CUDA memory without overflow?
try:
output = model(Variable(torch.LongTensor(np.array(x))).to(device),Variable(torch.LongTensor(np.array(pos))).to(device),Variable(torch.LongTensor(np.array(m))).to(device))
loss = criterion(output, Variable(torch.LongTensor(y)).to(device))#lb.transform(y)
loss.backward()
optimizer.step()
losses.append(loss.data.mean())
except:
half = int(len(x) / 2)
x1 = x[:half]
x2 = x[half:]
pos1 = pos[:half]
pos2 = pos[half:]
m1 = m[:half]
m2 = m[half:]
y1 = y[:half]
y2 = y[half:]
optimizer.zero_grad()
output = model(Variable(torch.LongTensor(np.array(x1))).to(device),Variable(torch.LongTensor(np.array(pos1))).to(device),Variable(torch.LongTensor(np.array(m1))).to(device))
loss = criterion(output, Variable(torch.LongTensor(y1)).to(device))#lb.transform(y)
loss.backward()
optimizer.step()
losses.append(loss.data.mean())
output = model(Variable(torch.LongTensor(np.array(x2))).to(device),Variable(torch.LongTensor(np.array(pos2))).to(device),Variable(torch.LongTensor(np.array(m2))).to(device))
loss = criterion(output, Variable(torch.LongTensor(y2)).to(device))#lb.transform(y)
loss.backward()
optimizer.step()
losses.append(loss.data.mean())

It looks like there is still something left on your gpu. Did you try to free your cuda cache using torch.cuda.empty_cache() in the beginning of except?

Related

How to run a GNN example with Pytorch, on a CPU without CUDA?

I am trying to code a GNN example problem as shown in the given link: https://towardsdatascience.com/hands-on-graph-neural-networks-with-pytorch-pytorch-geometric-359487e221a8
I am using a Macbook Pro 2016 edition, without a Nvidia graphic card!
The example problem is implementing 'CUDA' toolkit. Can I somehow modify the code and run in on my current laptop? I have made the dataset sufficiently small, such that it does not requires high computation and can run on my PC!
The part of the code which is giving an error is as follows!
def train():
model.train()
loss_all = 0
for data in train_loader:
data = data.to(device)
optimizer.zero_grad()
output = model(data)
label = data.y.to(device)
loss = crit(output, label)
loss.backward()
loss_all += data.num_graphs * loss.item()
optimizer.step()
return loss_all / len(train_dataset)
device = torch.device('cuda')
model = Net().to(device) # Net = A class inherited from torch.nn.Module
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)
crit = torch.nn.BCELoss()
train_loader = DataLoader(train_dataset, batch_size=batch_size)
for epoch in range(num_epochs):
train()
The error is as follows
AssertionError: Torch not compiled with CUDA enabled
You are using:
device = torch.device('cuda')
If you like to use cpu please change to:
device = torch.device('cpu')

Pytorch Repeating loss and AUC- When using cumulative loss

I am using PyTorch to accumulate and add losses, and then implement backpropagation(loss.backward()) at the end.
At this time, the loss is not updated and remains almost the same, and the AUC repeats exactly the same. Are there any points I haven't considered when using cumulative losses?
Thank you so much for any reply. :)
Below is the loss calculation that occurs in one batch.
opt.zero_grad()
for s in range(len(qshft)):
for a in range(len(qshft[0])):
if(m[s][a]):
y_pred = (y[s][a] * one_hot(qshft[s].long(), self.num_q)).sum(-1)
y_pred = torch.masked_select(y_pred, m[s])
t = torch.masked_select(rshft[s], m[s])
loss += binary_cross_entropy(y_pred, t).clone().detach().requires_grad_(True)
count += 1
loss = torch.tensor(loss/count,requires_grad=True)
loss.backward()
opt.step()
loss_mean.append(loss.detach().cpu().numpy())
Your following operation of detach removes the computation graph, so the loss.backward() and opt.step() won't update your weights which results in repeating loss and AUC.
loss += binary_cross_entropy(y_pred, t).clone().detach().requires_grad_(True)
You can do
loss += binary_cross_entropy(y_pred, t)
and change
loss = torch.tensor(loss/count,requires_grad=True)
to
loss = loss/count
But make sure you reset count and loss to 0 every time you go into this part.

Pytorch weights not updating.. sometimes

Not sure what causes this, but sometimes I start training my neural net, and none of my weights update. This happens maybe 4 out of 5 times when I initialize my script. The other 1 time, it updates everything as expected and trains and predicts as expected. Does anyone have any idea why this happens? Started when I changed my loss function if that's relevant.
Here's the gross part of my training loop, let me know any other relevant code I should include.
def train(model, train_loader, test_loader, test_data, full_test, args, epochs, early_stop=5):
t0 = time()
optimizer = Adam(model.parameters(), lr=args.lr)
lr_decay = lr_scheduler.ExponentialLR(optimizer, gamma=args.lr_decay)
best_val_acc, best_mae = 0, 500
for epoch in range(epochs):
model.train()
ti = time()
training_loss = 0.0
for i, (x, y) in enumerate(train_loader):
x, y = Variable(x.cuda()), Variable(y.cuda())
y_pred = model(x, y)
loss = mae_loss(y, y_pred) + rmse_loss(y, y_pred)
loss.backward()
training_loss += loss.detach() * x.size(0)
optimizer.step()
optimizer.zero_grad()
lr_decay.step()
I believe by far the most likely issue is that your loss function is returning something incorrect. Try printing the first few losses to see what they are and ensure they are reasonable and the correct datatype and shape. One possible reason for the weights not updating if your losses seem ok is that the learning rate is too low for your losses and the weights are being changed by such a small amount that it is either rounded off or not apparent.

Pytorch: Custom Loss involving Norm of End-to-End Jacobian

Cross posting from Pytorch discussion boards
I want to train a network using a modified loss function that has both a typical classification loss (e.g. nn.CrossEntropyLoss) as well as a penalty on the Frobenius norm of the end-to-end Jacobian (i.e. if f(x) is the output of the network, \nabla_x f(x)).
I’ve implemented a model that can successfully learn using nn.CrossEntropyLoss. However, when I try adding the second loss function (by doing two backwards passes), my training loop runs, but the model never learns. Furthermore, if I calculate the end-to-end Jacobian, but don’t include it in the loss function, the model also never learns. At a high level, my code does the following:
Forward pass to get predicted classes, yhat, from inputs x
Call yhat.backward(torch.ones(appropriate shape), retain_graph=True)
Jacobian norm = x.grad.data.norm(2)
Set loss equal to classification loss + scalar coefficient * jacobian norm
Run loss.backward()
I suspect that I’m misunderstanding how backward() works when run twice, but I haven’t been able to find any good resources to clarify this.
Too much is required to produce a working example, so I’ve tried to extract the relevant code:
def train_model(model, train_dataloader, optimizer, loss_fn, device=None):
if device is None:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.train()
train_loss = 0
correct = 0
for batch_idx, (batch_input, batch_target) in enumerate(train_dataloader):
batch_input, batch_target = batch_input.to(device), batch_target.to(device)
optimizer.zero_grad()
batch_input.requires_grad_(True)
model_batch_output = model(batch_input)
loss = loss_fn(model_output=model_batch_output, model_input=batch_input, model=model, target=batch_target)
train_loss += loss.item() # sum up batch loss
loss.backward()
optimizer.step()
and
def end_to_end_jacobian_loss(model_output, model_input):
model_output.backward(
torch.ones(*model_output.shape),
retain_graph=True)
jacobian = model_input.grad.data
jacobian_norm = jacobian.norm(2)
return jacobian_norm
Edit 1: I swapped my previous implementation with .backward() to autograd.grad and it apparently works! What's the difference?
def end_to_end_jacobian_loss(model_output, model_input):
jacobian = autograd.grad(
outputs=model_output['penultimate_layer'],
inputs=model_input,
grad_outputs=torch.ones(*model_output['penultimate_layer'].shape),
retain_graph=True,
only_inputs=True)[0]
jacobian_norm = jacobian.norm(2)
return jacobian_norm

Can we change the contents of the tensors once it is used?

I am wondering if we are allowed to change the contents of the tensors after they are sent to the Loss Functions in pytorch. For example:
x = torch.zeros(1000)
y = torch.zeros(1000)
output = net(x)
loss = criterion(oytput, y)
loss.backward()
optimizer.step()
After we have done this, can we change the contents of y and output without ill-effect? For example:
y[0] = 990
output[0] = 1000
After I have done this after a mini-batch, but keep feeding it with more mini-batches, will this cause issues?
I am not so sure, because maybe the nodes are still referenced internally by the computational graph.

Resources