RuntimeError: CUDA error: unspecified launch failure with python 3.7 - pytorch

Training process started.To avoid cuda out of memory error, I set 1 to batch size.
But after a few epochs, sometimes even most of the time, I am getting RuntimeError: CUDA error: unspecified launch failure error.It is very frustrating.
Every time I get the error, input size is changing so it is not happen same input.
train.py:
for i, (inputs, target, _) in enumerate(train_loader):
print(torch.cuda.is_available())
input_var = [input.cuda() for input in inputs]
target_var = target.cuda()
output = model(input_var)
loss = criterion(output, target_var)
losses.update(loss.item(), 1)
# compute accuracy
prec1, prec5 = accuracy(output.data.cpu(), target, topk=(1,5))
top1.update(prec1[0].item(), 1)
top5.update(prec5[0].item(), 1)
# zero the parameter gradients
optimizer.zero_grad()
# compute gradient
loss.backward()
optimizer.step()
.....
Output :
True
136
Traceback (most recent call last):
File "....\train.py", line 273, in <module>
train(train_loader, model, criterion, optimizer, epoch)
File ".....\train.py", line 75, in train
input_var = [input.cuda() for input in inputs]
File "......\train.py", line 75, in <listcomp>
input_var = [input.cuda() for input in inputs]
RuntimeError: CUDA error: unspecified launch failure
Do you have any idea how I can fix the error?
Thanks.
Windows 10
NVIDIA GeForce GTX 1060
Torch 1.6
Cuda 10.1

Related

Tensorflow HammingLoss gives ValueError with keras.utils.Sequence

I am working on a multi-label image classification problem with 13 labels. I want to use Hamming Loss to evaluate the performance of the model. So I specified tfa.metrics.HammingLoss(mode = 'multilabel') in the metrics parameter during model compilation. This worked when I provided both X_train and y_train to model.fit(), but it threw a ValueError when I used a Sequence object (described below) for training.
Data Generator description
I used a keras.utils.Sequence input object similar to what is present here. The generator returns 2 numpy arrays for each batch - the first array consists of the input images of shape (128, 128, 3) and the second array consists of labels each of shape (13,).
This is what my code looks like:
model.compile(
loss='binary_crossentropy',
optimizer='rmsprop',
metrics=[tfa.metrics.HammingLoss(mode = 'multilabel')]
)
model.fit(
train_datagen,
epochs = 5,
batch_size = BATCH_SIZE,
steps_per_epoch = TOTAL // BATCH_SIZE
)
And this is the error that I obtained:
Epoch 1/5
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-140-978987a2bbaa> in <module>
3 epochs=5,
4 batch_size=BATCH_SIZE,
----> 5 steps_per_epoch = 2000 // BATCH_SIZE
6 # validation_data=validation_generator,
7 )
4 frames
/usr/local/lib/python3.7/dist-packages/tensorflow_addons/metrics/hamming.py in else_body_2()
64 try:
65 do_return = True
---> 66 retval_ = (ag__.ld(nonzero) / ag__.converted_call(ag__.ld(y_true).get_shape, (), None, fscope)[(- 1)])
67 except:
68 do_return = False
ValueError: in user code:
File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 1051, in train_function *
return step_function(self, iterator)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_addons/metrics/utils.py", line 66, in update_state *
matches = self._fn(y_true, y_pred, **self._fn_kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_addons/metrics/hamming.py", line 133, in hamming_loss_fn *
return nonzero / y_true.get_shape()[-1]
ValueError: None values not supported.
How do I correct this? Is there any issue with the format of the labels?

pytorch-error : RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu

Hello everyone, I'm a beginner at pytorch.I just defined a very simple linear regression model.But unfortunately, my program got an error.I made a survey for this error, but unfortunately I was unable to resolve the problem.Can someone help me?Thank you in advance.
My program is as follows:
import torch
import numpy as np
import torch.nn as nn
x_values = [i for i in range(11)]
x_train = np.array(x_values, dtype=np.float32)
x_train = x_train.reshape(-1, 1)
y_values = [2*i + 1 for i in range(len(x_values))]
y_train = np.array(y_values, dtype=np.float32)
y_train = y_train.reshape(-1, 1)
class LinearRegressionModel(nn.Module):
def __init__(self, input_dim, output_dim):
super(LinearRegressionModel, self).__init__()
self.linear = nn.Linear(input_dim, output_dim)
def forward(self, x):
out = self.linear(x)
return out
input_dim = 1
output_dim = 1
model = LinearRegressionModel(input_dim, output_dim)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)
epochs = 1000
learning_rate = 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
criterion = nn.MSELoss()
for epoch in range(epochs):
epoch += 1
inputs = torch.from_numpy(x_train).to(device)
labels = torch.from_numpy(y_train).to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
if epoch % 50 == 0:
print('epoch {}, loss {}'.format(epoch, loss.item()))
torch.save(model.state_dict(), 'model.pkl')
print(model.load_state_dict(torch.load('model.pkl')))
predicted = model(torch.from_numpy(x_train).requires_grad_()).data.numpy()
print('predicted:', predicted)
I have a simple understanding of the cause of the error:The tensors should be calculated in the same device. Here I intended to train the linear regression model in the GPU, and I also intended to put the model and inputs into the GPU, but the process still reports errors.
The program has an error at the following code:
predicted =model(torch.from_numpy(x_train).requires_grad_()).data.numpy()
Traceback (most recent call last):
File "F:\pytorch_Study\My_program.py", line 71, in <module>
predicted = model(torch.from_numpy(x_train).requires_grad_()).data.numpy()
File "D:\Anaconda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "F:\pytorch_Study\My_program.py", line 27, in forward
out = self.linear(x)
File "D:\Anaconda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "D:\Anaconda\envs\pytorch\lib\site-packages\torch\nn\modules\linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_addmm)

Saving after Tensorflow model.fit() errors out with "ValueError: Model cannot be saved...."

Here is the code leading up to saving the model:
# split and validate data
x_train, x_test, y_train, y_test = train_test_split(INPUT, OUTPUT, test_size=0.1, random_state=42)
xi = [[float(y) for y in x] for x in x_train]
yi = [float(x) for x in y_train]
xt = [[float(y) for y in x] for x in x_train]
yt = [float(x) for x in y_train]
# Shuffle and slice the dataset.
train_dataset = tf.data.Dataset.from_tensor_slices((xi, yi))
train_dataset = train_dataset.shuffle(1024).cache().repeat().batch(BATCH_SIZE)
model = SubclassedModel()
opt = tf.keras.optimizers.SGD(
learning_rate=model.learning_rate,
momentum=model.momentum,
nesterov=False,
name="SGD"
)
model.compile(run_eagerly=True, optimizer=opt)
lr_schedule = tf.keras.callbacks.LearningRateScheduler(scheduler)
model.fit(
train_dataset,
epochs=50,
verbose=1,
batch_size=64,
use_multiprocessing=True,
steps_per_epoch=300,
shuffle=True,
callbacks=[lr_schedule]
)
model.save(os.path.join(CURRENT_DIRECTORY, "net"), save_format="tf")
The model loads and trains just fine (with decreasing loss across the training step), but then errors out when saving with the following:
WARNING:tensorflow:Skipping full serialization of Keras layer <src.model.SubclassedModel object at 0x7f5e0873c790>, because it is not built.
2022-10-28 13:59:19: train - ERROR - Traceback Error: Traceback (most recent call last):
File "/home/.../src/train.py", line 336, in train
output_model_path, output_weights_path = save_model_with_weights(model)
File "/home/.../src/model/SubclassedModel.py", line 549, in save_model_with_weights
model.save(full_model_path, save_format="tf")
File "/home/.../venv/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/.../venv/lib/python3.9/site-packages/keras/saving/saving_utils.py", line 97, in raise_model_input_error
raise ValueError(
ValueError: Model <src.model.SubclassedModel object at 0x7f5e0873c790> cannot be saved either because the input shape is not available or because the forward pass of the model is not defined.To define a forward pass, please override `Model.call()`. To specify an input shape, either call `build(input_shape)` directly, or call the model on actual data using `Model()`, `Model.fit()`, or `Model.predict()`. If you have a custom training step, please make sure to invoke the forward pass in train step through `Model.__call__`, i.e. `model(inputs)`, as opposed to `model.call()`.
Environment:
Tensorflow 2.10
Ubuntu 20.04
Python 3.9
The model uses custom train_step, test_step, and predict_step to handle the training, evaluating, and predicting.
Custom train_step:
def train_step(self, data):
# Unpack the data.
x, y, _ = data_adapter.unpack_x_y_sample_weight(data)
# Compute gradients
with tf.GradientTape(persistent=True) as tape:
tape.watch([x, y])
loss = self.regression_reconstruction(x, y)
autoencoder_vars = self.autoencoder.trainable_variables
reconstruction_gradients = tape.gradient(loss, autoencoder_vars)
... logic to apply gradients ...
# Return a dict mapping metric names to current value
return {"loss": loss, "lr": self.optimizer._decayed_lr(var_dtype=tf.float32)}
Am I calling fit incorrectly before saving?

Optuna Pytorch: returned value from the objective function cannot be cast to float

def autotune(trial):
cfg= { 'device' : "cuda" if torch.cuda.is_available() else "cpu",
# 'train_batch_size' : 64,
# 'test_batch_size' : 1000,
# 'n_epochs' : 1,
# 'seed' : 0,
# 'log_interval' : 100,
# 'save_model' : False,
# 'dropout_rate' : trial.suggest_uniform('dropout_rate',0,1.0),
'lr' : trial.suggest_loguniform('lr', 1e-3, 1e-2),
'momentum' : trial.suggest_uniform('momentum', 0.4, 0.99),
'optimizer': trial.suggest_categorical('optimizer',[torch.optim.Adam,torch.optim.SGD, torch.optim.RMSprop, torch.optim.$
'activation': F.tanh}
optimizer = cfg['optimizer'](model.parameters(), lr=cfg['lr'])
#optimizer = torch.optim.Adam(model.parameters(),lr=0.001
As u can see above , I am trying to run Optuna trials to search for the most optimal hyper parameters for my CNN model.
# Train the model
# use small epoch for large dataset
# An epoch is 1 run through all the training data
# losses = [] # use this array for plotting losses
for _ in range(epochs):
# using data_loader
for i, (data, labels) in enumerate(trainloader):
# Forward and get a prediction
# x is the training data which is X_train
if name.lower() == "rnn":
model.hidden = (torch.zeros(1,1,model.hidden_sz),
torch.zeros(1,1,model.hidden_sz))
y_pred = model.forward(data)
# compute loss/error by comparing predicted out vs acutal labels
loss = criterion(y_pred, labels)
#losses.append(loss)
if i%10==0: # print out loss at every 10 epoch
print(f'epoch {i} and loss is: {loss}')
#Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
study = optuna.create_study(sampler=optuna.samplers.TPESampler(), direction='minimize',pruner=optuna.pruners.SuccessiveHalvingPrune$
study.optimize(autotune, n_trials=1)
BUT , when I run the above code to tune and find out my most optimal parameters , the follow error occured, seems like the trial has failed even though I still get epoch losses and values. Please advise thanks!
[W 2020-11-11 13:59:48,000] Trial 0 failed, because the returned value from the objective function cannot be cast to float. Returned value is: None
Traceback (most recent call last):
File "autotune2", line 481, in <module>
n_instances, n_features, scores = run_analysis()
File "autotune2", line 350, in run_analysis
print(study.best_params)
File "/home/shar/anaconda3/lib/python3.7/site-packages/optuna/study.py", line 67, in best_params
return self.best_trial.params
File "/home/shar/anaconda3/lib/python3.7/site-packages/optuna/study.py", line 92, in best_trial
return copy.deepcopy(self._storage.get_best_trial(self._study_id))
File "/home/shar/anaconda3/lib/python3.7/site-packages/optuna/storages/_in_memory.py", line 287, in get_best_trial
raise ValueError("No trials are completed yet.")
ValueError: No trials are completed yet.
This exception is raised because the objetive function from your study must return a float.
In your case, the problem is in this line:
study.optimize(autotune, n_trials=1)
The autotune function you defined before does not return a value and cannot be used for optimization.
How to fix?
For hyperparameter search, the autotune function must return the either some metric you can get after some training - like the loss or cross-entropy.
A quick fix on your code could be something like this:
def autotune():
cfg= { 'device' : "cuda" if torch.cuda.is_available() else "cpu"
...etc...
}
best_loss = 1e100; # or larger
# Train the model
for _ in range(epochs):
for i, (data, labels) in enumerate(trainloader):
... (train the model) ...
# compute loss/error by comparing predicted out vs actual labels
loss = criterion(y_pred, labels)
best_loss = min(loss,best_loss)
return best_loss
There is a good example with Pythorch in the Optuna repo that uses a pythoch callback to retrieve the accuracy (but can be changed easily to use the RMSE if needed). It also uses more than one experiment and takes the median for hyperparameters.

Pytorch training loss function throws: "TypeError: 'Tensor' object is not callable"

I use Python 3.x, and pytorch 1.5.0 with a GPU. I am trying to write a simple multinomial logistic regression using mnist data.
My issue is the loss() function throws a TypeError: 'Tensor' object is not callable while looping through the training batches. The thing that baffles me is that the error does not show up in the first iteration of the loop, but for the second batch, I get the full error below:
Traceback (most recent call last):
File "/snap/pycharm-community/207/plugins/python-ce/helpers/pydev/pydevd.py", line 1448, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "/snap/pycharm-community/207/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/pytorch_tutorial/Pytorch_feed_fwd_310720.py", line 78, in <module>
loss = loss(preds,ys)
TypeError: 'Tensor' object is not callable
The loss() function here is simply loss = nn.CrossEntropyLoss(). The full code is below. Any pointers would be very welcome.
for epoch in range(5):
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
xs, ys = data
opt.zero_grad()
preds = net(xs)
loss = loss(preds,ys)
loss.backward()
opt.step()
# print statistics
running_loss += loss.item()
if i % 1000 == 999: # print every 1000 mini-batches
print('[%d, %5d] loss: %.3f' %
(epoch + 1, i + 1, running_loss / 2000))
running_loss = 0.0
print('epoch {}, loss {}'.format(epoch, loss.item()))
a=1
it is because you are setting loss locally in the loop.
change loss = loss(preds, ys) to _loss = loss(preds, ys)

Resources