Why I'm getting: CUDA error: device-side assert triggered - pytorch

I have the following model:
==========================================================================================
Layer (type:depth-idx) Output Shape Param #
=====================================================================================
=====
Sequential -- --
├─Identity: 1-1 [64, 1, 6, 2] --
├─Conv2d: 1-2 [64, 64, 6, 2] 576
├─Flatten: 1-3 [64, 768] --
├─Linear: 1-4 [64, 10] 7,690
├─ReLU: 1-5 [64, 10] --
├─Linear: 1-6 [64, 1] 11
==========================================================================================
I'm trying to train the model and caluclate accuracy:
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
Loss = nn.CrossEntropyLoss()
inputs, labels = next(iter(oTrainDL))
inputs, labels = inputs.to(DEVICE), labels.to(DEVICE)
oModel = GetModel()
oModel.to(DEVICE)
oModel.train(True)
inputs = inputs.type(torch.cuda.FloatTensor)
mZ = oModel(inputs)
labels = labels.type(torch.cuda.LongTensor)
loss = Loss(mZ, labels)
with torch.no_grad():
print(f"{mZ.shape}, {labels.shape}")
print(f"{mZ.is_cuda}, {labels.is_cuda}")
res = (mZ == labels).float().mean().item()
But I'm getting an error:
RuntimeError Traceback (most recent call last)
Input In [13], in <cell line: 18>()
21 print(f"{mZ.shape}, {labels.shape}")
22 print(f"{mZ.is_cuda}, {labels.is_cuda}")
---> 23 res = (mZ == labels).float().mean().item()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
What am I doing wrong ?
How can I fix it ?

Related

Tensorflow HammingLoss gives ValueError with keras.utils.Sequence

I am working on a multi-label image classification problem with 13 labels. I want to use Hamming Loss to evaluate the performance of the model. So I specified tfa.metrics.HammingLoss(mode = 'multilabel') in the metrics parameter during model compilation. This worked when I provided both X_train and y_train to model.fit(), but it threw a ValueError when I used a Sequence object (described below) for training.
Data Generator description
I used a keras.utils.Sequence input object similar to what is present here. The generator returns 2 numpy arrays for each batch - the first array consists of the input images of shape (128, 128, 3) and the second array consists of labels each of shape (13,).
This is what my code looks like:
model.compile(
loss='binary_crossentropy',
optimizer='rmsprop',
metrics=[tfa.metrics.HammingLoss(mode = 'multilabel')]
)
model.fit(
train_datagen,
epochs = 5,
batch_size = BATCH_SIZE,
steps_per_epoch = TOTAL // BATCH_SIZE
)
And this is the error that I obtained:
Epoch 1/5
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-140-978987a2bbaa> in <module>
3 epochs=5,
4 batch_size=BATCH_SIZE,
----> 5 steps_per_epoch = 2000 // BATCH_SIZE
6 # validation_data=validation_generator,
7 )
4 frames
/usr/local/lib/python3.7/dist-packages/tensorflow_addons/metrics/hamming.py in else_body_2()
64 try:
65 do_return = True
---> 66 retval_ = (ag__.ld(nonzero) / ag__.converted_call(ag__.ld(y_true).get_shape, (), None, fscope)[(- 1)])
67 except:
68 do_return = False
ValueError: in user code:
File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 1051, in train_function *
return step_function(self, iterator)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_addons/metrics/utils.py", line 66, in update_state *
matches = self._fn(y_true, y_pred, **self._fn_kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_addons/metrics/hamming.py", line 133, in hamming_loss_fn *
return nonzero / y_true.get_shape()[-1]
ValueError: None values not supported.
How do I correct this? Is there any issue with the format of the labels?

Unexpected ValueError after training Keras NN a few times

I am working on time series prediction using RNNs implemented in Keras on Google Colaboratory. I implemented the RNN as follows:
from tensorflow import keras
mae = keras.losses.MeanAbsoluteError()
hidden_neurons = 50
output_neurons = 1
epoch_size = 50
batch_size = 72
# x_train has shape (500, 1, 23)
LSTM_layer = keras.layers.LSTM(hidden_neurons, input_shape = (x_train.shape[1], x_train.shape[2]), dropout = 0.05)
output_layer = keras.layers.Dense(1)
test_model = keras.Sequential(layers = (LSTM_layer, output_layer))
test_model.reset_states()
test_model.compile(optimizer = 'adam', loss = mae)
test_model.summary()
history = test_model.fit(tf.expand_dims(x_train, axis=-1), y_train, epochs = epoch_size, batch_size = batch_size, validation_data=(x_test, y_test), shuffle = False)
# y_train has shape (500, 1)
# x_test has shape (500, 1, 23)
# y_test has shape (500, 1)
I have the above code (except the import) in a single code cell. When I start a fresh runtime, the network trains fine as expected. But after executing the code cell for around 3-4 times, Colab throws the following error:
ValueError Traceback (most recent call last)
<ipython-input-23-3ac5cc808611> in <module>
12 test_model.compile(optimizer = 'adam', loss = mae)
13 test_model.summary()
---> 14 history = test_model.fit(tf.expand_dims(x_train, axis=-1), y_train, epochs = epoch_size, batch_size = batch_size, validation_data=(x_test, y_test), shuffle = False)
...
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py in autograph_handler(*args, **kwargs)
1145 except Exception as e: # pylint:disable=broad-except
1146 if hasattr(e, "ag_error_metadata"):
-> 1147 raise e.ag_error_metadata.to_exception(e)
1148 else:
1149 raise
ValueError: Input 0 of layer "sequential_2" is incompatible with the layer: expected shape=(None, 1, 23), found shape=(None, 23)
The error persists if tf.expand_dims(x_train, axis=-1)) is omitted in test_model.fit() while fitting the Sequential model.
I guess this has something to do with the layer inputs somehow being changed during execution. I have tried using test_model.reset_states() and running
keras.backend.clear_session()
del test_model
in a separate code cell, but only forcibly killing the Colab runtime seems to work:
import os
os.kill(os.getpid(), 9)
What could cause the layer inputs to change midway during program run?
EDIT: I got the same error when I tried running the cells on Jupyter Notebook on my PC rather than on Colab.

ValueError: Graph disconnected: cannot obtain value for tensor KerasTensor

I am trying to connect the last layer of two UNet model using functional API and I am having the issue. I think error happens somewhere linking between input layer and each models but I have no clue where to fix.
data_input = keras.Input(shape=(512,512,3))
model_a = sm.Unet(BACKBONE1, input_shape=(512,512,3), encoder_weights='imagenet', classes=n_classes, activation=activation)
model_a(data_input)
model_a_output = model_a.get_layer('decoder_stage4b_relu').output
model_b = sm.Unet(BACKBONE2, input_shape=(512,512,3), encoder_weights='imagenet', classes=n_classes, activation=activation)
model_b(data_input)
model_b_output = model_b.get_layer('decoder_stage4b_relu').output
concat = tf.keras.layers.concatenate([model_a_output,model_b_output], axis=3)
data_output = layers.Conv2D(3, 2, padding="same", activation = "sigmoid")(concat)
ensemble_model= keras.Model(inputs=data_input, outputs=data_output, name="ensemble_model")
ensemble_model.summary()
The issue I'm geeting is
ValueError Traceback (most recent call last)
in ()
14 data_output = layers.Conv2D(3, 2, padding="same", activation = "sigmoid")(concat)
15
---> 16 ensemble_model= keras.Model(inputs=data_input, outputs=data_output, name="ensemble_model")
17
18 ensemble_model.summary()
4 frames
/usr/local/lib/python3.7/dist-packages/keras/engine/functional.py in _map_graph_network(inputs, outputs)
1035 if id(x) not in computable_tensors:
1036 raise ValueError(
-> 1037 f'Graph disconnected: cannot obtain value for tensor {x} '
1038 f'at layer "{layer.name}". The following previous layers '
1039 f'were accessed without issue: {layers_with_complete_input}')
ValueError: Graph disconnected: cannot obtain value for tensor KerasTensor(type_spec=TensorSpec(shape=(None, 512, 512, 3), dtype=tf.float32, name='data'), name='data', description="created by layer 'data'") at layer "bn_data". The following previous layers were accessed without issue: []

Semantic Segmentation runtime error at loss function

I am using a costume model for segmentation (SETRModel). The model output shape is (nBatch, 256, 256) and the code below confirms it (note that the channel is squeezed out). The target shape is the same (It’s a PILMask).
When I start training, I get a runtime error (see below) related to the loss function. What am I doing wrong?
```
size = 480
half= (256, 256)
splitter = FuncSplitter(lambda o: Path(o).parent.name == 'validation')
dblock = DataBlock(blocks=(ImageBlock, MaskBlock(codes)),
get_items=get_relevant_images,
splitter=splitter,
get_y=get_mask,
item_tfms=Resize((size,size)),
batch_tfms=[*aug_transforms(size=half), Normalize.from_stats(*imagenet_stats)])
dls = dblock.dataloaders(path/'images', bs=4)
model = SETRModel(patch_size=(32, 32),
in_channels=3,
out_channels=1,
hidden_size=1024,
num_hidden_layers=8,
num_attention_heads=16,
decode_features=[512, 256, 128, 64])
# Create a Learner using a custom model
loss = nn.BCEWithLogitsLoss()
learn = Learner(dls, model, loss_func=loss, lr=1.0e-4, cbs=callbacks, metrics=[Dice()])
# Let's test and make sure the loss function is happy with its inputs
learn.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
t1 = torch.rand(4, 3, 256, 256).to(device)
print("input: " + str(t1.shape))
pred = learn.model(t1).to(device)
print("output: " + str(pred.shape))
# prints this:
# input: torch.Size([4, 3, 256, 256])
# output: torch.Size([4, 256, 256])
target = next(iter(learn.dls.train))[1]
target = target.type(torch.float32).to(device)
target.size(), pred.size()
# prints this:
# (torch.Size([4, 256, 256]), torch.Size([4, 256, 256]))
loss(pred, target)
# prints this:
# TensorMask(0.6844, device='cuda:0', grad_fn=<AliasBackward>)
# so, the loss function is happy with its inputs
learn.fine_tune(50)
# prints this:
# ---------------------------------------------------------------------------
# RuntimeError Traceback (most recent call last)
# <ipython-input-114-0e514c73651a> in <module>()
# ----> 1 learn.fine_tune(50)
# 19 frames
# /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
# 2827 pixel_shuffle = _add_docstr(torch.pixel_shuffle, r"""
# 2828 Rearranges elements in a tensor of shape :math:`(*, C \times r^2, H, W)` to a
# -> 2829 tensor of shape :math:`(*, C, H \times r, W \times r)`.
# 2830
# 2831 See :class:`~torch.nn.PixelShuffle` for details.
# RuntimeError: result type Float can't be cast to the desired output type Long
This is something that happens when you use PyTorch inside fastai (I believe this should be fixed).
Just create custom loss_func. For example:
def loss_func(output, target): return CrossEntropyLossFlat()(out, targ.long())
and pass it when creating the DataBlock:
dblock = DataBlock(... , loss_func=loss_func, ...)

keras multi dimensions input to simpleRNN: dimension mismatch

The input element has 3 rows each having 199 columns and the output has 46 rows and 1 column
Input.shape, output.shape
((204563, 3, 199), (204563, 46, 1))
When the input is given the following error is thrown:
from keras.layers import Dense
from keras.models import Sequential
from keras.layers.recurrent import SimpleRNN
model = Sequential()
model.add(SimpleRNN(100, input_shape = (Input.shape[1], Input.shape[2])))
model.add(Dense(output.shape[1], activation = 'softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
model.fit(Input, output, epochs = 20, batch_size = 200)
error thrown:
Epoch 1/20
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-134-378dd431cf45> in <module>()
3 model.add(Dense(y_target.shape[1], activation = 'softmax'))
4 model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
----> 5 model.fit(X_input, y_target, epochs = 20, batch_size = 200)
.
.
.
ValueError: Error when checking model target: expected dense_6 to have 2 dimensions, but got array with shape (204563, 46, 1)
Please explain the reason for the problem and possible soution
The problem is that SimpleRNN(100) returns a tensor of shape (204563, 100), hence, the Dense(46) (since output.shape[1]=46) will return a tensor of shape (204563, 46), but your y_target have shape (204563, 46, 1). You need to remove the last dimension with, for example, y_target = np.squeeze(y_target), so that the dimension are consistent

Resources