Unable to find a valid cuDNN algorithm to run convolution

Unable to find a valid cuDNN algorithm to run convolution - pytorch

I just got this message when trying to run a feed forward torch.nn.Conv2d, getting the following stacktrace:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-26-04bd4a00565d> in <module>
3
4 # call training function
----> 5 losses = train(D, G, n_epochs=n_epochs)
<ipython-input-24-b539315e0aa0> in train(D, G, n_epochs, print_every)
46 real_images = real_images.cuda()
47
---> 48 D_real = D(real_images)
49 d_real_loss = real_loss(D_real, True) # smoothing label 1 => 0.9
50
~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
548 result = self._slow_forward(*input, **kwargs)
549 else:
--> 550 result = self.forward(*input, **kwargs)
551 for hook in self._forward_hooks.values():
552 hook_result = hook(self, input, result)
<ipython-input-14-bf68e57c25ff> in forward(self, x)
48 """
49
---> 50 x = self.leaky_relu(self.conv1(x))
51 x = self.leaky_relu(self.conv2(x))
52 x = self.leaky_relu(self.conv3(x))
~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
548 result = self._slow_forward(*input, **kwargs)
549 else:
--> 550 result = self.forward(*input, **kwargs)
551 for hook in self._forward_hooks.values():
552 hook_result = hook(self, input, result)
~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
98 def forward(self, input):
99 for module in self:
--> 100 input = module(input)
101 return input
102
~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
548 result = self._slow_forward(*input, **kwargs)
549 else:
--> 550 result = self.forward(*input, **kwargs)
551 for hook in self._forward_hooks.values():
552 hook_result = hook(self, input, result)
~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py in forward(self, input)
347
348 def forward(self, input):
--> 349 return self._conv_forward(input, self.weight)
350
351 class Conv3d(_ConvNd):
~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py in _conv_forward(self, input, weight)
344 _pair(0), self.dilation, self.groups)
345 return F.conv2d(input, weight, self.bias, self.stride,
--> 346 self.padding, self.dilation, self.groups)
347
348 def forward(self, input):
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
Running nvidia-smi shows:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 770 On | 00000000:01:00.0 N/A | N/A |
| 38% 50C P8 N/A / N/A | 624MiB / 4034MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
I'm using Python 3.7, Pytorch 1.5, and GPU is Nvidia GeForce GTX 770, running on Ubuntu 18.04.2. I haven't found that error message anywhere. Does it ring any bell?.
Thanks a lot in advance.

According to this answer for similar issue with tensorflow, it could occur because the VRAM memory limit was hit (which is rather non-intuitive from the error message).
For my case with PyTorch model training, decreasing batch size helped. You could try this or maybe decrease your model size to consume less VRAM.

This error is quite tricky sometimes. For some certain circumstances, out of memory will also report this error info.

I got this error when inference speed testing different EC2 nodes machine. When I digged thru the logs, I found this:
(pid=20839) /home/ubuntu/src/skai-ml/venv/lib/python3.7/site-packages/torch/cuda/__init__.py:87: UserWarning:
(pid=20839) Found GPU0 GRID K520 which is of cuda capability 3.0.
(pid=20839) PyTorch no longer supports this GPU because it is too old.
(pid=20839) The minimum cuda capability that we support is 3.5.
Lesson learned: don't use g2.XX instance types for PyTorch models. g3.XX and p series worked fine.

Check the number of classes you assign in the code.
This error appeared to me when I tried to run the code on Cifar100 instead of Cifar10 but forgot to change the num_classes from 10 to 100.

the problem is you are using torch.nn.Module for the feed-forward but you are returning with the functional module F.conv2d(). change your return code to nn.Conv2d()
this will probably help you more- https://pytorch.org/docs/stable/nn.html?highlight=conv2d#torch.nn.Conv2d

This thing happened to me a couple of times. Maybe it would be a little basic but. Shutting down running kernels helped me a lot. After shutting down other kernels, memory was restored almost completely and the problem was gone.

Related

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE on PyTorch Lightning

I am working on a tutorial of PyTorch Lightning.
https://pytorch-lightning.readthedocs.io/en/stable/starter/introduction.html
Because I wanted to try GPU training, I changed definition of trainer as below.
trainer = pl.Trainer(limit_train_batches=100, max_epochs=1, gpus=1)
Then I got the following error.
RuntimeError Traceback (most recent call last)
Cell In [3], line 4
1 # train the model (hint: here are some helpful Trainer arguments for rapid idea iteration)
2 # trainer = pl.Trainer(limit_train_batches=100, max_epochs=3)
3 trainer = pl.Trainer(limit_train_batches=100, max_epochs=3, accelerator='gpu', devices=1)
----> 4 trainer.fit(model=autoencoder, train_dataloaders=train_loader)
File ~/miniconda3/envs/py38-cu116/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:696, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
677 r"""
678 Runs the full optimization routine.
679
(...)
693 datamodule: An instance of :class:`~pytorch_lightning.core.datamodule.LightningDataModule`.
694 """
695 self.strategy.model = model
--> 696 self._call_and_handle_interrupt(
697 self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
698 )
File ~/miniconda3/envs/py38-cu116/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:650, in Trainer._call_and_handle_interrupt(self, trainer_fn, *args, **kwargs)
648 return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
649 else:
--> 650 return trainer_fn(*args, **kwargs)
651 # TODO(awaelchli): Unify both exceptions below, where `KeyboardError` doesn't re-raise
652 except KeyboardInterrupt as exception:
[...]
File ~/miniconda3/envs/py38-cu116/lib/python3.8/site-packages/pytorch_lightning/core/module.py:1450, in LightningModule.backward(self, loss, optimizer, optimizer_idx, *args, **kwargs)
1433 def backward(
1434 self, loss: Tensor, optimizer: Optional[Optimizer], optimizer_idx: Optional[int], *args, **kwargs
1435 ) -> None:
1436 """Called to perform backward on the loss returned in :meth:`training_step`. Override this hook with your
1437 own implementation if you need to.
1438
(...)
1448 loss.backward()
1449 """
-> 1450 loss.backward(*args, **kwargs)
File ~/miniconda3/envs/py38-cu116/lib/python3.8/site-packages/torch/_tensor.py:396, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
387 if has_torch_function_unary(self):
388 return handle_torch_function(
389 Tensor.backward,
390 (self,),
(...)
394 create_graph=create_graph,
395 inputs=inputs)
--> 396 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File ~/miniconda3/envs/py38-cu116/lib/python3.8/site-packages/torch/autograd/__init__.py:173, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
168 retain_graph = create_graph
170 # The reason we repeat same the comment below is that
171 # some Python versions print out the first line of a multi-line function
172 # calls in the traceback and some print out the last line
--> 173 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
174 tensors, grad_tensors_, retain_graph, create_graph, inputs,
175 allow_unreachable=True, accumulate_grad=True)
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
The only thing I added to the tutorial code is gpus=1, so I cannot figure out what is the problem. How can I fix this?
FYI, I tried giving devices=1, accelerator='ddp' instead of gpus=1, and got a following error.
ValueError: You selected an invalid accelerator name: `accelerator='ddp'`. Available names are: cpu, cuda, hpu, ipu, mps, tpu.
My environments are:
CUDA 11.6
Python 3.8.13
PyTorch 1.12.1
PyTorch Lightning 1.7.7

I think you made a mistake on the trainer's argument.
accelerator should be cpu, cuda, hpu, ipu, mps, tpu;
devices is the number of, say that, gpus;
and then you can pass the "ddp" argument to "strategy"
trainer = pl.Trainer(
accelerator="GPU",
devices=[0],
strategy="ddp"
)
hope it helps!

Though I'm not sure about the reason, the issue disappeared when I used Python 3.10 instead of 3.8.

what does set_memory_growth do in tensorflow2

I'm training a cnn on ubuntu server with keras and tensorflow 2. If I run my code without the added code below for memory growth, it throws the error I've posted below. I checked my driver memory and I've posted it below. It looks like my graphics card was running out of memory. I've posted my original code, and error below. I got the added memory growth code from the SO post below. The post says it allows the gpu memory to grow, I'm wondering what that means? Also when I check after running the additional code it still shows that I have a gpu device enabled, so does that mean that my code is still being run on gpu after I run the additional code below to allow gpu memory to grow?
post referenced:
AttributeError: module 'tensorflow' has no attribute 'ConfigProto'
added code:
import tensorflow as tf
physical_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)
code:
nvidia-smi
output:
Tue Mar 8 14:52:55 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:42:00.0 Off | N/A |
| 0% 37C P8 10W / 260W | 7970MiB / 7979MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2607 C ...a3/envs/tf-gpu/bin/python 7967MiB |
+-----------------------------------------------------------------------------+
original code:
code:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, MaxPool2D, Flatten
model = Sequential()
# CONVOLUTIONAL LAYER
model.add(Conv2D(filters=32, kernel_size=(4,4),input_shape=(28, 28, 1), activation='relu',))
# POOLING LAYER
model.add(MaxPool2D(pool_size=(2, 2)))
# FLATTEN IMAGES FROM 28 by 28 to 764 BEFORE FINAL LAYER
model.add(Flatten())
# 128 NEURONS IN DENSE HIDDEN LAYER (YOU CAN CHANGE THIS NUMBER OF NEURONS)
model.add(Dense(128, activation='relu'))
# LAST LAYER IS THE CLASSIFIER, THUS 10 POSSIBLE CLASSES
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
model.summary()
output:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 25, 25, 32) 544
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 12, 12, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 4608) 0
_________________________________________________________________
dense (Dense) (None, 128) 589952
_________________________________________________________________
dense_1 (Dense) (None, 10) 1290
=================================================================
Total params: 591,786
Trainable params: 591,786
Non-trainable params: 0
code:
model.fit(x_train,y_cat_train,epochs=10)
error:
---------------------------------------------------------------------------
UnknownError Traceback (most recent call last)
<ipython-input-19-bed1e94810c6> in <module>
----> 1 model.fit(x_train,y_cat_train,epochs=10)
~/anaconda3/envs/tf-gpu/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in _method_wrapper(self, *args, **kwargs)
64 def _method_wrapper(self, *args, **kwargs):
65 if not self._in_multi_worker_mode(): # pylint: disable=protected-access
---> 66 return method(self, *args, **kwargs)
67
68 # Running inside `run_distribute_coordinator` already.
~/anaconda3/envs/tf-gpu/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
846 batch_size=batch_size):
847 callbacks.on_train_batch_begin(step)
--> 848 tmp_logs = train_function(iterator)
849 # Catch OutOfRangeError for Datasets of unknown size.
850 # This blocks until the batch has finished executing.
~/anaconda3/envs/tf-gpu/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py in __call__(self, *args, **kwds)
578 xla_context.Exit()
579 else:
--> 580 result = self._call(*args, **kwds)
581
582 if tracing_count == self._get_tracing_count():
~/anaconda3/envs/tf-gpu/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py in _call(self, *args, **kwds)
642 # Lifting succeeded, so variables are initialized and we can run the
643 # stateless function.
--> 644 return self._stateless_fn(*args, **kwds)
645 else:
646 canon_args, canon_kwds = \
~/anaconda3/envs/tf-gpu/lib/python3.8/site-packages/tensorflow/python/eager/function.py in __call__(self, *args, **kwargs)
2418 with self._lock:
2419 graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
-> 2420 return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
2421
2422 #property
~/anaconda3/envs/tf-gpu/lib/python3.8/site-packages/tensorflow/python/eager/function.py in _filtered_call(self, args, kwargs)
1659 `args` and `kwargs`.
1660 """
-> 1661 return self._call_flat(
1662 (t for t in nest.flatten((args, kwargs), expand_composites=True)
1663 if isinstance(t, (ops.Tensor,
~/anaconda3/envs/tf-gpu/lib/python3.8/site-packages/tensorflow/python/eager/function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
1743 and executing_eagerly):
1744 # No tape is watching; skip to running the function.
-> 1745 return self._build_call_outputs(self._inference_function.call(
1746 ctx, args, cancellation_manager=cancellation_manager))
1747 forward_backward = self._select_forward_and_backward_functions(
~/anaconda3/envs/tf-gpu/lib/python3.8/site-packages/tensorflow/python/eager/function.py in call(self, ctx, args, cancellation_manager)
591 with _InterpolateFunctionError(self):
592 if cancellation_manager is None:
--> 593 outputs = execute.execute(
594 str(self.signature.name),
595 num_outputs=self._num_outputs,
~/anaconda3/envs/tf-gpu/lib/python3.8/site-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
57 try:
58 ctx.ensure_initialized()
---> 59 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
60 inputs, attrs, num_outputs)
61 except core._NotOkStatusException as e:
UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node sequential/conv2d/Conv2D (defined at <ipython-input-19-bed1e94810c6>:1) ]] [Op:__inference_train_function_753]
Function call stack:
train_function

Weights and Biases watch log causing CUDA out of memory

I am trying to use WandB gradient visualization to debug the gradient flow in my neural net on Google Colab. Without WandB logging, the training runs without error, taking up 11Gb/16GB on the p100 gpu. However, adding this line wandb.watch(model, log='all', log_freq=3) causes a cuda out of memory error.
How does WandB logging create extra GPU memory overhead?
Is there some way to reduce the overhead?
--adding training loop code--
learning_rate = 0.001
num_epochs = 50
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = MyModel()
model = model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
wandb.watch(model, log='all', log_freq=3)
for epoch in range(num_epochs):
train_running_loss = 0.0
train_accuracy = 0.0
model = model.train()
## training step
for i, (name, output_array, input) in enumerate(trainloader):
output_array = output_array.to(device)
input = input.to(device)
comb = torch.zeros(1,1,100,1632).to(device)
## forward + backprop + loss
output = model(input, comb)
loss = my_loss(output, output_array)
optimizer.zero_grad()
loss.backward()
## update model params
optimizer.step()
train_running_loss += loss.detach().item()
temp = get_accuracy(output, output_array)
print('check 13')
!nvidia-smi | grep MiB | awk '{print $9 $10 $11}'
train_accuracy += temp
-----edit-----
I think WandB is creating an extra copy of the gradient during logging preprocessing. Here is the traceback:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-11-13de83557b55> in <module>()
60 get_ipython().system("nvidia-smi | grep MiB | awk '{print $9 $10 $11}'")
61
---> 62 loss.backward()
63
64 print('check 10')
4 frames
/usr/local/lib/python3.7/dist-packages/torch/_tensor.py in backward(self, gradient, retain_graph, create_graph, inputs)
253 create_graph=create_graph,
254 inputs=inputs)
--> 255 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
256
257 def register_hook(self, hook):
/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
147 Variable._execution_engine.run_backward(
148 tensors, grad_tensors_, retain_graph, create_graph, inputs,
--> 149 allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
150
151
/usr/local/lib/python3.7/dist-packages/wandb/wandb_torch.py in <lambda>(grad)
283 self.log_tensor_stats(grad.data, name)
284
--> 285 handle = var.register_hook(lambda grad: _callback(grad, log_track))
286 self._hook_handles[name] = handle
287 return handle
/usr/local/lib/python3.7/dist-packages/wandb/wandb_torch.py in _callback(grad, log_track)
281 if not log_track_update(log_track):
282 return
--> 283 self.log_tensor_stats(grad.data, name)
284
285 handle = var.register_hook(lambda grad: _callback(grad, log_track))
/usr/local/lib/python3.7/dist-packages/wandb/wandb_torch.py in log_tensor_stats(self, tensor, name)
219 # Remove nans from tensor. There's no good way to represent that in histograms.
220 flat = flat[~torch.isnan(flat)]
--> 221 flat = flat[~torch.isinf(flat)]
222 if flat.shape == torch.Size([0]):
223 # Often the whole tensor is nan or inf. Just don't log it in that case.
RuntimeError: CUDA out of memory. Tried to allocate 4.65 GiB (GPU 0; 15.90 GiB total capacity; 10.10 GiB already allocated; 717.75 MiB free; 14.27 GiB reserved in total by PyTorch)
---update----
Indeed, commenting out the offending line flat = flat[~torch.isinf(flat)]
gets the logging step to just barely fit into the GPU memory.

TPU throws error when trying to initiliase and setup, and then later during tf.keras model building

I ran this block of code using TF 2.2.0, Keras and some TPU config:
try:
TPU_WORKER = os.environ["TPU_NAME"]
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
print(f"Running on TPU: {tpu.cluster_spec().as_dict()['worker']}")
print(f"TPU_WORKER: {TPU_WORKER}")
except ValueError:
tpu = None
gpus = tf.config.experimental.list_logical_devices("GPU")
if tpu:
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)
elif len(gpus) > 1: # multiple GPUs on the VM
strategy = tf.distribute.MirroredStrategy(gpus)
else:
strategy = tf.distribute.get_strategy()
and got this error message:
---------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
<ipython-input-27-a49335a43189> in <module>
15
16 if tpu:
---> 17 tf.config.experimental_connect_to_cluster(tpu)
18 tf.tpu.experimental.initialize_tpu_system(tpu)
19 strategy = tf.distribute.experimental.TPUStrategy(tpu)
/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/remote.py in connect_to_cluster(cluster_spec_or_resolver, job_name, task_index, protocol, make_master_device_default, cluster_device_filters)
181 context.set_server_def(server_def)
182 else:
--> 183 context.update_server_def(server_def)
184
185 if make_master_device_default and isinstance(
/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py in update_server_def(server_def)
2137
2138 def update_server_def(server_def):
-> 2139 context().update_server_def(server_def)
2140
2141
/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py in update_server_def(self, server_def, keep_alive_secs)
596 # Current executor might have pending nodes that involves updated remote
597 # devices. Wait for them to finish before updating.
--> 598 self.executor.wait()
599 self.executor.clear_error()
600 pywrap_tfe.TFE_ContextUpdateServerDef(self._context_handle,
/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/executor.py in wait(self)
65 def wait(self):
66 """Waits for ops dispatched in this executor to finish."""
---> 67 pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle)
68
69 def clear_error(self):
InvalidArgumentError: {{function_node __inference_train_function_75067}} Compilation failure: XLA can't deduce compile time constant output shape for strided slice: [4,?], output shape must be a compile-time constant
[[{{node model/tf_op_layer_strided_slice/strided_slice}}]]
TPU compilation failed
[[tpu_compile_succeeded_assert/_6359544293025479410/_3]]
This error:
InvalidArgumentError: {{function_node __inference_train_function_75067}} Compilation failure: XLA can't deduce compile time constant output shape for strided slice: [4,?], output shape must be a compile-time constant
[[{{node model/tf_op_layer_strided_slice/strided_slice}}]]
TPU compilation failed
[[tpu_compile_succeeded_assert/_6359544293025479410/_3]]
did occur during the previous run and now since then, I can't re-run my code.
The workaround would be to restart the notebook instead and re-run it.
But then I get the same error elsewhere:
---------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
<timed exec> in <module>
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py in _method_wrapper(self, *args, **kwargs)
64 def _method_wrapper(self, *args, **kwargs):
65 if not self._in_multi_worker_mode(): # pylint: disable=protected-access
---> 66 return method(self, *args, **kwargs)
67
68 # Running inside `run_distribute_coordinator` already.
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
853 context.async_wait()
854 logs = tmp_logs # No error, now safe to assign to logs.
--> 855 callbacks.on_train_batch_end(step, logs)
856 epoch_logs = copy.copy(logs)
857
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/callbacks.py in on_train_batch_end(self, batch, logs)
387 """
388 if self._should_call_train_batch_hooks:
--> 389 logs = self._process_logs(logs)
390 self._call_batch_hook(ModeKeys.TRAIN, 'end', batch, logs=logs)
391
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/callbacks.py in _process_logs(self, logs)
263 """Turns tensors into numpy arrays or Python scalars."""
264 if logs:
--> 265 return tf_utils.to_numpy_or_python_type(logs)
266 return {}
267
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/utils/tf_utils.py in to_numpy_or_python_type(tensors)
521 return t # Don't turn ragged or sparse tensors to NumPy.
522
--> 523 return nest.map_structure(_to_single_numpy_or_python_type, tensors)
524
/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/nest.py in map_structure(func, *structure, **kwargs)
615
616 return pack_sequence_as(
--> 617 structure[0], [func(*x) for x in entries],
618 expand_composites=expand_composites)
619
/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/nest.py in <listcomp>(.0)
615
616 return pack_sequence_as(
--> 617 structure[0], [func(*x) for x in entries],
618 expand_composites=expand_composites)
619
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/utils/tf_utils.py in _to_single_numpy_or_python_type(t)
517 def _to_single_numpy_or_python_type(t):
518 if isinstance(t, ops.Tensor):
--> 519 x = t.numpy()
520 return x.item() if np.ndim(x) == 0 else x
521 return t # Don't turn ragged or sparse tensors to NumPy.
/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in numpy(self)
959 """
960 # TODO(slebedev): Consider avoiding a copy for non-CPU or remote tensors.
--> 961 maybe_arr = self._numpy() # pylint: disable=protected-access
962 return maybe_arr.copy() if isinstance(maybe_arr, np.ndarray) else maybe_arr
963
/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in _numpy(self)
927 return self._numpy_internal()
928 except core._NotOkStatusException as e:
--> 929 six.raise_from(core._status_to_exception(e.code, e.message), None)
930
931 #property
/opt/conda/lib/python3.7/site-packages/six.py in raise_from(value, from_value)
InvalidArgumentError: {{function_node __inference_train_function_78422}} Compilation failure: XLA can't deduce compile time constant output shape for strided slice: [16,?], output shape must be a compile-time constant
[[{{node model/tf_op_layer_strided_slice/strided_slice}}]]
TPU compilation failed
[[tpu_compile_succeeded_assert/_626429452001451780/_8]]
when trying to train/fit the keras layered model, although from the above call-stack it's not clear at which point this error has occurred.
One more question would be, how do we clear the cache or buffer that is storing this error, so that we can reset the TPU and run our code again after making changes. And not have to restart sessions or kernels?

When I run the same TPU initialization code in Colab (runtime set to TPU):
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
print(f"Running on TPU: {tpu.cluster_spec().as_dict()['worker']}")
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)
It runs without errors and reinitializes TPU and clears out eager cache, here are the logs:
Running on TPU: ['10.18.71.154:8470']
WARNING:tensorflow:TPU system grpc://10.18.71.154:8470 has already been initialized. Reinitializing the TPU can cause previously created variables on TPU to be lost.
WARNING:tensorflow:TPU system grpc://10.18.71.154:8470 has already been initialized. Reinitializing the TPU can cause previously created variables on TPU to be lost.
INFO:tensorflow:Initializing the TPU system: grpc://10.18.71.154:8470
INFO:tensorflow:Initializing the TPU system: grpc://10.18.71.154:8470
INFO:tensorflow:Clearing out eager caches
INFO:tensorflow:Clearing out eager caches
....
For your second issue of "[4,?], output shape must be a compile-time constant", please give your model's input an output shapes when building your model.

Tensorflow 2.1 TPU v2 reduce memory usage with bfloat16

I have some issue with the TPUv2 regarding the memory usage.
I would like to do some experiment with some Large model but unfortunately the model does not fit the memory. I would like to use bfloat16 in order to save some memory but I have some issue when I call the model :
try:
# TPU detection. No parameters necessary if TPU_NAME environment variable is
# set: this is always the case on Kaggle.
resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
print('Running on TPU ', resolver.master())
except ValueError:
resolver = None
if resolver:
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)
else:
# Default distribution strategy in Tensorflow. Works on CPU and single GPU.
strategy = tf.distribute.get_strategy()
policy = tf.keras.mixed_precision.experimental.Policy('mixed_bfloat16')
tf.keras.mixed_precision.experimental.set_policy(policy)
with strategy.scope():
model = CustomModel(TFXLMRobertaModel.from_pretrained("jplu/tf-xlm-roberta-large"), num_classes=5)
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)
optimizer = tf.mixed_precision.LossScaleOptimizer(optimizer, loss_scale='dynamic')
model.compile(optimizer=optimizer,loss=['mse'])
InvalidArgumentError Traceback (most recent call
last)
in ()
3 with strategy.scope():
4
----> 5 model = CustomModel(TFXLMRobertaModel.from_pretrained("jplu/tf-xlm-roberta-large"),
num_classes=5)
6 optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)
7 optimizer = tf.mixed_precision.LossScaleOptimizer(optimizer, loss_scale='dynamic')
13 frames
/usr/local/lib/python3.6/dist-packages/transformers/modeling_tf_utils.py
in from_pretrained(cls, pretrained_model_name_or_path, *model_args,
**kwargs)
399 return load_pytorch_checkpoint_in_tf2_model(model, resolved_archive_file, allow_missing_keys=True)
400
--> 401 model(model.dummy_inputs, training=False) # build the network with dummy inputs
402
403 assert os.path.isfile(resolved_archive_file), "Error retrieving file {}".format(resolved_archive_file)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py
in call(self, *args, **kwargs)
966 with base_layer_utils.autocast_context_manager(
967 self._compute_dtype):
--> 968 outputs = self.call(cast_inputs, *args, **kwargs)
969 self._handle_activity_regularization(inputs, outputs)
970 self._set_mask_metadata(inputs, outputs, input_masks)
/usr/local/lib/python3.6/dist-packages/transformers/modeling_tf_roberta.py
in call(self, inputs, **kwargs)
222
223 """
--> 224 outputs = self.roberta(inputs, **kwargs)
225 return outputs
226
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py
in call(self, *args, **kwargs)
966 with base_layer_utils.autocast_context_manager(
967 self._compute_dtype):
--> 968 outputs = self.call(cast_inputs, *args, **kwargs)
969 self._handle_activity_regularization(inputs, outputs)
970 self._set_mask_metadata(inputs, outputs, input_masks)
/usr/local/lib/python3.6/dist-packages/transformers/modeling_tf_bert.py
in call(self, inputs, attention_mask, token_type_ids, position_ids,
head_mask, inputs_embeds, training)
567 # head_mask = tf.constant([0] * self.num_hidden_layers)
568
--> 569 embedding_output = self.embeddings([input_ids, position_ids, token_type_ids, inputs_embeds], training=training)
570 encoder_outputs = self.encoder([embedding_output, extended_attention_mask, head_mask], training=training)
571
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py
in call(self, *args, **kwargs)
966 with base_layer_utils.autocast_context_manager(
967 self._compute_dtype):
--> 968 outputs = self.call(cast_inputs, *args, **kwargs)
969 self._handle_activity_regularization(inputs, outputs)
970 self._set_mask_metadata(inputs, outputs, input_masks)
/usr/local/lib/python3.6/dist-packages/transformers/modeling_tf_bert.py
in call(self, inputs, mode, training)
146 """
147 if mode == "embedding":
--> 148 return self._embedding(inputs, training=training)
149 elif mode == "linear":
150 return self._linear(inputs)
/usr/local/lib/python3.6/dist-packages/transformers/modeling_tf_roberta.py
in _embedding(self, inputs, training)
79 position_ids = self.create_position_ids_from_inputs_embeds(inputs_embeds)
80
---> 81 return super()._embedding([input_ids, position_ids, token_type_ids, inputs_embeds], training=training)
82
83
/usr/local/lib/python3.6/dist-packages/transformers/modeling_tf_bert.py
in _embedding(self, inputs, training)
173
174 embeddings = inputs_embeds + position_embeddings + token_type_embeddings
--> 175 embeddings = self.LayerNorm(embeddings)
176 embeddings = self.dropout(embeddings, training=training)
177 return embeddings
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py
in call(self, *args, **kwargs)
962 # Eager execution on data tensors.
963 with backend.name_scope(self._name_scope()):
--> 964 self._maybe_build(inputs)
965 cast_inputs = self._maybe_cast_inputs(inputs)
966 with base_layer_utils.autocast_context_manager(
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py
in _maybe_build(self, inputs) 2406 self._dtype_policy =
policy.Policy(dtype) 2407 input_shapes = None
-> 2408 if all(hasattr(x, 'shape') for x in input_list): 2409 input_shapes = nest.map_structure(lambda x: x.shape, inputs) 2410
Only call build if the user has manually overridden the build method.
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py
in (.0) 2406 self._dtype_policy =
policy.Policy(dtype) 2407 input_shapes = None
-> 2408 if all(hasattr(x, 'shape') for x in input_list): 2409 input_shapes = nest.map_structure(lambda x: x.shape, inputs) 2410
Only call build if the user has manually overridden the build method.
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py
in shape(self) 1065 self._tensor_shape =
tensor_shape.TensorShape(self._shape_tuple()) 1066 except
core._NotOkStatusException as e:
-> 1067 six.raise_from(core._status_to_exception(e.code, e.message), None) 1068 1069 return self._tensor_shape
/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value,
from_value)
InvalidArgumentError: cannot compute AddV2 as input #1(zero-based) was
expected to be a bfloat16 tensor but is a float tensor
I suppose I have to cast something regarding the model ? How can I do that ?
I am using tensorflow 2.1 and TPU v2.
I have see this thread but it was with tensorflow 1.X I suppose as the code does not work for me.
Memory reduction Tensorflow TPU v2/v3 bfloat16

I think the problem is that you are trying to load a pre-trained model trained with full floats into a b16float model. I don't think that will work. You have to train from scratch.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Unable to find a valid cuDNN algorithm to run convolution - pytorch

This error is quite tricky sometimes. For some certain circumstances, out of memory will also report this error info.

Check the number of classes you assign in the code. This error appeared to me when I tried to run the code on Cifar100 instead of Cifar10 but forgot to change the num_classes from 10 to 100.

the problem is you are using torch.nn.Module for the feed-forward but you are returning with the functional module F.conv2d(). change your return code to nn.Conv2d() this will probably help you more- https://pytorch.org/docs/stable/nn.html?highlight=conv2d#torch.nn.Conv2d

This thing happened to me a couple of times. Maybe it would be a little basic but. Shutting down running kernels helped me a lot. After shutting down other kernels, memory was restored almost completely and the problem was gone.

Related

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE on PyTorch Lightning

what does set_memory_growth do in tensorflow2

Weights and Biases watch log causing CUDA out of memory

TPU throws error when trying to initiliase and setup, and then later during tf.keras model building

Tensorflow 2.1 TPU v2 reduce memory usage with bfloat16

Categories

Resources