RuntimeError: CUDNN_STATUS_INTERNAL_ERROR - pytorch

On ubuntu14.04,I use pytorch with cudnn.This problem happened:
Traceback (most recent call last):
File "main.py", line 58, in <module>
test_detect(test_loader, nod_net, get_pbb, bbox_result_path,config1,n_gpu=config_submit['n_gpu'])
File "/home/ubuntu/nndl/DSB2017/test_detect.py", line 52, in test_detect
output = net(input,inputcoord)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 252, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 58, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 252, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/nndl/DSB2017/net_detector.py", line 102, in forward
out = self.preBlock(x)#16
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 252, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/container.py", line 67, in forward
input = module(input)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 252, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 351, in forward
self.padding, self.dilation, self.groups)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/functional.py", line 119, in conv3d
return f(input, weight, bias)
RuntimeError: CUDNN_STATUS_INTERNAL_ERROR
I have google it for severial hours an am really confused.What made this happen?

I just encountered this problem on ubuntu16.04 and solved it. My solution was to run
sudo rm -rf ~/.nv
and then reboot.

Related

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED while using GPU with pytorch

I am running this code in a computer with rtx 3090ti github_code. However, the code raises an error with first forward layer. Although, the code succesfully runs on cpu.
The stack trace:
Traceback (most recent call last):
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
cli.main()
File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
run()
File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
runpy.run_path(target, run_name="__main__")
File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 322, in run_path
pkg_name=pkg_name, script_name=fname)
File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 136, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
exec(code, run_globals)
File "/home/tekre/Desktop/video_captioning_studies/HMN/main.py", line 37, in <module>
model = train_fn(cfgs, cfgs.model_name, model, hungary_matcher, train_loader, valid_loader, device)
File "/home/tekre/Desktop/video_captioning_studies/HMN/train.py", line 66, in train_fn
preds, objects_pending, action_pending, video_pending = model(objects, object_masks, feature2ds, feature3ds, numberic_caps)
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/tekre/Desktop/video_captioning_studies/HMN/models/caption_models/hierarchical_model.py", line 95, in forward
objects_feats, action_feats, video_feats, objects_semantics, action_semantics, video_semantics = self.forward_encoder(objects_feats, objects_mask, feature2ds, feature3ds)
File "/home/tekre/Desktop/video_captioning_studies/HMN/models/caption_models/hierarchical_model.py", line 57, in forward_encoder
objects_feats, objects_semantics = self.entity_level(feature2ds, feature3ds, objects, objects_mask)
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/tekre/Desktop/video_captioning_studies/HMN/models/encoders/entity_level.py", line 53, in forward
features_2d = self.feature2d_proj(features_2d.view(-1, features_2d.shape[-1]))
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 107, in forward
exponential_average_factor, self.eps)
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/functional.py", line 1670, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
I installed my environment as the github repo instructed. Do i need to additionally install cudnn package because pytorch handles it in environment. I am putting this question here because there is not much response there.
In my case this error was caused by a mismatch between the version of Cuda I was using (11.7) and the version of Cuda pytorch was installed to work with.
On installing the correct version from the Pytorch Installation Page, I was able to run the code.

PyG: RuntimeError: Tensors must have same number of dimensions: got 2 and 3

I am using TransformerConv and encountered this error:
Traceback (most recent call last):
File "pipeline_model_gat.py", line 1018, in <module>
output = model(
File"/mount/arbeitsdaten61/studenten3/advanced-ml/2022/gogirlspower/nicole/conda/envs/new_gvqa/lib/python3.8/sitepackages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "pipeline_model_gat.py", line 881, in forwardquestions_encoded = self.question_encoder(question_graphs)
File "/mount/arbeitsdaten61/studenten3/advanced-ml/2022/gogirlspower/nicole/conda/envs/new_gvqa/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "pipeline_model_gat.py", line 628, in forward= self.conv1(x, question_graphs.edge_index, edge_attr)
File "/mount/arbeitsdaten61/studenten3/advanced-ml/2022/gogirlspower/nicole/conda/envs/new_gvqa/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/mount/arbeitsdaten61/studenten3/advanced-ml/2022/gogirlspower/nicole/conda/envs/new_gvqa/lib/python3.8/site-packages/torch_geometric/nn/conv/transformer_conv.py", line 190, in forward
beta = self.lin_beta(torch.cat([out, x_r, out - x_r], dim=-1))
RuntimeError: Tensors must have same number of dimensions: got 2 and 3
Can someone please tell me what could have gone wrong?

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED in pytorch

I am running CNN algorithm using PyTorch on my new machine with 3 Nvidia GPUs and getting the error below:
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
File "code.py", line 342, in <module>
trainer.fit(model)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 514, in fit
self.dispatch()
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 554, in dispatch
self.accelerator.start_training(self)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 111, in start_training
self._results = trainer.run_train()
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 615, in run_train
self.run_sanity_check(self.lightning_module)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 864, in run_sanity_check
_, eval_results = self.run_evaluation(max_batches=self.num_sanity_val_batches)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 733, in run_evaluation
output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 164, in evaluation_step
output = self.trainer.accelerator.validation_step(args)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 178, in validation_step
return self.training_type_plugin.validation_step(*args)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 290, in validation_step
return self.model(*args, **kwargs)
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 63, in forward
output = self.module.validation_step(*inputs, **kwargs)
File code.py", line 314, in validation_step
pred = self.forward(x)
File code.py", line 259, in forward
x = self.conv0(x) #([12, 600, 600])
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
NVIDIA-MSI:
The code is running without any issue on another machine with driver version 450.51.06 and Cuda version 11. You can see nvidia-smi of new machine above. I checked different comments on other questions same to this issue and non of them resolved my issue.

Unable to run super-AND repository in JupyterLab

I am trying to run this repository in JupyterLab. I am getting below error, while running this command python3 main.py --dataset cifar10 --network resnet18. For installation setup i just followed the steps mentioned in the github link.
Log:
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
100.0%Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
Traceback (most recent call last):
File "main.py", line 237, in <module>
main()
File "main.py", line 215, in main
train(r, epoch, net, trainloader, optimizer, npc, criterion, criterion2, ANs_discovery, args.device)
File "main.py", line 124, in train
features = net(inputs) # (256, 128)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/jovyan/super-AND/models/resnet_cifar.py", line 111, in forward
out = self.layer2(out)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/jovyan/super-AND/models/resnet_cifar.py", line 26, in forward
out = F.relu(self.bn1(self.conv1(x)))
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/torch/nn/functional.py", line 914, in relu
result = torch.relu(input)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()

Pytorch to ONNX export function fails and causes legacy function error

I am trying to convert the pytorch model in this link to onnx model using the code below :
device=t.device('cuda:0' if t.cuda.is_available() else 'cpu')
print(device)
faster_rcnn = FasterRCNNVGG16()
trainer = FasterRCNNTrainer(faster_rcnn).cuda()
#trainer = FasterRCNNTrainer(faster_rcnn).to(device)
trainer.load('./checkpoints/model.pth')
dummy_input = t.randn(1, 3, 300, 300, device = 'cuda')
#dummy_input = dummy_input.to(device)
t.onnx.export(faster_rcnn, dummy_input, "model.onnx", verbose = True)
But I get the following error (Sorry for the block quote below stackoverflow wouldn't let the whole trace be in code format and wouldn't let the question be posted otherwise):
Traceback (most recent call last):
small_object_detection_master_samirsen\onnxtest.py", line 44, in <module>
t.onnx.export(faster_rcnn, dummy_input, "fasterrcnn_10120119_06025842847785781.onnx", verbose = True)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\onnx\__init__.py",
line 132, in export
strip_doc_string, dynamic_axes)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\onnx\utils.py",
line 64, in export
example_outputs=example_outputs, strip_doc_string=strip_doc_string, dynamic_axes=dynamic_axes)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\onnx\utils.py",
line 329, in _export
_retain_param_name, do_constant_folding)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\onnx\utils.py",
line 213, in _model_to_graph
graph, torch_out = _trace_and_get_graph_from_model(model, args, training)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\onnx\utils.py",
line 171, in _trace_and_get_graph_from_model
trace, torch_out = torch.jit.get_trace_graph(model, args, _force_outplace=True)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\jit__init__.py",
line 256, in get_trace_graph
return LegacyTracedModule(f, _force_outplace, return_inputs)(*args, **kwargs)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 547, in call
result = self.forward(*input, **kwargs)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\jit__init__.py",
line 323, in forward
out = self.inner(*trace_inputs)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 545, in call
result = self._slow_forward(*input, **kwargs)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 531, in _slow_forward
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 531, in _slow_forward
result = self.forward(*input, **kwargs)
File "D:\smallobject2\export test s\small_object_detection_master_samirsen\model\faster_rcnn.py", line
133, in forward
h, rois, roi_indices)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 545, in call
result = self._slow_forward(*input, **kwargs)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 531, in _slow_forward
result = self.forward(*input, **kwargs)
File "D:\smallobject2\export test s\small_object_detection_master_samirsen\model\faster_rcnn_vgg16.py",
line 142, in forward
pool = self.roi(x, indices_and_rois)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 545, in call
result = self._slow_forward(*input, **kwargs)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 531, in _slow_forward
result = self.forward(*input, **kwargs)
File "D:\smallobject2\export test s\small_object_detection_master_samirsen\model\roi_module.py", line
85, in forward
return self.RoI(x, rois)
RuntimeError: Attempted to trace RoI, but tracing of legacy functions is not supported
This is because ONNX does not support torch.grad.Function. The issue is because ROI class Refer this
To overcome the issue, you have to implement the forward and backward function as a separate function definition rather than a member of ROI class.
The function call to ROI in FasterRCNNVGG16 is supposed to be altered to explicit call forward and backward functions.

Resources