Unable to run super-AND repository in JupyterLab - pytorch

I am trying to run this repository in JupyterLab. I am getting below error, while running this command python3 main.py --dataset cifar10 --network resnet18. For installation setup i just followed the steps mentioned in the github link.
Log:
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
100.0%Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
Traceback (most recent call last):
File "main.py", line 237, in <module>
main()
File "main.py", line 215, in main
train(r, epoch, net, trainloader, optimizer, npc, criterion, criterion2, ANs_discovery, args.device)
File "main.py", line 124, in train
features = net(inputs) # (256, 128)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/jovyan/super-AND/models/resnet_cifar.py", line 111, in forward
out = self.layer2(out)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/jovyan/super-AND/models/resnet_cifar.py", line 26, in forward
out = F.relu(self.bn1(self.conv1(x)))
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/torch/nn/functional.py", line 914, in relu
result = torch.relu(input)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()

Related

Error about torch tensor precision on gpu

I tried to finetune a Bert model on GPU using PyTorch-Lightning's class Trainer using the following code:
from pytorch_lightning import Trainer
from models import LitAdModel, AdModel
from dataloaders import train_dataloader, test_dataloader
model = AdModel()
litmodel = LitAdModel(model=model)
trainer = Trainer(accelerator='gpu', devices=1)
trainer.fit(model=litmodel, train_dataloaders=train_dataloader,
val_dataloaders=test_dataloader)
in which train_dataloader, test_dataloader and AdModel and LitAdModel classes are defined elsewhere. When I do this without using the GPU, it works ( slowly), but with GPU it gives the following error:
File "/Users/sanjinjuricfot/developer/copy_models/test_pl.py", line
24, in
main() File "/Users/sanjinjuricfot/developer/copy_models/test_pl.py", line 18, in
main
littrain(train=train, test=test) File "/Users/sanjinjuricfot/developer/copy_models/src/_torch/littrain.py",
line 39, in littrain
trainer.fit(model=litmodel, train_dataloaders=train_dataloader, val_dataloaders=test_dataloader) File
"/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py",
line 582, in fit
call._call_and_handle_interrupt( File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py",
line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py",
line 624, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py",
line 1061, in _run
results = self._run_stage() File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py",
line 1140, in _run_stage
self._run_train() File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py",
line 1153, in _run_train
self._run_sanity_check() File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py",
line 1225, in _run_sanity_check
val_loop.run() File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py",
line 199, in run
self.advance(*args, **kwargs) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py",
line 152, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs) File
"/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py",
line 199, in run
self.advance(*args, **kwargs) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py",
line 121, in advance
batch = next(data_fetcher) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py",
line 184, in next
return self.fetching_function() File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py",
line 275, in fetching_function
return self.move_to_device(batch) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py",
line 294, in move_to_device
batch = self.batch_to_device(batch) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py",
line 142, in batch_to_device
batch = self.trainer._call_strategy_hook("batch_to_device", batch, dataloader_idx=dataloader_idx) File
"/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py",
line 1443, in _call_strategy_hook
output = fn(*args, **kwargs) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py",
line 273, in batch_to_device
return model._apply_batch_transfer_handler(batch, device=device, dataloader_idx=dataloader_idx) File
"/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/core/module.py",
line 295, in _apply_batch_transfer_handler
batch = self._call_batch_hook("transfer_batch_to_device", batch, device, dataloader_idx) File
"/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/core/module.py",
line 283, in _call_batch_hook
return trainer_method(hook_name, *args) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py",
line 1305, in _call_lightning_module_hook
output = fn(*args, **kwargs) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/core/hooks.py",
line 632, in transfer_batch_to_device
return move_data_to_device(batch, device) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/lightning_lite/utilities/apply_func.py",
line 101, in move_data_to_device
return apply_to_collection(batch, dtype=_TransferableDataType, function=batch_to) File
"/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py",
line 55, in apply_to_collection
v = apply_to_collection( File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py",
line 47, in apply_to_collection
return function(data, *args, **kwargs) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/lightning_lite/utilities/apply_func.py",
line 95, in batch_to
data_output = data.to(device, **kwargs) TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support
float64. Please use float32 instead.
I tried using this command
torch.set_default_dtype(torch.float32)
in all the relevant files and adding
.to(torch.float32)
extension to all the tensors, but it didn't work.
I am using MacBook Pro with M2 processor. Thanks in advance for any help!

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED while using GPU with pytorch

I am running this code in a computer with rtx 3090ti github_code. However, the code raises an error with first forward layer. Although, the code succesfully runs on cpu.
The stack trace:
Traceback (most recent call last):
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
cli.main()
File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
run()
File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
runpy.run_path(target, run_name="__main__")
File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 322, in run_path
pkg_name=pkg_name, script_name=fname)
File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 136, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
exec(code, run_globals)
File "/home/tekre/Desktop/video_captioning_studies/HMN/main.py", line 37, in <module>
model = train_fn(cfgs, cfgs.model_name, model, hungary_matcher, train_loader, valid_loader, device)
File "/home/tekre/Desktop/video_captioning_studies/HMN/train.py", line 66, in train_fn
preds, objects_pending, action_pending, video_pending = model(objects, object_masks, feature2ds, feature3ds, numberic_caps)
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/tekre/Desktop/video_captioning_studies/HMN/models/caption_models/hierarchical_model.py", line 95, in forward
objects_feats, action_feats, video_feats, objects_semantics, action_semantics, video_semantics = self.forward_encoder(objects_feats, objects_mask, feature2ds, feature3ds)
File "/home/tekre/Desktop/video_captioning_studies/HMN/models/caption_models/hierarchical_model.py", line 57, in forward_encoder
objects_feats, objects_semantics = self.entity_level(feature2ds, feature3ds, objects, objects_mask)
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/tekre/Desktop/video_captioning_studies/HMN/models/encoders/entity_level.py", line 53, in forward
features_2d = self.feature2d_proj(features_2d.view(-1, features_2d.shape[-1]))
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 107, in forward
exponential_average_factor, self.eps)
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/functional.py", line 1670, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
I installed my environment as the github repo instructed. Do i need to additionally install cudnn package because pytorch handles it in environment. I am putting this question here because there is not much response there.
In my case this error was caused by a mismatch between the version of Cuda I was using (11.7) and the version of Cuda pytorch was installed to work with.
On installing the correct version from the Pytorch Installation Page, I was able to run the code.

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED in pytorch

I am running CNN algorithm using PyTorch on my new machine with 3 Nvidia GPUs and getting the error below:
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
File "code.py", line 342, in <module>
trainer.fit(model)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 514, in fit
self.dispatch()
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 554, in dispatch
self.accelerator.start_training(self)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 111, in start_training
self._results = trainer.run_train()
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 615, in run_train
self.run_sanity_check(self.lightning_module)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 864, in run_sanity_check
_, eval_results = self.run_evaluation(max_batches=self.num_sanity_val_batches)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 733, in run_evaluation
output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 164, in evaluation_step
output = self.trainer.accelerator.validation_step(args)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 178, in validation_step
return self.training_type_plugin.validation_step(*args)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 290, in validation_step
return self.model(*args, **kwargs)
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 63, in forward
output = self.module.validation_step(*inputs, **kwargs)
File code.py", line 314, in validation_step
pred = self.forward(x)
File code.py", line 259, in forward
x = self.conv0(x) #([12, 600, 600])
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
NVIDIA-MSI:
The code is running without any issue on another machine with driver version 450.51.06 and Cuda version 11. You can see nvidia-smi of new machine above. I checked different comments on other questions same to this issue and non of them resolved my issue.

Pytorch to ONNX export function fails and causes legacy function error

I am trying to convert the pytorch model in this link to onnx model using the code below :
device=t.device('cuda:0' if t.cuda.is_available() else 'cpu')
print(device)
faster_rcnn = FasterRCNNVGG16()
trainer = FasterRCNNTrainer(faster_rcnn).cuda()
#trainer = FasterRCNNTrainer(faster_rcnn).to(device)
trainer.load('./checkpoints/model.pth')
dummy_input = t.randn(1, 3, 300, 300, device = 'cuda')
#dummy_input = dummy_input.to(device)
t.onnx.export(faster_rcnn, dummy_input, "model.onnx", verbose = True)
But I get the following error (Sorry for the block quote below stackoverflow wouldn't let the whole trace be in code format and wouldn't let the question be posted otherwise):
Traceback (most recent call last):
small_object_detection_master_samirsen\onnxtest.py", line 44, in <module>
t.onnx.export(faster_rcnn, dummy_input, "fasterrcnn_10120119_06025842847785781.onnx", verbose = True)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\onnx\__init__.py",
line 132, in export
strip_doc_string, dynamic_axes)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\onnx\utils.py",
line 64, in export
example_outputs=example_outputs, strip_doc_string=strip_doc_string, dynamic_axes=dynamic_axes)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\onnx\utils.py",
line 329, in _export
_retain_param_name, do_constant_folding)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\onnx\utils.py",
line 213, in _model_to_graph
graph, torch_out = _trace_and_get_graph_from_model(model, args, training)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\onnx\utils.py",
line 171, in _trace_and_get_graph_from_model
trace, torch_out = torch.jit.get_trace_graph(model, args, _force_outplace=True)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\jit__init__.py",
line 256, in get_trace_graph
return LegacyTracedModule(f, _force_outplace, return_inputs)(*args, **kwargs)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 547, in call
result = self.forward(*input, **kwargs)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\jit__init__.py",
line 323, in forward
out = self.inner(*trace_inputs)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 545, in call
result = self._slow_forward(*input, **kwargs)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 531, in _slow_forward
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 531, in _slow_forward
result = self.forward(*input, **kwargs)
File "D:\smallobject2\export test s\small_object_detection_master_samirsen\model\faster_rcnn.py", line
133, in forward
h, rois, roi_indices)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 545, in call
result = self._slow_forward(*input, **kwargs)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 531, in _slow_forward
result = self.forward(*input, **kwargs)
File "D:\smallobject2\export test s\small_object_detection_master_samirsen\model\faster_rcnn_vgg16.py",
line 142, in forward
pool = self.roi(x, indices_and_rois)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 545, in call
result = self._slow_forward(*input, **kwargs)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 531, in _slow_forward
result = self.forward(*input, **kwargs)
File "D:\smallobject2\export test s\small_object_detection_master_samirsen\model\roi_module.py", line
85, in forward
return self.RoI(x, rois)
RuntimeError: Attempted to trace RoI, but tracing of legacy functions is not supported
This is because ONNX does not support torch.grad.Function. The issue is because ROI class Refer this
To overcome the issue, you have to implement the forward and backward function as a separate function definition rather than a member of ROI class.
The function call to ROI in FasterRCNNVGG16 is supposed to be altered to explicit call forward and backward functions.

RuntimeError: CUDNN_STATUS_INTERNAL_ERROR

On ubuntu14.04,I use pytorch with cudnn.This problem happened:
Traceback (most recent call last):
File "main.py", line 58, in <module>
test_detect(test_loader, nod_net, get_pbb, bbox_result_path,config1,n_gpu=config_submit['n_gpu'])
File "/home/ubuntu/nndl/DSB2017/test_detect.py", line 52, in test_detect
output = net(input,inputcoord)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 252, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 58, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 252, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/nndl/DSB2017/net_detector.py", line 102, in forward
out = self.preBlock(x)#16
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 252, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/container.py", line 67, in forward
input = module(input)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 252, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 351, in forward
self.padding, self.dilation, self.groups)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/functional.py", line 119, in conv3d
return f(input, weight, bias)
RuntimeError: CUDNN_STATUS_INTERNAL_ERROR
I have google it for severial hours an am really confused.What made this happen?
I just encountered this problem on ubuntu16.04 and solved it. My solution was to run
sudo rm -rf ~/.nv
and then reboot.

Resources