RuntimeError: cuda runtime error (100) . The gpu is enabled but still giving error - pytorch

I am new to google Colab and pyTorch. I am running a pytorch model but it is giving me the Cuda Runtime Error in google Colab. My gpu is enabled on google colab but it is still giving error, The description of gpu is available in the image below. Can anyone please help me out?
torch GPU
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=47 error=100 : no CUDA-capable device is detected
Traceback (most recent call last):
File "run.py", line 338, in <module>
main()
File "run.py", line 303, in main
model = model.cuda()
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 458, in cuda
return self._apply(lambda t: t.cuda(device))
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 354, in _apply
module._apply(fn)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 354, in _apply
module._apply(fn)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 354, in _apply
module._apply(fn)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 376, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 458, in <lambda>
return self._apply(lambda t: t.cuda(device))
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 190, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (100) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:47
Read prediction from logs/logs_sparc_editsql/valid_use_predicted_queries_predictions.json

Nvm, I was putting CUDA_VISIBLE_DEVICES to 5.
It should be the number of CUDA devices you have.

Related

RuntimeError: value cannot be converted to type int without overflow

I want to reproduce DVGO on my platform. It seems the environment is already set up.
(the torch version is 1.12.1+cu113, the gcc version is 7.5.0, and the nvcc version is 11.0.194)
However, when I ran the command "python run.py --config configs/nerf/hotdog.py --render_test", the overflow issue occurred.
Following is the Traceback code.
Traceback (most recent call last):
File "run.py", line 630, in <module>
train(args, cfg, data_dict)
File "run.py", line 545, in train
data_dict=data_dict, stage='coarse')
File "run.py", line 449, in scene_rep_reconstruction
**render_kwargs)
File "/opt/pyenv/versions/mlhw/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/DirectVoxGO/lib/dvgo.py", line 309, in forward
rays_o=rays_o, rays_d=rays_d, **render_kwargs)
File "/home/user/DirectVoxGO/lib/dvgo.py", line 288, in sample_ray
ray_pts, mask_outbbox, ray_id, step_id, N_steps, t_min, t_max = render_utils_cuda.sample_pts_on_rays(
RuntimeError: value cannot be converted to type int without overflow
Does anyone know how to deal with the problem?
I have tried to trace the source code in the above GitHub page, but I am not familiar with the gcc source code.

RuntimeError: The 'data' object was created by an older version of PyG

thanks for your great contribution to the science:
I have installed the following pytorch and pytorch_gemetric versions as you have mentioned in this link:
conda create -n tox-env python=3.6
conda install pytorch=1.6.0 torchvision torchaudio cudatoolkit=10.2 -c pytorch
pip install torch-scatter==2.0.6 torch-sparse==0.6.9 -f https://pytorch-geometric.com/whl/torch-1.6.0+cu102.html
pip install torch-geometric==2.0.0
The reason is that I am trying to run the code from a GitHub repositorty, when it reaches to this line, it was raising an error (in the latest version of pytorch). Then I had to downgrade the pyG and pytorch versions, however, I am getting the following error:
/home/es/anaconda3/envs/tox-env/bin/python /home/es/PycharmProjects/1-Meta-MGNN/Meta-MGNN/main.py
/home/es/anaconda3/envs/tox-env/lib/python3.6/site-packages/torch/cuda/__init__.py:125: UserWarning:
NVIDIA GeForce RTX 3090 Ti with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.
If you want to use the NVIDIA GeForce RTX 3090 Ti GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
tox21
Iteration: 0%| | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/es/PycharmProjects/1-Meta-MGNN/Meta-MGNN/main.py", line 131, in <module>
main("tox21", "model_gin/supervised_contextpred.pth", "gin", True, True, True, 0.1, 5)
File "/home/es/PycharmProjects/1-Meta-MGNN/Meta-MGNN/main.py", line 105, in main
support_grads = model(epoch)
File "/home/es/anaconda3/envs/tox-env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/es/PycharmProjects/1-Meta-MGNN/Meta-MGNN/meta_model.py", line 183, in forward
for step, batch in enumerate(tqdm(support_loaders[task], desc="Iteration")):
File "/home/es/anaconda3/envs/tox-env/lib/python3.6/site-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "/home/es/anaconda3/envs/tox-env/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
data = self._next_data()
File "/home/es/anaconda3/envs/tox-env/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data
return self._process_data(data)
File "/home/es/anaconda3/envs/tox-env/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data
data.reraise()
File "/home/es/anaconda3/envs/tox-env/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/es/anaconda3/envs/tox-env/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
data = fetcher.fetch(index)
File "/home/es/anaconda3/envs/tox-env/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/es/anaconda3/envs/tox-env/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/es/anaconda3/envs/tox-env/lib/python3.6/site-packages/torch_geometric/data/dataset.py", line 198, in __getitem__
data = self.get(self.indices()[idx])
File "/home/es/PycharmProjects/1-Meta-MGNN/Meta-MGNN/loader.py", line 142, in get
for key in self.data.keys:
File "/home/es/anaconda3/envs/tox-env/lib/python3.6/site-packages/torch_geometric/data/data.py", line 103, in keys
for store in self.stores:
File "/home/es/anaconda3/envs/tox-env/lib/python3.6/site-packages/torch_geometric/data/data.py", line 393, in stores
return [self._store]
File "/home/es/anaconda3/envs/tox-env/lib/python3.6/site-packages/torch_geometric/data/data.py", line 341, in __getattr__
"The 'data' object was created by an older version of PyG. "
RuntimeError: The 'data' object was created by an older version of PyG. If this error occurred while loading an already existing dataset, remove the 'processed/' directory in the dataset's root folder and try again.
Process finished with exit code 1

[ERROR]: training mobnetv2 with object_detection api and I get the error while training. unknown error

W0615 19:12:26.293519 16220 deprecation.py:554] From C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\tensorflow\python\util\deprecation.py:629: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
2022-06-15 19:13:01.007705: W tensorflow/core/framework/op_kernel.cc:1733] UNKNOWN: JIT compilation failed.
Traceback (most recent call last):
File "C:\Users\oknor\Documents\Programming\TrainingModels\TensorFlow\workspace\car_training\model_main_tf2.py", line 114, in <module>
tf.compat.v1.app.run()
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\tensorflow\python\platform\app.py", line 36, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\absl\app.py", line 312, in run
_run_main(main, args)
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\absl\app.py", line 258, in _run_main
sys.exit(main(argv))
File "C:\Users\oknor\Documents\Programming\TrainingModels\TensorFlow\workspace\car_training\model_main_tf2.py", line 105, in main
model_lib_v2.train_loop(
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\object_detection\model_lib_v2.py", line 685, in train_loop
losses_dict = _dist_train_step(train_input_iter)
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\tensorflow\python\util\traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\tensorflow\python\eager\execute.py", line 54, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:
Detected at node 'train_input_images/write_summary/mod' defined at (most recent call last):
File "C:\Program Files\Python310\lib\threading.py", line 966, in _bootstrap
self._bootstrap_inner()
File "C:\Program Files\Python310\lib\threading.py", line 1009, in _bootstrap_inner
self.run()
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\object_detection\model_lib_v2.py", line 629, in train_step_fn
if record_summaries:
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\object_detection\model_lib_v2.py", line 630, in train_step_fn
tf.compat.v2.summary.image(
File "C:\Program Files\Python310\lib\site-packages\tensorboard\plugins\image\summary_v2.py", line 141, in image
tag=tag, tensor=lazy_tensor, step=step, metadata=summary_metadata
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\object_detection\model_lib_v2.py", line 599, in <lambda>
lambda: global_step % num_steps_per_iteration == 0):
Node: 'train_input_images/write_summary/mod'
Detected at node 'train_input_images/write_summary/mod' defined at (most recent call last):
File "C:\Program Files\Python310\lib\threading.py", line 966, in _bootstrap
self._bootstrap_inner()
File "C:\Program Files\Python310\lib\threading.py", line 1009, in _bootstrap_inner
self.run()
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\object_detection\model_lib_v2.py", line 629, in train_step_fn
if record_summaries:
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\object_detection\model_lib_v2.py", line 630, in train_step_fn
tf.compat.v2.summary.image(
File "C:\Program Files\Python310\lib\site-packages\tensorboard\plugins\image\summary_v2.py", line 141, in image
tag=tag, tensor=lazy_tensor, step=step, metadata=summary_metadata
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\object_detection\model_lib_v2.py", line 599, in <lambda>
lambda: global_step % num_steps_per_iteration == 0):
Node: 'train_input_images/write_summary/mod'
2 root error(s) found.
(0) UNKNOWN: JIT compilation failed.
[[{{node train_input_images/write_summary/mod}}]]
[[train_input_images/write_summary/Equal_1/_16]]
(1) UNKNOWN: JIT compilation failed.
[[{{node train_input_images/write_summary/mod}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference__dist_train_step_59118]
This is part of the console output and if required I can share the whole output.
I am using tensorflowv2 2.9, python 3.10, CUDA 11.7, and cuDNN 8401.
I want to train my mobnetv2 to detect cars in images, custom object detection.
I get this error one i run the command to start the training process.
Had the same issue.
Try running it with TF 2.9.1, CUDA : 11.2, cuDNN : 8.1.
Source

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED while using flair

I have been using https://github.com/zalandoresearch/flair#example-usage
tried using flair to experiment flair but then I don't know why I am not able to use the GPU.
and tried the following:
>>> from flair.data import Sentence
>>> from flair.models import SequenceTagger
>>> sentence = Sentence('I love Berlin .')
>>> tagger = SequenceTagger.load('ner')
2019-07-20 17:52:15,062 loading file /home/vz/.flair/models/en-ner-conll03-v0.4.pt
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/vz/miniconda3/envs/gp/lib/python3.6/site-packages/flair/nn.py", line 103, in load
model = cls._init_model_with_state_dict(state)
File "/home/vz/miniconda3/envs/gp/lib/python3.6/site-packages/flair/models/sequence_tagger_model.py", line 205, in _init_model_with_state_dict
locked_dropout=use_locked_dropout,
File "/home/vz/miniconda3/envs/gp/lib/python3.6/site-packages/flair/models/sequence_tagger_model.py", line 166, in __init__
self.to(flair.device)
File "/home/vz/miniconda3/envs/gp/lib/python3.6/site-packages/torch/nn/modules/module.py", line 386, in to
return self._apply(convert)
File "/home/vz/miniconda3/envs/gp/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/home/vz/miniconda3/envs/gp/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/home/vz/miniconda3/envs/gp/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "/home/vz/miniconda3/envs/gp/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 127, in _apply
self.flatten_parameters()
File "/home/vz/miniconda3/envs/gp/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Can anyone please help me as to how to fix this error ?
Thanks in advance.
The error is with my machine and CUDNN requirement i would suggest every one to install pytorch with conda so the way to install should be something like this
conda install pytorch torchvision cudatoolkit=9.0 -c pytorch
to Eradicate any kind of issues with the installation.

InternalError: Invalid variable reference. [Op:ResourceApplyAdam] on TensorFlow

I am currently working on EagerExecution with tensorflow 1.7.0.
I get this error when I am working on GPU :
tensorflow.python.framework.errors_impl.InternalError: Invalid variable reference. [Op:ResourceApplyAdam]
Unfortunately, I wasn't able to isolate the error so I can't give a snippet which could explain that.
The error doesn't occur when I am working on CPU. My code was working fine on GPU until a recent update. I don't think it is machine related because it occurs on different machines.
I wasn't able to find something relevant so if you have any hints on what can cause this error, please let me know.
Complete tracking :
2018-07-19 17:52:32.393711: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at training_ops.cc:2507 : Internal: Invalid variable reference.
Traceback (most recent call last):
File "debugging_jules_usage.py", line 391, in <module>
mainLoop()
File "debugging_jules_usage.py", line 370, in mainLoop
raise e
File "debugging_jules_usage.py", line 330, in mainLoop
Kn.fit(train)
File "/home/jbayet/xai-link-prediction/xai_lp/temporal/models_temporal.py", line 707, in fit
self._train_one_batch(X_bis, i)
File "/home/jbayet/xai-link-prediction/xai_lp/temporal/models_temporal.py", line 639, in _train_one_batch
self.optimizer.minimize(batch_model_loss, global_step=tf.train.get_global_step())
File "/home/jbayet/miniconda3/envs/xai/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 409, in minimize
name=name)
File "/home/jbayet/miniconda3/envs/xai/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 564, in apply_gradients
update_ops.append(processor.update_op(self, grad))
File "/home/jbayet/miniconda3/envs/xai/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 161, in update_op
update_op = optimizer._resource_apply_dense(g, self._v)
File "/home/jbayet/miniconda3/envs/xai/lib/python3.6/site-packages/tensorflow/python/training/adam.py", line 166, in _resource_apply_dense
grad, use_locking=self._use_locking)
File "/home/jbayet/miniconda3/envs/xai/lib/python3.6/site-packages/tensorflow/python/training/gen_training_ops.py", line 1105, in resource_apply_adam
_six.raise_from(_core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: Invalid variable reference. [Op:ResourceApplyAdam]

Resources