RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED in pytorch - pytorch

I am running CNN algorithm using PyTorch on my new machine with 3 Nvidia GPUs and getting the error below:
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
File "code.py", line 342, in <module>
trainer.fit(model)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 514, in fit
self.dispatch()
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 554, in dispatch
self.accelerator.start_training(self)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 111, in start_training
self._results = trainer.run_train()
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 615, in run_train
self.run_sanity_check(self.lightning_module)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 864, in run_sanity_check
_, eval_results = self.run_evaluation(max_batches=self.num_sanity_val_batches)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 733, in run_evaluation
output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 164, in evaluation_step
output = self.trainer.accelerator.validation_step(args)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 178, in validation_step
return self.training_type_plugin.validation_step(*args)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 290, in validation_step
return self.model(*args, **kwargs)
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 63, in forward
output = self.module.validation_step(*inputs, **kwargs)
File code.py", line 314, in validation_step
pred = self.forward(x)
File code.py", line 259, in forward
x = self.conv0(x) #([12, 600, 600])
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
NVIDIA-MSI:
The code is running without any issue on another machine with driver version 450.51.06 and Cuda version 11. You can see nvidia-smi of new machine above. I checked different comments on other questions same to this issue and non of them resolved my issue.

Related

Error using RandomizedSearchCV on a LSTM model

I want to find the optimal number of neurons and layers for a LSTM model and I have created the following:
def build_model(n_hidden=1, n_neurons=30, learning_rate=3e-3, input_shape=[20,11]):
model = keras.models.Sequential()
options = {"input_shape": input_shape}
for layer in range(n_hidden):
model.add(keras.layers.Dense(n_neurons, activation="swish", **options))
options = {}
model.add(keras.layers.Dense(10, **options))
#optimizer = keras.optimizers.SGD(learning_rate)
model.compile(loss="mse", optimizer='adam')
return model
keras_reg = keras.wrappers.scikit_learn.KerasRegressor(build_model)
from scipy.stats import reciprocal
from sklearn.model_selection import RandomizedSearchCV
param_distribs = {
"n_hidden": [0, 1, 2, 3],
"n_neurons": np.arange(1, 100),
#"learning_rate": reciprocal(3e-4, 3e-2),
}
rnd_search_cv = RandomizedSearchCV(keras_reg, param_distribs, n_iter=10)
rnd_search_cv.fit(x_tr, y_tr , epochs=100,
callbacks=[keras.callbacks.EarlyStopping(patience=10)])
The dataset used is composed of 11 features and in this model it looks for the 20-in previous times steps and aims to predict the following 10 time steps. When I run the following code I have the following error:
> /home/use/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:372: FitFailedWarning:
50 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
Below are more details about the failures:
> --------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
File "/home/use/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/home/use/anaconda3/lib/python3.9/site-packages/keras/wrappers/scikit_learn.py", line 175, in fit
history = self.model.fit(x, y, **fit_args)
File "/home/use/anaconda3/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/use/anaconda3/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:
Detected at node 'gradient_tape/mean_squared_error/BroadcastGradientArgs' defined at (most recent call last):
File "/home/use/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/use/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/use/anaconda3/lib/python3.9/site-packages/ipykernel_launcher.py", line 17, in <module>
app.launch_new_instance()
File "/home/use/anaconda3/lib/python3.9/site-packages/traitlets/config/application.py", line 846, in launch_instance
app.start()
File "/home/use/anaconda3/lib/python3.9/site-packages/ipykernel/kernelapp.py", line 712, in start
self.io_loop.start()
File "/home/use/anaconda3/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 199, in start
self.asyncio_loop.run_forever()
File "/home/use/anaconda3/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/home/use/anaconda3/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/home/use/anaconda3/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/home/use/anaconda3/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 510, in dispatch_queue
await self.process_one()
File "/home/use/anaconda3/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 499, in process_one
await dispatch(*args)
File "/home/use/anaconda3/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 406, in dispatch_shell
await result
File "/home/use/anaconda3/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 730, in execute_request
reply_content = await reply_content
File "/home/use/anaconda3/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 390, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
File "/home/use/anaconda3/lib/python3.9/site-packages/ipykernel/zmqshell.py", line 528, in run_cell
return super().run_cell(*args, **kwargs)
File "/home/use/anaconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 2914, in run_cell
result = self._run_cell(
File "/home/use/anaconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 2960, in _run_cell
return runner(coro)
File "/home/use/anaconda3/lib/python3.9/site-packages/IPython/core/async_helpers.py", line 78, in _pseudo_sync_runner
coro.send(None)
File "/home/use/anaconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3185, in run_cell_async
has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
File "/home/use/anaconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3377, in run_ast_nodes
if (await self.run_code(code, result, async_=asy)):
File "/home/use/anaconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3457, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "/tmp/ipykernel_10055/1251208469.py", line 9, in <module>
rnd_search_cv.fit(x_tr, y_tr , epochs=100,
File "/home/use/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 891, in fit
self._run_search(evaluate_candidates)
File "/home/use/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 1766, in _run_search
evaluate_candidates(
File "/home/use/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 838, in evaluate_candidates
out = parallel(
File "/home/use/anaconda3/lib/python3.9/site-packages/joblib/parallel.py", line 1043, in __call__
if self.dispatch_one_batch(iterator):
File "/home/use/anaconda3/lib/python3.9/site-packages/joblib/parallel.py", line 861, in dispatch_one_batch
self._dispatch(tasks)
File "/home/use/anaconda3/lib/python3.9/site-packages/joblib/parallel.py", line 779, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/use/anaconda3/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "/home/use/anaconda3/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 572, in __init__
self.results = batch()
File "/home/use/anaconda3/lib/python3.9/site-packages/joblib/parallel.py", line 262, in __call__
return [func(*args, **kwargs)
File "/home/use/anaconda3/lib/python3.9/site-packages/joblib/parallel.py", line 262, in <listcomp>
return [func(*args, **kwargs)
File "/home/use/anaconda3/lib/python3.9/site-packages/sklearn/utils/fixes.py", line 216, in __call__
return self.function(*args, **kwargs)
File "/home/use/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/home/use/anaconda3/lib/python3.9/site-packages/keras/wrappers/scikit_learn.py", line 175, in fit
history = self.model.fit(x, y, **fit_args)
File "/home/use/anaconda3/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/home/use/anaconda3/lib/python3.9/site-packages/keras/engine/training.py", line 1650, in fit
tmp_logs = self.train_function(iterator)
File "/home/use/anaconda3/lib/python3.9/site-packages/keras/engine/training.py", line 1249, in train_function
return step_function(self, iterator)
File "/home/use/anaconda3/lib/python3.9/site-packages/keras/engine/training.py", line 1233, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/home/use/anaconda3/lib/python3.9/site-packages/keras/engine/training.py", line 1222, in run_step
outputs = model.train_step(data)
File "/home/use/anaconda3/lib/python3.9/site-packages/keras/engine/training.py", line 1027, in train_step
self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
File "/home/use/anaconda3/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 526, in minimize
grads_and_vars = self.compute_gradients(loss, var_list, tape)
File "/home/use/anaconda3/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 259, in compute_gradients
grads = tape.gradient(loss, var_list)
Node: 'gradient_tape/mean_squared_error/BroadcastGradientArgs'
Incompatible shapes: [32,20,10] vs. [32,10]
[[{{node gradient_tape/mean_squared_error/BroadcastGradientArgs}}]] [Op:__inference_train_function_1033752]

Error about torch tensor precision on gpu

I tried to finetune a Bert model on GPU using PyTorch-Lightning's class Trainer using the following code:
from pytorch_lightning import Trainer
from models import LitAdModel, AdModel
from dataloaders import train_dataloader, test_dataloader
model = AdModel()
litmodel = LitAdModel(model=model)
trainer = Trainer(accelerator='gpu', devices=1)
trainer.fit(model=litmodel, train_dataloaders=train_dataloader,
val_dataloaders=test_dataloader)
in which train_dataloader, test_dataloader and AdModel and LitAdModel classes are defined elsewhere. When I do this without using the GPU, it works ( slowly), but with GPU it gives the following error:
File "/Users/sanjinjuricfot/developer/copy_models/test_pl.py", line
24, in
main() File "/Users/sanjinjuricfot/developer/copy_models/test_pl.py", line 18, in
main
littrain(train=train, test=test) File "/Users/sanjinjuricfot/developer/copy_models/src/_torch/littrain.py",
line 39, in littrain
trainer.fit(model=litmodel, train_dataloaders=train_dataloader, val_dataloaders=test_dataloader) File
"/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py",
line 582, in fit
call._call_and_handle_interrupt( File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py",
line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py",
line 624, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py",
line 1061, in _run
results = self._run_stage() File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py",
line 1140, in _run_stage
self._run_train() File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py",
line 1153, in _run_train
self._run_sanity_check() File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py",
line 1225, in _run_sanity_check
val_loop.run() File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py",
line 199, in run
self.advance(*args, **kwargs) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py",
line 152, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs) File
"/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py",
line 199, in run
self.advance(*args, **kwargs) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py",
line 121, in advance
batch = next(data_fetcher) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py",
line 184, in next
return self.fetching_function() File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py",
line 275, in fetching_function
return self.move_to_device(batch) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py",
line 294, in move_to_device
batch = self.batch_to_device(batch) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py",
line 142, in batch_to_device
batch = self.trainer._call_strategy_hook("batch_to_device", batch, dataloader_idx=dataloader_idx) File
"/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py",
line 1443, in _call_strategy_hook
output = fn(*args, **kwargs) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py",
line 273, in batch_to_device
return model._apply_batch_transfer_handler(batch, device=device, dataloader_idx=dataloader_idx) File
"/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/core/module.py",
line 295, in _apply_batch_transfer_handler
batch = self._call_batch_hook("transfer_batch_to_device", batch, device, dataloader_idx) File
"/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/core/module.py",
line 283, in _call_batch_hook
return trainer_method(hook_name, *args) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py",
line 1305, in _call_lightning_module_hook
output = fn(*args, **kwargs) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/pytorch_lightning/core/hooks.py",
line 632, in transfer_batch_to_device
return move_data_to_device(batch, device) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/lightning_lite/utilities/apply_func.py",
line 101, in move_data_to_device
return apply_to_collection(batch, dtype=_TransferableDataType, function=batch_to) File
"/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py",
line 55, in apply_to_collection
v = apply_to_collection( File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py",
line 47, in apply_to_collection
return function(data, *args, **kwargs) File "/Users/sanjinjuricfot/developer/copy_models/.venv/lib/python3.10/site-packages/lightning_lite/utilities/apply_func.py",
line 95, in batch_to
data_output = data.to(device, **kwargs) TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support
float64. Please use float32 instead.
I tried using this command
torch.set_default_dtype(torch.float32)
in all the relevant files and adding
.to(torch.float32)
extension to all the tensors, but it didn't work.
I am using MacBook Pro with M2 processor. Thanks in advance for any help!

output and feeb_dict inside session FailedPreconditionError (see above for traceback): Attempting to use uninitialized value

I am converting the MTCNN tensorflow into tensorflow tensorRT
When I run camera_test.py
I get this error FailedPreconditionError: Attempting to use uninitialized in Tensorflow
Traceback (most recent call last): File
"/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/client/session.py",
line 1334, in _do_call
return fn(*args) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/client/session.py",
line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/client/session.py",
line 1407, in _call_tf_sessionrun
run_metadata) tensorflow.python.framework.errors_impl.FailedPreconditionError:
Attempting to use uninitialized value conv4_2/biases [[{{node
conv4_2/biases/read}}]] [[{{node Squeeze_1}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "camera_test_trrt.py", line
48, in
boxes_c,landmarks = mtcnn_detector.detect(image) File "../Detection/MtcnnDetector.py", line 371, in detect
boxes, boxes_c, _ = self.detect_pnet(img) File "../Detection/MtcnnDetector.py", line 221, in detect_pnet
cls_cls_map, reg = self.pnet_detector.predict(im_resized) File "../Detection/fcn_detector_trrt.py", line 56, in predict
self.height_op: height}) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/client/session.py",
line 929, in run
run_metadata_ptr) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/client/session.py",
line 1152, in _run
feed_dict_tensor, options, run_metadata) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/client/session.py",
line 1328, in _do_run
run_metadata) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/client/session.py",
line 1348, in _do_call
raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.FailedPreconditionError:
Attempting to use uninitialized value conv4_2/biases [[node
conv4_2/biases/read (defined at ../train_models/mtcnn_model.py:208) ]]
[[node Squeeze_1 (defined at ../train_models/mtcnn_model.py:245) ]]
Caused by op 'conv4_2/biases/read', defined at: File
"camera_test_trrt.py", line 23, in
PNet = FcnDetector(P_Net, '/home/jetsonnano/Downloads/MTCNN-Tensorflow-master/test/p_output_graph_FP16.pb')
File "../Detection/fcn_detector_trrt.py", line 23, in init
self.cls_prob, self.bbox_pred, _ = net_factory(image_reshape, training=False) File "../train_models/mtcnn_model.py", line 208, in
P_Net
bbox_pred = slim.conv2d(net,num_outputs=4,kernel_size=[1,1],stride=1,scope='conv4_2',activation_fn=None)
File
"/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py",
line 182, in func_with_args
return func(*args, **current_args) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py",
line 1158, in convolution2d
conv_dims=2) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py",
line 182, in func_with_args
return func(*args, **current_args) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py",
line 1061, in convolution
outputs = layer.apply(inputs) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py",
line 1227, in apply
return self.call(inputs, *args, **kwargs) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/layers/base.py",
line 530, in call
outputs = super(Layer, self).call(inputs, *args, **kwargs) File
"/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py",
line 538, in call
self._maybe_build(inputs) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py",
line 1603, in _maybe_build
self.build(input_shapes) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/keras/layers/convolutional.py",
line 174, in build
dtype=self.dtype) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/layers/base.py",
line 435, in add_weight
getter=vs.get_variable) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py",
line 349, in add_weight
aggregation=aggregation) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/training/checkpointable/base.py",
line 607, in _add_variable_with_custom_getter
**kwargs_for_getter) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py",
line 1479, in get_variable
aggregation=aggregation) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py",
line 1220, in get_variable
aggregation=aggregation) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py",
line 530, in get_variable
return custom_getter(**custom_getter_kwargs) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py",
line 1753, in layer_variable_getter
return _model_variable_getter(getter, *args, **kwargs) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py",
line 1744, in _model_variable_getter
aggregation=aggregation) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py",
line 182, in func_with_args
return func(*args, **current_args) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/variables.py",
line 350, in model_variable
aggregation=aggregation) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py",
line 182, in func_with_args
return func(*args, **current_args) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/variables.py",
line 277, in variable
aggregation=aggregation) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py",
line 499, in _true_getter
aggregation=aggregation) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py",
line 911, in _get_single_variable
aggregation=aggregation) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/ops/variables.py",
line 213, in call
return cls._variable_v1_call(*args, **kwargs) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/ops/variables.py",
line 176, in _variable_v1_call
aggregation=aggregation) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/ops/variables.py",
line 155, in
previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py",
line 2495, in default_variable_creator
expected_shape=expected_shape, import_scope=import_scope) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/ops/variables.py",
line 217, in call
return super(VariableMetaclass, cls).call(*args, **kwargs) File
"/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/ops/variables.py",
line 1395, in init
constraint=constraint) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/ops/variables.py",
line 1557, in _init_from_args
self._snapshot = array_ops.identity(self._variable, name="read") File
"/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py",
line 180, in wrapper
return target(*args, **kwargs) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py",
line 81, in identity
ret = gen_array_ops.identity(input, name=name) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py",
line 3890, in identity
"Identity", input=input, name=name) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py",
line 788, in _apply_op_helper
op_def=op_def) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py",
line 507, in new_func
return func(*args, **kwargs) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/framework/ops.py",
line 3300, in create_op
op_def=op_def) File "/home/jetsonnano/.virtualenvs/jetsonnanotest/lib/python3.6/site-packages/tensorflow/python/framework/ops.py",
line 1801, in init
self._traceback = tf_stack.extract_stack()
FailedPreconditionError (see above for traceback): Attempting to use
uninitialized value conv4_2/biases [[node conv4_2/biases/read
(defined at ../train_models/mtcnn_model.py:208) ]] [[node Squeeze_1
(defined at ../train_models/mtcnn_model.py:245) ]]
how do i tf.global_variables_initializer will sess.run
init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)
When I have output parameters and feed_dict in sess.run
cls_prob, bbox_pred,landmark_pred = self.sess.run([self.cls_prob, self.bbox_pred,self.landmark_pred], feed_dict={self.image_op: data})
in detector.py
and
cls_prob, bbox_pred = self.sess.run([self.cls_prob, self.bbox_pred],feed_dict={self.image_op: databatch, self.width_op: width,self.height_op: height})
in fcn_detector.py
can anyone help out here?
Just after the following line
self.sess = tf.Session( config=tf.ConfigProto(allow_soft_placement=True, gpu_options=tf.GPUOptions(allow_growth=True)))
declare
init_op = tf.global_variables_initializer()
and do
self.sess.run(init_op)

Pytorch to ONNX export function fails and causes legacy function error

I am trying to convert the pytorch model in this link to onnx model using the code below :
device=t.device('cuda:0' if t.cuda.is_available() else 'cpu')
print(device)
faster_rcnn = FasterRCNNVGG16()
trainer = FasterRCNNTrainer(faster_rcnn).cuda()
#trainer = FasterRCNNTrainer(faster_rcnn).to(device)
trainer.load('./checkpoints/model.pth')
dummy_input = t.randn(1, 3, 300, 300, device = 'cuda')
#dummy_input = dummy_input.to(device)
t.onnx.export(faster_rcnn, dummy_input, "model.onnx", verbose = True)
But I get the following error (Sorry for the block quote below stackoverflow wouldn't let the whole trace be in code format and wouldn't let the question be posted otherwise):
Traceback (most recent call last):
small_object_detection_master_samirsen\onnxtest.py", line 44, in <module>
t.onnx.export(faster_rcnn, dummy_input, "fasterrcnn_10120119_06025842847785781.onnx", verbose = True)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\onnx\__init__.py",
line 132, in export
strip_doc_string, dynamic_axes)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\onnx\utils.py",
line 64, in export
example_outputs=example_outputs, strip_doc_string=strip_doc_string, dynamic_axes=dynamic_axes)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\onnx\utils.py",
line 329, in _export
_retain_param_name, do_constant_folding)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\onnx\utils.py",
line 213, in _model_to_graph
graph, torch_out = _trace_and_get_graph_from_model(model, args, training)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\onnx\utils.py",
line 171, in _trace_and_get_graph_from_model
trace, torch_out = torch.jit.get_trace_graph(model, args, _force_outplace=True)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\jit__init__.py",
line 256, in get_trace_graph
return LegacyTracedModule(f, _force_outplace, return_inputs)(*args, **kwargs)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 547, in call
result = self.forward(*input, **kwargs)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\jit__init__.py",
line 323, in forward
out = self.inner(*trace_inputs)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 545, in call
result = self._slow_forward(*input, **kwargs)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 531, in _slow_forward
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 531, in _slow_forward
result = self.forward(*input, **kwargs)
File "D:\smallobject2\export test s\small_object_detection_master_samirsen\model\faster_rcnn.py", line
133, in forward
h, rois, roi_indices)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 545, in call
result = self._slow_forward(*input, **kwargs)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 531, in _slow_forward
result = self.forward(*input, **kwargs)
File "D:\smallobject2\export test s\small_object_detection_master_samirsen\model\faster_rcnn_vgg16.py",
line 142, in forward
pool = self.roi(x, indices_and_rois)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 545, in call
result = self._slow_forward(*input, **kwargs)
File "C:\Users\HP\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py",
line 531, in _slow_forward
result = self.forward(*input, **kwargs)
File "D:\smallobject2\export test s\small_object_detection_master_samirsen\model\roi_module.py", line
85, in forward
return self.RoI(x, rois)
RuntimeError: Attempted to trace RoI, but tracing of legacy functions is not supported
This is because ONNX does not support torch.grad.Function. The issue is because ROI class Refer this
To overcome the issue, you have to implement the forward and backward function as a separate function definition rather than a member of ROI class.
The function call to ROI in FasterRCNNVGG16 is supposed to be altered to explicit call forward and backward functions.

RuntimeError: CUDNN_STATUS_INTERNAL_ERROR

On ubuntu14.04,I use pytorch with cudnn.This problem happened:
Traceback (most recent call last):
File "main.py", line 58, in <module>
test_detect(test_loader, nod_net, get_pbb, bbox_result_path,config1,n_gpu=config_submit['n_gpu'])
File "/home/ubuntu/nndl/DSB2017/test_detect.py", line 52, in test_detect
output = net(input,inputcoord)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 252, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 58, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 252, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/nndl/DSB2017/net_detector.py", line 102, in forward
out = self.preBlock(x)#16
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 252, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/container.py", line 67, in forward
input = module(input)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 252, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 351, in forward
self.padding, self.dilation, self.groups)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/functional.py", line 119, in conv3d
return f(input, weight, bias)
RuntimeError: CUDNN_STATUS_INTERNAL_ERROR
I have google it for severial hours an am really confused.What made this happen?
I just encountered this problem on ubuntu16.04 and solved it. My solution was to run
sudo rm -rf ~/.nv
and then reboot.

Resources