I'm following this guide without changing anything. I'm using an aws server with deep learning ami: Deep Learning AMI (Ubuntu 18.04) Version 40.0
I've tried to change my custom dataset to the coco dataset and to a small subset of the custom one.
batch size doesn't seems to matter, CUDA and other drivers seems to work.
The exception is thrown when the batch starts the training process. This is the full stack trace:
Logging results to runs/train/exp66
Starting training for 5 epochs...
Epoch gpu_mem box obj cls total targets img_size
0%| | 0/22 [00:00<?, ?it/s]
Traceback (most recent call last):
File "train.py", line 533, in <module>
train(hyp, opt, device, tb_writer, wandb)
File "train.py", line 298, in train
pred = model(imgs) # forward
File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ubuntu/yolov5/models/yolo.py", line 121, in forward
return self.forward_once(x, profile) # single-scale inference, train
File "/home/ubuntu/yolov5/models/yolo.py", line 137, in forward_once
x = m(x) # run
File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ubuntu/yolov5/models/common.py", line 113, in forward
return self.conv(torch.cat([x[..., ::2, ::2], x[..., 1::2, ::2], x[..., ::2, 1::2], x[..., 1::2, 1::2]], 1))
File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ubuntu/yolov5/models/common.py", line 38, in forward
return self.act(self.bn(self.conv(x)))
File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
I don't know why but it seems as torch 1.8 is built on older version of cuda.
Also as pytorch has its own cuda it seems to doesn't care what version you have on your machine.
Changing the torch version (and matching compatible tochvision) solved my problem.
In my case I did as follows:
Changed two lines in "requirements.txt":
torch==1.7.1
torchvision==0.8.2
Created fresh conda environment with python=3.8
Activated the environment
Installed requirements from modified file:
$ pip install -r requirements.txt
Hope it'll help to someone :)
I fixed it using conda, I cloned the pytorch environment one which came with the image and it works perfectly. I still don't know the cause though.
I ran into something similar when trying to train yolov5 in a script. I found that upgrading to torch==1.9.0 and torchvision==0.10.0 also works (in case you dont want to downgrade as mentioned above)
Related
I am trying to run this 3D pose estimation repo in Google Colab on a GPU, but after doing all of the steps and putting in my own left/right cam vids, I get this error in Colab:
infering thread started
1 1
: cannot connect to X server
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/content/Stereo-3D-Pose-Estimation/poseinferscheduler.py", line 59, in infer_pose_loop
l_pose_t = infer_fast(self.net, l_img, height, self.stride, self.upsample_ratio, self.cpu)
File "/content/Stereo-3D-Pose-Estimation/pose3dmodules.py", line 47, in infer_fast
stages_output = net(tensor_img)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/content/Stereo-3D-Pose-Estimation/models/with_mobilenet.py", line 115, in forward
backbone_features = self.model(x)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/activation.py", line 102, in forward
return F.relu(input, inplace=self.inplace)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1296, in relu
result = torch.relu_(input)
RuntimeError: unknown parameter type
I am a bit confused as to why I am seeing it, I have already installed all necessary prerequisites; also can't interpret what it means either.
Since the traceback happens in the pytorch library, I checked the code there on the pytorch github.
What the error means is that you are calling an inplace activation function in torch.relu_ to some object called input. However, what is happening is that the type of input is not recognized by the torch backend which is why it is a runtime error.
Therefore, I would suggest to print out input and also run
type(input)
to find out what object input represents and what that variable is. As a further reference, this is the particular script that Pytorch runs in the backend that leads it to throw an unknown parameter type error. From a quick look, it seems to be a switch statement that confirms if a value falls into a list of types. If it is not in the list of types, then it will run the default block which throws unknown parameter type error.
https://github.com/pytorch/pytorch/blob/aacc722aeca3de1aedd35adb41e6f8149bd656cd/torch/csrc/utils/python_arg_parser.cpp#L518-L541
EDIT:
If type(input) returns a torch.tensor then it is probably an issue with the version of python you are using. I know you said you have the prerequisites but I think it would be good to double check if you have python 3.6, and maybe but less preferably python 3.5 or 3.7. These are the python versions that work with the repo you just sent me.
You can find the python version on your collab by typing
!python --version on one of cells. Make sure that it returns a correct version supported by the software you are running. This error might come from the fact that instead of torch, python itself is expressing this error in its backend.
I found this stackoverflow useful as it shows how some code was unable to recognize a built in type dictionary in python: "TypeError: Unknown parameter type: <class 'dict_values'>"
The solution to this was to check python versions.
Sarthak
I'm writing a GUI application with wxpython. The application uses yolo to detect pavement breakage. I use the yolo code to train and detect. It is too time-consuming to load the yolo model, so the GUI will freeze. Therefore, I expect to show a progress bar during loading yolo model with threading.Thread. I can use main thread to load yolo model, but I get a exception during loading yolo model with a new thread.
The error:
Traceback (most recent call last):
File "C:\Program Files\Python36\lib\contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "C:\Users\JH-06\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\framework\ops.py", line 5652, in get_controller
yield g
File "d:\code\Python\yoloDetector_v007\src\YOLO\yolo.py", line 76, in generate
self.yolo_model = load_model(model_path, compile=False)
File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 419, in load_model
model = _deserialize_model(f, custom_objects, compile)
File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 221, in _deserialize_model
model_config = f['model_config']
File "C:\Program Files\Python36\lib\site-packages\keras\utils\io_utils.py", line 302, in __getitem__
raise ValueError('Cannot create group in read only mode.')
ValueError: Cannot create group in read only mode.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "d:\code\Python\yoloDetector_v007\src\myRoadDamageUtil\myRoadDetectionModel.py", line 166, in init
self.__m_oVideoDetector.init()
File "d:\code\Python\yoloDetector_v007\src\myRoadDamageUtil\myVideoDetector.py", line 130, in init
self.__m_oDetector.init()
File "d:\code\Python\yoloDetector_v007\src\myRoadDamageUtil\myRoadBreakageDetector.py", line 87, in init
self.__m_oYoloDetector.init()
File "d:\code\Python\yoloDetector_v007\src\YOLO\yolo.py", line 46, in init
self.boxes, self.scores, self.classes = self.generate()
File "d:\code\Python\yoloDetector_v007\src\YOLO\yolo.py", line 80, in generate
self.yolo_model.load_weights(self.model_path) # make sure model, anchors and classes match
File "C:\Program Files\Python36\lib\site-packages\keras\engine\network.py", line 1166, in load_weights
f, self.layers, reshape=reshape)
File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 1058, in load_weights_from_hdf5_group
K.batch_set_value(weight_value_tuples)
File "C:\Program Files\Python36\lib\site-packages\keras\backend\tensorflow_backend.py", line 2470, in batch_set_value
get_session().run(assign_ops, feed_dict=feed_dict)
File "C:\Users\JH-06\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\client\session.py", line 950, in run
run_metadata_ptr)
File "C:\Users\JH-06\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\client\session.py", line 1098, in _run
raise RuntimeError('The Session graph is empty. Add operations to the '
RuntimeError: The Session graph is empty. Add operations to the graph before calling run().
May somebody give me any suggestion?
When using wxPython with threads, you need to make sure that you are using a thread-safe method to communicate back to the GUI. There are 3 thread-safe methods you can use with wxPython:
wx.CallAfter
wx.CallLater
wx.PostEvent
Check out either of the following articles for more information
https://www.blog.pythonlibrary.org/2010/05/22/wxpython-and-threads/
https://wiki.wxpython.org/LongRunningTasks
I am new to Python and trying to install Airflow in my Mac, by following this tutorial
While these two commands work fine:
$ airflow initdb
$ airflow webserver -p 8080
The scheduler command (airflow scheduler) throws the following error:
[2020-02-18 13:18:09,012] {scheduler_job.py:1382} ERROR - Exception when executing execute_helper Traceback (most recent call last):
File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 1380, in _execute
self._execute_helper()
File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 1413, in _execute_helper
self.processor_agent.start()
File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/dag_processing.py", line 554, in start
self._process.start()
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 283, in _Popen
return Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'SchedulerJob._execute.<locals>.processor_factory'
[2020-02-18 13:18:09,035] {helpers.py:322} INFO - Sending Signals.SIGTERM to GPID None
Traceback (most recent call last): File "/Users/mac/Workspace/airflow/airflow_venv/bin/airflow", line 37, in <module>
args.func(args) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/cli.py", line 75, in wrapper
return f(*args, **kwargs) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/bin/cli.py", line 1040, in scheduler
job.run() File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 221, in run
self._execute() File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 1384, in _execute
self.processor_agent.end() File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/dag_processing.py", line 707, in end
reap_process_group(self._process.pid, log=self.log) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/helpers.py", line 324, in reap_process_group
signal_procs(sig) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/helpers.py", line 293, in signal_procs
os.killpg(pgid, sig)
TypeError: an integer is required (got type NoneType)
EDIT: Python 3.8 is supported now https://github.com/apache/airflow#requirements. So this answer might not be relevant now.
This due to the Python version you are using. Airflow doesn't support Python 3.8 yet https://github.com/apache/airflow#stable-version-1109.
Downgrade your Python to 3.7 and check.
Maybe there are some compatibility problems?
Using Python 3.6.10 and airflow v1.10.4, I can get airflow running. Maybe you could try some other versions?
This worked for me!
1- Make sure you are using the correct celery version that supports your other packages like RabbitMQ ( as V5 doesn't support AMQP in its usual format), my advice is to use V4.6.X
2-THIS HAS NOTHING TO DO WITH PYTHON VERSION IF YOU ARE USING AIRFLOW V2.0
3- simply make yourself happy with airflow db reset (command may differ if you are using airflow Version X<2.0 )
4- Avoid deleting any dag like you delete a file and use airflow dag ... commands to do so. (it makes up a mess in your environment that you wont like, trust me on this..)
Wish you luck bearing python stuff..
https://github.com/zzh8829/yolov3-tf2 is the project. I've installed all the correct versions ofthings I think.
google is telling me that it is probably a low VRAM issue but I am still looking around for other reasons. please help.
I am using :
Windows 10 (don't say "there's your problem" I need it)
cuDNN 7.4.6
CUDA 10.0
tensorflow 2.0.0
python 3.6
I have a gtx1660 super 6GB VRAM with a ryzen 7 2700x on 16GB of RAM. I'm getting a gt1080 8gig in a few days I'm going to add to the second PCI slot.
the Error is as follows:
2019-11-30 06:31:26.167368: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2019-11-30 06:31:27.843742: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-11-30 06:31:27.853725: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
Traceback (most recent call last):
File ".\convert.py", line 34, in <module>
app.run(main)
File "C:\Program Files\Python36\lib\site-packages\absl\app.py", line 299, in run
_run_main(main, args)
File "C:\Program Files\Python36\lib\site-packages\absl\app.py", line 250, in _run_main
sys.exit(main(argv))
File ".\convert.py", line 25, in main
output = yolo(img)
File "C:\Program Files\Python36\lib\site-packages\tensorflow_core\python\keras\engine\base_layer.py", line 891, in __call__
outputs = self.call(cast_inputs, *args, **kwargs)
File "C:\Program Files\Python36\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 708, in call
convert_kwargs_to_constants=base_layer_utils.call_context().saving)
File "C:\Program Files\Python36\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 860, in _run_internal_graph
output_tensors = layer(computed_tensors, **kwargs)
File "C:\Program Files\Python36\lib\site-packages\tensorflow_core\python\keras\engine\base_layer.py", line 891, in __call__
outputs = self.call(cast_inputs, *args, **kwargs)
File "C:\Program Files\Python36\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 708, in call
convert_kwargs_to_constants=base_layer_utils.call_context().saving)
File "C:\Program Files\Python36\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 860, in _run_internal_graph
output_tensors = layer(computed_tensors, **kwargs)
File "C:\Program Files\Python36\lib\site-packages\tensorflow_core\python\keras\engine\base_layer.py", line 891, in __call__
outputs = self.call(cast_inputs, *args, **kwargs)
File "C:\Program Files\Python36\lib\site-packages\tensorflow_core\python\keras\layers\convolutional.py", line 197, in call
outputs = self._convolution_op(inputs, self.kernel)
File "C:\Program Files\Python36\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 1134, in __call__
return self.conv_op(inp, filter)
File "C:\Program Files\Python36\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 639, in __call__
return self.call(inp, filter)
File "C:\Program Files\Python36\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 238, in __call__
name=self.name)
File "C:\Program Files\Python36\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 2010, in conv2d
name=name)
File "C:\Program Files\Python36\lib\site-packages\tensorflow_core\python\ops\gen_nn_ops.py", line 1031, in conv2d
data_format=data_format, dilations=dilations, name=name, ctx=_ctx)
File "C:\Program Files\Python36\lib\site-packages\tensorflow_core\python\ops\gen_nn_ops.py", line 1130, in conv2d_eager_fallback
ctx=_ctx, name=name)
File "C:\Program Files\Python36\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a wa
rning log message was printed above. [Op:Conv2D]
I had the same problem in the same repository.
The solution that worked for me and my team was to upgrade cuDNN to version 7.5 or higher (as opposed to your 7.4).
The instructions for updating can be found on Nvidia's site:
https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html
This could happen for a few reasons.
(1) As you mentioned, it may be a a memory issue, which you could try to verify by allocating less memory to the GPU and seeing if that error still occurs. You can do this in TF 2.0 like so (https://github.com/tensorflow/tensorflow/issues/25138#issuecomment-484428798):
import tensorflow as tf
tf.config.gpu.set_per_process_memory_fraction(0.75)
tf.config.gpu.set_per_process_memory_growth(True)
# your model creation, etc.
model = MyModel(...)
I see the code you're running sets dynamic memory growth if you have > 1 GPU (https://github.com/zzh8829/yolov3-tf2/blob/master/train.py#L46-L47), but since you only have 1 GPU, then it is likely just trying to allocate all memory (>90%) at the start.
(2) Some users seem to have experienced this on Windows when there were other TensorFlow or similar processes using the GPU simultaneously, either by you or by other users: https://stackoverflow.com/a/53707323/10993413
(3) As always, make sure your PATH variables are correct. Sometimes if you tried multiple installations and didn't clean things up properly, the PATHs may be finding the wrong version first and cause an issue. If you add new paths to the beginning of PATH, they should be found first: https://www.tensorflow.org/install/gpu#windows_setup
(4) As mentioned by #xenotecc, you could try upgrading to a newer version of CUDNN, though I'm not sure this will help since your config is listed as supported on TF docs: https://www.tensorflow.org/install/source#gpu. If this does solve it, it may have been PATH issue after all since you will likely update the PATHs after installing the newer version.
Got the same error and resolved by below:
gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_virtual_device_configuration(
gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=5000)])
(with GTX 1660, 6G memory, tensorflow 2.0.1)
Simple fix:
insert this line under the imports in "convert.py"
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
this will ignore your gpu while loading the weights.
I am running this example code ( seq2seq built on Keras)form https://github.com/fchollet/keras/blob/master/examples/lstm_seq2seq.py.
This code runs correctly on my Ubuntu. But an error occured when I ran the same code on my Windows.
It says:
Using TensorFlow backend.
Number of samples: 10000
Number of unique input tokens: 73
Number of unique output tokens: 86
Max sequence length for inputs: 17
Max sequence length for outputs: 42
Traceback (most recent call last):
File "h:/eclipse_workspace/Keras_DL/src/seq2seq/lstm_seq2seq.py", line 125, in
encoder = LSTM(latent_dim, return_state = True)
File "D:\software\anaconda\lib\site-packages\keras\legacy\interfaces.py", line 88, in wrapper
return func(*args, **kwargs)
File "D:\software\anaconda\lib\site-packages\keras\layers\recurrent.py", line 949, in init
super(LSTM, self).init(**kwargs)
File "D:\software\anaconda\lib\site-packages\keras\layers\recurrent.py", line 191, in init
super(Recurrent, self).init(**kwargs)
File "D:\software\anaconda\lib\site-packages\keras\engine\topology.py", line 281, in init
raise TypeError('Keyword argument not understood:', kwarg)
TypeError: ('Keyword argument not understood:', 'return_state')
I found that return_state do exists in
keras.layers.recurrent.Recurrent(return_sequences=False, return_state=False, go_backwards=False, stateful=False, unroll=False, implementation=0)
Can anyone tell me how can I run this demo correctly on Windows?
My system info:
- OS : Windows 10 64 bit
- python 3.5.2 64 bit
- cudnn-8.0-windows10-x64-v5.1
- keras 2.04 tensorflow-gpu 1.1.0
Your Keras version is too old. return_state is added in Keras 2.0.5. I suggest you install the latest version from GitHub, since the example code you're running has been added to the library less than 24 hours ago.