Running training using torch.distributed.launch - pytorch

I'm trying to run training of the following model: https://github.com/Atten4Vis/ConditionalDETR
by using a script conddetr_r50_epoch50.sh, just like it is said in README. It looks like this:
script_name1=`basename $0`
script_name=${script_name1:0:${#script_name1}-3}
python -m torch.distributed.launch \
--nproc_per_node=8 \
--use_env \
main.py \
--coco_path ../data/coco \
--output_dir output/$script_name
But I am getting the following errors:
NOTE: Redirects are currently not supported in Windows or MacOs.
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-16DB4TE]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-16DB4TE]:29500 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
File "C:\DETR\ConditionalDETR\main.py", line 258, in <module>
main(args)
File "C:\DETR\ConditionalDETR\main.py", line 116, in main
utils.init_distributed_mode(args)
File "C:\DETR\ConditionalDETR\util\misc.py", line 429, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\cuda\__init__.py", line 326, in set_device
torch._C._cuda_setDevice(device)
AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'
C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\distributed\launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 55928) of binary: C:\ProgramData\Anaconda3\envs\conditional_detr\python.exe
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\distributed\launch.py", line 195, in <module>
main()
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\distributed\launch.py", line 191, in main
launch(args)
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\distributed\launch.py", line 176, in launch
run(args)
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\distributed\run.py", line 753, in run
elastic_launch(
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\distributed\launcher\api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\ProgramData\Anaconda3\envs\conditional_detr\lib\site-packages\torch\distributed\launcher\api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I am very new to PyTorch I do not quite understand why I'm getting this errors and what should I do to fix this?

Related

[ERROR]: training mobnetv2 with object_detection api and I get the error while training. unknown error

W0615 19:12:26.293519 16220 deprecation.py:554] From C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\tensorflow\python\util\deprecation.py:629: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
2022-06-15 19:13:01.007705: W tensorflow/core/framework/op_kernel.cc:1733] UNKNOWN: JIT compilation failed.
Traceback (most recent call last):
File "C:\Users\oknor\Documents\Programming\TrainingModels\TensorFlow\workspace\car_training\model_main_tf2.py", line 114, in <module>
tf.compat.v1.app.run()
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\tensorflow\python\platform\app.py", line 36, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\absl\app.py", line 312, in run
_run_main(main, args)
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\absl\app.py", line 258, in _run_main
sys.exit(main(argv))
File "C:\Users\oknor\Documents\Programming\TrainingModels\TensorFlow\workspace\car_training\model_main_tf2.py", line 105, in main
model_lib_v2.train_loop(
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\object_detection\model_lib_v2.py", line 685, in train_loop
losses_dict = _dist_train_step(train_input_iter)
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\tensorflow\python\util\traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\tensorflow\python\eager\execute.py", line 54, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:
Detected at node 'train_input_images/write_summary/mod' defined at (most recent call last):
File "C:\Program Files\Python310\lib\threading.py", line 966, in _bootstrap
self._bootstrap_inner()
File "C:\Program Files\Python310\lib\threading.py", line 1009, in _bootstrap_inner
self.run()
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\object_detection\model_lib_v2.py", line 629, in train_step_fn
if record_summaries:
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\object_detection\model_lib_v2.py", line 630, in train_step_fn
tf.compat.v2.summary.image(
File "C:\Program Files\Python310\lib\site-packages\tensorboard\plugins\image\summary_v2.py", line 141, in image
tag=tag, tensor=lazy_tensor, step=step, metadata=summary_metadata
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\object_detection\model_lib_v2.py", line 599, in <lambda>
lambda: global_step % num_steps_per_iteration == 0):
Node: 'train_input_images/write_summary/mod'
Detected at node 'train_input_images/write_summary/mod' defined at (most recent call last):
File "C:\Program Files\Python310\lib\threading.py", line 966, in _bootstrap
self._bootstrap_inner()
File "C:\Program Files\Python310\lib\threading.py", line 1009, in _bootstrap_inner
self.run()
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\object_detection\model_lib_v2.py", line 629, in train_step_fn
if record_summaries:
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\object_detection\model_lib_v2.py", line 630, in train_step_fn
tf.compat.v2.summary.image(
File "C:\Program Files\Python310\lib\site-packages\tensorboard\plugins\image\summary_v2.py", line 141, in image
tag=tag, tensor=lazy_tensor, step=step, metadata=summary_metadata
File "C:\Users\oknor\AppData\Roaming\Python\Python310\site-packages\object_detection\model_lib_v2.py", line 599, in <lambda>
lambda: global_step % num_steps_per_iteration == 0):
Node: 'train_input_images/write_summary/mod'
2 root error(s) found.
(0) UNKNOWN: JIT compilation failed.
[[{{node train_input_images/write_summary/mod}}]]
[[train_input_images/write_summary/Equal_1/_16]]
(1) UNKNOWN: JIT compilation failed.
[[{{node train_input_images/write_summary/mod}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference__dist_train_step_59118]
This is part of the console output and if required I can share the whole output.
I am using tensorflowv2 2.9, python 3.10, CUDA 11.7, and cuDNN 8401.
I want to train my mobnetv2 to detect cars in images, custom object detection.
I get this error one i run the command to start the training process.
Had the same issue.
Try running it with TF 2.9.1, CUDA : 11.2, cuDNN : 8.1.
Source

How to solve TemplateRuntimeError: no test named '>' when compile chromium?

When compile chromium, I get the following error messages:
ninja: Entering directory `out/Cros'
[44/32769] ACTION //components/exo/wayland/compatibility_test:generated_client_event_receiver_version_tests(//build/toolchain/linux:clang_x64)
FAILED: gen/components/exo/wayland/compatibility_test/all_generated_client_event_receiver_version_tests.cc
python3 ../../components/exo/wayland/compatibility_test/wayland_protocol_codegen.py ../../third_party ../../buildtools/linux64/clang-format ../../components/exo/wayland/compatibility_test/template_client_event_receiver_version_tests.cc.tmpl gen/components/exo/wayland/compatibility_test/all_generated_client_event_receiver_version_tests.cc ../../components/exo/wayland/protocol/aura-shell.xml ../../third_party/wayland/src/protocol/wayland.xml ../../third_party/wayland-protocols/src/stable/presentation-time/presentation-time.xml ../../third_party/wayland-protocols/src/stable/viewporter/viewporter.xml ../../third_party/wayland-protocols/src/stable/xdg-shell/xdg-shell.xml ../../third_party/wayland-protocols/src/unstable/fullscreen-shell/fullscreen-shell-unstable-v1.xml ../../third_party/wayland-protocols/src/unstable/input-timestamps/input-timestamps-unstable-v1.xml ../../third_party/wayland-protocols/src/unstable/linux-dmabuf/linux-dmabuf-unstable-v1.xml ../../third_party/wayland-protocols/src/unstable/linux-explicit-synchronization/linux-explicit-synchronization-unstable-v1.xml ../../third_party/wayland-protocols/src/unstable/pointer-constraints/pointer-constraints-unstable-v1.xml ../../third_party/wayland-protocols/src/unstable/pointer-gestures/pointer-gestures-unstable-v1.xml ../../third_party/wayland-protocols/src/unstable/relative-pointer/relative-pointer-unstable-v1.xml ../../third_party/wayland-protocols/src/unstable/text-input/text-input-unstable-v1.xml ../../third_party/wayland-protocols/src/unstable/xdg-shell/xdg-shell-unstable-v6.xml ../../third_party/wayland-protocols/unstable/alpha-compositing/alpha-compositing-unstable-v1.xml ../../third_party/wayland-protocols/unstable/color-space/color-space-unstable-v1.xml ../../third_party/wayland-protocols/unstable/cursor-shapes/cursor-shapes-unstable-v1.xml ../../third_party/wayland-protocols/unstable/gaming-input/gaming-input-unstable-v2.xml ../../third_party/wayland-protocols/unstable/keyboard/keyboard-configuration-unstable-v1.xml ../../third_party/wayland-protocols/unstable/keyboard/keyboard-extension-unstable-v1.xml ../../third_party/wayland-protocols/unstable/notification-shell/notification-shell-unstable-v1.xml ../../third_party/wayland-protocols/unstable/remote-shell/remote-shell-unstable-v1.xml ../../third_party/wayland-protocols/unstable/secure-output/secure-output-unstable-v1.xml ../../third_party/wayland-protocols/unstable/stylus/stylus-unstable-v2.xml ../../third_party/wayland-protocols/unstable/stylus-tools/stylus-tools-unstable-v1.xml ../../third_party/wayland-protocols/unstable/vsync-feedback/vsync-feedback-unstable-v1.xml
Traceback (most recent call last):
File "../../components/exo/wayland/compatibility_test/wayland_protocol_codegen.py", line 120, in <module>
main()
File "../../components/exo/wayland/compatibility_test/wayland_protocol_codegen.py", line 111, in main
expanded = expand_template(args.template,
File "../../components/exo/wayland/compatibility_test/wayland_protocol_codegen.py", line 58, in expand_template
return env.get_template(os.path.basename(template)).render(context)
File "/usr/lib/python3/dist-packages/jinja2/environment.py", line 989, in render
return self.environment.handle_exception(exc_info, True)
File "/usr/lib/python3/dist-packages/jinja2/environment.py", line 754, in handle_exception
reraise(exc_type, exc_value, tb)
File "/usr/lib/python3/dist-packages/jinja2/_compat.py", line 37, in reraise
raise value.with_traceback(tb)
File "../../components/exo/wayland/compatibility_test/template_client_event_receiver_version_tests.cc.tmpl", line 32, in <module>
{% if interface.events|selectattr("since")|selectattr("since", ">", min_version)|list|length %}
File "/usr/lib/python3/dist-packages/jinja2/filters.py", line 750, in do_list
return list(value)
File "/usr/lib/python3/dist-packages/jinja2/filters.py", line 942, in _select_or_reject
if modfunc(func(transfunc(item))):
File "/usr/lib/python3/dist-packages/jinja2/filters.py", line 935, in <lambda>
func = lambda item: context.environment.call_test(
File "/usr/lib/python3/dist-packages/jinja2/environment.py", line 449, in call_test
raise TemplateRuntimeError('no test named %r' % name)
jinja2.exceptions.TemplateRuntimeError: no test named '>'
[61/32769] CXX obj/mojo/public/tools/fuzzers/fuzz_mojom_blink/fuzz.mojom-blink.o
ninja: build stopped: subcommand failed.
I was not sure if it's related to the version of jinja2.
And mine is v2.8 for sure.
autoninja -C out/Default is OK with me.
This question appears when I try to compile CrOS on my Ubuntu 16.04.
And autoninja -C out/Cros chrome is also fine. #_#
Find the reason!
It was because that ninja used jinja2 in the system instead of the version under src/third_party dir.
I solved this problem by uninstalling system jinja2.
Thanks for the guidance of tikuta#chromium.org. \o/

SyntaxError while trying to launch a pynodered server

I am trying to launch a simple python code as a node in node-red, I installed pynodered and made a .py file including the following code
from pynodered import node_red
#node_red(category="pyfuncs")
def lower_case(node, msg):
msg['payload'] = str(msg['payload']).lower()
return msg
I tried launching the pynodered server but I got a syntax error (I tried on windows10/Ubuntu16.04 still the same error)
C:\Users\omara\Downloads\Cyber Physical Systems>pynodered test.py
Traceback (most recent call last):
File "c:\users\omara\appdata\local\programs\python\python35\lib\runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
File "c:\users\omara\appdata\local\programs\python\python35\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\omara\AppData\Local\Programs\Python\Python35\Scripts\pynodered.exe\__main__.py", line 5, in <module>
File "c:\users\omara\appdata\local\programs\python\python35\lib\site-packages\pynodered\server.py", line 93
print(f"From {name} register {obj.name}")
^
SyntaxError: invalid syntax

tensorboard debugger is not working

I have tried to initialise the tensorboard debugger via the following command
tensorboard --logdir summarytrain/ --debugger_port 7000
The output from this command is:
Traceback (most recent call last): File "c:\users\krisb\appdata\local\programs\python\python36\lib\site-packages\tensorboard\plugins\debugger\debugger_plugin_loader.py", line 79, in _ConstructDebuggerPluginWithGrpcPort from tensorboard.plugins.debugger import debugger_plugin as debugger_plugin_lib File "c:\users\krisb\appdata\local\programs\python\python36\lib\site-packages\tensorboard\plugins\debugger\debugger_plugin.py", line 36, in from tensorboard.plugins.debugger import debugger_server_lib File "c:\users\krisb\appdata\local\programs\python\python36\lib\site-packages\tensorboard\plugins\debugger\debugger_server_lib.py", line 33, in from tensorflow.python.debug.lib import grpc_debug_server File "c:\users\krisb\appdata\local\programs\python\python36\lib\site-packages\tensorflow\python\debug\lib\grpc_debug_server.py", line 27, in import grpc ModuleNotFoundError: No module named 'grpc' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "c:\users\krisb\appdata\local\programs\python\python36\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "c:\users\krisb\appdata\local\programs\python\python36\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\krisb\AppData\Local\Programs\Python\Python36\Scripts\tensorboard.exe__main__.py", line 9, in File "c:\users\krisb\appdata\local\programs\python\python36\lib\site-packages\tensorboard\main.py", line 36, in run_main tf.app.run(main) File "c:\users\krisb\appdata\local\programs\python\python36\lib\site-packages\tensorflow\python\platform\app.py", line 124, in run _sys.exit(main(argv)) File "c:\users\krisb\appdata\local\programs\python\python36\lib\site-packages\tensorboard\main.py", line 45, in main default.get_assets_zip_provider()) File "c:\users\krisb\appdata\local\programs\python\python36\lib\site-packages\tensorboard\program.py", line 165, in main tb = create_tb_app(plugins, assets_zip_provider) File "c:\users\krisb\appdata\local\programs\python\python36\lib\site-packages\tensorboard\program.py", line 199, in create_tb_app window_title=FLAGS.window_title) File "c:\users\krisb\appdata\local\programs\python\python36\lib\site-packages\tensorboard\backend\application.py", line 126, in standard_tensorboard_wsgi plugin_instances = [constructor(context) for constructor in plugins] File "c:\users\krisb\appdata\local\programs\python\python36\lib\site-packages\tensorboard\backend\application.py", line 126, in plugin_instances = [constructor(context) for constructor in plugins] File "c:\users\krisb\appdata\local\programs\python\python36\lib\site-packages\tensorboard\plugins\debugger\debugger_plugin_loader.py", line 87, in _ConstructDebuggerPluginWithGrpcPort err.message + AttributeError: 'ModuleNotFoundError' object has no attribute 'message'
I have tried to install GPRC via pip install gprc but i get the following error too
Command "python setup.py egg_info" failed with error code 1 in
C:\Users\krisb\AppData\Local\Temp\pip-install-h7zjk7ly\grpc\
Totally stuck - anyone have any solutions?
Turns out the solution to this problem is to install
grpcio
Therefore enter
pip install grpcio

TypeError: can't pickle memoryview objects when running basic add.delay(1,2) test

Trying to run the most basic test of add.delay(1,2) using celery 4.1.0 with Python 3.6.4 and getting the following error:
[2018-02-27 13:58:50,194: INFO/MainProcess] Received task:
exb.tasks.test_tasks.add[52c3fb33-ce00-4165-ad18-15026eca55e9]
[2018-02-27 13:58:50,194: CRITICAL/MainProcess] Unrecoverable error:
SystemError(' returned a result with an error set',) Traceback (most
recent call last): File
"/opt/myapp/lib/python3.6/site-packages/kombu/messaging.py", line 624,
in _receive_callback
return on_m(message) if on_m else self.receive(decoded, message) File
"/opt/myapp/lib/python3.6/site-packages/celery/worker/consumer/consumer.py",
line 570, in on_task_received
callbacks, File "/opt/myapp/lib/python3.6/site-packages/celery/worker/strategy.py",
line 145, in task_message_handler
handle(req) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/worker.py", line
221, in _process_task_sem
return self._quick_acquire(self._process_task, req) File "/opt/myapp/lib/python3.6/site-packages/kombu/async/semaphore.py",
line 62, in acquire
callback(*partial_args, **partial_kwargs) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/worker.py", line
226, in _process_task
req.execute_using_pool(self.pool) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/request.py",
line 531, in execute_using_pool
correlation_id=task_id, File "/opt/myapp/lib/python3.6/site-packages/celery/concurrency/base.py",
line 155, in apply_async
**options) File "/opt/myapp/lib/python3.6/site-packages/billiard/pool.py", line 1486,
in apply_async
self._quick_put((TASK, (result._job, None, func, args, kwds))) File
"/opt/myapp/lib/python3.6/site-packages/celery/concurrency/asynpool.py",
line 813, in send_job
body = dumps(tup, protocol=protocol) TypeError: can't pickle memoryview objects
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File
"/opt/myapp/lib/python3.6/site-packages/celery/worker/worker.py", line
203, in start
self.blueprint.start(self) File "/opt/myapp/lib/python3.6/site-packages/celery/bootsteps.py", line
119, in start
step.start(parent) File "/opt/myapp/lib/python3.6/site-packages/celery/bootsteps.py", line
370, in start
return self.obj.start() File "/opt/myapp/lib/python3.6/site-packages/celery/worker/consumer/consumer.py",
line 320, in start
blueprint.start(self) File "/opt/myapp/lib/python3.6/site-packages/celery/bootsteps.py", line
119, in start
step.start(parent) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/consumer/consumer.py",
line 596, in start
c.loop(*c.loop_args()) File "/opt/myapp/lib/python3.6/site-packages/celery/worker/loops.py", line
88, in asynloop
next(loop) File "/opt/myapp/lib/python3.6/site-packages/kombu/async/hub.py", line 354,
in create_loop
cb(*cbargs) File "/opt/myapp/lib/python3.6/site-packages/kombu/transport/base.py", line
236, in on_readable
reader(loop) File "/opt/myapp/lib/python3.6/site-packages/kombu/transport/base.py", line
218, in _read
drain_events(timeout=0) File "/opt/myapp/lib/python3.6/site-packages/librabbitmq-2.0.0-py3.6-linux-x86_64.egg/librabbitmq/init.py",
line 227, in drain_events
self._basic_recv(timeout) SystemError: returned a result with an error set
I cannot find any previous evidence of anyone hitting this error. I noticed from the celery site that only python 3.5 is mentioned as supported, is that the issue or is this something I am missing?
Any help would be much appreciated!
UPDATE: Tried with Python 3.5.5 and the problem persists. Tried with Django 4.0.2 and the problem persists.
UPDATE: Uninstalled librabbitmq and the problem stopped. This was seen after migration from Python 2.7.5, Django 1.7.7 to Python 3.6.4, Django 2.0.2.
After uninstalling librabbitmq, the problem was resolved.

Resources