Pytorch 1.8 hangs by chance when calling loss.backward()

Pytorch 1.8 hangs by chance when calling loss.backward() - pytorch

When I was training an LSTM on pytorch the training process hangs by chance and cannot be terminated by Crlt+C. Then I used faulthandler to locate the problem. The training parameters, environment and faulthandler traceback output are listed below. It seems to be some problem with the C++ backend or the CUDA or even my graphic card I do not know.
Trace output batch_size = 64/32/16, num_workers = 2, CUDA 11.1, pytorch 1.8.0, cuDNN 8.0.5/8.0.1:
Thread 0x000036e8 (most recent call first): File
"C:\Users\myUserName\anaconda3\lib\threading.py", line 302 in wait
File "C:\Users\myUserName\anaconda3\lib\multiprocessing\queues.py",
line 227 in _feed File
"C:\Users\myUserName\anaconda3\lib\threading.py", line 870 in run
File "C:\Users\myUserName\anaconda3\lib\threading.py", line 932 in
_bootstrap_inner File "C:\Users\myUserName\anaconda3\lib\threading.py", line 890 in
_bootstrap
Thread 0x00004644 (most recent call first): File
"C:\Users\myUserName\anaconda3\lib\threading.py", line 302 in wait
File "C:\Users\myUserName\anaconda3\lib\multiprocessing\queues.py",
line 227 in _feed File
"C:\Users\myUserName\anaconda3\lib\threading.py", line 870 in run
File "C:\Users\myUserName\anaconda3\lib\threading.py", line 932 in
_bootstrap_inner File "C:\Users\myUserName\anaconda3\lib\threading.py", line 890 in
_bootstrap
Thread 0x00000efc (most recent call first):
Thread 0x00000138 (most recent call first): File
"C:\Users\myUserName\anaconda3\lib\threading.py", line 306 in wait
File "C:\Users\myUserName\anaconda3\lib\threading.py", line 558 in
wait File
"C:\Users\myUserName\anaconda3\lib\site-packages\tqdm_monitor.py",
line 59 in run File
"C:\Users\myUserName\anaconda3\lib\threading.py", line 932 in
_bootstrap_inner File "C:\Users\myUserName\anaconda3\lib\threading.py", line 890 in
_bootstrap
Thread 0x00001644 (most recent call first): File
"C:\Users\myUserName\anaconda3\lib\threading.py", line 306 in wait
File "C:\Users\myUserName\anaconda3\lib\queue.py", line 179 in get
File
"C:\Users\myUserName\anaconda3\lib\site-packages\tensorboard\summary\writer\event_file_writer.py",
line 232 in run File
"C:\Users\myUserName\anaconda3\lib\threading.py", line 932 in
_bootstrap_inner File "C:\Users\myUserName\anaconda3\lib\threading.py", line 890 in
_bootstrap
Thread 0x0000443c (most recent call first): File
"C:\Users\myUserName\anaconda3\lib\site-packages\torch\autograd_init_.py",
line 145 in backward File
"C:\Users\myUserName\anaconda3\lib\site-packages\torch\tensor.py",
line 245 in backward File "train.py", line 129 in main File
"train.py", line 246 in
batch_size = 64, num_workers = 0, CUDA 11.1, pytorch 1.8.0, cuDNN
8.0.5 Thread 0x00003650 (most recent call first):
Thread 0x000043b4 (most recent call first): File
"C:\Users\myUserName\anaconda3\lib\threading.py", line 306 in wait File
"C:\Users\myUserName\anaconda3\lib\threading.py", line 558 in wait File
"C:\Users\myUserName\anaconda3\lib\site-packages\tqdm_monitor.py", line 59
in run File "C:\Users\myUserName\anaconda3\lib\threading.py", line 932 in
_bootstrap_inner File "C:\Users\myUserName\anaconda3\lib\threading.py", line 890 in _bootstrap
Thread 0x000017c4 (most recent call first): File
"C:\Users\myUserName\anaconda3\lib\threading.py", line 306 in wait File
"C:\Users\myUserName\anaconda3\lib\queue.py", line 179 in get File
"C:\Users\myUserName\anaconda3\lib\site-packages\tensorboard\summary\writer\event_file_writer.py",
line 232 in run File "C:\Users\myUserName\anaconda3\lib\threading.py",
line 932 in _bootstrap_inner File
"C:\Users\myUserName\anaconda3\lib\threading.py", line 890 in _bootstrap
Thread 0x00001458 (most recent call first): File
"C:\Users\myUserName\anaconda3\lib\site-packages\torch\autograd_init_.py",
line 145 in backward File
"C:\Users\myUserName\anaconda3\lib\site-packages\torch\tensor.py", line 245
in backward File "train.py", line 129 in main File "train.py",
line 246 in
When num_workers=0 the output is the same except for lacking two threads below that I think belong to the dataloader.
Thread 0x000036e8 (most recent call first): File
"C:\Users\myUserName\anaconda3\lib\threading.py", line 302 in wait
File "C:\Users\myUserName\anaconda3\lib\multiprocessing\queues.py",
line 227 in _feed File
"C:\Users\myUserName\anaconda3\lib\threading.py", line 870 in run File
"C:\Users\myUserName\anaconda3\lib\threading.py", line 932 in
_bootstrap_inner File "C:\Users\myUserName\anaconda3\lib\threading.py", line 890 in
_bootstrap
Thread 0x00004644 (most recent call first): File
"C:\Users\myUserName\anaconda3\lib\threading.py", line 302 in wait
File "C:\Users\myUserName\anaconda3\lib\multiprocessing\queues.py",
line 227 in _feed File
"C:\Users\myUserName\anaconda3\lib\threading.py", line 870 in run File
"C:\Users\myUserName\anaconda3\lib\threading.py", line 932 in
_bootstrap_inner File "C:\Users\myUserName\anaconda3\lib\threading.py", line 890 in
_bootstrap
The resource usage is also moderate, with around 20% CPU usage, 16/32GB memory, and 3.8/8GB graphics memory usage. The GPU usage is low when training RNNs. The script was run on Windows 10. The graphic card I use is RTX 3070. The driver version for my graphics card is 461.09.
More Information
When I started debugging code I was using the unmatching versions of CUDA 11.2 with pytorch 1.7.1 and cudnn 8.1.0. At that time I came into CUDA exceptions from time to time, with outputs like kernel launch failed or failed to synchronization, and things just hang without error after I changed my CUDA version.

Related

python, websocket exception with run_forever()

I have a piece of code that uses, websocket and run_forever()
the program runs fine and suddenly just crashes with the follwoing output.
I have no idea how to trace this error and it just keeps happening at will :)
Requesting any help on how to debug this.
Exception in thread WebSocketClient:
Traceback (most recent call last):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2544.0_x64__qbz5n2kfra8p0\lib\threading.py", line 1016, in _bootstrap_inner
self.run()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2544.0_x64__qbz5n2kfra8p0\lib\threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\cgs\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\ws4py\websocket.py", line 531, in run
self.terminate()
File "C:\Users\cgs\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\ws4py\websocket.py", line 431, in terminate
self.closed(1006, "Going away")
TypeError: WsClient.closed() takes 1 positional argument but 3 were given

Why Error urllib.error.ContentTooShortError:

I get this error while retrieving the program I wrote with kivy.
I have to use vpn
Downloading https://dl.google.com/android/repository/android-ndk-r19c-linux-x86_64.zip
Traceback (most recent call last):
File "/home/mm/kivyenv/bin/buildozer", line 8, in <module>
sys.exit(main())
File "/home/mm/kivyenv/lib/python3.8/site-packages/buildozer/scripts/client.py", line 13, in main
Buildozer().run_command(sys.argv[1:])
File "/home/mm/kivyenv/lib/python3.8/site-packages/buildozer/__init__.py", line 1047, in run_command
self.target.run_commands(args)
File "/home/mm/kivyenv/lib/python3.8/site-packages/buildozer/target.py", line 92, in run_commands
func(args)
File "/home/mm/kivyenv/lib/python3.8/site-packages/buildozer/target.py", line 102, in cmd_debug
self.buildozer.prepare_for_build()
File "/home/mm/kivyenv/lib/python3.8/site-packages/buildozer/__init__.py", line 169, in prepare_for_build
self.target.install_platform()
File "/home/mm/kivyenv/lib/python3.8/site-packages/buildozer/targets/android.py", line 665, in install_platform
self._install_android_ndk()
File "/home/mm/kivyenv/lib/python3.8/site-packages/buildozer/targets/android.py", line 455, in _install_android_ndk
self.buildozer.download(url,
File "/home/mm/kivyenv/lib/python3.8/site-packages/buildozer/__init__.py", line 677, in download
urlretrieve(url, filename, report_hook)
File "/usr/lib/python3.8/urllib/request.py", line 1866, in retrieve
raise ContentTooShortError(
urllib.error.ContentTooShortError: <urlopen error retrieval incomplete: got only 261037633 out of 823376982 bytes>

From the urllib documentation:
exception urllib.error.ContentTooShortError(msg, content)
This exception is raised when the urlretrieve() function detects that the amount of the downloaded data is less than the expected amount (given by the Content-Length header). The content attribute stores the downloaded (and supposedly truncated) data.
In practice, it's likely that the VPN terminated the socket. You may need to implement retry/resume capability in your program.

How to solve "ValueError: Cannot create group in read only mode" during loading yolo model?

I'm writing a GUI application with wxpython. The application uses yolo to detect pavement breakage. I use the yolo code to train and detect. It is too time-consuming to load the yolo model, so the GUI will freeze. Therefore, I expect to show a progress bar during loading yolo model with threading.Thread. I can use main thread to load yolo model, but I get a exception during loading yolo model with a new thread.
The error:
Traceback (most recent call last):
File "C:\Program Files\Python36\lib\contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "C:\Users\JH-06\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\framework\ops.py", line 5652, in get_controller
yield g
File "d:\code\Python\yoloDetector_v007\src\YOLO\yolo.py", line 76, in generate
self.yolo_model = load_model(model_path, compile=False)
File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 419, in load_model
model = _deserialize_model(f, custom_objects, compile)
File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 221, in _deserialize_model
model_config = f['model_config']
File "C:\Program Files\Python36\lib\site-packages\keras\utils\io_utils.py", line 302, in __getitem__
raise ValueError('Cannot create group in read only mode.')
ValueError: Cannot create group in read only mode.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "d:\code\Python\yoloDetector_v007\src\myRoadDamageUtil\myRoadDetectionModel.py", line 166, in init
self.__m_oVideoDetector.init()
File "d:\code\Python\yoloDetector_v007\src\myRoadDamageUtil\myVideoDetector.py", line 130, in init
self.__m_oDetector.init()
File "d:\code\Python\yoloDetector_v007\src\myRoadDamageUtil\myRoadBreakageDetector.py", line 87, in init
self.__m_oYoloDetector.init()
File "d:\code\Python\yoloDetector_v007\src\YOLO\yolo.py", line 46, in init
self.boxes, self.scores, self.classes = self.generate()
File "d:\code\Python\yoloDetector_v007\src\YOLO\yolo.py", line 80, in generate
self.yolo_model.load_weights(self.model_path) # make sure model, anchors and classes match
File "C:\Program Files\Python36\lib\site-packages\keras\engine\network.py", line 1166, in load_weights
f, self.layers, reshape=reshape)
File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 1058, in load_weights_from_hdf5_group
K.batch_set_value(weight_value_tuples)
File "C:\Program Files\Python36\lib\site-packages\keras\backend\tensorflow_backend.py", line 2470, in batch_set_value
get_session().run(assign_ops, feed_dict=feed_dict)
File "C:\Users\JH-06\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\client\session.py", line 950, in run
run_metadata_ptr)
File "C:\Users\JH-06\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\client\session.py", line 1098, in _run
raise RuntimeError('The Session graph is empty. Add operations to the '
RuntimeError: The Session graph is empty. Add operations to the graph before calling run().
May somebody give me any suggestion?

When using wxPython with threads, you need to make sure that you are using a thread-safe method to communicate back to the GUI. There are 3 thread-safe methods you can use with wxPython:
wx.CallAfter
wx.CallLater
wx.PostEvent
Check out either of the following articles for more information
https://www.blog.pythonlibrary.org/2010/05/22/wxpython-and-threads/
https://wiki.wxpython.org/LongRunningTasks

How to handle a segmentation fault occurred in python

I have written a piece of code that checks in a system if there is a new entry in the database or not if a new entry is found it will fetch the data and client will try to send it to the server ... but if data is continuously found in every execution of threading.timer cycle it will try to send it to the server...
Now issue is if the server is unreachable then it will print that server is not alive etc. if this cycle goes for few minutes the script just crash and it shows Segmentation Fault ... i want to handle this exception and wanted to process something if Segmentation fault occurs
working environment:
linux
python3
sql-server
EDIT
This is what shows after script gets crashed...
Fatal Python error: Segmentation fault
Thread 0xb3dff460 (most recent call first):
File "/usr/lib/python3.4/threading.py", line 294 in wait File "/usr/lib/python3.4/threading.py", line 553 in wait File "/usr/lib/python3.4/threading.py", line 1184 in run File "/usr/lib/python3.4/threading.py", line 920 in _bootstrap_inner File "/usr/lib/python3.4/threading.py", line 888 in _bootstrap
Current thread 0xb33ff460 (most recent call first):
File "abc.py", line 343 in send_to_server
File "abc.py", line 244 in sql_connect1
File "/usr/lib/python3.4/threading.py", line 1186 in run
File "/usr/lib/python3.4/threading.py", line 920 in _bootstrap_inner
File "/usr/lib/python3.4/threading.py", line 888 in _bootstrap
Thread 0xb45ff460 (most recent call first):
File "abc.py", line 408 in send_to_server
File "abc.py", line 244 in sql_connect1
File "/usr/lib/python3.4/threading.py", line 1186 in run
File "/usr/lib/python3.4/threading.py", line 920 in _bootstrap_inner
File "/usr/lib/python3.4/threading.py", line 888 in _bootstrap
Thread 0xb4fff460 (most recent call first):
File "abc.py", line 343 in send_to_server
File "abc.py", line 244 in sql_connect1
File "/usr/lib/python3.4/threading.py", line 1186 in run
File "/usr/lib/python3.4/threading.py", line 920 in _bootstrap_inner
File "/usr/lib/python3.4/threading.py", line 888 in _bootstrap
Thread 0xb6f39300 (most recent call first):
File "/usr/lib/python3.4/threading.py", line 1076 in _wait_for_tstate_lock
File "/usr/lib/python3.4/threading.py", line 1060 in join
File "/usr/lib/python3.4/threading.py", line 1294 in _shutdown
Segmentation fault

well i didn't know a segmentation fault was even possible, a segmentation fault mean that somwhere you go too far in the memory to a place that is not mapped.
you should note the line where it is said the problem originally happened and check the lines.

It's hard to say, but :
you are using python3.4, but we don't know the exact version, so maybe upgrading to the last version of 3.4 (or better, upgrading to 3.6 or 3.7) could help.
we also don't know what version of pyodbc you are using. Maybe this bug has been resolved. But there are several seg fault reported in github
we don't see you're code. Maybe you make use of the driver incorrectly (can't see why but...)

FileNotFoundError when using subprocess

from subprocess import check_output
output=check_output(["ls", "F:\myData\input"]).decode("utf8")
print(ouptut)
I'm trying to run this code to view the files in this directory and save the results as output but this line is throwing an error upon execution.
Can anyone help me solve this and understand the issue?
Traceback (most recent call last):
File "F:\myData\input\Analysis.py", line 21, in <module>
output=check_output(["ls", "F:\myData\input\Analysis.py]).decode("utf8")
File "C:\Users\Abhinav\AppData\Local\Programs\Python\Python36\lib\subprocess.py", line 336, in check_output
**kwargs).stdout
File "C:\Users\Abhinav\AppData\Local\Programs\Python\Python36\lib\subprocess.py", line 403, in run
with Popen(*popenargs, **kwargs) as process:
File "C:\Users\Abhinav\AppData\Local\Programs\Python\Python36\lib\subprocess.py", line 709, in __init__
restore_signals, start_new_session)
File "C:\Users\Abhinav\AppData\Local\Programs\Python\Python36\lib\subprocess.py", line 997, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pytorch 1.8 hangs by chance when calling loss.backward() - pytorch

Related

python, websocket exception with run_forever()

Why Error urllib.error.ContentTooShortError:

How to solve "ValueError: Cannot create group in read only mode" during loading yolo model?

How to handle a segmentation fault occurred in python

FileNotFoundError when using subprocess

Categories

Resources