Unpickling error when running fairseq on AML using multiple GPUs - azure-machine-learning-service

I am trying to run fairseq translation task on AML using 4 GPUs (P100)and it fails with the following error:
-- Process 2 terminated with the following error: Traceback (most recent call last): File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py",
line 174, in all_gather_list
result.append(pickle.loads(bytes(out_buffer[2 : size + 2].tolist())))
_pickle.UnpicklingError: invalid load key, '\xad'.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/torch/multiprocessing/spawn.py",
line 19, in _wrap
fn(i, *args) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 272, in distributed_main
main(args, init_distributed=True) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 82, in main
train(args, trainer, task, epoch_itr) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 123, in train
log_output = trainer.train_step(samples) File "/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/trainer.py",
line 305, in train_step
[logging_outputs, sample_sizes, ooms, self._prev_grad_norm], File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py",
line 178, in all_gather_list
'Unable to unpickle data from other workers. all_gather_list requires all ' Exception: Unable to unpickle data from other workers.
all_gather_list requires all workers to enter the function together,
so this error usually indicates that the workers have fallen out of
sync somehow. Workers can fall out of sync if one of them runs out of
memory, or if there are other conditions in your training script that
can cause one worker to finish an epoch while other workers are still
iterating over their portions of the data.
2019-09-18
17:28:44,727|azureml.WorkerPool|DEBUG|[STOP]
Error occurred: User program failed with Exception:
-- Process 2 terminated with the following error: Traceback (most recent call last): File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py",
line 174, in all_gather_list
result.append(pickle.loads(bytes(out_buffer[2 : size + 2].tolist())))
_pickle.UnpicklingError: invalid load key, '\xad'.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/torch/multiprocessing/spawn.py",
line 19, in _wrap
fn(i, *args) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 272, in distributed_main
main(args, init_distributed=True) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 82, in main
train(args, trainer, task, epoch_itr) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 123, in train
log_output = trainer.train_step(samples) File "/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/trainer.py",
line 305, in train_step
[logging_outputs, sample_sizes, ooms, self._prev_grad_norm], File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py",
line 178, in all_gather_list
'Unable to unpickle data from other workers. all_gather_list requires all ' Exception: Unable to unpickle data from other workers.
all_gather_list requires all workers to enter the function together,
so this error usually indicates that the workers have fallen out of
sync somehow. Workers can fall out of sync if one of them runs out of
memory, or if there are other conditions in your training script that
can cause one worker to finish an epoch while other workers are still
iterating over their portions of the data.
The same code with same param runs fine on a single local GPU. How do I resolve this issue?

Related

Kernel restarts when training a sklearn regression model in Sagemaker

I have been trying to train a regression model, with big data on AWS Sagemaker.
The instance I used on my last try was ml.m5.12xlarge and I was confident it will work this time, but no. I still get the error.
After some minutes in the training I get this error on Cloudwatch:
[E 07:00:35.308 NotebookApp] KernelRestarter: restart callback <bound method ZMQChannelsHandler.on_kernel_restarted of ZMQChannelsHandler(f92aff37-be6b-48df-a5f5-522bcc6dd072)> failed
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/jupyter_client/restarter.py", line 86, in _fire_callbacks
callback()
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 473, in on_kernel_restarted
self._send_status_message('restarting')
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 469, in _send_status_message
self.write_message(json.dumps(msg, default=date_default))
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/tornado/websocket.py", line 337, in write_message
raise WebSocketClosedError()
tornado.websocket.WebSocketClosedError
Does anyone might know what the error could be?

How to solve "ValueError: Cannot create group in read only mode" during loading yolo model?

I'm writing a GUI application with wxpython. The application uses yolo to detect pavement breakage. I use the yolo code to train and detect. It is too time-consuming to load the yolo model, so the GUI will freeze. Therefore, I expect to show a progress bar during loading yolo model with threading.Thread. I can use main thread to load yolo model, but I get a exception during loading yolo model with a new thread.
The error:
Traceback (most recent call last):
File "C:\Program Files\Python36\lib\contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "C:\Users\JH-06\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\framework\ops.py", line 5652, in get_controller
yield g
File "d:\code\Python\yoloDetector_v007\src\YOLO\yolo.py", line 76, in generate
self.yolo_model = load_model(model_path, compile=False)
File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 419, in load_model
model = _deserialize_model(f, custom_objects, compile)
File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 221, in _deserialize_model
model_config = f['model_config']
File "C:\Program Files\Python36\lib\site-packages\keras\utils\io_utils.py", line 302, in __getitem__
raise ValueError('Cannot create group in read only mode.')
ValueError: Cannot create group in read only mode.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "d:\code\Python\yoloDetector_v007\src\myRoadDamageUtil\myRoadDetectionModel.py", line 166, in init
self.__m_oVideoDetector.init()
File "d:\code\Python\yoloDetector_v007\src\myRoadDamageUtil\myVideoDetector.py", line 130, in init
self.__m_oDetector.init()
File "d:\code\Python\yoloDetector_v007\src\myRoadDamageUtil\myRoadBreakageDetector.py", line 87, in init
self.__m_oYoloDetector.init()
File "d:\code\Python\yoloDetector_v007\src\YOLO\yolo.py", line 46, in init
self.boxes, self.scores, self.classes = self.generate()
File "d:\code\Python\yoloDetector_v007\src\YOLO\yolo.py", line 80, in generate
self.yolo_model.load_weights(self.model_path) # make sure model, anchors and classes match
File "C:\Program Files\Python36\lib\site-packages\keras\engine\network.py", line 1166, in load_weights
f, self.layers, reshape=reshape)
File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 1058, in load_weights_from_hdf5_group
K.batch_set_value(weight_value_tuples)
File "C:\Program Files\Python36\lib\site-packages\keras\backend\tensorflow_backend.py", line 2470, in batch_set_value
get_session().run(assign_ops, feed_dict=feed_dict)
File "C:\Users\JH-06\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\client\session.py", line 950, in run
run_metadata_ptr)
File "C:\Users\JH-06\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\client\session.py", line 1098, in _run
raise RuntimeError('The Session graph is empty. Add operations to the '
RuntimeError: The Session graph is empty. Add operations to the graph before calling run().
May somebody give me any suggestion?
When using wxPython with threads, you need to make sure that you are using a thread-safe method to communicate back to the GUI. There are 3 thread-safe methods you can use with wxPython:
wx.CallAfter
wx.CallLater
wx.PostEvent
Check out either of the following articles for more information
https://www.blog.pythonlibrary.org/2010/05/22/wxpython-and-threads/
https://wiki.wxpython.org/LongRunningTasks

Having trouble catching an exception in python 3

Working with Python 3.7.3, still figuring out how exception handling works.
I'm writing an xmpp bot, using slixmpp. I'm trying to make it so that if it loses connection to the server, it will try to reconnect. There doesn't seem to be any way to do this built in to slixmpp, so I'm write something into my own code to do it.
I've imported slixmpp as xmpp, and using it's send_raw() method to test that we're still connected to the server.
while True:
time.sleep(5) # Send every 5 seconds just for testing purposes
xmpp.send_raw('aroo?')
When I sever the connection to the server, this is what it spits out:
Traceback (most recent call last):
File "C:\Program Files\Python37\lib\threading.py", line 917, in _bootstrap_inner
self.run()
File "testcom.py", line 19, in run
eval(self.thing)()
File "testcom.py", line 28, in check_conn
xmpp.send_raw('aroo?')
File "C:\Program Files\Python37\lib\site-packages\slixmpp\xmlstream\xmlstream.py", line 926, in send_raw
raise NotConnectedError
slixmpp.xmlstream.xmlstream.NotConnectedError
I'm assuming that "NotConnectedError" is the exception that I need to catch, so I put the code inside a try block, like so:
try:
while True:
time.sleep(5) # Send every 5 seconds just for testing purposes
xmpp.send_raw('aroo?')
except NotConnectedError:
# Do a thing
pass
And this is what I get:
Traceback (most recent call last):
File "testcom.py", line 28, in check_conn
xmpp.send_raw('aroo?')
File "C:\Program Files\Python37\lib\site-packages\slixmpp\xmlstream\xmlstream.py", line 926, in send_raw
raise NotConnectedError()
slixmpp.xmlstream.xmlstream.NotConnectedError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Program Files\Python37\lib\threading.py", line 917, in _bootstrap_inner
self.run()
"testcom.py", line 19, in run
eval(self.thing)()
File "testcom.py", line 29, in check_conn
except NotConnectedError:
NameError: name 'NotConnectedError' is not defined
Can anyone tell me what I'm doing wrong here?
Thanks!
I can't see your imports but make sure you have from slixmpp.xmlstream.xmlstream import NotConnectedError otherwise it doesn't have a definition for NotConnectedError within the application. You could also change NotConnectedError to xmpp.xmlstream.xmlstream.NotConnectedError if you don't want to have it imported as well.

python 3 exception also gives the output of the previous program

i ran into an interesting bug when writing a json parser(called /home/myusername/py/json.py) in python3
i raised a basic exception and got unexpected output,
when investigating this further i wrote a new script entirely given below
/home/myusername/py/error.py
raise Exception("basic exception")
after running "python3 error.py"
i should get a really short error message, but instead i get console output of the previous run program.
[unexpected debug output of json.py]
[truncated for readability]
[it is extremely long but does not contain further errors]
Traceback (most recent call last):
File "error.py", line 1, in <module>
raise Exception("basic exception")
Exception: basic exception
Error in sys.excepthook:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/apport_python_hook.py", line 63, in apport_excepthook
from apport.fileutils import likely_packaged, get_recent_crashes
File "/usr/lib/python3/dist-packages/apport/__init__.py", line 5, in <module>
from apport.report import Report
File "/usr/lib/python3/dist-packages/apport/report.py", line 30, in <module>
import apport.fileutils
File "/usr/lib/python3/dist-packages/apport/fileutils.py", line 23, in <module>
from apport.packaging_impl import impl as packaging
File "/usr/lib/python3/dist-packages/apport/packaging_impl.py", line 17, in <module>
import json
File "/home/myusername/py/json.py", line 174, in <module>
rs = parser.Object(testcase)
File "/home/myusername/py/json.py", line 104, in Object
raise Exception(self.Array(source, "crashing object scanner"))
Exception: None
Original exception was:
Traceback (most recent call last):
File "error.py", line 1, in <module>
raise Exception("basic exception")
Exception: basic exception
i dont know why i get such a long message. nor do i know why i get debug code of an uncalled script. i would like an explanation, i am running Ubuntu, i have not yet found related bugs on the internet.
it appears that basic exception handling requires a json.py script, so when my error.py raises an exception it loads my json.py script instead of the buildin script, then my json script trows an exception.
the solution is to rename my json.py

Silencing broken pipe errors (`[Errno 32] Broken pipe`) in python with WSGI+Pyramid

I have a rather simple, naive Python/WSGI/Pyramid web-server.
It's run using wsgiref.simple_server.make_server(), on a server built using pyramid.config.Configurator().make_wsgi_app(). This server works fine.
However, the application it's serving has a lot of javascript image mouseover popups. If you run the mouse across the page, it can generate 20+ image requests. This is fine as well (It's an internal thing, not a lot of users).
However, doing so causes the server to emit something like half a dozen error tracebacks:
10.1.1.4 - - [25/Apr/2014 01:56:42] "GET /*SNIP* 500 59
----------------------------------------
Exception happened during processing of request from ('10.1.1.4', 18338)
Traceback (most recent call last):
File "/usr/lib/python3.4/wsgiref/handlers.py", line 138, in run
self.finish_response()
File "/usr/lib/python3.4/wsgiref/handlers.py", line 180, in finish_response
self.write(data)
File "/usr/lib/python3.4/wsgiref/handlers.py", line 274, in write
self.send_headers()
File "/usr/lib/python3.4/wsgiref/handlers.py", line 333, in send_headers
self._write(bytes(self.headers))
File "/usr/lib/python3.4/wsgiref/handlers.py", line 453, in _write
self.stdout.write(data)
File "/usr/lib/python3.4/socket.py", line 391, in write
return self._sock.send(b)
BrokenPipeError: [Errno 32] Broken pipe
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.4/wsgiref/handlers.py", line 141, in run
self.handle_error()
File "/usr/lib/python3.4/wsgiref/handlers.py", line 368, in handle_error
self.finish_response()
File "/usr/lib/python3.4/wsgiref/handlers.py", line 180, in finish_response
self.write(data)
File "/usr/lib/python3.4/wsgiref/handlers.py", line 274, in write
self.send_headers()
File "/usr/lib/python3.4/wsgiref/handlers.py", line 331, in send_headers
if not self.origin_server or self.client_is_modern():
File "/usr/lib/python3.4/wsgiref/handlers.py", line 344, in client_is_modern
return self.environ['SERVER_PROTOCOL'].upper() != 'HTTP/0.9'
TypeError: 'NoneType' object is not subscriptable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.4/socketserver.py", line 306, in _handle_request_noblock
self.process_request(request, client_address)
File "/usr/lib/python3.4/socketserver.py", line 332, in process_request
self.finish_request(request, client_address)
File "/usr/lib/python3.4/socketserver.py", line 345, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/usr/lib/python3.4/socketserver.py", line 666, in __init__
self.handle()
File "/usr/lib/python3.4/wsgiref/simple_server.py", line 126, in handle
handler.run(self.server.get_app())
File "/usr/lib/python3.4/wsgiref/handlers.py", line 144, in run
self.close()
File "/usr/lib/python3.4/wsgiref/simple_server.py", line 35, in close
self.status.split(' ',1)[0], self.bytes_sent
AttributeError: 'NoneType' object has no attribute 'split'
I understand why I'm getting broken pipe errors (the request for the image is canceled before the image has fully transfered, because the mouseover popup has closed), and it seems harmless.
However, I have no idea how to silence this traceback. There are thousands of them in my logs, and it makes debugging actual errors a nightmare. I don't care that I'm getting broken pipe errors, how can I catch them and swallow them silently?
It seems like wsgiref.simple_server.make_server() installs an internal handler that catches BrokenPipeError: [Errno 32] Broken pipe, prints the traceback, and then swallows the error. I've tried wrapping the run_server() call in a try-except clause, and it doesn't have any effect.
I wound up just switching to using the CherryPy WSGI Server. It doesn't suffer from the broken pipe log issues, and is probably far more robust as well.
It also uses a threadpool, so it's more performant as well (multiple requests aren't blocking!).
I didn't find a straightforward way for achieving this, however, you can always do some monkey patching:
from wsgiref.handlers import BaseHandler
import sys
def ignore_broken_pipes(self):
if sys.exc_info()[0] != BrokenPipeError: BaseHandler.__handle_error_original_(self)
BaseHandler.__handle_error_original_ = BaseHandler.handle_error
BaseHandler.handle_error = ignore_broken_pipes
You will not see these annoyances after running this code somewhere in the beginning anymore.
For me, it looks like a bug somewhere in wsgiref implementation of BaseHandler:
def handle_error(self):
"""Log current error, and send error output to client if possible"""
self.log_exception(sys.exc_info())
if not self.headers_sent:
self.result = self.error_output(self.environ, self.start_response)
self.finish_response()
# XXX else: attempt advanced recovery techniques for HTML or text?
if BrokenPipeError is handled by this method, finish_response crashes. Why do we ever want to finish response if pipe is broken? Where the data is sent to?

Resources