Kernel restarts when training a sklearn regression model in Sagemaker - scikit-learn

I have been trying to train a regression model, with big data on AWS Sagemaker.
The instance I used on my last try was ml.m5.12xlarge and I was confident it will work this time, but no. I still get the error.
After some minutes in the training I get this error on Cloudwatch:
[E 07:00:35.308 NotebookApp] KernelRestarter: restart callback <bound method ZMQChannelsHandler.on_kernel_restarted of ZMQChannelsHandler(f92aff37-be6b-48df-a5f5-522bcc6dd072)> failed
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/jupyter_client/restarter.py", line 86, in _fire_callbacks
callback()
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 473, in on_kernel_restarted
self._send_status_message('restarting')
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 469, in _send_status_message
self.write_message(json.dumps(msg, default=date_default))
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/tornado/websocket.py", line 337, in write_message
raise WebSocketClosedError()
tornado.websocket.WebSocketClosedError
Does anyone might know what the error could be?

Related

Error when running meta-analysis of LCIA methods notebook

I am learning Brightway2 and I have been doing the notebooks from the brightway2 Github repo. So far all notebooks I have done have run smoothly, I am stuck in one concerning Meta-analysis of LCA methods, more specifically when running line [8].
This line computes 50.000 LCA calculations and times them. Here is the code:
from time import time
start = time()
lca_scores, methods, activities = get_lots_of_lca_scores()
print(time() - start)</code>
This enters into a never-ending loop, with the following message repeating:
Traceback (most recent call last):
File "/Users/.../miniconda3/envs/bw2_rosetta/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/Users/.../miniconda3/envs/bw2_rosetta/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/Users/.../miniconda3/envs/bw2_rosetta/lib/python3.9/multiprocessing/pool.py", line 114, in worker
task = get()
File "/Users/.../miniconda3/envs/bw2_rosetta/lib/python3.9/multiprocessing/queues.py", line 368, in get
return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'many_activities_one_method' on <module '__main__' (built-in)>
I tried looking at the called functions def many_activities_one_method(activities, method) and def get_lots_of_lca_scores(). But I had no luck and when I make changes I have the feeling I make things worse.
Here is my question: Has anyone tried this notebook and worked successfully? What could I be missing?
*Note: I have done the required notebook Getting started with Brightwway2
Thank you!
The notebook has been updated to remove this error.

Python Firestore insert return error 503 DNS resolution failed

I have a problem during the execution of my python script from crontab, which consists of an insert operation in the firestore database.
db.collection(u'ab').document(str(row["Name"])).collection(str(row["id"])).document(str(row2["id"])).set(self.packStructure(row2))
When I execute normally with python3 script.py command it works, but when I execute it from crontab it return the following error:
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/axatel/angel_bridge/esportazione_firebase/main.py", line 23, in <module>
dato.getDati(dato, db, cursor, cursor2, fdb, select, anagrafica)
File "/home/axatel/angel_bridge/esportazione_firebase/dati.py", line 19, in getDati
db.collection(u'ab').document(str(row["Name"])).collection(str(row["id"])).document(str(row2["id"])).set(self.packStructure(row2))
File "/home/axatel/.local/lib/python3.7/site-packages/google/cloud/firestore_v1/document.py", line 234, in set
write_results = batch.commit()
File "/home/axatel/.local/lib/python3.7/site-packages/google/cloud/firestore_v1/batch.py", line 147, in commit
metadata=self._client._rpc_metadata,
File "/home/axatel/.local/lib/python3.7/site-packages/google/cloud/firestore_v1/gapic/firestore_client.py", line 1121, in commit
request, retry=retry, timeout=timeout, metadata=metadata
File "/home/axatel/.local/lib/python3.7/site-packages/google/api_core/gapic_v1/method.py", line 145, in __call__
return wrapped_func(*args, **kwargs)
File "/home/axatel/.local/lib/python3.7/site-packages/google/api_core/retry.py", line 286, in retry_wrapped_func
on_error=on_error,
File "/home/axatel/.local/lib/python3.7/site-packages/google/api_core/retry.py", line 184, in retry_target
return target()
File "/home/axatel/.local/lib/python3.7/site-packages/google/api_core/timeout.py", line 214, in func_with_timeout
return func(*args, **kwargs)
File "/home/axatel/.local/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 59, in error_remapped_callable
six.raise_from(exceptions.from_grpc_error(exc), exc)
File "<string>", line 3, in raise_from
google.api_core.exceptions.ServiceUnavailable: 503 DNS resolution failed for service: firestore.googleapis.com:443
I really don't understand what's the problem, because the connection at the database works every time the script is started in both ways.
Is there a fix for this kind of issue?
I found something that might be helpful. There is nice troubleshooting guide and there is a part there, which seems to be related:
If your command works by invoking a runtime like python some-command.py perform a few checks to determine that the runtime
version and environment is correct. Each language runtime has quirks
that can cause unexpected behavior under crontab.
For python you might find that your web app is using a virtual
environment you need to invoke in your crontab.
I haven't seen such error running Firestore API, but this seems to match to your issue.
I found the solution.
The problem occured because the timeout sleep() value was lower than expected, so the database connection function starts too early during boot phase of machine. Increasing this value to 45 or 60 seconds fixed the problem.
#time.sleep(10) # old version
time.sleep(60) # working version
fdb = firebaseConnection()
def firebaseConnection():
# firebase connection
cred = credentials.Certificate('/database/axatel.json')
firebase_admin.initialize_app(cred)
fdb = firestore.client()
if fdb:
return fdb
else:
print("Error")
sys.exit()

How to solve "ValueError: Cannot create group in read only mode" during loading yolo model?

I'm writing a GUI application with wxpython. The application uses yolo to detect pavement breakage. I use the yolo code to train and detect. It is too time-consuming to load the yolo model, so the GUI will freeze. Therefore, I expect to show a progress bar during loading yolo model with threading.Thread. I can use main thread to load yolo model, but I get a exception during loading yolo model with a new thread.
The error:
Traceback (most recent call last):
File "C:\Program Files\Python36\lib\contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "C:\Users\JH-06\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\framework\ops.py", line 5652, in get_controller
yield g
File "d:\code\Python\yoloDetector_v007\src\YOLO\yolo.py", line 76, in generate
self.yolo_model = load_model(model_path, compile=False)
File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 419, in load_model
model = _deserialize_model(f, custom_objects, compile)
File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 221, in _deserialize_model
model_config = f['model_config']
File "C:\Program Files\Python36\lib\site-packages\keras\utils\io_utils.py", line 302, in __getitem__
raise ValueError('Cannot create group in read only mode.')
ValueError: Cannot create group in read only mode.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "d:\code\Python\yoloDetector_v007\src\myRoadDamageUtil\myRoadDetectionModel.py", line 166, in init
self.__m_oVideoDetector.init()
File "d:\code\Python\yoloDetector_v007\src\myRoadDamageUtil\myVideoDetector.py", line 130, in init
self.__m_oDetector.init()
File "d:\code\Python\yoloDetector_v007\src\myRoadDamageUtil\myRoadBreakageDetector.py", line 87, in init
self.__m_oYoloDetector.init()
File "d:\code\Python\yoloDetector_v007\src\YOLO\yolo.py", line 46, in init
self.boxes, self.scores, self.classes = self.generate()
File "d:\code\Python\yoloDetector_v007\src\YOLO\yolo.py", line 80, in generate
self.yolo_model.load_weights(self.model_path) # make sure model, anchors and classes match
File "C:\Program Files\Python36\lib\site-packages\keras\engine\network.py", line 1166, in load_weights
f, self.layers, reshape=reshape)
File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 1058, in load_weights_from_hdf5_group
K.batch_set_value(weight_value_tuples)
File "C:\Program Files\Python36\lib\site-packages\keras\backend\tensorflow_backend.py", line 2470, in batch_set_value
get_session().run(assign_ops, feed_dict=feed_dict)
File "C:\Users\JH-06\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\client\session.py", line 950, in run
run_metadata_ptr)
File "C:\Users\JH-06\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\client\session.py", line 1098, in _run
raise RuntimeError('The Session graph is empty. Add operations to the '
RuntimeError: The Session graph is empty. Add operations to the graph before calling run().
May somebody give me any suggestion?
When using wxPython with threads, you need to make sure that you are using a thread-safe method to communicate back to the GUI. There are 3 thread-safe methods you can use with wxPython:
wx.CallAfter
wx.CallLater
wx.PostEvent
Check out either of the following articles for more information
https://www.blog.pythonlibrary.org/2010/05/22/wxpython-and-threads/
https://wiki.wxpython.org/LongRunningTasks

Unpickling error when running fairseq on AML using multiple GPUs

I am trying to run fairseq translation task on AML using 4 GPUs (P100)and it fails with the following error:
-- Process 2 terminated with the following error: Traceback (most recent call last): File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py",
line 174, in all_gather_list
result.append(pickle.loads(bytes(out_buffer[2 : size + 2].tolist())))
_pickle.UnpicklingError: invalid load key, '\xad'.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/torch/multiprocessing/spawn.py",
line 19, in _wrap
fn(i, *args) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 272, in distributed_main
main(args, init_distributed=True) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 82, in main
train(args, trainer, task, epoch_itr) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 123, in train
log_output = trainer.train_step(samples) File "/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/trainer.py",
line 305, in train_step
[logging_outputs, sample_sizes, ooms, self._prev_grad_norm], File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py",
line 178, in all_gather_list
'Unable to unpickle data from other workers. all_gather_list requires all ' Exception: Unable to unpickle data from other workers.
all_gather_list requires all workers to enter the function together,
so this error usually indicates that the workers have fallen out of
sync somehow. Workers can fall out of sync if one of them runs out of
memory, or if there are other conditions in your training script that
can cause one worker to finish an epoch while other workers are still
iterating over their portions of the data.
2019-09-18
17:28:44,727|azureml.WorkerPool|DEBUG|[STOP]
Error occurred: User program failed with Exception:
-- Process 2 terminated with the following error: Traceback (most recent call last): File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py",
line 174, in all_gather_list
result.append(pickle.loads(bytes(out_buffer[2 : size + 2].tolist())))
_pickle.UnpicklingError: invalid load key, '\xad'.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/torch/multiprocessing/spawn.py",
line 19, in _wrap
fn(i, *args) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 272, in distributed_main
main(args, init_distributed=True) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 82, in main
train(args, trainer, task, epoch_itr) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 123, in train
log_output = trainer.train_step(samples) File "/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/trainer.py",
line 305, in train_step
[logging_outputs, sample_sizes, ooms, self._prev_grad_norm], File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py",
line 178, in all_gather_list
'Unable to unpickle data from other workers. all_gather_list requires all ' Exception: Unable to unpickle data from other workers.
all_gather_list requires all workers to enter the function together,
so this error usually indicates that the workers have fallen out of
sync somehow. Workers can fall out of sync if one of them runs out of
memory, or if there are other conditions in your training script that
can cause one worker to finish an epoch while other workers are still
iterating over their portions of the data.
The same code with same param runs fine on a single local GPU. How do I resolve this issue?

Gensim multicore LDA overflow error

I'm having an issue running multicored LDA in gensim (generating 2000 topics, 1 pass using 15 workers). I get the error below, I initially thought it might not have to do with saving the model, but looking at the error (the code still keeps running, at least the process hasn't quit).
Anyone know what I can do to prevent this error from occurring?
python3 run.py --method MultiLDA --ldaparams 2000 1 --workers 15 --path $DATA/gender_spectrum/
Traceback (most recent call last):
File "/usr/lib64/python3.5/multiprocessing/queues.py", line 241, in _feed
obj = ForkingPickler.dumps(obj)
File "/usr/lib64/python3.5/multiprocessing/reduction.py", line 50, in dumps
cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB```

Resources