Gensim multicore LDA overflow error - python-3.x

I'm having an issue running multicored LDA in gensim (generating 2000 topics, 1 pass using 15 workers). I get the error below, I initially thought it might not have to do with saving the model, but looking at the error (the code still keeps running, at least the process hasn't quit).
Anyone know what I can do to prevent this error from occurring?
python3 run.py --method MultiLDA --ldaparams 2000 1 --workers 15 --path $DATA/gender_spectrum/
Traceback (most recent call last):
File "/usr/lib64/python3.5/multiprocessing/queues.py", line 241, in _feed
obj = ForkingPickler.dumps(obj)
File "/usr/lib64/python3.5/multiprocessing/reduction.py", line 50, in dumps
cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB```

Related

Error when running meta-analysis of LCIA methods notebook

I am learning Brightway2 and I have been doing the notebooks from the brightway2 Github repo. So far all notebooks I have done have run smoothly, I am stuck in one concerning Meta-analysis of LCA methods, more specifically when running line [8].
This line computes 50.000 LCA calculations and times them. Here is the code:
from time import time
start = time()
lca_scores, methods, activities = get_lots_of_lca_scores()
print(time() - start)</code>
This enters into a never-ending loop, with the following message repeating:
Traceback (most recent call last):
File "/Users/.../miniconda3/envs/bw2_rosetta/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/Users/.../miniconda3/envs/bw2_rosetta/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/Users/.../miniconda3/envs/bw2_rosetta/lib/python3.9/multiprocessing/pool.py", line 114, in worker
task = get()
File "/Users/.../miniconda3/envs/bw2_rosetta/lib/python3.9/multiprocessing/queues.py", line 368, in get
return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'many_activities_one_method' on <module '__main__' (built-in)>
I tried looking at the called functions def many_activities_one_method(activities, method) and def get_lots_of_lca_scores(). But I had no luck and when I make changes I have the feeling I make things worse.
Here is my question: Has anyone tried this notebook and worked successfully? What could I be missing?
*Note: I have done the required notebook Getting started with Brightwway2
Thank you!
The notebook has been updated to remove this error.

Kernel restarts when training a sklearn regression model in Sagemaker

I have been trying to train a regression model, with big data on AWS Sagemaker.
The instance I used on my last try was ml.m5.12xlarge and I was confident it will work this time, but no. I still get the error.
After some minutes in the training I get this error on Cloudwatch:
[E 07:00:35.308 NotebookApp] KernelRestarter: restart callback <bound method ZMQChannelsHandler.on_kernel_restarted of ZMQChannelsHandler(f92aff37-be6b-48df-a5f5-522bcc6dd072)> failed
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/jupyter_client/restarter.py", line 86, in _fire_callbacks
callback()
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 473, in on_kernel_restarted
self._send_status_message('restarting')
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 469, in _send_status_message
self.write_message(json.dumps(msg, default=date_default))
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/tornado/websocket.py", line 337, in write_message
raise WebSocketClosedError()
tornado.websocket.WebSocketClosedError
Does anyone might know what the error could be?

Airflow scheduler starts up with exception when parallelism is set to a large number

I am new to Airflow and I am trying to use airflow to build a data pipeline, but it keeps getting some exceptions. My airflow.cfg look like this:
executor = LocalExecutor
sql_alchemy_conn = postgresql+psycopg2://airflow:airflow#localhost/airflow
sql_alchemy_pool_size = 5
parallelism = 96
dag_concurrency = 96
worker_concurrency = 96
max_threads = 96
broker_url = postgresql+psycopg2://airflow:airflow#localhost/airflow
result_backend = postgresql+psycopg2://airflow:airflow#localhost/airflow
When I started up airflow webserver -p 8080 in one terminal and then airflow scheduler in another terminal, the scheduler run will have the following execption(It failed when I set the parallelism number greater some amount, it works fine otherwise, this may be computer-specific but at least we know that it is resulted by the parallelism). I have tried run 1000 python processes on my computer and it worked fine, I have configured Postgres to allow maximum 500 database connections but it is still giving me the errors.
[2019-11-20 12:15:00,820] {dag_processing.py:556} INFO - Launched DagFileProcessorManager with pid: 85050
Process QueuedLocalWorker-18:
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/managers.py", line 811, in _callmethod
conn = self._tls.connection
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/Users/edward/.local/share/virtualenvs/avat-utils-JpGzQGRW/lib/python3.7/site-packages/airflow/executors/local_executor.py", line 111, in run
key, command = self.task_queue.get()
File "<string>", line 2, in get
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/managers.py", line 815, in _callmethod
self._connect()
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/managers.py", line 802, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 492, in Client
c = SocketClient(address)
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 619, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 61] Connection refused
Thanks
Updated: I tried run in Pycharm, and it worked fine in Pycharm but sometimes failed in the terminal and sometimes it's not
I had the same issue. Turns out I had set max_threads=10 in airflow.cfg in combination with LocalExecutor. Switching max_threads=2 solved the issue.
Found out few days ago, Airflow actually starts up all the parallel process when starting up, I was thinking max_sth and parallelism as the capacity but it is the number of processes it will run when start up. So it looks like this issue is caused by the insufficient resources of the computer.

Unpickling error when running fairseq on AML using multiple GPUs

I am trying to run fairseq translation task on AML using 4 GPUs (P100)and it fails with the following error:
-- Process 2 terminated with the following error: Traceback (most recent call last): File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py",
line 174, in all_gather_list
result.append(pickle.loads(bytes(out_buffer[2 : size + 2].tolist())))
_pickle.UnpicklingError: invalid load key, '\xad'.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/torch/multiprocessing/spawn.py",
line 19, in _wrap
fn(i, *args) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 272, in distributed_main
main(args, init_distributed=True) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 82, in main
train(args, trainer, task, epoch_itr) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 123, in train
log_output = trainer.train_step(samples) File "/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/trainer.py",
line 305, in train_step
[logging_outputs, sample_sizes, ooms, self._prev_grad_norm], File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py",
line 178, in all_gather_list
'Unable to unpickle data from other workers. all_gather_list requires all ' Exception: Unable to unpickle data from other workers.
all_gather_list requires all workers to enter the function together,
so this error usually indicates that the workers have fallen out of
sync somehow. Workers can fall out of sync if one of them runs out of
memory, or if there are other conditions in your training script that
can cause one worker to finish an epoch while other workers are still
iterating over their portions of the data.
2019-09-18
17:28:44,727|azureml.WorkerPool|DEBUG|[STOP]
Error occurred: User program failed with Exception:
-- Process 2 terminated with the following error: Traceback (most recent call last): File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py",
line 174, in all_gather_list
result.append(pickle.loads(bytes(out_buffer[2 : size + 2].tolist())))
_pickle.UnpicklingError: invalid load key, '\xad'.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/torch/multiprocessing/spawn.py",
line 19, in _wrap
fn(i, *args) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 272, in distributed_main
main(args, init_distributed=True) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 82, in main
train(args, trainer, task, epoch_itr) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 123, in train
log_output = trainer.train_step(samples) File "/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/trainer.py",
line 305, in train_step
[logging_outputs, sample_sizes, ooms, self._prev_grad_norm], File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py",
line 178, in all_gather_list
'Unable to unpickle data from other workers. all_gather_list requires all ' Exception: Unable to unpickle data from other workers.
all_gather_list requires all workers to enter the function together,
so this error usually indicates that the workers have fallen out of
sync somehow. Workers can fall out of sync if one of them runs out of
memory, or if there are other conditions in your training script that
can cause one worker to finish an epoch while other workers are still
iterating over their portions of the data.
The same code with same param runs fine on a single local GPU. How do I resolve this issue?

zarr.consolidate_metadata yields error: 'memoryview' object has no attribute 'decode'

I have an existing LMDB zarr archive (~6GB) saved at path. Now I want to consolidate the metadata to improve read performance.
Here is my script:
store = zarr.LMDBStore(path)
root = zarr.open(store)
zarr.consolidate_metadata(store)
store.close()
I get the following error:
Traceback (most recent call last):
File "zarr_consolidate.py", line 12, in <module>
zarr.consolidate_metadata(store)
File "/local/home/marcel/.virtualenvs/noisegan/local/lib/python3.5/site-packages/zarr/convenience.py", line 1128, in consolidate_metadata
return open_consolidated(store, metadata_key=metadata_key)
File "/local/home/marcel/.virtualenvs/noisegan/local/lib/python3.5/site-packages/zarr/convenience.py", line 1182, in open_consolidated
meta_store = ConsolidatedMetadataStore(store, metadata_key=metadata_key)
File "/local/home/marcel/.virtualenvs/noisegan/local/lib/python3.5/site-packages/zarr/storage.py", line 2455, in __init__
d = store[metadata_key].decode() # pragma: no cover
AttributeError: 'memoryview' object has no attribute 'decode'
I am using zarr 2.3.2 and python 3.5.2. I have another machine running python 3.6.2 where this works. Could it have to do with the python version?
Thanks for the report. Should be fixed with gh-452. Please test it out (if you are able).
If you are able to share a bit more information on why read performance suffers in your case, that would be interesting to learn about. :)

Resources