Airflow scheduler starts up with exception when parallelism is set to a large number - python-3.x

I am new to Airflow and I am trying to use airflow to build a data pipeline, but it keeps getting some exceptions. My airflow.cfg look like this:
executor = LocalExecutor
sql_alchemy_conn = postgresql+psycopg2://airflow:airflow#localhost/airflow
sql_alchemy_pool_size = 5
parallelism = 96
dag_concurrency = 96
worker_concurrency = 96
max_threads = 96
broker_url = postgresql+psycopg2://airflow:airflow#localhost/airflow
result_backend = postgresql+psycopg2://airflow:airflow#localhost/airflow
When I started up airflow webserver -p 8080 in one terminal and then airflow scheduler in another terminal, the scheduler run will have the following execption(It failed when I set the parallelism number greater some amount, it works fine otherwise, this may be computer-specific but at least we know that it is resulted by the parallelism). I have tried run 1000 python processes on my computer and it worked fine, I have configured Postgres to allow maximum 500 database connections but it is still giving me the errors.
[2019-11-20 12:15:00,820] {dag_processing.py:556} INFO - Launched DagFileProcessorManager with pid: 85050
Process QueuedLocalWorker-18:
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/managers.py", line 811, in _callmethod
conn = self._tls.connection
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/Users/edward/.local/share/virtualenvs/avat-utils-JpGzQGRW/lib/python3.7/site-packages/airflow/executors/local_executor.py", line 111, in run
key, command = self.task_queue.get()
File "<string>", line 2, in get
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/managers.py", line 815, in _callmethod
self._connect()
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/managers.py", line 802, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 492, in Client
c = SocketClient(address)
File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 619, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 61] Connection refused
Thanks
Updated: I tried run in Pycharm, and it worked fine in Pycharm but sometimes failed in the terminal and sometimes it's not

I had the same issue. Turns out I had set max_threads=10 in airflow.cfg in combination with LocalExecutor. Switching max_threads=2 solved the issue.

Found out few days ago, Airflow actually starts up all the parallel process when starting up, I was thinking max_sth and parallelism as the capacity but it is the number of processes it will run when start up. So it looks like this issue is caused by the insufficient resources of the computer.

Related

Error when running meta-analysis of LCIA methods notebook

I am learning Brightway2 and I have been doing the notebooks from the brightway2 Github repo. So far all notebooks I have done have run smoothly, I am stuck in one concerning Meta-analysis of LCA methods, more specifically when running line [8].
This line computes 50.000 LCA calculations and times them. Here is the code:
from time import time
start = time()
lca_scores, methods, activities = get_lots_of_lca_scores()
print(time() - start)</code>
This enters into a never-ending loop, with the following message repeating:
Traceback (most recent call last):
File "/Users/.../miniconda3/envs/bw2_rosetta/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/Users/.../miniconda3/envs/bw2_rosetta/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/Users/.../miniconda3/envs/bw2_rosetta/lib/python3.9/multiprocessing/pool.py", line 114, in worker
task = get()
File "/Users/.../miniconda3/envs/bw2_rosetta/lib/python3.9/multiprocessing/queues.py", line 368, in get
return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'many_activities_one_method' on <module '__main__' (built-in)>
I tried looking at the called functions def many_activities_one_method(activities, method) and def get_lots_of_lca_scores(). But I had no luck and when I make changes I have the feeling I make things worse.
Here is my question: Has anyone tried this notebook and worked successfully? What could I be missing?
*Note: I have done the required notebook Getting started with Brightwway2
Thank you!
The notebook has been updated to remove this error.

Python Firestore insert return error 503 DNS resolution failed

I have a problem during the execution of my python script from crontab, which consists of an insert operation in the firestore database.
db.collection(u'ab').document(str(row["Name"])).collection(str(row["id"])).document(str(row2["id"])).set(self.packStructure(row2))
When I execute normally with python3 script.py command it works, but when I execute it from crontab it return the following error:
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/axatel/angel_bridge/esportazione_firebase/main.py", line 23, in <module>
dato.getDati(dato, db, cursor, cursor2, fdb, select, anagrafica)
File "/home/axatel/angel_bridge/esportazione_firebase/dati.py", line 19, in getDati
db.collection(u'ab').document(str(row["Name"])).collection(str(row["id"])).document(str(row2["id"])).set(self.packStructure(row2))
File "/home/axatel/.local/lib/python3.7/site-packages/google/cloud/firestore_v1/document.py", line 234, in set
write_results = batch.commit()
File "/home/axatel/.local/lib/python3.7/site-packages/google/cloud/firestore_v1/batch.py", line 147, in commit
metadata=self._client._rpc_metadata,
File "/home/axatel/.local/lib/python3.7/site-packages/google/cloud/firestore_v1/gapic/firestore_client.py", line 1121, in commit
request, retry=retry, timeout=timeout, metadata=metadata
File "/home/axatel/.local/lib/python3.7/site-packages/google/api_core/gapic_v1/method.py", line 145, in __call__
return wrapped_func(*args, **kwargs)
File "/home/axatel/.local/lib/python3.7/site-packages/google/api_core/retry.py", line 286, in retry_wrapped_func
on_error=on_error,
File "/home/axatel/.local/lib/python3.7/site-packages/google/api_core/retry.py", line 184, in retry_target
return target()
File "/home/axatel/.local/lib/python3.7/site-packages/google/api_core/timeout.py", line 214, in func_with_timeout
return func(*args, **kwargs)
File "/home/axatel/.local/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 59, in error_remapped_callable
six.raise_from(exceptions.from_grpc_error(exc), exc)
File "<string>", line 3, in raise_from
google.api_core.exceptions.ServiceUnavailable: 503 DNS resolution failed for service: firestore.googleapis.com:443
I really don't understand what's the problem, because the connection at the database works every time the script is started in both ways.
Is there a fix for this kind of issue?
I found something that might be helpful. There is nice troubleshooting guide and there is a part there, which seems to be related:
If your command works by invoking a runtime like python some-command.py perform a few checks to determine that the runtime
version and environment is correct. Each language runtime has quirks
that can cause unexpected behavior under crontab.
For python you might find that your web app is using a virtual
environment you need to invoke in your crontab.
I haven't seen such error running Firestore API, but this seems to match to your issue.
I found the solution.
The problem occured because the timeout sleep() value was lower than expected, so the database connection function starts too early during boot phase of machine. Increasing this value to 45 or 60 seconds fixed the problem.
#time.sleep(10) # old version
time.sleep(60) # working version
fdb = firebaseConnection()
def firebaseConnection():
# firebase connection
cred = credentials.Certificate('/database/axatel.json')
firebase_admin.initialize_app(cred)
fdb = firestore.client()
if fdb:
return fdb
else:
print("Error")
sys.exit()

Unpickling error when running fairseq on AML using multiple GPUs

I am trying to run fairseq translation task on AML using 4 GPUs (P100)and it fails with the following error:
-- Process 2 terminated with the following error: Traceback (most recent call last): File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py",
line 174, in all_gather_list
result.append(pickle.loads(bytes(out_buffer[2 : size + 2].tolist())))
_pickle.UnpicklingError: invalid load key, '\xad'.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/torch/multiprocessing/spawn.py",
line 19, in _wrap
fn(i, *args) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 272, in distributed_main
main(args, init_distributed=True) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 82, in main
train(args, trainer, task, epoch_itr) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 123, in train
log_output = trainer.train_step(samples) File "/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/trainer.py",
line 305, in train_step
[logging_outputs, sample_sizes, ooms, self._prev_grad_norm], File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py",
line 178, in all_gather_list
'Unable to unpickle data from other workers. all_gather_list requires all ' Exception: Unable to unpickle data from other workers.
all_gather_list requires all workers to enter the function together,
so this error usually indicates that the workers have fallen out of
sync somehow. Workers can fall out of sync if one of them runs out of
memory, or if there are other conditions in your training script that
can cause one worker to finish an epoch while other workers are still
iterating over their portions of the data.
2019-09-18
17:28:44,727|azureml.WorkerPool|DEBUG|[STOP]
Error occurred: User program failed with Exception:
-- Process 2 terminated with the following error: Traceback (most recent call last): File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py",
line 174, in all_gather_list
result.append(pickle.loads(bytes(out_buffer[2 : size + 2].tolist())))
_pickle.UnpicklingError: invalid load key, '\xad'.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/torch/multiprocessing/spawn.py",
line 19, in _wrap
fn(i, *args) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 272, in distributed_main
main(args, init_distributed=True) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 82, in main
train(args, trainer, task, epoch_itr) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py",
line 123, in train
log_output = trainer.train_step(samples) File "/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/trainer.py",
line 305, in train_step
[logging_outputs, sample_sizes, ooms, self._prev_grad_norm], File
"/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py",
line 178, in all_gather_list
'Unable to unpickle data from other workers. all_gather_list requires all ' Exception: Unable to unpickle data from other workers.
all_gather_list requires all workers to enter the function together,
so this error usually indicates that the workers have fallen out of
sync somehow. Workers can fall out of sync if one of them runs out of
memory, or if there are other conditions in your training script that
can cause one worker to finish an epoch while other workers are still
iterating over their portions of the data.
The same code with same param runs fine on a single local GPU. How do I resolve this issue?

pexpect telnet on windows

I've used pexpect on linux successfully to telnet/ssh into Cisco switches for CLI scrapping. I'm trying to convert this code over to Windows and having some issues since it doesn't support the pxpect.spawn() command.
I read some online documentation and it suggested to use the pexpect.popen_spawn.PopenSpawn command. Can someone please point to me what I'm doing wrong here? I removed all my exception handling to simplify the code. Thanks.
import pexpect
from pexpect.popen_spawn import PopenSpawn
child = pexpect.popen_spawn.PopenSpawn('C:/Windows/System32/telnet 192.168.1.1')
child.expect('Username:')
child.sendline('cisco')
child.expect('Password:')
child.sendline('cisco')
child.expect('>')
child.close()
Error:
Traceback (most recent call last):
File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\site-packages\pexpect\expect.py", line 98, in expect_loop
incoming = spawn.read_nonblocking(spawn.maxread, timeout)
File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\site-packages\pexpect\popen_spawn.py", line 68, in read_nonblocking
raise EOF('End Of File (EOF).')
pexpect.exceptions.EOF: End Of File (EOF).
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Python\windows scripts\telnet\telnet.py", line 37, in <module>
main()
File "C:\Python\windows scripts\telnet\telnet.py", line 20, in main
child.expect('Username:')
File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\site-packages\pexpect\spawnbase.py", line 327, in expect
timeout, searchwindowsize, async_)
File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\site-packages\pexpect\spawnbase.py", line 355, in expect_list
return exp.expect_loop(timeout)
File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\site-packages\pexpect\expect.py", line 104, in expect_loop
return self.eof(e)
File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\site-packages\pexpect\expect.py", line 50, in eof
raise EOF(msg)
pexpect.exceptions.EOF: End Of File (EOF).
<pexpect.popen_spawn.PopenSpawn object at 0x000000000328C550>
searcher: searcher_re:
0: re.compile("b'Username:'")
The solution is to use plink.exe, which is a part of putty installation. You can download it from https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html
Enable TelnetClient on windows system and put plink.exe in the same folder from where you are running pexpect. Then you can telnet using pexpect as mentioned in the below example.
Also, use timeout flag with PopenSpawn to wait for the connection to establish. The above error is due to a timeout flag is not set.
p = pexpect.popen_spawn.PopenSpawn('plink.exe -telnet 192.168.0.1 -P 23', timeout=1)

memsql-ops start error in thread-7

After installing and starting memsql-ops, it shows the following error:
# ./memsql-ops start
Starting MemSQL Ops...
Exception in thread Thread-7:
Traceback (most recent call last):
File "/usr/local/updated-openssl/lib/python3.4/threading.py", line 921, in _bootstrap_inner
File "/usr/local/updated-openssl/lib/python3.4/threading.py", line 869, in run
File "/memsql_platform/memsql_platform/agent/daemon/manage.py", line 200, in startup_watcher
File "/memsql_platform/memsql_platform/network/api_client.py", line 34, in call
File "/usr/local/updated-openssl/lib/python3.4/site-packages/simplejson/__init__.py", line 516, in loads
File "/usr/local/updated-openssl/lib/python3.4/site-packages/simplejson/decoder.py", line 370, in decode
File "/usr/local/updated-openssl/lib/python3.4/site-packages/simplejson/decoder.py", line 400, in raw_decode
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Does anyone know the issue?
OS : CentOS 6.7
Memsql : 5.1.0 Enterprise Trial
It is likely that you have another server running on the port Ops is trying to start with (default 9000) that is returning data which is not JSON decodable. The solution is to either start MemSQL Ops on a different port, or kill the server running at that port.
We will fix this bug in an upcoming release of Ops! Thanks for pointing it out.

Resources