APScheduler add lots of jobs concurrently (database Jobstore) - python-3.x

How can I schedule lots of APScheduler jobs (4,000+) concurrently? (I must schedule all these after certain user events.)
Iteratively calling add_job simply takes too long with many jobs. But when I try to use AsyncIOScheduler and the following async code, I don't get any added performance increase either.
NOTE: my scheduler needs to connect to a SQL jobstore via SqlAlchemy
scheduler = AsyncIOScheduler(jobstores={"default": SQLAlchemyJobStore(url="a valid db connection str")})
scheduler.start()
def schedule_jobs_quickly():
# init lots of (fake) jobs
jobs = []
for i in range(3000):
jobs.append(i)
send_time = datetime.datetime.now() + datetime.timedelta(days=2)
# try to schedule jobs concurrently
start_time = time.time()
asyncio.get_event_loop().run_until_complete(schedule_all_jobs(jobs, send_time))
duration = time.time() - start_time
print(f"Created {len(jobs)} jobs in {duration} seconds")
async def schedule_all_jobs(all_jobs, send_time):
tasks = []
for job in all_jobs:
task = asyncio.ensure_future(schedule_job(job, send_time))
tasks.append(task)
await asyncio.gather(*tasks, return_exceptions=True)
async def schedule_job(job, send_time):
scheduler.add_job(send_email_if_needed, trigger=send_time)
Result is very slow. How to speed this up?
>>> schedule_jobs_quickly()
...
Created 3000 jobs in 401.9982771873474 seconds
For comparison, this is how long it took with a BackgroundScheduler() using the default memory jobstore:
Created 3000 jobs in 0.9155495166778564 seconds
So, it seems to be the database connections that are so expensive. Maybe there's a way to create multiple jobs using the same connection, instead of re-connecting for each add_job?

It's not the solution I was looking for, but I decided to give up on AsyncIOScheduler and instead schedule my many tasks in a separate thread so the rest of my program could continue without being held up by all of the DB connections. Example below.
from threading import Thread
def schedule_jobs_quickly():
# init lots of (fake) jobs
jobs = []
for i in range(3000):
jobs.append(i)
send_time = datetime.datetime.now() + datetime.timedelta(days=2)
# schedule jobs in new thread
scheduler_thread = Thread(target=schedule_email_jobs, args=(email_jobs,))
scheduler_thread.start()
def schedule_email_jobs(jobs):
for job in jobs:
scheduler.add_job(send_email, trigger=send_time)
def send_email():
# sends email

Related

How to fire N Requests per Second in Scala

I am developing an application in Scala, which is kind of request processor in batch. It gets the required data from storage and forms the request and calls an API of another service (named ServiceB)
ServiceB allocates 110TPS of its throughput to my service. In order to utilize max available throughput, I want to make request at the rate of 100TPS.
How can I fire requests at rate of exactly 100 TPS ?
In theory, if the API call takes around 500 milliseconds to execute. One thread can execute max 2 request per second. So, 50 threads are needed to achieve 100TPS per second.
But Futures in scala are anyway executed parallely. This makes it difficult to come up with an exact number of threads required.
I tried the following
def main(args: Array[String]): Unit = {
val executorService = Executors.newFixedThreadPool(1) // threadpoolsize 1
implicit val executionContext: ExecutionContextExecutor = ExecutionContextFactory.get(executorService)
....
val request = getRequest(..)
val responseList = ListBuffer[Future[Response]]();
val stTime = System.currentTimeMillis()
val rateLimiter = RateLimiter.create(300) //guava-ratelimiter
while(System.currentTimeMillis() - stTime <= (1000*60*3)) { // Running for 3 mins
rateLimiter.acquire(1)
responseList += http(request) // using dispatch
}
//TODO : use different threadpool to process the futures in responseList.
}
Metrics from DataDog:
TPS is 83
Total no of calls to tat api is 16.4k
This doesn't seem to be firing calls at even 100 TPS (though I allowed 300 request per seconds, check the rate limiter value)
This problem is similar to load testing, how are load testing frameworks firing requests at X TPS constantly?
Thanks.

How to get maximum/minimum duration of all DagRun instances in Airflow?

Is there a way to find the maximum/minimum or even an average duration of all DagRun instances in Airflow? - That is all dagruns from all dags not just one single dag.
I can't find anywhere to do this on the UI or even a page with a programmatic/command line example.
You can use airflow- api to get all dag_runs for dag and calculate statistics.
An example to get all dag_runs per dag and calc total time :
import datetime
import requests
from requests.auth import HTTPBasicAuth
airflow_server = "http://localhost:8080/api/v1/"
auth = HTTPBasicAuth("airflow", "airflow")
get_dags_url = f"{airflow_server}dags"
get_dag_params = {
"limit": 100,
"only_active": "true"
}
response = requests.get(get_dags_url, params=get_dag_params, auth=auth)
dags = response.json()["dags"]
get_dag_run_params = {
"limit": 100,
}
for dag in dags:
dag_id = dag["dag_id"]
dag_run_url = f"{airflow_server}/dags/{dag_id}/dagRuns?limit=100&state=success"
response = requests.get(dag_run_url, auth=auth)
dag_runs = response.json()["dag_runs"]
for dag_run in dag_runs:
execution_date = datetime.datetime.fromisoformat(dag_run['execution_date'])
end_date = datetime.datetime.fromisoformat(dag_run['end_date'])
duration = end_date - execution_date
duration_in_s = duration.total_seconds()
print(duration_in_s)
The easiest way will be to query your Airflow metastore. All the scheduling, DAG runs, and task instances are stored there and Airflow can't operate without it. I do recommend filtering on DAG/execution date if your use-case allows. It's not obvious to me what one can do with just these three overarching numbers alone.
select
min(runtime_seconds) min_runtime,
max(runtime_seconds) max_runtime,
avg(runtime_seconds) avg_runtime
from (
select extract(epoch from (d.end_date - d.start_date)) runtime_seconds
from public.dag_run d
where d.execution_date between '2022-01-01' and '2022-06-30' and d.state = 'success'
)
You might also consider joining to the task_instance table to get some task-level data, and perhaps use the min start and max end times for DAG tasks within a DAG run for your timestamps.

Python API: synchronous multithreaded DB queries

I have a Python flask API that apply some SQL based filtering on an object.
Steps of the API workflow:
receive a POST request (with arguments)
run multiple SQL read queries (against a postgres DB) depending on some of the posted arguments
apply some simple "pure python" rules on the SQL results to get a boolean result
store the boolean result and the associated posted arguments in the postgres DB
return the boolean result
Contraints of the API:
The API needs to return the boolean answer under 150ms
I can store the boolean result asynchronously in DB to avoid waiting for the write query to complete before returning the boolean result
However and as explained, the boolean answer depends on the SQL read queries so I cannot run those queries asynchronously
Test made:
While making some tests, I saw that I can make read queries in parallel. The test I did was:
Running the query below 2 times not using multithreading => the code ran in roughly 10 seconds
from sqlalchemy import create_engine
import os
import time
engine = create_engine(
os.getenv("POSTGRES_URL")
)
def run_query():
with engine.connect() as conn:
rs = conn.execute(f"""
SELECT
*
, pg_sleep(5)
FROM users
""")
for row in rs:
print(row)
if __name__ == "__main__":
start = time.time()
for i in range(5):
run_query()
end = time.time() - start
Running the query using multithreading => the code ran in roughly 5 seconds
from sqlalchemy import create_engine
import os
import threading
import time
engine = create_engine(
os.getenv("POSTGRES_URL")
)
def run_query():
with engine.connect() as conn:
rs = conn.execute(f"""
SELECT
*
, pg_sleep(5)
FROM users
""")
for row in rs:
print(row)
if __name__ == "__main__":
start = time.time()
threads = []
for i in range(5):
t = threading.Thread(target=run_query)
t.start()
threads.append(t)
for t in threads:
t.join()
end = time.time() - start
Question:
What is the bottleneck of the code ? I'm sure there must be a maximum number of read queries that I can run in parallel in 1 API call. However I'm wondering what is determining these limit.
Thank you very much for your help !
This scales well beyond the point that is sensible. With some tweaks to the built in connection pool's pool_size, you could easily have 100 pg_sleep going simultaneously. But as soon as you change that to do real work rather than just sleeping, it would fall apart. You only have so many CPU and so many disk drives, and that number is probably way less than 100.
You should start by looking at those read queries to see why they are slow and if they can't be made faster with indices or something.

Can't get Asyncio to run 2 tasks in parallel (process a query result while waiting for the next query to finish)

Hi I'm trying to run tasks in an asynchronous way, but can't get it to work.
What I want to do is query a db (takes 30s) and process the result (takes 15s) while I make the next query.
My problem seems very simple to solve, but for some reason I can't get it to work.
Thank you very much for your help.
Here is my code so far
async def query_db(date):
sql_query = f"SELECT * FROM tablename WHERE date='{date}'"
df = pd.read_sql(sql_query, engine)
df.to_csv(f"{date}_data.csv", index=False)
async def process_df(filepath):
df = pd.read_csv(filepath)
# Do processing stuff here and save the modified file
for dte in pd.date_range("2020-10-10", "2020-10-12", freq='D'):
query_data_task = asyncio.create_task(query_db(dte))
filepath = f"{dte}_data.csv"
await query_data_task
process_dataframe = asyncio.create_task(process_df(filepath))

Send the output of one task to another task in airflow

I am using airflow, i want to pass the output of the function of task 1 to the task 2.
def create_dag(dag_id,
schedule,
default_args):
def getData(**kwargs):
res = requests.post('https://dummyURL')
return res.json())
def sendAlert(**kwargs):
requests.post('https://dummyURL', params = "here i want to send res.json() from task 1")
dag = DAG(dag_id,
schedule_interval=schedule,
default_args=default_args)
with dag:
t1 = PythonOperator(task_id='task1',python_callable=getData,provide_context=True,dag=dag)
t2 = PythonOperator(task_id='task2',python_callable=sendAlert,provide_context=True,dag=dag)
return dag
Check out xcom's, as long as the data you want to pass is relatively small it's the best option.

Resources