Apache airflow: Cannot use more than 1 thread when using sqlite. Setting max_threads to 1 - multithreading

I have setup airflow with postgresql database, I am creating multiple dags
def subdag(parent_dag_name, child_dag_name,currentDate,batchId,category,subCategory,yearMonth,utilityType,REALTIME_HOME, args):
dag_subdag = DAG(
dag_id='%s.%s' % (parent_dag_name, child_dag_name),
default_args=args,
schedule_interval="#once",
)
# get site list to run bs reports
site_list = getSiteListforProcessing(category,subCategory,utilityType,yearMonth);
print (site_list)
def update_status(siteId,**kwargs):
createdDate=getCurrentTimestamp();
print ('N',siteId,batchId,yearMonth,utilityType,'N')
updateJobStatusLog('N',siteId,batchId,yearMonth,utilityType,'P')
def error_status(siteId,**kwargs):
createdDate=getCurrentTimestamp();
print ('N',siteId,batchId,yearMonth,utilityType,'N')
BS_template = """
echo "{{ params.date }}"
java -cp xx.jar com.xGenerator {{params.siteId}} {{params.utilityType}} {{params.date}}
"""
for index,siteid in enumerate(site_list):
t1 = BashOperator(
task_id='%s-task-%s' % (child_dag_name, index + 1),
bash_command=BS_template,
params={'date': currentDate, 'realtime_home': REALTIME_HOME,'siteId': siteid, "utilityType":utilityType},
default_args=args,
dag=dag_subdag)
t2 = PythonOperator(
task_id='%s-updatetask-%s' % (child_dag_name, index + 1),
dag=dag_subdag,
python_callable=update_status,
op_kwargs={'siteId':siteid})
t2.set_upstream(t1)
return dag_subdag
It creates dynamic tasks but on all number of dynamic task, it always fails the last one and logs error as:
"Cannot use more than 1 thread when using sqlite. Setting max_threads to 1"
E.g. : if 4 tasks are created 3 runs, and if 2 tasks are created 1 runs.

Related

Why the query result() difference when using Bigquery Python Client Library

I am trying to figure out the difference when executing query via below 2 ways:
job1 = client.query(query).result()
vs
job2= client.query(query)
job2.result()
Code is below:
from google.cloud import bigquery
bq_project= '<project_name>'
client = bigquery.Client(project=bq_project)
m_query = "SELECT * FROM <dataset.tbl>"
## NOTE: This query result has just 1 row.
x= client.query(m_query)
job1= x.result()
for row in job1:
val1 = row
job2 = client.query(myQuery).result()
for row in job2:
val2 = row
print(job1 == job2) # This is giving the return type as False
print(val1 == val2) # True
I understand that the final o/p from both the query execution will be same.
But I am not able to understand why job1 is not equal to job2.
Is the internal working different for both job1 and job2?
Note : I have gone through this Link but the question there is different.
You are comparing two objects (or iterables), stored in different spots in the memory. There is no point in comparing those. If you want to compare their contents, thats a doable thing:
A = [each for each in job1]
B = [each for each in job2]
print(A == B)

Best practice to reduce the number of spark job started. Use union?

With Spark, starting a job take a time.
For a complex workflow, it's possible to invoke a job in a loop.
But, we pay for each 'start'.
def test_loop(spark):
all_datas = []
for i in ['CZ12905K01', 'CZ12809WRH', 'CZ129086RP']:
all_datas.extend(spark.sql(f"""
select * from data where id=='{i}'
""").collect()) # Star a job
return all_datas
Sometime, it's possible to explode the loop to a big job, with 'union'.
def test_union(spark):
full_request = None
for i in ['CZ12905K01', 'CZ12809WRH' ,'CZ129086RP']:
q = f"""
select '{i}' ID,* from data where leh_be_lot_id=='{i}'
"""
partial_df = spark.sql(q)
if not full_request:
full_request = partial_df
else:
full_request = full_request.union(partial_df)
return full_request.collect() # Start a job
For clarity, my samples are elementary (I know, I can use in (...)) . The real requests will be more complex.
It's a good idea ?
With union approach, I can reduce drastically the number of jobs submitted, but with a more complex job.
My tests show that:
It's possible to union > 1000 request for a better performance
For 950 requests with local[6]
With 0 union : 1h53m
with 10 unions : 20m01s
with 100 unions : 7m12s
with 200 unions : 6m02s
with 500 unions : 6m25s
Sometime, the union version must "broadcast big data", or generate an "Out of memory"
My final approach: set a level_of_union
Merge some request, start the job, get the data
continue the loop with another batch
def test_union(spark,level_of_union):
full_request = None
all_datas = []
todo=['CZ12905K01', 'CZ12809WRH' ,'CZ129086RP']
for idx,i in enumerate(todo):
q = f"""
select '{i}' ID,* from leh_be where leh_be_lot_id=='{i}'
"""
partial_df = spark.sql(q)
if not full_request:
full_request = partial_df
else:
full_request = full_request.union(partial_df)
if idx % level_of_union == level_of_union-1 or idx == len(todo)-1:
all_datas.extend(full_request.collect()) # Start a job
full_request=None
return all_datas
Make a test to adjust the meta parameter : level_of_union

Having trouble to extract a DAG name into a JSON in Airflow

I'm trying to get the DAG name to the following JSON:
INFO - {'conf': <airflow.configuration.AutoReloadableProxy object at ... >, 'dag': <DAG: dag-name-i-want-to-get>, 'ds': '2021-07-29' ... N }
By the way, I got the JSON using the following function in Airflow:
def execute_dag_run(**kwargs):
print(kwargs)
dag = kwargs['dag']
print(type(dag))
print(dag)
get_dag_run_task = PythonOperator(
task_id='get_dag_run',
python_callable=execute_dag_run,
dag=dag,
provide_context=True
)
However, I'm getting a class if I print type(dag):
INFO - <class 'airflow.models.dag.DAG'>
Do you have any idea how to get this without do a manual extraction?
You are printing the dag object if you want to get the dag name you need to get it from the dag object as:
def execute_dag_run(**kwargs):
dag = kwargs['dag']
print ("dag_id from dag:")
print(dag.dag_id)
Alternatively you can also get it from task_instance as:
def execute_dag_run(**kwargs):
ti = kwargs['task_instance']
print ("dag_id from task instance:")
print (ti.dag_id)
another option is to get it from dag_run as:
def execute_dag_run(**kwargs):
dag_run = kwargs['dag_run']
print("dag_id from dag_run:")
print (dag_run.dag_id)

Send the output of one task to another task in airflow

I am using airflow, i want to pass the output of the function of task 1 to the task 2.
def create_dag(dag_id,
schedule,
default_args):
def getData(**kwargs):
res = requests.post('https://dummyURL')
return res.json())
def sendAlert(**kwargs):
requests.post('https://dummyURL', params = "here i want to send res.json() from task 1")
dag = DAG(dag_id,
schedule_interval=schedule,
default_args=default_args)
with dag:
t1 = PythonOperator(task_id='task1',python_callable=getData,provide_context=True,dag=dag)
t2 = PythonOperator(task_id='task2',python_callable=sendAlert,provide_context=True,dag=dag)
return dag
Check out xcom's, as long as the data you want to pass is relatively small it's the best option.

how to run multiple jira search_issues queries in parallel inside python function?

I have a python function(python 3.6) which executes approximately 140-150 jira search_issue queries and each query approximately takes 7-8 seconds to return the result. So in all this python function takes like 4-5 mins to execute.
Is there a way how to run these queries in parallel inside the python function so that python function's execution time reduces?
I am using 'Jira' package in python.
from jira import JIRA
def jira():
user = #####
pwd = #####
jira_access = JIRA("https://#####.atlassian.net", basic_auth=(user, pwd))
return jira_access
def jira_count(jira_filter):
result = jira().search_issues(jira_filter, startAt=0, maxResults=0)
total_count = len(result)
return total_count
def final_view():
query1 = jira_count(jira_filter1)
query2 = jira_count(jira_filter2)
query3 = jira_count(jira_filter3)
.
.
.
.
.
query150 = jira_count(jira_filter150)
return query1, query2, query3 ..... query150

Resources