Scheduling airflow TaskGroup throws AttributeError - python-3.x

So I am creating task in a TaskGroup and am trying to add them in my dag sequence of tasks but it is throwing this error:
Broken DAG: [/Users/abc/projects/abc/airflow_dags/dag.py] Traceback (most recent call last):
File "/Users/abc/.pyenv/versions/3.8.12/envs/vmd-3.8.12/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 1234, in set_downstream
self._set_relatives(task_or_task_list, upstream=False)
File "/Users/abc/.pyenv/versions/3.8.12/envs/vmd-3.8.12/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 1178, in _set_relatives
task_object.update_relative(self, not upstream)
AttributeError: 'NoneType' object has no attribute 'update_relative'
I am creating my task group and tasks like this:
def get_task_group(dag, task_group):
t1 = DummyOperator(task_id='t1', dag=dag, task_group=task_group)
t2 = DummyOperator(task_id='t2', dag=dag, task_group=task_group)
t3 = DummyOperator(task_id='t3', dag=dag, task_group=task_group)
t4 = DummyOperator(task_id='t4', dag=dag, task_group=task_group)
t5 = DummyOperator(task_id='t5', dag=dag, task_group=task_group)
t_list = [t2, t3, t4]
t1.set_downstream(t_list)
t5.set_upstream(t_list)
with DAG('some_dag', default_args=args) as dag:
with TaskGroup(group_id=f"run_model_tasks", dag=dag) as tg:
run_model_task_group = get_task_group(dag, tg)
a1 = DummyOperator(task_id='a1', dag=dag)
a2 = DummyOperator(task_id='a2', dag=dag)
a3 = DummyOperator(task_id='a3', dag=dag)
a4 = DummyOperator(task_id='a4', dag=dag)
a1.set_downstream(a2)
a2.set_downstream(run_model_task_group)
a3.set_upstream(run_model_task_group)
a3.set_downstream(a4)
If I remove the task groups and leave task group task from sequencing by removing the lines
a2.set_downstream(run_model_task_group)
a3.set_upstream(run_model_task_group)
I can see that a1, a2 a3 & a4 are sequenced properly and I can the disconnected run_model_task_group tasks, but as soon as I add it in the sequence, I get the aforementioned error.
Can anyone guide me what might be happening here?
Note that I am using the function taking dag and task_group parameters to create task group tasks because I want to create the same set of tasks for another dag too.
Python Version: 3.8.8
Airflow Version: 2.0.1

AttributeError: 'NoneType' object has no attribute 'update_relative'It's happening because run_model_task_group its None outside of the scope of the With block, which is expected Python behaviour.
Without changing things too much from what you have done so far, you could refactor get_task_group() to return a TaskGroup object, like this:
def get_task_group(dag, group_id):
with TaskGroup(group_id=group_id, dag=dag) as tg:
t1 = DummyOperator(task_id='t1', dag=dag)
t2 = DummyOperator(task_id='t2', dag=dag)
t3 = DummyOperator(task_id='t3', dag=dag)
t4 = DummyOperator(task_id='t4', dag=dag)
t5 = DummyOperator(task_id='t5', dag=dag)
t_list = [t2, t3, t4]
t1.set_downstream(t_list)
t5.set_upstream(t_list)
return tg
In the DAG definition simply call it with:
run_model_task_group = get_task_group(dag, "run_model_tasks")
The resultant graph view looks like this:
DAG definition:
with DAG('some_dag',
default_args=default_args,
start_date=days_ago(2),
schedule_interval='#once') as dag:
# with TaskGroup(group_id=f"run_model_tasks", dag=dag) as tg:
# run_model_task_group = get_task_group(dag, )
run_model_task_group = get_task_group(dag, "run_model_tasks")
a1 = DummyOperator(task_id='a1', dag=dag)
a2 = DummyOperator(task_id='a2', dag=dag)
a3 = DummyOperator(task_id='a3', dag=dag)
a4 = DummyOperator(task_id='a4', dag=dag)
a1.set_downstream(a2)
a2.set_downstream(run_model_task_group)
a3.set_upstream(run_model_task_group)
a3.set_downstream(a4)
Finally, considering using bitwise operator instead of set_downstream and set_upstream, it's the recommend way and also less verbose source here.
Let me know if that worked for you.
Tested with: Airflow version: 2.1.4, Python 3.8.10

Related

How can I only call a BigQuery table each dag run instead of every time Airflow refreshes?

I have a DAG that queries a table, pulling data from it, and which also uses a ShortCircuitOperator to check if the DAG needs to run, which is also based on a BQ table name. The issue is this currently queries the table every time Airflow refreshes. The table is small (each query is less than 1 kb) but I'm concerned this will get more expensive as its scaled up. Is there a way to only query this table each DAG run instead?
Here's a code snippet to show what's going on:
client = bigquery.Client()
def create_query_lists(list_type):
query_job = client.query(
"""
SELECT filename
FROM `requests`
"""
)
results = query_job.result()
results_list = []
for row in results:
results_list.append(row.filename)
return results_list
def check_contents():
if len(create_query_lists()) == 0:
raise ValueError('Nothing to do')
return False
else:
print("There's stuff to do")
return True
#Create task to check if data being pulled is empty, if so fail so other tasks don't run
check_list = ShortCircuitOperator(
task_id="check_column_not_empty",
provide_context=True,
python_callable=check_list_contents
)
check_list #do subsequent tasks which use the same function
If I correctly understood your need, you want to execute tasks only if the result of SQL query is not empty.
In this case you can also use BranchPythonOperator, example :
import airflow
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import BranchPythonOperator
from google.cloud import bigquery
client = bigquery.Client()
def create_query_lists(list_type):
query_job = client.query(
"""
SELECT filename
FROM `requests`
"""
)
results = query_job.result()
results_list = []
for row in results:
results_list.append(row.filename)
return results_list
def check_contents():
if len(create_query_lists()) == 0:
return 'KO'
else:
return 'OK'
with airflow.DAG(
"your_dag",
schedule_interval=None) as dag:
branching = BranchPythonOperator(
task_id='file_exists',
python_callable=check_contents,
provide_context=True,
op_kwargs={
'param': 'param'
},
dag=dag
)
ok = DummyOperator(task_id='OK', dag=dag)
ko = DummyOperator(task_id='KO', dag=dag)
fake_task = DummyOperator(task_id='fake_task', dag=dag)
(branching >>
ok >>
fake_task)
branching >> ko
The BranchPythonOperator executes the query, if the result is not empty, it returns OK, otherwise KO
We create 2 DummyOperator one for OK, the other for KO (2 branches)
Depending on the result, we will go to the OK or KO branch
The KO branch will finish the DAG without other tasks
The OK branch will continue the DAG with tasks that follow (fake_task) in my example

Facing Issue while comparing xcom value to a static variable in airflow 2.3.4

I have a use case wherein I need to extract the value using xcom_pull.
Version of Airflow : 2.3.4
Composer version : 2.1.1
live_fw_num="{{ ti.xcom_pull(dag_id='" + DAG_ID + "',task_ids='get_fw_of_month')[0][0] }}")
The output coming out is 1. Images Attached
The first image shows the xcom tab value
The second image shows the value when extracted in a live_fw_num variable
Code :
today = datetime.date.today()
def function_1(table2,table3,live_fw_num):
if live_fw_num == '1' : ####( Tried with Integer value as well)
<do something>
else:
<do something else>
with dag:
task_1 = PythonOperator(
task_id='get_fw_of_month',
python_callable=get_data,
op_kwargs={'sql': task_1_func(tb1=<some table name>,
curr_dte=today,
)
}
)
task_3 = PythonOperator(
task_id = 'task3',
python_callable = function_1,
op_kwargs={'table2': <table name >,
'table3': <table name >,
'live_fw_num': "{{ ti.xcom_pull(dag_id='" + DAG_ID + "',task_ids='get_fw_of_month')[0][0] }}"
}
)
task_1 >> task_3
But when I am comparing this value to a static variable using if-else clause, it goes to else consition instead of if condition , even though the output value of live_fw_num is 1
I tested your DAG and your code and it worked on my side :
The only difference, I mocked the get_data method used by the first task, to return 1 as String.
import datetime
import logging
import airflow
from airflow.operators.python import PythonOperator
from integration_ocd.config import settings
today = datetime.date.today()
def get_data():
return '1'
def function_1(table2, table3, live_fw_num):
if live_fw_num == '1':
logging.info('############################################<do something>')
else:
logging.info('############################################<do something else>')
with airflow.DAG(
'dag_test_xcom',
default_args=your_default_dag_args,
schedule_interval=None) as dag:
task_1 = PythonOperator(
task_id='get_fw_of_month',
python_callable=get_data
)
task_3 = PythonOperator(
task_id='task3',
python_callable=function_1,
op_kwargs={
'table2': 'table1',
'table3': 'table2',
'live_fw_num': "{{ ti.xcom_pull(dag_id='dag_test_xcom',task_ids='get_fw_of_month')[0][0] }}"
}
)
task_1 >> task_3

How to only run certain operator when dag conf value exist

def skip_update_job_pod_name(dag):
"""
:param dag: Airflow DAG
:return: Dummy operator to skip update pod name
"""
return DummyOperator(task_id="skip_update_job_pod_name", dag=dag)
def update_pod_name_branch_operator(dag: DAG, job_id: str):
"""branch operator to update pod name."""
return BranchPythonOperator(
dag=dag,
trigger_rule="all_done",
task_id="update_pod_name",
python_callable=update_pod_name_func,
op_kwargs={"job_id": job_id},
)
def update_pod_name_func(job_id: Optional[str]) -> str:
"""function for update pod name."""
return "update_job_pod_name" if job_id else "skip_update_pod_name"
def update_job_pod_name(dag: DAG, job_id: str, process_name: str) -> MySqlOperator:
"""
:param dag: Airflow DAG
:param job_id: Airflow Job ID
:param process_name: name of the current running process
:return: MySqlOperator to update Airflow job ID
"""
return MySqlOperator(
task_id="update_job_pod_name",
mysql_conn_id="semantic-search-airflow-sdk",
autocommit=True,
sql=[
f"""
INSERT INTO airflow.Pod (job_id, pod_name, task_name)
SELECT * FROM (SELECT '{job_id}', '{xcom_pull("pod_name")}', '{process_name}') AS temp
WHERE NOT EXISTS (
SELECT pod_name FROM airflow.Pod WHERE pod_name = '{{{{ ti.xcom_pull(key="pod_name") }}}}'
) LIMIT 1;
"""
],
task_concurrency=1,
dag=dag,
trigger_rule="all_done",
)
def create_k8s_pod_operator_without_volume(dag: DAG,
job_id: int,
....varaible) -> TaskGroup:
"""
Create task group for k8 operator without volume
"""
with TaskGroup(group_id="k8s_pod_operator_without_volume", dag=dag) as eks_without_volume_group:
emit_pod_name_branch = update_pod_name_branch_operator(dag=dag, job_id=job_id)
update_pod_name = update_job_pod_name(dag=dag, job_id=job_id, process_name=process_name)
skip_update_pod_name = skip_update_job_pod_name(dag=dag)
emit_pod_name_branch >> [update_pod_name, skip_update_pod_name]
return eks_without_volume_group
I update the code based on the comment, I am curious how does the taskgroup work with branch operator I will get this when I try to do this
airflow.exceptions.AirflowException: Branch callable must return valid task_ids. Invalid tasks found: {'update_job_pod_name'}
You can use BranchPythonOperator that get the value and return which the name of task to run in any condition.
def choose_job_func(job_id):
if job_id:
return "update_pod_name_rds"
choose_update_job =BranchPythonOperator(task_id="choose_update_job", python_callable=choose_job_func,
op_kwargs={"job_id": "{{ params.job_id }}"})
or, in task flow api it would look like this :
#task.branch
def choose_update_job(job_id):
if job_id:
return "update_pod_name_rds"
Full Dag Example :
with DAG(
dag_id="test_dag",
start_date=datetime(2022, 1, 1),
schedule_interval=None,
render_template_as_native_obj=True,
params={
"job_id": Param(default=None, type=["null", "string"])
},
tags=["test"],) as dag:
def update_job_pod_name(job_id: str, process_name: str):
return MySqlOperator(
task_id="update_pod_name_rds",
mysql_conn_id="semantic-search-airflow-sdk",
autocommit=True,
sql=[
f"""
INSERT INTO airflow.Pod (job_id, pod_name, task_name)
SELECT * FROM (SELECT '{job_id}', '{xcom_pull("pod_name")}', '{process_name}') AS temp
WHERE NOT EXISTS (
SELECT pod_name FROM airflow.Pod WHERE pod_name = '{{{{ ti.xcom_pull(key="pod_name") }}}}'
) LIMIT 1;
"""
],
task_concurrency=1,
dag=dag,
trigger_rule="all_done",
)
#task.branch
def choose_update_job(job_id):
print(job_id)
if job_id:
return "update_pod_name_rds"
return "do_nothing"
sql_task = update_job_pod_name(
"{{ params.job_id}}",
"process_name",
)
do_nothing = EmptyOperator(task_id="do_nothing")
start_dag = EmptyOperator(task_id="start")
end_dag = EmptyOperator(task_id="end", trigger_rule=TriggerRule.ONE_SUCCESS)
(start_dag >> choose_update_job("{{ params.job_id }}") >> [sql_task, do_nothing] >> end_dag)

How to process multiple Spark SQL queries in parallel [duplicate]

I am trying to run 2 functions doing completely independent transformations on a single RDD in parallel using PySpark. What are some methods to do the same?
def doXTransforms(sampleRDD):
(X transforms)
def doYTransforms(sampleRDD):
(Y Transforms)
if __name__ == "__main__":
sc = SparkContext(appName="parallelTransforms")
sqlContext = SQLContext(sc)
hive_context = HiveContext(sc)
rows_rdd = hive_context.sql("select * from tables.X_table")
p1 = Process(target=doXTransforms , args=(rows_rdd,))
p1.start()
p2 = Process(target=doYTransforms, args=(rows_rdd,))
p2.start()
p1.join()
p2.join()
sc.stop()
This does not work and I now understand this will not work.
But is there any alternative way to make this work? Specifically are there any python-spark specific solutions?
Just use threads and make sure that cluster have enough resources to process both tasks at the same time.
from threading import Thread
import time
def process(rdd, f):
def delay(x):
time.sleep(1)
return f(x)
return rdd.map(delay).sum()
rdd = sc.parallelize(range(100), int(sc.defaultParallelism / 2))
t1 = Thread(target=process, args=(rdd, lambda x: x * 2))
t2 = Thread(target=process, args=(rdd, lambda x: x + 1))
t1.start(); t2.start()
Arguably this is not that often useful in practice but otherwise should work just fine.
You can further use in-application scheduling with FAIR scheduler and scheduler pools for a better control over execution strategy.
You can also try pyspark-asyncactions (disclaimer - the author of this answer is also the author of the package) which provides a set of wrappers around Spark API and concurrent.futures:
import asyncactions
import concurrent.futures
f1 = rdd.filter(lambda x: x % 3 == 0).countAsync()
f2 = rdd.filter(lambda x: x % 11 == 0).countAsync()
[x.result() for x in concurrent.futures.as_completed([f1, f2])]

How to handle different task intervals on a single Dag in airflow?

I have a single dag with multiple tasks with this simple structure that tasks A, B, and C can run at the start without any dependencies but task D depends on A no here is my question:
tasks A, B, and C run daily but I need task D to run weekly after A succeeds. how can I setup this dag?
does changing schedule_interval of task work? Is there any best practice to this problem?
Thanks for your help.
You can use a ShortCircuitOperator to do this.
import airflow
from airflow.operators.python_operator import ShortCircuitOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.models import DAG
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
'schedule_interval': '0 10 * * *'
}
dag = DAG(dag_id='example', default_args=args)
a = DummyOperator(task_id='a', dag=dag)
b = DummyOperator(task_id='b', dag=dag)
c = DummyOperator(task_id='c', dag=dag)
d = DummyOperator(task_id='d', dag=dag)
def check_trigger(execution_date, **kwargs):
return execution_date.weekday() == 0
check_trigger_d = ShortCircuitOperator(
task_id='check_trigger_d',
python_callable=check_trigger,
provide_context=True,
dag=dag
)
a.set_downstream(b)
b.set_downstream(c)
a.set_downstream(check_trigger_d)
# Perform D only if trigger function returns a true value
check_trigger_d.set_downstream(d)
In Airflow version >= 2.1.0, you can use the BranchDayOfWeekOperator which is exactly suited for your case.
See this answer for more details.

Resources