Airflow example_branch_operator usage of join - bug? - python-3.x

As a newbie to airflow, I'm looking at the example_branch_operator:
"""Example DAG demonstrating the usage of the BranchPythonOperator."""
import random
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import BranchPythonOperator
from airflow.utils.dates import days_ago
args = {
'owner': 'airflow',
}
with DAG(
dag_id='example_branch_operator',
default_args=args,
start_date=days_ago(2),
schedule_interval="#daily",
tags=['example', 'example2'],
) as dag:
run_this_first = DummyOperator(
task_id='run_this_first',
)
options = ['branch_a', 'branch_b', 'branch_c', 'branch_d']
branching = BranchPythonOperator(
task_id='branching',
python_callable=lambda: random.choice(options),
)
run_this_first >> branching
join = DummyOperator(
task_id='join',
trigger_rule='none_failed_or_skipped',
)
for option in options:
t = DummyOperator(
task_id=option,
)
dummy_follow = DummyOperator(
task_id='follow_' + option,
)
branching >> t >> dummy_follow >> join
Looking at the join operator, I'd expect for it to collect all the branches, but instead it's just another task that happens at the end of each branch. If multiple branches are executed, join will run that many times.
(yes, yes, it should be idempotent, but that's not the point of the question)
Is this a bug, a poorly named task, or am I missing something?

The tree view displays a complete branch from each DAG root node. Multiple branches that converge on a single task will be shown multiple times but they will only be executed once. Check out the Graph View of this DAG:

Related

How to get maximum/minimum duration of all DagRun instances in Airflow?

Is there a way to find the maximum/minimum or even an average duration of all DagRun instances in Airflow? - That is all dagruns from all dags not just one single dag.
I can't find anywhere to do this on the UI or even a page with a programmatic/command line example.
You can use airflow- api to get all dag_runs for dag and calculate statistics.
An example to get all dag_runs per dag and calc total time :
import datetime
import requests
from requests.auth import HTTPBasicAuth
airflow_server = "http://localhost:8080/api/v1/"
auth = HTTPBasicAuth("airflow", "airflow")
get_dags_url = f"{airflow_server}dags"
get_dag_params = {
"limit": 100,
"only_active": "true"
}
response = requests.get(get_dags_url, params=get_dag_params, auth=auth)
dags = response.json()["dags"]
get_dag_run_params = {
"limit": 100,
}
for dag in dags:
dag_id = dag["dag_id"]
dag_run_url = f"{airflow_server}/dags/{dag_id}/dagRuns?limit=100&state=success"
response = requests.get(dag_run_url, auth=auth)
dag_runs = response.json()["dag_runs"]
for dag_run in dag_runs:
execution_date = datetime.datetime.fromisoformat(dag_run['execution_date'])
end_date = datetime.datetime.fromisoformat(dag_run['end_date'])
duration = end_date - execution_date
duration_in_s = duration.total_seconds()
print(duration_in_s)
The easiest way will be to query your Airflow metastore. All the scheduling, DAG runs, and task instances are stored there and Airflow can't operate without it. I do recommend filtering on DAG/execution date if your use-case allows. It's not obvious to me what one can do with just these three overarching numbers alone.
select
min(runtime_seconds) min_runtime,
max(runtime_seconds) max_runtime,
avg(runtime_seconds) avg_runtime
from (
select extract(epoch from (d.end_date - d.start_date)) runtime_seconds
from public.dag_run d
where d.execution_date between '2022-01-01' and '2022-06-30' and d.state = 'success'
)
You might also consider joining to the task_instance table to get some task-level data, and perhaps use the min start and max end times for DAG tasks within a DAG run for your timestamps.

Notify on status of other operators in an Airflow DAG using Email Operator

Lets say I have a dag in airflow whose definition file looks like:
import airflow
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.email_operator import EmailOperator
from airflow.utils.trigger_rule import TriggerRule
default_args = {
'owner' : 'airflow',
'retries' : 0
}
dag = DAG(
dag_id = 'email_notification_test',
start_date = airflow.utils.dates.days_ago(2),
default_args = default_args,
schedule_interval = None,
catchup=False
)
start = DummyOperator(task_id = 'start',
dag = dag)
def built_to_fail(ds, **kwargs):
raise Exception('This operator fails')
def built_to_succeed(ds, **kwargs):
print('This Operator succeeds')
return
operator_that_fails = PythonOperator(
task_id='operator_that_fails',
python_callable=built_to_fail,
dag = dag
)
operator_that_succeeds = PythonOperator(
task_id='operator_that_succeeds',
python_callable=built_to_succeed,
dag = dag
)
email = EmailOperator(
task_id='send_email',
to='<email address>',
subject='DAG Run Complete',
html_content="""run_id: {{ run_id }} </p>
dag_run: {{ dag_run }} </p>
dag_run.id: {{ dag_run.id }} </p>
dag_run.state: {{ dag_run.state }}""",
trigger_rule=TriggerRule.ALL_DONE,
dag=dag
)
start >> [operator_that_fails, operator_that_succeeds] >> email
DAG TLDR: The dag has two operators, one which fails and one which succeeds. After both have finished executing, run a third task - an email operator - that sends a notification summary of the statuses of the preceding operators. For a visual aid, here is the webui graph view:
As I've demonstrated in the html_content part of the email operator, you can use jinja to reference objects and their attributes. What I really need though, is not only to report on the status of the dag itself, but of individual operators that have already run, so something like:
html_content="""operator_that_fails status : {{ <dynamic reference to preceeding task status> }} </p>
operator_that_succeeds status: {{ <ditto for the other task> }}"""
I was trying to do this by exploring the Airflow object model documentation, ie. this page for the "dag run" object, but wasn't able to find a good way of getting preceeding task statuses.
Anyone know how best to achieve my goal here?
You can use templating to access the context, like this:
def contact_endpoints(**context):
return ("subject text", "body text")
execute = PythonOperator(
task_id='source_task_id',
python_callable=contact_endpoints,
provide_context=True,
do_xcom_push = True
)
email = EmailOperator(
task_id='send_mail',
to="email#email.com",
subject=f"""[{dag.dag_id}] {{{{ task_instance.xcom_pull(task_ids='source_task_id', key='return_value')[0] }}}}""",
html_content=f"""
<br><br>
<h3>Report</h3>
{{{{ task_instance.xcom_pull(task_ids='source_task_id', key='return_value')[1] }}}}
""")
In my example, the task whose id is referenced in the xcom-pulls returns a tuple ("part to insert in subject", "part to insert in body").
You have to set do_xcom_push to true on the reference task.

Airflow Composer custom module not found - PythonVirtualenvOperator

I have a very simple Airflow instance setup in GCP Composer. It has the bucket and everything. I want to set up each dag to run it its own environment with PythonVirtualenvOperator.
The structure in it is as follows:
dags ->
------> code_snippets/
----------> print_name.py - has function called print_my_name() which prints a string into the terminal
------> test_dag.py
test_dag.py:
import datetime
from airflow.operators.python_operator import PythonVirtualenvOperator
from airflow import DAG
def main_func():
import pandas as pd
import datetime
from code_snippets.print_name import print_my_name
print_my_name()
df = pd.DataFrame(data={
'date': [str(datetime.datetime.now().date())]
})
print(df)
default_args = {
'owner': 'test_dag',
'start_date': datetime.datetime(2020, 7, 3, 5, 1, 00),
'concurrency': 1,
'retries': 0
}
dag = DAG('test_dag', description='Test DAGS with environment',
schedule_interval='0 5 * * *',
default_args=default_args, catchup=False)
test_the_dag = PythonVirtualenvOperator(
task_id="test_dag",
python_callable=main_func,
python_version='3.8',
requirements=["DateTime==4.3", "numpy==1.20.2", "pandas==1.2.4", "python-dateutil==2.8.1", "pytz==2021.1",
"six==1.15.0", "zope.interface==5.4.0"],
system_site_packages=False,
dag=dag,
)
test_the_dag
Everything works until I start importing custom modules - having an init.py does not help, it still gives out the same error, which in my case is:
from code_snippets.print_name import print_my_name\nModuleNotFoundError: No module named \'code_snippets\'
I also have a local instance of Airflow and i experience the same issue. I have tried moving things around or adding the path to the folders to PATH, adding inits in the directories or even changing the import statements, but the error persists as long as I am importing custom modules.
system_site_packages=False or True also has no effect
Is there a fix for that or a way to go around it so I can utilize the custom code I have separated outside of the DAGs?
Airflow Version : 1.10.14+composer
Python version for Airflow is set to: 3
The implementation for airflow.operators.python.PythonVirtualenvOperator is such that the python_callable is expected to not reference external names.
Any non-standard library packages used in the callable must be declared as external dependencies in the requirements.txt file.
If you need to use code_snippets, publish it as a package either to pypi or a VCS repository and add it in the list of packages in the requirements kwargs for the PythonVirtualenvOperator.

What is the best way to transfer information between dags in Airflow?

What is the best way to transfer information between dags?
Since i have a scenario where multiple dags, let’s say dag A and dag B can call dag C, I thought of 2 ways to do so:
XCOM - I cannot use XCOM-pull from dag C since I don’t know which dag id to give as input.
use context[“dag_run”].payload when calling to TriggerDagRunOperator but that solution forces me to create some kind of custom operator for each task in dag C in order to be able to use the context inside execute func and read that info.
Any other suggestions? pros/cons?
Thanks.
# Pass info from DAG_A \ DAG_B:
def push_info(context, dag_run_obj):
dag_run_obj.payload = {PARAMS_KEY: {"feature_ids": 1234}}
return dag_run_obj
trigger_DAG_C = TriggerDagRunOperator(
task_id="trigger_DAG_C",
trigger_dag_id="DAG_C",
python_callable=push_info,
dag=dag,
)
# Read the info from DAG C:
def execute():
info = context["dag_run"].conf

Repeating a airflow DAG with different date parameters for data migrations

For the data migrations ,I have created a DAG which ultimately inserts data to a migration table after all the tasks with required logic.
DAG has a sql which is something similar to the below which initially extracts the data and feeds to other tasks:
sql=" select col_names from tables where created_on >=date1 and created_on <=date2"
For each DAG run Iam manually changing date1 and date2 in above sql and initiating data migrations(as data chunk is heavy,as of now date range length is 1 week).
I just want to automate this date changing process ex.if i give date intervals ,after the first DAG is run,the second run is initiated and so on until the end date interval.
I have researched so far,one solution I got was dynamic DAGS in airflow.But the problem is it creates multiple DAG file instances and its also very difficult to debug and maintain .
Is there a way to repeat a DAG with changing date parameter so that I no longer have to keep changing dates manually.
I had the exact same issue! Backfilling in Airflow doesn't seem to make any sense if you don't have the DAG interval start and end as input parameters. If you want to do data migration, you'll probably need to store your last migration time in a file to read. However, this goes against some of the properties an Airflow DAG/task should have (idempotence).
My solution was to add two tasks to my DAG before the start of my "main" tasks. I have two operators (you can possibly make it one) which gets the start and end times of the current DAG run. The "start" and "end" names are sort of misleading because the "start" is actually the start of the previous run and "end" the start of the current run.
I can't reveal the custom operator I wrote but you can do this in a single Python operator:
from croniter import croniter
def get_interval_start_end(**kwargs):
dag = kwargs['dag']
ti = kwargs['ti']
dag_execution = ti.execution_date # current DAG scheduled start
dag_interval = dag._scheduled_interval # note the preceding underscore
cron_iter = croniter(dag_interval, dag_execution)
dag_prev_execution = cron_iter.get_prev()
return (dag_execution, dag_prev_execution)
# dag
task = PythonOperator(task_id='blabla',
python_callable=get_interval_start_end,
provide_context=True)
# other tasks
Then pull these values from xcom in your next task.
There is also a way to get the "last_run" of the DAG using dag.get_last_dagrun() instead. However, it doesn't return the previous scheduled run but the previous actual run. If you have already run your DAG for a "future" time, your "last dag run" will be after your current execution! Then again, I might not have tested with the right settings, so you can try that out first.
I had similar req and here is how I accessed the dates which later can be used in SQLs for backfill.
from airflow import DAG
from airflow.operators import BashOperator, PythonOperator
from datetime import datetime, timedelta
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 8, 1),
'end_date': datetime(2020, 8, 3),
'retries': 0,
}
dag = DAG('helloWorld_v1', default_args=default_args, catchup=True, schedule_interval='0 1 * * *')
def print_dag_run_date(**kwargs):
print(kwargs)
execution_date = kwargs['ds']
prev_execution_date = kwargs['prev_ds']
return (execution_date, prev_execution_date)
# t1, t2 are examples of tasks created using operators
bash = BashOperator(
task_id='bash',
depends_on_past=True,
bash_command='echo "Hello World from Task 1"',
dag=dag)
py = PythonOperator(
task_id='py',
depends_on_past=True,
python_callable=print_dag_run_date,
provide_context=True,
dag=dag)
py.set_upstream(bash)

Resources