For the data migrations ,I have created a DAG which ultimately inserts data to a migration table after all the tasks with required logic.
DAG has a sql which is something similar to the below which initially extracts the data and feeds to other tasks:
sql=" select col_names from tables where created_on >=date1 and created_on <=date2"
For each DAG run Iam manually changing date1 and date2 in above sql and initiating data migrations(as data chunk is heavy,as of now date range length is 1 week).
I just want to automate this date changing process ex.if i give date intervals ,after the first DAG is run,the second run is initiated and so on until the end date interval.
I have researched so far,one solution I got was dynamic DAGS in airflow.But the problem is it creates multiple DAG file instances and its also very difficult to debug and maintain .
Is there a way to repeat a DAG with changing date parameter so that I no longer have to keep changing dates manually.
I had the exact same issue! Backfilling in Airflow doesn't seem to make any sense if you don't have the DAG interval start and end as input parameters. If you want to do data migration, you'll probably need to store your last migration time in a file to read. However, this goes against some of the properties an Airflow DAG/task should have (idempotence).
My solution was to add two tasks to my DAG before the start of my "main" tasks. I have two operators (you can possibly make it one) which gets the start and end times of the current DAG run. The "start" and "end" names are sort of misleading because the "start" is actually the start of the previous run and "end" the start of the current run.
I can't reveal the custom operator I wrote but you can do this in a single Python operator:
from croniter import croniter
def get_interval_start_end(**kwargs):
dag = kwargs['dag']
ti = kwargs['ti']
dag_execution = ti.execution_date # current DAG scheduled start
dag_interval = dag._scheduled_interval # note the preceding underscore
cron_iter = croniter(dag_interval, dag_execution)
dag_prev_execution = cron_iter.get_prev()
return (dag_execution, dag_prev_execution)
# dag
task = PythonOperator(task_id='blabla',
python_callable=get_interval_start_end,
provide_context=True)
# other tasks
Then pull these values from xcom in your next task.
There is also a way to get the "last_run" of the DAG using dag.get_last_dagrun() instead. However, it doesn't return the previous scheduled run but the previous actual run. If you have already run your DAG for a "future" time, your "last dag run" will be after your current execution! Then again, I might not have tested with the right settings, so you can try that out first.
I had similar req and here is how I accessed the dates which later can be used in SQLs for backfill.
from airflow import DAG
from airflow.operators import BashOperator, PythonOperator
from datetime import datetime, timedelta
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 8, 1),
'end_date': datetime(2020, 8, 3),
'retries': 0,
}
dag = DAG('helloWorld_v1', default_args=default_args, catchup=True, schedule_interval='0 1 * * *')
def print_dag_run_date(**kwargs):
print(kwargs)
execution_date = kwargs['ds']
prev_execution_date = kwargs['prev_ds']
return (execution_date, prev_execution_date)
# t1, t2 are examples of tasks created using operators
bash = BashOperator(
task_id='bash',
depends_on_past=True,
bash_command='echo "Hello World from Task 1"',
dag=dag)
py = PythonOperator(
task_id='py',
depends_on_past=True,
python_callable=print_dag_run_date,
provide_context=True,
dag=dag)
py.set_upstream(bash)
Related
Is there a way to find the maximum/minimum or even an average duration of all DagRun instances in Airflow? - That is all dagruns from all dags not just one single dag.
I can't find anywhere to do this on the UI or even a page with a programmatic/command line example.
You can use airflow- api to get all dag_runs for dag and calculate statistics.
An example to get all dag_runs per dag and calc total time :
import datetime
import requests
from requests.auth import HTTPBasicAuth
airflow_server = "http://localhost:8080/api/v1/"
auth = HTTPBasicAuth("airflow", "airflow")
get_dags_url = f"{airflow_server}dags"
get_dag_params = {
"limit": 100,
"only_active": "true"
}
response = requests.get(get_dags_url, params=get_dag_params, auth=auth)
dags = response.json()["dags"]
get_dag_run_params = {
"limit": 100,
}
for dag in dags:
dag_id = dag["dag_id"]
dag_run_url = f"{airflow_server}/dags/{dag_id}/dagRuns?limit=100&state=success"
response = requests.get(dag_run_url, auth=auth)
dag_runs = response.json()["dag_runs"]
for dag_run in dag_runs:
execution_date = datetime.datetime.fromisoformat(dag_run['execution_date'])
end_date = datetime.datetime.fromisoformat(dag_run['end_date'])
duration = end_date - execution_date
duration_in_s = duration.total_seconds()
print(duration_in_s)
The easiest way will be to query your Airflow metastore. All the scheduling, DAG runs, and task instances are stored there and Airflow can't operate without it. I do recommend filtering on DAG/execution date if your use-case allows. It's not obvious to me what one can do with just these three overarching numbers alone.
select
min(runtime_seconds) min_runtime,
max(runtime_seconds) max_runtime,
avg(runtime_seconds) avg_runtime
from (
select extract(epoch from (d.end_date - d.start_date)) runtime_seconds
from public.dag_run d
where d.execution_date between '2022-01-01' and '2022-06-30' and d.state = 'success'
)
You might also consider joining to the task_instance table to get some task-level data, and perhaps use the min start and max end times for DAG tasks within a DAG run for your timestamps.
As a newbie to airflow, I'm looking at the example_branch_operator:
"""Example DAG demonstrating the usage of the BranchPythonOperator."""
import random
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import BranchPythonOperator
from airflow.utils.dates import days_ago
args = {
'owner': 'airflow',
}
with DAG(
dag_id='example_branch_operator',
default_args=args,
start_date=days_ago(2),
schedule_interval="#daily",
tags=['example', 'example2'],
) as dag:
run_this_first = DummyOperator(
task_id='run_this_first',
)
options = ['branch_a', 'branch_b', 'branch_c', 'branch_d']
branching = BranchPythonOperator(
task_id='branching',
python_callable=lambda: random.choice(options),
)
run_this_first >> branching
join = DummyOperator(
task_id='join',
trigger_rule='none_failed_or_skipped',
)
for option in options:
t = DummyOperator(
task_id=option,
)
dummy_follow = DummyOperator(
task_id='follow_' + option,
)
branching >> t >> dummy_follow >> join
Looking at the join operator, I'd expect for it to collect all the branches, but instead it's just another task that happens at the end of each branch. If multiple branches are executed, join will run that many times.
(yes, yes, it should be idempotent, but that's not the point of the question)
Is this a bug, a poorly named task, or am I missing something?
The tree view displays a complete branch from each DAG root node. Multiple branches that converge on a single task will be shown multiple times but they will only be executed once. Check out the Graph View of this DAG:
I want to use badrecordspath in Spark in in Azure Databricks to get a count of corrupt records associated to the job, but there is no simple way to know :
if a file has been written
in which partition the file has been written
I thought maybe i could check if the last partition was created in the last 60 seconds with some code like that :
from datetime import datetime, timedelta
import time
import datetime
df = spark.read.format('csv').option("badRecordsPath", corrupt_record_path)
partition_dict = {} #here the dictionnary of partition and corresponding timestamp
for i in dbutils.fs.ls(corrupt_record_path):
partition_dict[i.name[:-1]]=time.mktime(datetime.datetime.strptime(i.name[:-1], "%Y%m%dT%H%M%S").timetuple())
#here i get the timestamp of one minute ago
submit_timestamp_utc_minus_minute = datetime.datetime.now().replace(tzinfo = timezone.utc) - timedelta(seconds=60)
submit_timestamp_utc_minus_minute = time.mktime(datetime.datetime.strptime(submit_timestamp_utc_minus_minute.strftime("%Y%m%dT%H%M%S"), "%Y%m%dT%H%M%S").timetuple())
#Here i compare the latest partition to check if is more recent than 60 seconds ago
if partition_dict[max(partition_dict, key=lambda k: partition_dict[k])]>submit_timestamp_utc_minus_minute :
corrupt_dataframe = spark.read.format('json').load(corrupt_record_path+partition+'/bad_records')
corrupt_records_count = corrupt_dataframe.count()
else:
corrupt_records_count = 0
But i see two issues :
it is a lot of overhead (ok the code could also be better written, but
still)
i'm not even sure when does the partition name is defined in the
reading job. Is it at the beginning of the job or at the end ? If it is at the beginning, then the 60 seconds are not relevant at all.
As a side note i cannot use PERMISSIVE read with corrupt_records_column, as i don't want to cache the dataframe (you can see my other question here )
Any suggestion or observation would be much appreciated !
What is the best way to transfer information between dags?
Since i have a scenario where multiple dags, let’s say dag A and dag B can call dag C, I thought of 2 ways to do so:
XCOM - I cannot use XCOM-pull from dag C since I don’t know which dag id to give as input.
use context[“dag_run”].payload when calling to TriggerDagRunOperator but that solution forces me to create some kind of custom operator for each task in dag C in order to be able to use the context inside execute func and read that info.
Any other suggestions? pros/cons?
Thanks.
# Pass info from DAG_A \ DAG_B:
def push_info(context, dag_run_obj):
dag_run_obj.payload = {PARAMS_KEY: {"feature_ids": 1234}}
return dag_run_obj
trigger_DAG_C = TriggerDagRunOperator(
task_id="trigger_DAG_C",
trigger_dag_id="DAG_C",
python_callable=push_info,
dag=dag,
)
# Read the info from DAG C:
def execute():
info = context["dag_run"].conf
I am trying to have an airflow script to be scheduled to run every Tuesday at 9:10 AM UTC. Given below is how I have defined it.
dag = DAG(
dag_id=DAG_NAME,
default_args=args,
schedule_interval="10 9 * * 2",
catchup=False
I however find that when the time comes, the script does not get triggered automatically. However if I do not have the value defined in the day column (last column), the scheduler works fine. Any idea where I am going wrong.
Thanks
Update:
args = {
'owner': 'admin',
'start_date': airflow.utils.dates.days_ago(9)
}
dag = DAG(
dag_id=DAG_NAME,
default_args=args,
schedule_interval = "10 9 * * 2",
catchup = False
)
This one stumps people more than anything else in Airflow, but as commenter and Airflow documentation state,
The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
In this case you can either bump back your DAG start_date one schedule_interval or wait for the next schedule_interval to complete.