I want to schedule a DAG at first of month at 5 AM UTC time. so lets say that I want to start running my DAG from 01/01/2021 5 AM . what should be my start date and schedule interval. I want the DAG run on 01/01/2021 to have the same execution date that is 01/01/2021. Any leads on how this could be achieved.
Thanks
The FAQs about execution_date may help you understand what's happening, (see also DAG Runs):
Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if you want to summarize data for 2016-02-19, You would do it at 2016-02-20 midnight UTC, which would be right after all data for 2016-02-19 becomes available.
Basically, the DAG with execution_date = 2021-01-01T05:00:00+00:00 will actually be executed one schedule_interval later (2021-02-01T05:00:00+00:00). The actual date the execution occurred, is represented in the start_date attribute of the "dag_run" object (you can access it through the execution context parameters). It is the same date that you can find in the Explore UI >> Dag Runs >> Start Date column.
Try creating a dummy DAG like the following:
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy import DummyOperator
args = {
"owner": "airflow",
}
with DAG(
dag_id="dummy_dag",
start_date=datetime(2021, 1, 1, 5),
schedule_interval="0 5 1 * *",
) as dag:
t1 = DummyOperator(task_id="task_1")
After the first exeuction, you could play around with the CLI to calculate future execution dates:
~/airflow$ airflow dags next-execution dummy_dag -n 10 -h
usage: airflow dags next-execution [-h] [-n NUM_EXECUTIONS] [-S SUBDIR] dag_id
Get the next execution datetimes of a DAG.
It returns one execution unless the num-executions option is given
Let me know if that worked for you!
Related
I need to schedule my spark v3.0.2 on job to run on specified dates (i.e. March31 and Dec31) of every year.
I am using airflow for scheduling.
how to handle this use-case ?
If you want to run your job only March, 31st and December, 31st, you can set a cron expression in schedule_interval argument in your DAG definition.
Cron expression will be 0 0 31 3,12 * and can be translated to run at midnight the 31st day of the month for month 3 (March) and month 12 (December). Thus your DAG definition should be:
from airflow import DAG
your_dag = DAG(
dag_id='your_dag_id',
...
schedule_interval='0 0 31 3,12 *',
...
)
For more complicated cases such as run April, 15th and August, 23rd that cannot be defined with cron expression, I guess you should do as Iñigo suggested
You have some options here:
Option 1:
Create a dag that has as the first step one PytonOperator that the date and fails if it not Dec31 or Mar31.
Make this first step required to run the next step.
Option 2:
Create one dag that runs yearly for every date. This looks awful but it can be done easily with a single python file like this:
# Create a dag for an exact date
def createYearlyDagForDate(startdate):
with DAG(startdate=startdate,
task_id=f"createdagdordate_{startdate.strftime(month_%m_day_%d)}"
schedule_interval="#yearly") as dag:
sparkjob = SparkSubmitOperator(...)
return dag
for x in [datetime(2021,12,31), datetime(2021,03,31) ]:
createYearlyDag(x)
The trick here is having a task_id for each dag. If you reuse task_id in the dags, you are overwriting the dag and will have just one declared.
How can I configure airflow (mwaa) so that it will fire at the same time (6am PST) every day regards of when the dag is deployed?
I have tried what makes sense to me:
set the schedule_interval to 0 6 * * *.
set the start date to:
now = datetime.utcnow()
now = now.replace(tzinfo=pendulum.timezone('America/Los_Angeles'))
previous_five_am = now.replace(hour = 5, minute = 0, second = 0, microsecond = 0)
start_date = previous_five_am
It seems that whenever I deploy by setting the start_date to 5am the previous day it would always fire at the next 6am no matter what time I deploy the dag or do a airflow update
Your confusion may be because you expect Airflow to schedule DAGs like cronjob when it's not.
The first DAG Run is created based on the minimum start_date for the tasks in your DAG. Subsequent DAG Runs are created by the scheduler process, based on your DAG’s schedule_interval, sequentially. Airflow schedule tasks at the END of the interval (See docs) you can view this answer for examples.
As for your sample code - never set your start_date to be dynamic. It's a bad practice that can sometimes lead to DAG never being executed because now() always moves to now() + interval may never be reached see Airflow FAQ.
i need some help understanding the behaviour with monthly Cron expression [43 10 3,8,12 */3 *] with start_date as datetime(year=2019, month=11, day=18, hour=1, minute=30, second=0 , tzinfo=pendulum.timezone("UTC")) and end_date as None . This has backfill set as true .
Current Date is: 2020-10-19
As per my understanding it should not have triggered last two runs 10-03 and 10-08 . Can someone please help me understand this behavior? Also if it is triggering run for execution_date of 10-03 and 10-08 then why not for 10-12?
Could you elaborate on "it should not have trifggered the last two runs"?
The cron expression 43 10 3,8,12 */3 * matches:
“At 10:43 on day-of-month 3, 8, and 12 in every 3rd month.”
A good tool to validate cron expression is crontab.guru.
The execution date 10-12 hasn't triggered yet, because of how Airflow handles execution_date - see airflow scheduler:
The scheduler won’t trigger your tasks until the period it covers has ended e.g., A job with schedule_interval set as #daily runs after the day has ended. This technique makes sure that whatever data is required for that period is fully available before the dag is executed. In the UI, it appears as if Airflow is running your tasks a day late
Let’s Repeat That, the scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
This means the run with execution date 2020-10-12 10:43:00 will be triggered just shortly before 2021-01-03 10:43:00.
I am having issues scheduling my DAG runs properly, especially in terms of schedule_interval & start_date.
Airflow's default behavior doesn't suit my use case, wherein if I set an arbitrarily early date for start_date, airflow will always trigger the initial run outside the "intuitive" first schedule (e.g. if I set my schedule_interval to run at 12AM every Wednesday & Friday, and I created this DAG on a Monday with an arbitrarily early date many weeks back, there is a cron run at the exact instance the DAG is created in the Monday, which I do not want).
Understand from reference that the scheduled run will happen after 1 scheduled_interval cycle. While I have taken reference to How to work correctly airflow schedule_interval & https://airflow.apache.org/docs/1.10.6/scheduler.html#backfill-and-catchup, they generally show examples related to regular intervals (e.g. hourly, daily, monthly, etc.).
However, in my scenario, I have a schedule_interval of 0 16 * * 1,2,0 & 0 16 * * 1,3,5. As the intervals between runs are irregular, how do I set my start_date such that:
The first run happens exactly on the next datetime when it is "intuitive" to run? e.g.
if I create these 2 DAGs on a Saturday, I would want my DAGs to make their first run at 1600h on Sunday & Monday respectively
if I create these 2 DAGs on Monday 1700h, I would want my DAGs to make their first run at 1600h on Tuesday & Wednesday respectively
i'm trying to export dag statistics out of Airflow. the statsd output is not very useful, so i decided to basically run a dag to query the SQL and export it out to say influxdb.
so it's easy enough to create a DAG to query the postgres airflow database. however, i'm a little stumped at the schema. i would have thought:
select run_id, start_date, end_date from dag_run where dag_id= 'blah';
would do it, but end end_date never appears to be populated.
all i'm really after is the total time from which the dag run started (where the first job is initiated as opposed to when the job is first put into a running state) and the time the dag went into a success state.
Try hitting the task_instance table:
SELECT execution_date
, MIN(start_date) AS start
, MAX(end_date) AS end
, MAX(end_date) - MIN(start_date) AS duration
FROM task_instance
WHERE dag_id = 'blah'
AND state = 'success'
GROUP BY execution_date
ORDER BY execution_date DESC