I am scheduling dag in airflow for 10 minutes is not doing anything.
here is my dags code:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date':datetime.now(),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('Python_call', default_args=default_args, schedule_interval= '*/10 * * * *')
t1 = BashOperator(
task_id='testairflow',
bash_command='python /var/www/projects/python_airflow/airpy/hello.py',
dag=dag)
and the scheduler log looks like this:
[2018-01-05 14:05:08,536] {jobs.py:351} DagFileProcessor484 INFO - Processing /var/www/projects/python_airflow/airpy/airflow_home/dags/scheduler.py took 2.278 seconds
[2018-01-05 14:05:09,712] {jobs.py:343} DagFileProcessor485 INFO - Started process (PID=29795) to work on /var/www/projects/python_airflow/airpy/airflow_home/dags/scheduler.py
[2018-01-05 14:05:09,715] {jobs.py:534} DagFileProcessor485 ERROR - Cannot use more than 1 thread when using sqlite. Setting max_threads to 1
[2018-01-05 14:05:09,717] {jobs.py:1521} DagFileProcessor485 INFO - Processing file /var/www/projects/python_airflow/airpy/airflow_home/dags/scheduler.py for tasks to queue
[2018-01-05 14:05:09,717] {models.py:167} DagFileProcessor485 INFO - Filling up the DagBag from /var/www/projects/python_airflow/airpy/airflow_home/dags/scheduler.py
[2018-01-05 14:05:10,057] {jobs.py:1535} DagFileProcessor485 INFO - DAG(s) dict_keys(['example_passing_params_via_test_command', 'latest_only_with_trigger', 'example_branch_operator', 'example_subdag_operator', 'latest_only', 'example_skip_dag', 'example_subdag_operator.section-1', 'example_subdag_operator.section-2', 'tutorial', 'example_http_operator', 'example_trigger_controller_dag', 'example_bash_operator', 'example_python_operator', 'test_utils', 'Python_call', 'example_trigger_target_dag', 'example_xcom', 'example_short_circuit_operator', 'example_branch_dop_operator_v3']) retrieved from /var/www/projects/python_airflow/airpy/airflow_home/dags/scheduler.py
[2018-01-05 14:05:12,039] {jobs.py:1169} DagFileProcessor485 INFO - Processing Python_call
[2018-01-05 14:05:12,048] {jobs.py:566} DagFileProcessor485 INFO - Skipping SLA check for <DAG: Python_call> because no tasks in DAG have SLAs
[2018-01-05 14:05:12,060] {models.py:322} DagFileProcessor485 INFO - Finding 'running' jobs without a recent heartbeat
[2018-01-05 14:05:12,061] {models.py:328} DagFileProcessor485 INFO - Failing jobs without heartbeat after 2018-01-05 14:00:12.061146
command line airflow scheduler :
[2018-01-05 14:31:20,496] {dag_processing.py:627} INFO - Started a process (PID: 32222) to generate tasks for /var/www/projects/python_airflow/airpy/airflow_home/dags/scheduler.py - logging into /var/www/projects/python_airflow/airpy/airflow_home/logs/scheduler/2018-01-05/scheduler.py.log
[2018-01-05 14:31:23,122] {jobs.py:1002} INFO - No tasks to send to the executor
[2018-01-05 14:31:23,123] {jobs.py:1440} INFO - Heartbeating the executor
[2018-01-05 14:31:23,123] {jobs.py:1450} INFO - Heartbeating the scheduler
[2018-01-05 14:31:24,243] {jobs.py:1404} INFO - Heartbeating the process manager
[2018-01-05 14:31:24,244] {dag_processing.py:559} INFO - Processor for /var/www/projects/python_airflow/airpy/airflow_home/dags/scheduler.py finished
Airflow is an ETL/data pipelining tool. This means its meant to execute things over already "gone by" periods. E.g. using:
Task parameter 'start_date': datetime(2018,1,4)
Default dag parameter schedule_interval='#daily'
Means that the DAG won't run until the whole schedule interval unit (one day) has gone through since the start date; thus on Airflow server time equal to datetime(2018,1,5).
Since you have a start_date of datetime.now() with a #daily inteval (which again is the default), the aforementioned condition is never fulfilled (refer to the official FAQ).
You can change the start_date parameter to, e.g. yesterday using timedelta for a relative start_date earlier than today (although this is not recommended). I would advise using 'start_date': datetime(2018,1,1) and adding a scheduler_interval='#once' to the DAG parameters for test purposes. This should get your DAG to run.
Related
I have been reading a lot about logging in to Airflow and experimenting a lot but could not achieve what I am looking for. I want to customize the logging for Airflow. I have a lot of DAGs. Each DAG has multiple tasks. EAch DAG has multiple tasks corresponding to one alert_name. DAG is running hourly and pushing logs to S3. If something goes wrong and debugging is required it is very tough to look for logs in S3. I want to customize the logs to search the log lines by RUN_ID and alert_name.
I have the following piece of code.
from airflow import DAG # noqa
from datetime import datetime
from datetime import timedelta
from airflow.operators.python_operator import PythonOperator
import y
import logging
log = logging.getLogger(__name__)
default_args = {
'owner': 'SRE',
'execution_timeout': timedelta(minutes=150)
}
dag = DAG(
dag_id = 'new_dag',
default_args = default_args,
start_date = datetime(year=2021, month=11, day=22),
schedule_interval = timedelta(days=1),
catchup = False,
max_active_runs = 3,
)
def implement_alert_logic(alert_name):
log.info(f'In the implementation for {alert_name}')
pass
def myfunc(**wargs):
for alert in ['alert_1', 'alert_2', 'alert_3']:
log.info(f'Executing logic for {alert}')
implement_alert_logic(alert)
t1 = PythonOperator(
task_id='testing_this',
python_callable = myfunc,
provide_context=True,
dag=dag)
t2 = PythonOperator(
task_id='testing_this2',
python_callable = myfunc,
provide_context=True,
dag=dag)
t1 >> t2
It prints something like
[2022-06-13, 08:16:54 UTC] {myenv.py:32} INFO - Executing logic for alert_1
[2022-06-13, 08:16:54 UTC] {myenv.py:27} INFO - In the implementation for alert_1
[2022-06-13, 08:16:54 UTC] {myenv.py:32} INFO - Executing logic for alert_2
[2022-06-13, 08:16:54 UTC] {myenv.py:27} INFO - In the implementation for alert_2
[2022-06-13, 08:16:54 UTC] {myenv.py:32} INFO - Executing logic for alert_3
[2022-06-13, 08:16:54 UTC] {myenv.py:27} INFO - In the implementation for alert_3
Actual code is much more complex and sophisticated than this. That's why I need faster and more customized debugging logs.
What I am trying to achieve is to customize the log formatter and add RUN_ID and alert_name as part of the log_message
Logs should be something like this:
[2022-06-13, 08:16:54 UTC] [manual__2022-06-13T08:16:54.103265+00:00] {myenv.py:32} INFO - Executing logic for alert_1
[2022-06-13, 08:16:54 UTC] [manual__2022-06-13T08:16:54.103265+00:00] [alert1]{myenv.py:32} INFO - In the implementation for alert_1
You are already sending the context to the callable, just make use of it
def myfunc(**wargs):
for alert in ['alert_1', 'alert_2', 'alert_3']:
log.info(f"[{wargs['run_id']}] [{alert}] Executing logic for {alert}")
implement_alert_logic(alert)
Then just send the whole context or just the run_id also to implement_alert_logic function.
Problem: We're having difficulties having a DAG fire off given a defined interval. It's also preventing manual DAG executions as well. We've added catchup=False as well to the DAG definition.
Context: We're planning to have a DAG execute on a 4HR interval from M-F. We've defined this behavior using the following CRON expression:
0 0 0/4 ? * MON,TUE,WED,THU,FRI *
We're unsure at this time whether the interval has been defined properly or if there are extra parameters we're ignoring.
Any help would be appreciated.
I think you are looking for is 0 0/4 * * 1-5 which will run on 4th hour from 0 through 23 on every day-of-week from Monday through Friday.
Your DAG object can be:
from airflow import DAG
with DAG(
dag_id="my_dag",
start_date=datetime(2022, 2, 22),
catchup=False,
schedule_interval="0 0/4 * * 1-5",
) as dag:
# add your operators here
I have an Airflow DAG which is scheduled to run once in a month. The DAG ran wonderfully till 01-08-2021. The next schedule was on 01-09-2021. The DAG and the Task were in SUCCESS state, but the DAG actually did not run. I couldn't see the start and end dates for the task instance. There are no logs too. Any help is appreciated. Thanks!
Airflow version : 1.10.6
Task Instances : All are in SUCCESS state
DAG details :
key
value
schedule_interval
0 6 1 * *
max_active_runs
0 / 16
concurrency
16
default_args
{'provide_context': True, 'depends_on_past': False, 'start_date': <Pendulum [2021-05-01T00:00:00+00:00]>, 'retries': 1, 'catchup_by_default': False, 'retry_delay': datetime.timedelta(0, 300)}
tasks count
1
task ids
['GENERATE_INVOICE']
filepath
invoicing.py
owner
airflow
For the data migrations ,I have created a DAG which ultimately inserts data to a migration table after all the tasks with required logic.
DAG has a sql which is something similar to the below which initially extracts the data and feeds to other tasks:
sql=" select col_names from tables where created_on >=date1 and created_on <=date2"
For each DAG run Iam manually changing date1 and date2 in above sql and initiating data migrations(as data chunk is heavy,as of now date range length is 1 week).
I just want to automate this date changing process ex.if i give date intervals ,after the first DAG is run,the second run is initiated and so on until the end date interval.
I have researched so far,one solution I got was dynamic DAGS in airflow.But the problem is it creates multiple DAG file instances and its also very difficult to debug and maintain .
Is there a way to repeat a DAG with changing date parameter so that I no longer have to keep changing dates manually.
I had the exact same issue! Backfilling in Airflow doesn't seem to make any sense if you don't have the DAG interval start and end as input parameters. If you want to do data migration, you'll probably need to store your last migration time in a file to read. However, this goes against some of the properties an Airflow DAG/task should have (idempotence).
My solution was to add two tasks to my DAG before the start of my "main" tasks. I have two operators (you can possibly make it one) which gets the start and end times of the current DAG run. The "start" and "end" names are sort of misleading because the "start" is actually the start of the previous run and "end" the start of the current run.
I can't reveal the custom operator I wrote but you can do this in a single Python operator:
from croniter import croniter
def get_interval_start_end(**kwargs):
dag = kwargs['dag']
ti = kwargs['ti']
dag_execution = ti.execution_date # current DAG scheduled start
dag_interval = dag._scheduled_interval # note the preceding underscore
cron_iter = croniter(dag_interval, dag_execution)
dag_prev_execution = cron_iter.get_prev()
return (dag_execution, dag_prev_execution)
# dag
task = PythonOperator(task_id='blabla',
python_callable=get_interval_start_end,
provide_context=True)
# other tasks
Then pull these values from xcom in your next task.
There is also a way to get the "last_run" of the DAG using dag.get_last_dagrun() instead. However, it doesn't return the previous scheduled run but the previous actual run. If you have already run your DAG for a "future" time, your "last dag run" will be after your current execution! Then again, I might not have tested with the right settings, so you can try that out first.
I had similar req and here is how I accessed the dates which later can be used in SQLs for backfill.
from airflow import DAG
from airflow.operators import BashOperator, PythonOperator
from datetime import datetime, timedelta
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 8, 1),
'end_date': datetime(2020, 8, 3),
'retries': 0,
}
dag = DAG('helloWorld_v1', default_args=default_args, catchup=True, schedule_interval='0 1 * * *')
def print_dag_run_date(**kwargs):
print(kwargs)
execution_date = kwargs['ds']
prev_execution_date = kwargs['prev_ds']
return (execution_date, prev_execution_date)
# t1, t2 are examples of tasks created using operators
bash = BashOperator(
task_id='bash',
depends_on_past=True,
bash_command='echo "Hello World from Task 1"',
dag=dag)
py = PythonOperator(
task_id='py',
depends_on_past=True,
python_callable=print_dag_run_date,
provide_context=True,
dag=dag)
py.set_upstream(bash)
I am trying to have an airflow script to be scheduled to run every Tuesday at 9:10 AM UTC. Given below is how I have defined it.
dag = DAG(
dag_id=DAG_NAME,
default_args=args,
schedule_interval="10 9 * * 2",
catchup=False
I however find that when the time comes, the script does not get triggered automatically. However if I do not have the value defined in the day column (last column), the scheduler works fine. Any idea where I am going wrong.
Thanks
Update:
args = {
'owner': 'admin',
'start_date': airflow.utils.dates.days_ago(9)
}
dag = DAG(
dag_id=DAG_NAME,
default_args=args,
schedule_interval = "10 9 * * 2",
catchup = False
)
This one stumps people more than anything else in Airflow, but as commenter and Airflow documentation state,
The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
In this case you can either bump back your DAG start_date one schedule_interval or wait for the next schedule_interval to complete.