How to make a DAG run for every 25days in GCP? - cron

We have "schedule_interval" attribute in DAG function to provide the cron expression to fulfill my requirement. I think there is a limitation with cron that we cannot make a job/task run for every consecutive 25days. Below is the cron expression to run a job for every 25th day of month.
5 10 */25 * *
But I need a job/DAG to run for every consecutive 25days. Is there a way to run the DAG to satisfy my requirement?

You can set schedule_interval using datetime.timedelta.
For example, to schedule a DAG to run for the first time in 25 days from today at 10:05 CET and then run every 25 days, the DAG script can be specified like this:
import pendulum
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'Airflow',
'start_date': datetime(
2019, 11, 24, 10, 5, tzinfo=pendulum.timezone('Europe/Berlin')
),
}
with DAG(
'my_dag', schedule_interval=timedelta(days=25), default_args=default_args,
) as dag:
op = DummyOperator(task_id='dummy')

Related

Airflow DAG issue

I am trying with airflow in my local windows machine , and noticing airflow UI is continuously refreshing and taking a long time to show DAG along with I am unable to figure out , what is wrong with my code where DAG is published but graph is not displaying. Also ON/OFF button is getting auto reset for this DAG only where copied code from google for other scenario is working fine, Can someone help me if I am missing something here ?
from airflow import DAG
from airflow.operators.python import PythonOperator, BranchPythonOperator
from airflow.operators.bash import BashOperator
from datetime import datetime
def _check_value():
return "Monday"
with DAG(
"first_dag",
start_date=datetime(2022, 10, 15),
schedule_interval='10 3-4 * * *',
catchup=False
) as dag:
task_1 = PythonOperator(task_id="task_1", python_callable=_check_value)
task_2 = PythonOperator(task_id="task_2", python_callable=_check_value)
task_1 >> task_2

Airflow task is stuck in status queued

Airflow is running but the task is stuck as its status queued.
I ran airflow scheduler.
Here are my code and snapshot of the airflow ui.
Can any one explain to me what the problem would be?
import datetime as dt
from datetime import timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
import pandas as pd
def CSVToJson():
df = pd.read_csv('/Users/daeyong/Desktop/Projects/Python/airflow2/file.csv')
for i,r in df.interrows() :
print(r['name'])
df.to_json('fromAirflow.json', orient='record')
default_args = {
'owner': 'paulcrickard',
'start_date': dt.datetime(2022, 3, 10),
'retries': 1,
'retry_delay': dt.timedelta(minutes=5)
}
with DAG('MyCSVDAG',
default_args=default_args,
schedule_interval=timedelta(minutes=5),
# '0 * * * *',
) as dag:
print_starting = BashOperator(task_id='starting',
bash_command='echo "I am reading the CSV now....."')
CSVJson = PythonOperator(task_id='convertCSVtoJson', python_callable=CSVToJson)
print_starting >> CSVJson
airflow_screenshot_1
airflow_screenshot_2
Two possible reasons without more context.
Your default pool does not have any slots assigned or available
Your declaration of tasks needs to be tabbed over to fall within the "with DAG" statement
Scheduler logs and an image of your pools page would help more.

Writing txt file to disk in Airflow is not working

I have been trying to write a simple text to a local txt file through a DAG script. Even though the task runs successfully. I cannot seem to find the file anywhere. Is it because I am using WSL on Windows?
Here is my simple script:
import os
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
default_args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": datetime(2020, 12, 5),
"retries": 0,
}
dag = DAG(
"simple_dag",
default_args=default_args,
schedule_interval="#once",
)
t1 = BashOperator(
task_id="print_file",
bash_command='echo "pipeline" > opDE.txt',
dag=dag)
t1
You will need to define a path for the output file.
When Airflow execute your code it moves it to temp location and execute it from there so it does export it to that location. You can also see this from the log:
So the fix is to export to a desired path

Airflow's DAG runs multiple times in one minute, although it was scheduled to run every 5 minutes

I've created a DAG which was scheduled for execution in each 5 minutes using cron syntax.
Also, the pool was created for this dag, with single slot only.
I've tried to restart server/scheduler and reset the database. Currently, DAG is running in UTC time. Also, I've tried to set my local timezone, which is 'Europe/Minsk' (UTC+3) - and It gives no effect.
import random
import time
import airflow
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(2),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'pool': 'download',
# 'priority_weight': 10,
# 'queue': 'bash_queue',
}
params = {
'table': 'api_avitoimage',
}
dag = DAG(
dag_id='test_download_avitoimage',
default_args=default_args,
schedule_interval='*/5 * * * *',
)
def sleep_for_a_bit(random_base):
time.sleep(random_base)
with dag:
download = BashOperator(
task_id='download',
bash_command='/usr/bin/python3 /home/artur/downloader.py --table {{ params.table }}',
params=params,
dag=dag)
sleep = PythonOperator(
task_id='sleep_for_a_bit',
python_callable=sleep_for_a_bit,
op_kwargs={'random_base': random.uniform(0, 1)},
dag=dag,
)
download >> sleep
Issue: the DAG is running ~2-3 times per one minute, which is totally an improper execution.
EDITED: It happens that there is 16/16 simultaneously active DAG runs.But I can not understand where this "magic number 16" came from.
By default Airflow tries to complete all "missed" DAGs since start_date. As your start_date is set to airflow.utils.dates.days_ago(2), Airflow is going to run DAG 576 times before it starts launching DAGs by schedule. You can turn it off by adding catchup = False to your DAG definition (not default_args).
The magic number 16 comes from parameter max_active_runs_per_dag = 16, which is set by default.

airflow druid hook not working

I am trying to use druid hook to load data from hdfs to druid , below is my dag script :
from datetime import datetime, timedelta
import json
from airflow.hooks import HttpHook, DruidHook
from airflow.operators import PythonOperator
from airflow.models import DAG
def check_druid_con():
dr_hook = DruidHook(druid_ingest_conn_id='DRUID_INDEX',druid_query_conn_id='DRUID_QUERY')
dr_hook.load_from_hdfs("druid_airflow","hdfs://xx.xx.xx.xx/demanddata/demand2.tsv","stay_date",["channel","rate"],"2016-12-11/2017-12-13",1,-1,metric_spec=[{ "name" : "count", "type" : "count" }],hadoop_dependency_coordinates="org.apache.hadoop:hadoop-client:2.7.3")
default_args = {
'owner': 'TC',
'start_date': datetime(2017, 8, 7),
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG('druid_data_load', default_args=default_args)
druid_task1=PythonOperator(task_id='check_druid',
python_callable=check_druid_con,
dag=dag)
I keep getting error , TypeError: load_from_hdfs() takes at least 10 arguments (10 given) . However I have given 10 arguments to load_from_hdfs , still it errors out . Please help.
Regards
Rahul
The issue was with function definition in airflow documents and actual definition in the code :
In documents it's defined as :
load_from_hdfs(datasource, static_path, ts_dim, columns, intervals, num_shards, target_partition_size, metric_spec=None, hadoop_dependency_coordinates=None)
But in reality the function definition is :
def load_from_hdfs(
self, datasource, static_path, ts_dim, columns,
intervals, num_shards, target_partition_size, **query_granularity, segment_granularity**,
metric_spec=None, hadoop_dependency_coordinates=None)
Passing correct arguments as per definition , took me out of this error.

Resources