I am trying with airflow in my local windows machine , and noticing airflow UI is continuously refreshing and taking a long time to show DAG along with I am unable to figure out , what is wrong with my code where DAG is published but graph is not displaying. Also ON/OFF button is getting auto reset for this DAG only where copied code from google for other scenario is working fine, Can someone help me if I am missing something here ?
from airflow import DAG
from airflow.operators.python import PythonOperator, BranchPythonOperator
from airflow.operators.bash import BashOperator
from datetime import datetime
def _check_value():
return "Monday"
with DAG(
"first_dag",
start_date=datetime(2022, 10, 15),
schedule_interval='10 3-4 * * *',
catchup=False
) as dag:
task_1 = PythonOperator(task_id="task_1", python_callable=_check_value)
task_2 = PythonOperator(task_id="task_2", python_callable=_check_value)
task_1 >> task_2
Related
I have a dag like below,
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.python_operator import PythonVirtualenvOperator
import platform
def get_info():
from modelsFile import model_config
print("version: ")
dag = DAG('sample', description='Sample DAG',
schedule_interval=None,
start_date=datetime(2022, 5, 1), catchup=False)
get_info_operator = PythonVirtualenvOperator(task_id='get_info_task', python_callable=get_info, dag=dag)
get_info_operator
Since I have used PythonVirtualEnvOperator, I need to give all the dependencies inside the python_callable function.
But, when I tried to import the class model_config from the file modelsfile, It is throwing the error as "ModuleNotFoundError: No module named 'modelsfile'".
But, when I change from pythonVirtualEnvOperator to PythonOperator, It is working fine.
Can anyone help me to solve this issue in Airflow PythonVirtualEnvOperator?
Airflow is running but the task is stuck as its status queued.
I ran airflow scheduler.
Here are my code and snapshot of the airflow ui.
Can any one explain to me what the problem would be?
import datetime as dt
from datetime import timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
import pandas as pd
def CSVToJson():
df = pd.read_csv('/Users/daeyong/Desktop/Projects/Python/airflow2/file.csv')
for i,r in df.interrows() :
print(r['name'])
df.to_json('fromAirflow.json', orient='record')
default_args = {
'owner': 'paulcrickard',
'start_date': dt.datetime(2022, 3, 10),
'retries': 1,
'retry_delay': dt.timedelta(minutes=5)
}
with DAG('MyCSVDAG',
default_args=default_args,
schedule_interval=timedelta(minutes=5),
# '0 * * * *',
) as dag:
print_starting = BashOperator(task_id='starting',
bash_command='echo "I am reading the CSV now....."')
CSVJson = PythonOperator(task_id='convertCSVtoJson', python_callable=CSVToJson)
print_starting >> CSVJson
airflow_screenshot_1
airflow_screenshot_2
Two possible reasons without more context.
Your default pool does not have any slots assigned or available
Your declaration of tasks needs to be tabbed over to fall within the "with DAG" statement
Scheduler logs and an image of your pools page would help more.
I have been trying to write a simple text to a local txt file through a DAG script. Even though the task runs successfully. I cannot seem to find the file anywhere. Is it because I am using WSL on Windows?
Here is my simple script:
import os
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
default_args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": datetime(2020, 12, 5),
"retries": 0,
}
dag = DAG(
"simple_dag",
default_args=default_args,
schedule_interval="#once",
)
t1 = BashOperator(
task_id="print_file",
bash_command='echo "pipeline" > opDE.txt',
dag=dag)
t1
You will need to define a path for the output file.
When Airflow execute your code it moves it to temp location and execute it from there so it does export it to that location. You can also see this from the log:
So the fix is to export to a desired path
We have "schedule_interval" attribute in DAG function to provide the cron expression to fulfill my requirement. I think there is a limitation with cron that we cannot make a job/task run for every consecutive 25days. Below is the cron expression to run a job for every 25th day of month.
5 10 */25 * *
But I need a job/DAG to run for every consecutive 25days. Is there a way to run the DAG to satisfy my requirement?
You can set schedule_interval using datetime.timedelta.
For example, to schedule a DAG to run for the first time in 25 days from today at 10:05 CET and then run every 25 days, the DAG script can be specified like this:
import pendulum
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'Airflow',
'start_date': datetime(
2019, 11, 24, 10, 5, tzinfo=pendulum.timezone('Europe/Berlin')
),
}
with DAG(
'my_dag', schedule_interval=timedelta(days=25), default_args=default_args,
) as dag:
op = DummyOperator(task_id='dummy')
I've created a DAG which was scheduled for execution in each 5 minutes using cron syntax.
Also, the pool was created for this dag, with single slot only.
I've tried to restart server/scheduler and reset the database. Currently, DAG is running in UTC time. Also, I've tried to set my local timezone, which is 'Europe/Minsk' (UTC+3) - and It gives no effect.
import random
import time
import airflow
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(2),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'pool': 'download',
# 'priority_weight': 10,
# 'queue': 'bash_queue',
}
params = {
'table': 'api_avitoimage',
}
dag = DAG(
dag_id='test_download_avitoimage',
default_args=default_args,
schedule_interval='*/5 * * * *',
)
def sleep_for_a_bit(random_base):
time.sleep(random_base)
with dag:
download = BashOperator(
task_id='download',
bash_command='/usr/bin/python3 /home/artur/downloader.py --table {{ params.table }}',
params=params,
dag=dag)
sleep = PythonOperator(
task_id='sleep_for_a_bit',
python_callable=sleep_for_a_bit,
op_kwargs={'random_base': random.uniform(0, 1)},
dag=dag,
)
download >> sleep
Issue: the DAG is running ~2-3 times per one minute, which is totally an improper execution.
EDITED: It happens that there is 16/16 simultaneously active DAG runs.But I can not understand where this "magic number 16" came from.
By default Airflow tries to complete all "missed" DAGs since start_date. As your start_date is set to airflow.utils.dates.days_ago(2), Airflow is going to run DAG 576 times before it starts launching DAGs by schedule. You can turn it off by adding catchup = False to your DAG definition (not default_args).
The magic number 16 comes from parameter max_active_runs_per_dag = 16, which is set by default.