I have been reading a lot about logging in to Airflow and experimenting a lot but could not achieve what I am looking for. I want to customize the logging for Airflow. I have a lot of DAGs. Each DAG has multiple tasks. EAch DAG has multiple tasks corresponding to one alert_name. DAG is running hourly and pushing logs to S3. If something goes wrong and debugging is required it is very tough to look for logs in S3. I want to customize the logs to search the log lines by RUN_ID and alert_name.
I have the following piece of code.
from airflow import DAG # noqa
from datetime import datetime
from datetime import timedelta
from airflow.operators.python_operator import PythonOperator
import y
import logging
log = logging.getLogger(__name__)
default_args = {
'owner': 'SRE',
'execution_timeout': timedelta(minutes=150)
}
dag = DAG(
dag_id = 'new_dag',
default_args = default_args,
start_date = datetime(year=2021, month=11, day=22),
schedule_interval = timedelta(days=1),
catchup = False,
max_active_runs = 3,
)
def implement_alert_logic(alert_name):
log.info(f'In the implementation for {alert_name}')
pass
def myfunc(**wargs):
for alert in ['alert_1', 'alert_2', 'alert_3']:
log.info(f'Executing logic for {alert}')
implement_alert_logic(alert)
t1 = PythonOperator(
task_id='testing_this',
python_callable = myfunc,
provide_context=True,
dag=dag)
t2 = PythonOperator(
task_id='testing_this2',
python_callable = myfunc,
provide_context=True,
dag=dag)
t1 >> t2
It prints something like
[2022-06-13, 08:16:54 UTC] {myenv.py:32} INFO - Executing logic for alert_1
[2022-06-13, 08:16:54 UTC] {myenv.py:27} INFO - In the implementation for alert_1
[2022-06-13, 08:16:54 UTC] {myenv.py:32} INFO - Executing logic for alert_2
[2022-06-13, 08:16:54 UTC] {myenv.py:27} INFO - In the implementation for alert_2
[2022-06-13, 08:16:54 UTC] {myenv.py:32} INFO - Executing logic for alert_3
[2022-06-13, 08:16:54 UTC] {myenv.py:27} INFO - In the implementation for alert_3
Actual code is much more complex and sophisticated than this. That's why I need faster and more customized debugging logs.
What I am trying to achieve is to customize the log formatter and add RUN_ID and alert_name as part of the log_message
Logs should be something like this:
[2022-06-13, 08:16:54 UTC] [manual__2022-06-13T08:16:54.103265+00:00] {myenv.py:32} INFO - Executing logic for alert_1
[2022-06-13, 08:16:54 UTC] [manual__2022-06-13T08:16:54.103265+00:00] [alert1]{myenv.py:32} INFO - In the implementation for alert_1
You are already sending the context to the callable, just make use of it
def myfunc(**wargs):
for alert in ['alert_1', 'alert_2', 'alert_3']:
log.info(f"[{wargs['run_id']}] [{alert}] Executing logic for {alert}")
implement_alert_logic(alert)
Then just send the whole context or just the run_id also to implement_alert_logic function.
Related
I am trying to copy file from SFTP to Azure Blob store using SFTPToWasbOperator. I am getting error. It seems like I'm doing something wrong, but I can't figure out what it is. Could someone please check the following code and see if there is anything wrong with it?
Airflow Logs
**
[2022-07-10, 13:08:48 UTC] {sftp_to_wasb.py:188} INFO - Uploading /SPi_ESG_Live/07-04-2022/DataPoint_2022_07_04.csv to wasb://testcotainer as https://test.blob.core.windows.net/testcotainer/DataPoint_2022_07_04.csv
[2022-07-10, 13:08:48 UTC] {_universal.py:473} INFO - Request URL: 'https://.blob.core.windows.net/***/test/https%3A//test.blob.core.windows.net/testcontainer/DataPoint_2022_07_04.csv'
Error msg
"azure.core.exceptions.ServiceRequestError: URL has an invalid label."
Airflow DAG
import os
from datetime import datetime
from airflow import DAG
from airflow.decorators import task
from airflow.providers.microsoft.azure.operators.wasb_delete_blob import WasbDeleteBlobOperator
from airflow.providers.microsoft.azure.transfers.sftp_to_wasb import SFTPToWasbOperator
from airflow.providers.sftp.hooks.sftp import SFTPHook
from airflow.providers.sftp.operators.sftp import SFTPOperator
AZURE_CONTAINER_NAME = "testcotainer"
BLOB_PREFIX = "https://test.blob.core.windows.net/testcotainer/"
SFTP_SRC_PATH = "/SPi_test_Live/07-04-2022/"
ENV_ID = os.environ.get("SYSTEM_TESTS_ENV_ID")
DAG_ID = "example_sftp_to_wasb"
with DAG(
DAG_ID,
schedule_interval=None,
catchup=False,
start_date=datetime(2021, 1, 1), # Override to match your needs
) as dag:
# [START how_to_sftp_to_wasb]
transfer_files_to_azure = SFTPToWasbOperator(
task_id="transfer_files_from_sftp_to_wasb",
# SFTP args
sftp_source_path=SFTP_SRC_PATH,
# AZURE args
container_name=AZURE_CONTAINER_NAME,
blob_prefix=BLOB_PREFIX,
)
# [END how_to_sftp_to_wasb]
The problem is with BLOB_PREFIX, its not a url its the prefix after the azure url
see this source example : https://airflow.apache.org/docs/apache-airflow-providers-microsoft-azure/stable/_modules/tests/system/providers/microsoft/azure/example_sftp_to_wasb.html
I'm trying to get the DAG name to the following JSON:
INFO - {'conf': <airflow.configuration.AutoReloadableProxy object at ... >, 'dag': <DAG: dag-name-i-want-to-get>, 'ds': '2021-07-29' ... N }
By the way, I got the JSON using the following function in Airflow:
def execute_dag_run(**kwargs):
print(kwargs)
dag = kwargs['dag']
print(type(dag))
print(dag)
get_dag_run_task = PythonOperator(
task_id='get_dag_run',
python_callable=execute_dag_run,
dag=dag,
provide_context=True
)
However, I'm getting a class if I print type(dag):
INFO - <class 'airflow.models.dag.DAG'>
Do you have any idea how to get this without do a manual extraction?
You are printing the dag object if you want to get the dag name you need to get it from the dag object as:
def execute_dag_run(**kwargs):
dag = kwargs['dag']
print ("dag_id from dag:")
print(dag.dag_id)
Alternatively you can also get it from task_instance as:
def execute_dag_run(**kwargs):
ti = kwargs['task_instance']
print ("dag_id from task instance:")
print (ti.dag_id)
another option is to get it from dag_run as:
def execute_dag_run(**kwargs):
dag_run = kwargs['dag_run']
print("dag_id from dag_run:")
print (dag_run.dag_id)
As a newbie to airflow, I'm looking at the example_branch_operator:
"""Example DAG demonstrating the usage of the BranchPythonOperator."""
import random
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import BranchPythonOperator
from airflow.utils.dates import days_ago
args = {
'owner': 'airflow',
}
with DAG(
dag_id='example_branch_operator',
default_args=args,
start_date=days_ago(2),
schedule_interval="#daily",
tags=['example', 'example2'],
) as dag:
run_this_first = DummyOperator(
task_id='run_this_first',
)
options = ['branch_a', 'branch_b', 'branch_c', 'branch_d']
branching = BranchPythonOperator(
task_id='branching',
python_callable=lambda: random.choice(options),
)
run_this_first >> branching
join = DummyOperator(
task_id='join',
trigger_rule='none_failed_or_skipped',
)
for option in options:
t = DummyOperator(
task_id=option,
)
dummy_follow = DummyOperator(
task_id='follow_' + option,
)
branching >> t >> dummy_follow >> join
Looking at the join operator, I'd expect for it to collect all the branches, but instead it's just another task that happens at the end of each branch. If multiple branches are executed, join will run that many times.
(yes, yes, it should be idempotent, but that's not the point of the question)
Is this a bug, a poorly named task, or am I missing something?
The tree view displays a complete branch from each DAG root node. Multiple branches that converge on a single task will be shown multiple times but they will only be executed once. Check out the Graph View of this DAG:
For the data migrations ,I have created a DAG which ultimately inserts data to a migration table after all the tasks with required logic.
DAG has a sql which is something similar to the below which initially extracts the data and feeds to other tasks:
sql=" select col_names from tables where created_on >=date1 and created_on <=date2"
For each DAG run Iam manually changing date1 and date2 in above sql and initiating data migrations(as data chunk is heavy,as of now date range length is 1 week).
I just want to automate this date changing process ex.if i give date intervals ,after the first DAG is run,the second run is initiated and so on until the end date interval.
I have researched so far,one solution I got was dynamic DAGS in airflow.But the problem is it creates multiple DAG file instances and its also very difficult to debug and maintain .
Is there a way to repeat a DAG with changing date parameter so that I no longer have to keep changing dates manually.
I had the exact same issue! Backfilling in Airflow doesn't seem to make any sense if you don't have the DAG interval start and end as input parameters. If you want to do data migration, you'll probably need to store your last migration time in a file to read. However, this goes against some of the properties an Airflow DAG/task should have (idempotence).
My solution was to add two tasks to my DAG before the start of my "main" tasks. I have two operators (you can possibly make it one) which gets the start and end times of the current DAG run. The "start" and "end" names are sort of misleading because the "start" is actually the start of the previous run and "end" the start of the current run.
I can't reveal the custom operator I wrote but you can do this in a single Python operator:
from croniter import croniter
def get_interval_start_end(**kwargs):
dag = kwargs['dag']
ti = kwargs['ti']
dag_execution = ti.execution_date # current DAG scheduled start
dag_interval = dag._scheduled_interval # note the preceding underscore
cron_iter = croniter(dag_interval, dag_execution)
dag_prev_execution = cron_iter.get_prev()
return (dag_execution, dag_prev_execution)
# dag
task = PythonOperator(task_id='blabla',
python_callable=get_interval_start_end,
provide_context=True)
# other tasks
Then pull these values from xcom in your next task.
There is also a way to get the "last_run" of the DAG using dag.get_last_dagrun() instead. However, it doesn't return the previous scheduled run but the previous actual run. If you have already run your DAG for a "future" time, your "last dag run" will be after your current execution! Then again, I might not have tested with the right settings, so you can try that out first.
I had similar req and here is how I accessed the dates which later can be used in SQLs for backfill.
from airflow import DAG
from airflow.operators import BashOperator, PythonOperator
from datetime import datetime, timedelta
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 8, 1),
'end_date': datetime(2020, 8, 3),
'retries': 0,
}
dag = DAG('helloWorld_v1', default_args=default_args, catchup=True, schedule_interval='0 1 * * *')
def print_dag_run_date(**kwargs):
print(kwargs)
execution_date = kwargs['ds']
prev_execution_date = kwargs['prev_ds']
return (execution_date, prev_execution_date)
# t1, t2 are examples of tasks created using operators
bash = BashOperator(
task_id='bash',
depends_on_past=True,
bash_command='echo "Hello World from Task 1"',
dag=dag)
py = PythonOperator(
task_id='py',
depends_on_past=True,
python_callable=print_dag_run_date,
provide_context=True,
dag=dag)
py.set_upstream(bash)
I am scheduling dag in airflow for 10 minutes is not doing anything.
here is my dags code:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date':datetime.now(),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('Python_call', default_args=default_args, schedule_interval= '*/10 * * * *')
t1 = BashOperator(
task_id='testairflow',
bash_command='python /var/www/projects/python_airflow/airpy/hello.py',
dag=dag)
and the scheduler log looks like this:
[2018-01-05 14:05:08,536] {jobs.py:351} DagFileProcessor484 INFO - Processing /var/www/projects/python_airflow/airpy/airflow_home/dags/scheduler.py took 2.278 seconds
[2018-01-05 14:05:09,712] {jobs.py:343} DagFileProcessor485 INFO - Started process (PID=29795) to work on /var/www/projects/python_airflow/airpy/airflow_home/dags/scheduler.py
[2018-01-05 14:05:09,715] {jobs.py:534} DagFileProcessor485 ERROR - Cannot use more than 1 thread when using sqlite. Setting max_threads to 1
[2018-01-05 14:05:09,717] {jobs.py:1521} DagFileProcessor485 INFO - Processing file /var/www/projects/python_airflow/airpy/airflow_home/dags/scheduler.py for tasks to queue
[2018-01-05 14:05:09,717] {models.py:167} DagFileProcessor485 INFO - Filling up the DagBag from /var/www/projects/python_airflow/airpy/airflow_home/dags/scheduler.py
[2018-01-05 14:05:10,057] {jobs.py:1535} DagFileProcessor485 INFO - DAG(s) dict_keys(['example_passing_params_via_test_command', 'latest_only_with_trigger', 'example_branch_operator', 'example_subdag_operator', 'latest_only', 'example_skip_dag', 'example_subdag_operator.section-1', 'example_subdag_operator.section-2', 'tutorial', 'example_http_operator', 'example_trigger_controller_dag', 'example_bash_operator', 'example_python_operator', 'test_utils', 'Python_call', 'example_trigger_target_dag', 'example_xcom', 'example_short_circuit_operator', 'example_branch_dop_operator_v3']) retrieved from /var/www/projects/python_airflow/airpy/airflow_home/dags/scheduler.py
[2018-01-05 14:05:12,039] {jobs.py:1169} DagFileProcessor485 INFO - Processing Python_call
[2018-01-05 14:05:12,048] {jobs.py:566} DagFileProcessor485 INFO - Skipping SLA check for <DAG: Python_call> because no tasks in DAG have SLAs
[2018-01-05 14:05:12,060] {models.py:322} DagFileProcessor485 INFO - Finding 'running' jobs without a recent heartbeat
[2018-01-05 14:05:12,061] {models.py:328} DagFileProcessor485 INFO - Failing jobs without heartbeat after 2018-01-05 14:00:12.061146
command line airflow scheduler :
[2018-01-05 14:31:20,496] {dag_processing.py:627} INFO - Started a process (PID: 32222) to generate tasks for /var/www/projects/python_airflow/airpy/airflow_home/dags/scheduler.py - logging into /var/www/projects/python_airflow/airpy/airflow_home/logs/scheduler/2018-01-05/scheduler.py.log
[2018-01-05 14:31:23,122] {jobs.py:1002} INFO - No tasks to send to the executor
[2018-01-05 14:31:23,123] {jobs.py:1440} INFO - Heartbeating the executor
[2018-01-05 14:31:23,123] {jobs.py:1450} INFO - Heartbeating the scheduler
[2018-01-05 14:31:24,243] {jobs.py:1404} INFO - Heartbeating the process manager
[2018-01-05 14:31:24,244] {dag_processing.py:559} INFO - Processor for /var/www/projects/python_airflow/airpy/airflow_home/dags/scheduler.py finished
Airflow is an ETL/data pipelining tool. This means its meant to execute things over already "gone by" periods. E.g. using:
Task parameter 'start_date': datetime(2018,1,4)
Default dag parameter schedule_interval='#daily'
Means that the DAG won't run until the whole schedule interval unit (one day) has gone through since the start date; thus on Airflow server time equal to datetime(2018,1,5).
Since you have a start_date of datetime.now() with a #daily inteval (which again is the default), the aforementioned condition is never fulfilled (refer to the official FAQ).
You can change the start_date parameter to, e.g. yesterday using timedelta for a relative start_date earlier than today (although this is not recommended). I would advise using 'start_date': datetime(2018,1,1) and adding a scheduler_interval='#once' to the DAG parameters for test purposes. This should get your DAG to run.