Having trouble to extract a DAG name into a JSON in Airflow - python-3.x

I'm trying to get the DAG name to the following JSON:
INFO - {'conf': <airflow.configuration.AutoReloadableProxy object at ... >, 'dag': <DAG: dag-name-i-want-to-get>, 'ds': '2021-07-29' ... N }
By the way, I got the JSON using the following function in Airflow:
def execute_dag_run(**kwargs):
print(kwargs)
dag = kwargs['dag']
print(type(dag))
print(dag)
get_dag_run_task = PythonOperator(
task_id='get_dag_run',
python_callable=execute_dag_run,
dag=dag,
provide_context=True
)
However, I'm getting a class if I print type(dag):
INFO - <class 'airflow.models.dag.DAG'>
Do you have any idea how to get this without do a manual extraction?

You are printing the dag object if you want to get the dag name you need to get it from the dag object as:
def execute_dag_run(**kwargs):
dag = kwargs['dag']
print ("dag_id from dag:")
print(dag.dag_id)
Alternatively you can also get it from task_instance as:
def execute_dag_run(**kwargs):
ti = kwargs['task_instance']
print ("dag_id from task instance:")
print (ti.dag_id)
another option is to get it from dag_run as:
def execute_dag_run(**kwargs):
dag_run = kwargs['dag_run']
print("dag_id from dag_run:")
print (dag_run.dag_id)

Related

Binding data in type (list) is not supported

I wrote this script to execute a query and return the results to a data frame. It works like a charm and is quite fast.
Now, I want to give the script bind parameters at runtime, a customer id and a date.
from snowflake.snowpark.functions import *
from snowflake.snowpark import *
import pandas as pd
import sys
# Parameter style has to be set before connection
snowflake.connector.paramstyle = 'numeric'
# Connection
conn = snowflake.connector.connect (
account=...
)
try:
print('Session started...')
# Session Handle
cs = conn.cursor()
# Debug
print(f"Arguments : {sys.argv[1:]=}")
# Query with bin variables
Query = "with dims as (\n" \
" select...
" from revenue where site_id = :1 and subscription_product = TRUE and first_or_recurring = 'First'\n" \
" qualify row_number = 1\n" \
...
")\n" \
"select to_date(:2) as occurred_date, count(distinct srm.subscription_id) as ..."
# Bind variables - create list first
params = [ [sys.argv[1]] ]
cs.executemany(Query, [ [params] ] )
# This works
# cs.execute(Query, ['ksisk5kZRvk', '2022-10-28'])
# Write to a dataframe & show
df = cs.fetch_pandas_all()
print(df.head(1))
# Cleanup
cs.close()
conn.close()
except Exception as e:
print(e)
finally:
if conn:
cs.close()
conn.close()
print('Done...')
At runtime, I get the following error:
myscript.py ksisk5kZRvk 2022-10-28
session started...
Arguments : sys.argv[1:]=['ksisk5kZRvk,', '2022-10-28']
255001: Binding data in type (list) is not supported.
done...
Ok, so I change params to be a tuple instead
... params = ( (sys.argv[1]) )
cs.execute(Query, ( (params) ) )...
And I get a new error
myscript.py ksisk5kZRvk 2022-10-28
session started...
Arguments : sys.argv[1:]=['ksisk5kZRvk,', '2022-10-28']
252004: Binding parameters must be a list: ksisk5kZRvk,
done...
Obviously, I am missing something here. I think the issue is how I use sys.argv[1:] but I can't figure it out.
Any ideas? Thanks in advance!
My guess is that you are bundling a list into a list into a list. Here:
params = [ [sys.argv[1]] ]
cs.executemany(Query, [ [params] ] )
Instead try:
params = [ [sys.argv[1]] ]
cs.executemany(Query, params)
Or just:
cs.executemany(Query, [sys.argv[1]])

Notify on status of other operators in an Airflow DAG using Email Operator

Lets say I have a dag in airflow whose definition file looks like:
import airflow
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.email_operator import EmailOperator
from airflow.utils.trigger_rule import TriggerRule
default_args = {
'owner' : 'airflow',
'retries' : 0
}
dag = DAG(
dag_id = 'email_notification_test',
start_date = airflow.utils.dates.days_ago(2),
default_args = default_args,
schedule_interval = None,
catchup=False
)
start = DummyOperator(task_id = 'start',
dag = dag)
def built_to_fail(ds, **kwargs):
raise Exception('This operator fails')
def built_to_succeed(ds, **kwargs):
print('This Operator succeeds')
return
operator_that_fails = PythonOperator(
task_id='operator_that_fails',
python_callable=built_to_fail,
dag = dag
)
operator_that_succeeds = PythonOperator(
task_id='operator_that_succeeds',
python_callable=built_to_succeed,
dag = dag
)
email = EmailOperator(
task_id='send_email',
to='<email address>',
subject='DAG Run Complete',
html_content="""run_id: {{ run_id }} </p>
dag_run: {{ dag_run }} </p>
dag_run.id: {{ dag_run.id }} </p>
dag_run.state: {{ dag_run.state }}""",
trigger_rule=TriggerRule.ALL_DONE,
dag=dag
)
start >> [operator_that_fails, operator_that_succeeds] >> email
DAG TLDR: The dag has two operators, one which fails and one which succeeds. After both have finished executing, run a third task - an email operator - that sends a notification summary of the statuses of the preceding operators. For a visual aid, here is the webui graph view:
As I've demonstrated in the html_content part of the email operator, you can use jinja to reference objects and their attributes. What I really need though, is not only to report on the status of the dag itself, but of individual operators that have already run, so something like:
html_content="""operator_that_fails status : {{ <dynamic reference to preceeding task status> }} </p>
operator_that_succeeds status: {{ <ditto for the other task> }}"""
I was trying to do this by exploring the Airflow object model documentation, ie. this page for the "dag run" object, but wasn't able to find a good way of getting preceeding task statuses.
Anyone know how best to achieve my goal here?
You can use templating to access the context, like this:
def contact_endpoints(**context):
return ("subject text", "body text")
execute = PythonOperator(
task_id='source_task_id',
python_callable=contact_endpoints,
provide_context=True,
do_xcom_push = True
)
email = EmailOperator(
task_id='send_mail',
to="email#email.com",
subject=f"""[{dag.dag_id}] {{{{ task_instance.xcom_pull(task_ids='source_task_id', key='return_value')[0] }}}}""",
html_content=f"""
<br><br>
<h3>Report</h3>
{{{{ task_instance.xcom_pull(task_ids='source_task_id', key='return_value')[1] }}}}
""")
In my example, the task whose id is referenced in the xcom-pulls returns a tuple ("part to insert in subject", "part to insert in body").
You have to set do_xcom_push to true on the reference task.

What is the best way to transfer information between dags in Airflow?

What is the best way to transfer information between dags?
Since i have a scenario where multiple dags, let’s say dag A and dag B can call dag C, I thought of 2 ways to do so:
XCOM - I cannot use XCOM-pull from dag C since I don’t know which dag id to give as input.
use context[“dag_run”].payload when calling to TriggerDagRunOperator but that solution forces me to create some kind of custom operator for each task in dag C in order to be able to use the context inside execute func and read that info.
Any other suggestions? pros/cons?
Thanks.
# Pass info from DAG_A \ DAG_B:
def push_info(context, dag_run_obj):
dag_run_obj.payload = {PARAMS_KEY: {"feature_ids": 1234}}
return dag_run_obj
trigger_DAG_C = TriggerDagRunOperator(
task_id="trigger_DAG_C",
trigger_dag_id="DAG_C",
python_callable=push_info,
dag=dag,
)
# Read the info from DAG C:
def execute():
info = context["dag_run"].conf

Send the output of one task to another task in airflow

I am using airflow, i want to pass the output of the function of task 1 to the task 2.
def create_dag(dag_id,
schedule,
default_args):
def getData(**kwargs):
res = requests.post('https://dummyURL')
return res.json())
def sendAlert(**kwargs):
requests.post('https://dummyURL', params = "here i want to send res.json() from task 1")
dag = DAG(dag_id,
schedule_interval=schedule,
default_args=default_args)
with dag:
t1 = PythonOperator(task_id='task1',python_callable=getData,provide_context=True,dag=dag)
t2 = PythonOperator(task_id='task2',python_callable=sendAlert,provide_context=True,dag=dag)
return dag
Check out xcom's, as long as the data you want to pass is relatively small it's the best option.

Apache airflow: Cannot use more than 1 thread when using sqlite. Setting max_threads to 1

I have setup airflow with postgresql database, I am creating multiple dags
def subdag(parent_dag_name, child_dag_name,currentDate,batchId,category,subCategory,yearMonth,utilityType,REALTIME_HOME, args):
dag_subdag = DAG(
dag_id='%s.%s' % (parent_dag_name, child_dag_name),
default_args=args,
schedule_interval="#once",
)
# get site list to run bs reports
site_list = getSiteListforProcessing(category,subCategory,utilityType,yearMonth);
print (site_list)
def update_status(siteId,**kwargs):
createdDate=getCurrentTimestamp();
print ('N',siteId,batchId,yearMonth,utilityType,'N')
updateJobStatusLog('N',siteId,batchId,yearMonth,utilityType,'P')
def error_status(siteId,**kwargs):
createdDate=getCurrentTimestamp();
print ('N',siteId,batchId,yearMonth,utilityType,'N')
BS_template = """
echo "{{ params.date }}"
java -cp xx.jar com.xGenerator {{params.siteId}} {{params.utilityType}} {{params.date}}
"""
for index,siteid in enumerate(site_list):
t1 = BashOperator(
task_id='%s-task-%s' % (child_dag_name, index + 1),
bash_command=BS_template,
params={'date': currentDate, 'realtime_home': REALTIME_HOME,'siteId': siteid, "utilityType":utilityType},
default_args=args,
dag=dag_subdag)
t2 = PythonOperator(
task_id='%s-updatetask-%s' % (child_dag_name, index + 1),
dag=dag_subdag,
python_callable=update_status,
op_kwargs={'siteId':siteid})
t2.set_upstream(t1)
return dag_subdag
It creates dynamic tasks but on all number of dynamic task, it always fails the last one and logs error as:
"Cannot use more than 1 thread when using sqlite. Setting max_threads to 1"
E.g. : if 4 tasks are created 3 runs, and if 2 tasks are created 1 runs.

Resources