How to get maximum/minimum duration of all DagRun instances in Airflow? - python-3.x

Is there a way to find the maximum/minimum or even an average duration of all DagRun instances in Airflow? - That is all dagruns from all dags not just one single dag.
I can't find anywhere to do this on the UI or even a page with a programmatic/command line example.

You can use airflow- api to get all dag_runs for dag and calculate statistics.
An example to get all dag_runs per dag and calc total time :
import datetime
import requests
from requests.auth import HTTPBasicAuth
airflow_server = "http://localhost:8080/api/v1/"
auth = HTTPBasicAuth("airflow", "airflow")
get_dags_url = f"{airflow_server}dags"
get_dag_params = {
"limit": 100,
"only_active": "true"
}
response = requests.get(get_dags_url, params=get_dag_params, auth=auth)
dags = response.json()["dags"]
get_dag_run_params = {
"limit": 100,
}
for dag in dags:
dag_id = dag["dag_id"]
dag_run_url = f"{airflow_server}/dags/{dag_id}/dagRuns?limit=100&state=success"
response = requests.get(dag_run_url, auth=auth)
dag_runs = response.json()["dag_runs"]
for dag_run in dag_runs:
execution_date = datetime.datetime.fromisoformat(dag_run['execution_date'])
end_date = datetime.datetime.fromisoformat(dag_run['end_date'])
duration = end_date - execution_date
duration_in_s = duration.total_seconds()
print(duration_in_s)

The easiest way will be to query your Airflow metastore. All the scheduling, DAG runs, and task instances are stored there and Airflow can't operate without it. I do recommend filtering on DAG/execution date if your use-case allows. It's not obvious to me what one can do with just these three overarching numbers alone.
select
min(runtime_seconds) min_runtime,
max(runtime_seconds) max_runtime,
avg(runtime_seconds) avg_runtime
from (
select extract(epoch from (d.end_date - d.start_date)) runtime_seconds
from public.dag_run d
where d.execution_date between '2022-01-01' and '2022-06-30' and d.state = 'success'
)
You might also consider joining to the task_instance table to get some task-level data, and perhaps use the min start and max end times for DAG tasks within a DAG run for your timestamps.

Related

Best practice to reduce the number of spark job started. Use union?

With Spark, starting a job take a time.
For a complex workflow, it's possible to invoke a job in a loop.
But, we pay for each 'start'.
def test_loop(spark):
all_datas = []
for i in ['CZ12905K01', 'CZ12809WRH', 'CZ129086RP']:
all_datas.extend(spark.sql(f"""
select * from data where id=='{i}'
""").collect()) # Star a job
return all_datas
Sometime, it's possible to explode the loop to a big job, with 'union'.
def test_union(spark):
full_request = None
for i in ['CZ12905K01', 'CZ12809WRH' ,'CZ129086RP']:
q = f"""
select '{i}' ID,* from data where leh_be_lot_id=='{i}'
"""
partial_df = spark.sql(q)
if not full_request:
full_request = partial_df
else:
full_request = full_request.union(partial_df)
return full_request.collect() # Start a job
For clarity, my samples are elementary (I know, I can use in (...)) . The real requests will be more complex.
It's a good idea ?
With union approach, I can reduce drastically the number of jobs submitted, but with a more complex job.
My tests show that:
It's possible to union > 1000 request for a better performance
For 950 requests with local[6]
With 0 union : 1h53m
with 10 unions : 20m01s
with 100 unions : 7m12s
with 200 unions : 6m02s
with 500 unions : 6m25s
Sometime, the union version must "broadcast big data", or generate an "Out of memory"
My final approach: set a level_of_union
Merge some request, start the job, get the data
continue the loop with another batch
def test_union(spark,level_of_union):
full_request = None
all_datas = []
todo=['CZ12905K01', 'CZ12809WRH' ,'CZ129086RP']
for idx,i in enumerate(todo):
q = f"""
select '{i}' ID,* from leh_be where leh_be_lot_id=='{i}'
"""
partial_df = spark.sql(q)
if not full_request:
full_request = partial_df
else:
full_request = full_request.union(partial_df)
if idx % level_of_union == level_of_union-1 or idx == len(todo)-1:
all_datas.extend(full_request.collect()) # Start a job
full_request=None
return all_datas
Make a test to adjust the meta parameter : level_of_union

Airflow example_branch_operator usage of join - bug?

As a newbie to airflow, I'm looking at the example_branch_operator:
"""Example DAG demonstrating the usage of the BranchPythonOperator."""
import random
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import BranchPythonOperator
from airflow.utils.dates import days_ago
args = {
'owner': 'airflow',
}
with DAG(
dag_id='example_branch_operator',
default_args=args,
start_date=days_ago(2),
schedule_interval="#daily",
tags=['example', 'example2'],
) as dag:
run_this_first = DummyOperator(
task_id='run_this_first',
)
options = ['branch_a', 'branch_b', 'branch_c', 'branch_d']
branching = BranchPythonOperator(
task_id='branching',
python_callable=lambda: random.choice(options),
)
run_this_first >> branching
join = DummyOperator(
task_id='join',
trigger_rule='none_failed_or_skipped',
)
for option in options:
t = DummyOperator(
task_id=option,
)
dummy_follow = DummyOperator(
task_id='follow_' + option,
)
branching >> t >> dummy_follow >> join
Looking at the join operator, I'd expect for it to collect all the branches, but instead it's just another task that happens at the end of each branch. If multiple branches are executed, join will run that many times.
(yes, yes, it should be idempotent, but that's not the point of the question)
Is this a bug, a poorly named task, or am I missing something?
The tree view displays a complete branch from each DAG root node. Multiple branches that converge on a single task will be shown multiple times but they will only be executed once. Check out the Graph View of this DAG:

How do I know if a file has been dumped in badrecordspath ? [ Azure Databricks -Spark]

I want to use badrecordspath in Spark in in Azure Databricks to get a count of corrupt records associated to the job, but there is no simple way to know :
if a file has been written
in which partition the file has been written
I thought maybe i could check if the last partition was created in the last 60 seconds with some code like that :
from datetime import datetime, timedelta
import time
import datetime
df = spark.read.format('csv').option("badRecordsPath", corrupt_record_path)
partition_dict = {} #here the dictionnary of partition and corresponding timestamp
for i in dbutils.fs.ls(corrupt_record_path):
partition_dict[i.name[:-1]]=time.mktime(datetime.datetime.strptime(i.name[:-1], "%Y%m%dT%H%M%S").timetuple())
#here i get the timestamp of one minute ago
submit_timestamp_utc_minus_minute = datetime.datetime.now().replace(tzinfo = timezone.utc) - timedelta(seconds=60)
submit_timestamp_utc_minus_minute = time.mktime(datetime.datetime.strptime(submit_timestamp_utc_minus_minute.strftime("%Y%m%dT%H%M%S"), "%Y%m%dT%H%M%S").timetuple())
#Here i compare the latest partition to check if is more recent than 60 seconds ago
if partition_dict[max(partition_dict, key=lambda k: partition_dict[k])]>submit_timestamp_utc_minus_minute :
corrupt_dataframe = spark.read.format('json').load(corrupt_record_path+partition+'/bad_records')
corrupt_records_count = corrupt_dataframe.count()
else:
corrupt_records_count = 0
But i see two issues :
it is a lot of overhead (ok the code could also be better written, but
still)
i'm not even sure when does the partition name is defined in the
reading job. Is it at the beginning of the job or at the end ? If it is at the beginning, then the 60 seconds are not relevant at all.
As a side note i cannot use PERMISSIVE read with corrupt_records_column, as i don't want to cache the dataframe (you can see my other question here )
Any suggestion or observation would be much appreciated !

Repeating a airflow DAG with different date parameters for data migrations

For the data migrations ,I have created a DAG which ultimately inserts data to a migration table after all the tasks with required logic.
DAG has a sql which is something similar to the below which initially extracts the data and feeds to other tasks:
sql=" select col_names from tables where created_on >=date1 and created_on <=date2"
For each DAG run Iam manually changing date1 and date2 in above sql and initiating data migrations(as data chunk is heavy,as of now date range length is 1 week).
I just want to automate this date changing process ex.if i give date intervals ,after the first DAG is run,the second run is initiated and so on until the end date interval.
I have researched so far,one solution I got was dynamic DAGS in airflow.But the problem is it creates multiple DAG file instances and its also very difficult to debug and maintain .
Is there a way to repeat a DAG with changing date parameter so that I no longer have to keep changing dates manually.
I had the exact same issue! Backfilling in Airflow doesn't seem to make any sense if you don't have the DAG interval start and end as input parameters. If you want to do data migration, you'll probably need to store your last migration time in a file to read. However, this goes against some of the properties an Airflow DAG/task should have (idempotence).
My solution was to add two tasks to my DAG before the start of my "main" tasks. I have two operators (you can possibly make it one) which gets the start and end times of the current DAG run. The "start" and "end" names are sort of misleading because the "start" is actually the start of the previous run and "end" the start of the current run.
I can't reveal the custom operator I wrote but you can do this in a single Python operator:
from croniter import croniter
def get_interval_start_end(**kwargs):
dag = kwargs['dag']
ti = kwargs['ti']
dag_execution = ti.execution_date # current DAG scheduled start
dag_interval = dag._scheduled_interval # note the preceding underscore
cron_iter = croniter(dag_interval, dag_execution)
dag_prev_execution = cron_iter.get_prev()
return (dag_execution, dag_prev_execution)
# dag
task = PythonOperator(task_id='blabla',
python_callable=get_interval_start_end,
provide_context=True)
# other tasks
Then pull these values from xcom in your next task.
There is also a way to get the "last_run" of the DAG using dag.get_last_dagrun() instead. However, it doesn't return the previous scheduled run but the previous actual run. If you have already run your DAG for a "future" time, your "last dag run" will be after your current execution! Then again, I might not have tested with the right settings, so you can try that out first.
I had similar req and here is how I accessed the dates which later can be used in SQLs for backfill.
from airflow import DAG
from airflow.operators import BashOperator, PythonOperator
from datetime import datetime, timedelta
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 8, 1),
'end_date': datetime(2020, 8, 3),
'retries': 0,
}
dag = DAG('helloWorld_v1', default_args=default_args, catchup=True, schedule_interval='0 1 * * *')
def print_dag_run_date(**kwargs):
print(kwargs)
execution_date = kwargs['ds']
prev_execution_date = kwargs['prev_ds']
return (execution_date, prev_execution_date)
# t1, t2 are examples of tasks created using operators
bash = BashOperator(
task_id='bash',
depends_on_past=True,
bash_command='echo "Hello World from Task 1"',
dag=dag)
py = PythonOperator(
task_id='py',
depends_on_past=True,
python_callable=print_dag_run_date,
provide_context=True,
dag=dag)
py.set_upstream(bash)

Spark 1.5.2: Grouping DataFrame Rows over a Time Range

I have a df with the following schema:
ts: TimestampType
key: int
val: int
The df is sorted in ascending order of ts. Starting with row(0), I would like to group the dataframe within certain time intervals.
For example, if I say df.filter(row(0).ts + expr(INTERVAL 24 HOUR)).collect(), it should return all the rows within the 24 hr time window of row(0).
Is there a way to achieve the above within Spark DF context?
Generally speaking it is relatively simple task. All you need is basic arithmetics on UNIX timestamps. First lets cast all timestamps to numerics:
val dfNum = df.withColumn("ts", $"timestamp".cast("long"))
Next lets find minimum timestamp over all rows:
val offset = dfNum.agg(min($"ts")).first.getLong(0)
and use it to compute groups:
val aDay = lit(60 * 60 * 24)
val group = (($"ts" - lit(offset)) / aDay).cast("long")
val dfWithGroups = dfNum.withColumn("group", group)
Finally you can use it as a grouping column:
dfWithGroups.groupBy($"group").agg(min($"value")).
If you want meaningful intervals (interpretable as timestamps) just multiply groups by aDay.
Obviously this won't handle complex cases like handling daylight saving time or leap seconds but should be good enough most of the time. If you need to handle properly any of this you use a similar logic using Joda time with an UDF.

Resources