get statistics of dag run times - statistics

i'm trying to export dag statistics out of Airflow. the statsd output is not very useful, so i decided to basically run a dag to query the SQL and export it out to say influxdb.
so it's easy enough to create a DAG to query the postgres airflow database. however, i'm a little stumped at the schema. i would have thought:
select run_id, start_date, end_date from dag_run where dag_id= 'blah';
would do it, but end end_date never appears to be populated.
all i'm really after is the total time from which the dag run started (where the first job is initiated as opposed to when the job is first put into a running state) and the time the dag went into a success state.

Try hitting the task_instance table:
SELECT execution_date
, MIN(start_date) AS start
, MAX(end_date) AS end
, MAX(end_date) - MIN(start_date) AS duration
FROM task_instance
WHERE dag_id = 'blah'
AND state = 'success'
GROUP BY execution_date
ORDER BY execution_date DESC

Related

DB Connect and workspace notebooks returns different results

I'm using DB Connect 9.1.9.
My cluster version is 9.1LTS with a single node (for test purposes).
My data is stored on a S3 as a delta table.
Running the following:
df = spark.sql("select * from <table_name> where runDate >= '2022-01-10 14:00:00' and runDate <= '2022-01-10 15:00:00'")
When I run it with DB Connect I get: 31.
When I run it on a Databricks Workspace: 462.
Of course you can't check that numbers, I just wanted to find why do we have a difference.
If I remove the condition on the runDate, I have good results on both platform.
So I deduced that it was the "runDate" fault, but I can't find why.
The schema:
StructType(List(StructField(id,StringType,False),
StructField(runDate,TimestampType,true)))
I have the same explain plan on both platform too.
Did I miss something about Timestamp usage ?
Update 1: it is funny, when I'm putting the count() inside the spark.sql("SELECT count(*) ...") directly I still have 31 rows. It might be the way db-connect translate the query to the cluster.
The problem was the timezone associated to the Spark Session.
Add this after your spark session declaration (in case your dates are stored in UTC):
spark.conf.set("spark.sql.session.timeZone", "UTC")

Airflow Scheduling first of month

I want to schedule a DAG at first of month at 5 AM UTC time. so lets say that I want to start running my DAG from 01/01/2021 5 AM . what should be my start date and schedule interval. I want the DAG run on 01/01/2021 to have the same execution date that is 01/01/2021. Any leads on how this could be achieved.
Thanks
The FAQs about execution_date may help you understand what's happening, (see also DAG Runs):
Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if you want to summarize data for 2016-02-19, You would do it at 2016-02-20 midnight UTC, which would be right after all data for 2016-02-19 becomes available.
Basically, the DAG with execution_date = 2021-01-01T05:00:00+00:00 will actually be executed one schedule_interval later (2021-02-01T05:00:00+00:00). The actual date the execution occurred, is represented in the start_date attribute of the "dag_run" object (you can access it through the execution context parameters). It is the same date that you can find in the Explore UI >> Dag Runs >> Start Date column.
Try creating a dummy DAG like the following:
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy import DummyOperator
args = {
"owner": "airflow",
}
with DAG(
dag_id="dummy_dag",
start_date=datetime(2021, 1, 1, 5),
schedule_interval="0 5 1 * *",
) as dag:
t1 = DummyOperator(task_id="task_1")
After the first exeuction, you could play around with the CLI to calculate future execution dates:
~/airflow$ airflow dags next-execution dummy_dag -n 10 -h
usage: airflow dags next-execution [-h] [-n NUM_EXECUTIONS] [-S SUBDIR] dag_id
Get the next execution datetimes of a DAG.
It returns one execution unless the num-executions option is given
Let me know if that worked for you!

Airflow dags - reporting of the runtime for tracking purposes

I am trying to find a way to capture the dag stats - i.e run time (start time, end time), status, dag id, task id, etc for various dags and their task in a separate table
found the default logs which goes to elasticsearch/kibana, but not a simple way to pull the required logs from there back to the s3 table.
building a separate process to load those logs in s3 will have replicated data and also there will be too much data to scan and filter as tons of other system-related logs are generated as well.
adding a function to each dag - would have to modify each dag
what other possibilities are to get it don't efficiently, of any other airflow inbuilt feature can be used
You can try using Ad Hoc Query available in Apache airflow.
This option is available at Data Profiling -> Ad Hoc Query and select airflow_db
If you wish to get DAG statistics such as start_time,end_time etc you can simply query in the below format
select start_date,end_date from dag_run where dag_id = 'your_dag_name'
The above query returns start_time and end_time details of the DAG for all the DAG runs. If you wish to get details for a particular run then you can add another filter condition like below
select start_date,end_date from dag_run where dag_id = 'your_dag_name' and execution_date = '2021-01-01 09:12:59.0000' ##this is a sample time
You can get this execution_date from tree or graph views. Also you can get other stats like id,dag_id,execution_date,state,run_id,conf as well.
You can also refer to https://airflow.apache.org/docs/apache-airflow/1.10.1/profiling.html#:~:text=Part%20of%20being%20productive%20with,application%20letting%20you%20visualize%20data. link for more details.
You did not mention do you need this information real time or in batches.
Since you do not want to use ES logs either, you can try airflow metrics, if it suits your need.
However pulling this information from database is not efficient, in any case but it still is an option if you are not looking for real time data collection.

In PySpark groupBy, how do I calculate execution time by group?

I am using PySpark for a university project, where I have large dataframes and I apply a PandasUDF, using groupBy. Basically the call looks like this:
df.groupBy(col).apply(pandasUDF)
I am using 10 cores in my Spark config (SparkConf().setMaster('local[10]')).
The goal is to be able to report the time each group took to run my code. I want the time each group takes to finish so that I can take the average. I am also interested in calculating the standard deviation.
I now am testing with cleaned data that I know will be separated into 10 groups, and I have the UDF print the running time using time.time(). But, if I am to use more groups this is not going to be possible to do (for context, all my data will be separated into 3000-something groups). Is there a way to measure the execution time per group?
If don't want to print the execution time to stdout you could return it as an extra column from the Pandas UDF instead e.g.
#pandas_udf("my_col long, execution_time long", PandasUDFType.GROUPED_MAP)
def my_pandas_udf(pdf):
start = datetime.now()
# Some business logic
return pdf.assign(execution_time=datetime.now() - start)
Alternatively, to compute the average execution time in the driver application, you could accumulate the execution time and the number of UDF calls in the UDF with two Accumulators. e.g.
udf_count = sc.accumulator(0)
total_udf_execution_time = sc.accumulator(0)
#pandas_udf("my_col long", PandasUDFType.GROUPED_MAP)
def my_pandas_udf(pdf):
start = datetime.now()
# Some business logic
udf_count.add(1)
total_udf_execution_time.add(datetime.now() - start)
return pdf
# Some Spark action to run business logic
mean_udf_execution_time = total_udf_execution_time.value / udf_count.value

Airflow schedule_interval and start_date to get it to always fire the next interval

How can I configure airflow (mwaa) so that it will fire at the same time (6am PST) every day regards of when the dag is deployed?
I have tried what makes sense to me:
set the schedule_interval to 0 6 * * *.
set the start date to:
now = datetime.utcnow()
now = now.replace(tzinfo=pendulum.timezone('America/Los_Angeles'))
previous_five_am = now.replace(hour = 5, minute = 0, second = 0, microsecond = 0)
start_date = previous_five_am
It seems that whenever I deploy by setting the start_date to 5am the previous day it would always fire at the next 6am no matter what time I deploy the dag or do a airflow update
Your confusion may be because you expect Airflow to schedule DAGs like cronjob when it's not.
The first DAG Run is created based on the minimum start_date for the tasks in your DAG. Subsequent DAG Runs are created by the scheduler process, based on your DAG’s schedule_interval, sequentially. Airflow schedule tasks at the END of the interval (See docs) you can view this answer for examples.
As for your sample code - never set your start_date to be dynamic. It's a bad practice that can sometimes lead to DAG never being executed because now() always moves to now() + interval may never be reached see Airflow FAQ.

Resources