Airflow dags - reporting of the runtime for tracking purposes - statistics

I am trying to find a way to capture the dag stats - i.e run time (start time, end time), status, dag id, task id, etc for various dags and their task in a separate table
found the default logs which goes to elasticsearch/kibana, but not a simple way to pull the required logs from there back to the s3 table.
building a separate process to load those logs in s3 will have replicated data and also there will be too much data to scan and filter as tons of other system-related logs are generated as well.
adding a function to each dag - would have to modify each dag
what other possibilities are to get it don't efficiently, of any other airflow inbuilt feature can be used

You can try using Ad Hoc Query available in Apache airflow.
This option is available at Data Profiling -> Ad Hoc Query and select airflow_db
If you wish to get DAG statistics such as start_time,end_time etc you can simply query in the below format
select start_date,end_date from dag_run where dag_id = 'your_dag_name'
The above query returns start_time and end_time details of the DAG for all the DAG runs. If you wish to get details for a particular run then you can add another filter condition like below
select start_date,end_date from dag_run where dag_id = 'your_dag_name' and execution_date = '2021-01-01 09:12:59.0000' ##this is a sample time
You can get this execution_date from tree or graph views. Also you can get other stats like id,dag_id,execution_date,state,run_id,conf as well.
You can also refer to https://airflow.apache.org/docs/apache-airflow/1.10.1/profiling.html#:~:text=Part%20of%20being%20productive%20with,application%20letting%20you%20visualize%20data. link for more details.

You did not mention do you need this information real time or in batches.
Since you do not want to use ES logs either, you can try airflow metrics, if it suits your need.
However pulling this information from database is not efficient, in any case but it still is an option if you are not looking for real time data collection.

Related

Can we set task wise parameters using Databricks Jobs API "run-now"

I have a job with multiple tasks like Task1 -> Task2. I am trying to call the job using api "run now". Task details are below
Task1 - It executes a Note Book with some input parameters
Task2 - It executes a Note Book with some input parameters
So, how I can provide parameters to job api using "run now" command for task1,task2?
I have a parameter "lib" which needs to have values 'pandas' and 'spark' task wise.
I know that we can give unique parameter names like Task1_lib, Task2_lib and read that way.
current way:
json = {"job_id" : 3234234, "notebook_params":{Task1_lib: a, Task2_lib: b}}
Is there a way to send task wise parameters?
It's not supported right now - parameters are defined on the job level. You can ask your Databricks representative (if you have) to communicate this ask to the product team who works on the Databricks Workflows.

Profiling the Spark Analyzer: how to access the QueryPlanningTracker for a pyspark query?

Any Spark & Py4J gurus available to explain how to reliably access Spark's java objects and variables from the Python side of pyspark? Specifically, how to access the data in Spark's QueryPlanningTracker from python?
Details
I am trying to profile a creating a pyspark dataframe (df = spark_session.sql(thousand_line_query)). Not running the query. Just creating the dataframe so I can inspect its schema. Merely waiting for the return from that .sql() call which initializes the dataframe with no data takes a long time (10-30 seconds). I have tracked the slow steps to Spark's Analyzer stage. Logging (below) suggests Spark is recomputing the same sub-query too many times, so I'm trying to dig in and see what is going on by profiling Spark's work on my query. I tried to methods from a number of articles for profiling the Spark Optimizer stage for executing queries (e.g. Luca Canali's sparkMeasure, Rose Toomey's Care and Feeding of Catalyst Optimizer). But I have found no guide that focuses on profiling the Spark Analyzer stage that runs before the Optimizer stage. (Hence I also include extra details below on what I've found that others may find helpful.)
Reading Spark's Scala sourcecode, I see the Analyzer is a RuleExecutor, and RuleExecutors have a QueryPlanningTracker which seems to record details on each invocation of each Analyzer Rule that Spark runs, specifically to allow a person to reconstruct a timeline of what the analyzer is doing for a single query.
However, I cannot seem to access the data in the Analyzer's QueryPlanningTracker from python. I would like to be able to retrieve a QueryPlanningTracker java object with the full details of the run of one query, and to inspect what fields & methods are available on the Python code. Any advice?
Examples
In python using pyspark, request a dataframe for my 1,000-line query and find it is slow:
query_sql = 'SELECT ... <long query here>
spark_df = spark_session.sql(query_sql) # takes 10-30 seconds
Turn on copious logging, rerun query above, look at output & see the slow steps all mention the PlanCheckLogger which is in the Spark Analyzer. Also access Spark's RuleExecutor to see how much time is used by each rule & which rules are not effective:
spark_session.sparkContext.setLogLevel('ALL')
rule_executor = spark_session._jvm.org.apache.spark.sql.catalyst.rules.RuleExecutor
rule_executor.resetMetrics()
spark_df = spark_session.sql(query_sql) # logs 10,000+ lines of output, lines with keyword `PlanChangeLogger` give timestamps showing the slow steps are in the Analyzer, but not the order of steps that occur
print(rule_executor.dumpTimeSpent()) # prints Analyzer rules that ran, how much time was 'effective' for each rule, but no specifics on order of rules run, no details on why rules took up a lot of time but were not effective.
Next: Try (unsuccessfully) to access Spark's QueryPlanningTracker data to drill down to a timeline of rules run, how long each call to each rule took, and any other specifics I can get:
tracker = spark_session._jvm.org.apache.spark.sql.catalyst.QueryPlanningTracker
## Use some call here to show data contents of the tracker; which currently gives E.g. intitial exploration:
tracker.measurePhase.topRulesByTime(10)
*** TypeError: 'JavaPackage' object is not callable ....
The above is one example; The tracker code suggests it has other methods & fields I could use, however I do not see how to access those nor how to inspect from Python to see what methods & fields are available, so it is just trial & error from reading Spark's github repository ...
You can try this:
>>> df = spark.range(1000).selectExpr("count(*)")
>>> tracker = df._jdf.queryExecution().tracker()
>>> print(tracker)
org.apache.spark.sql.catalyst.QueryPlanningTracker#5702d8be
>>> print(tracker.topRulesByTime(10))
Stream((org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions,RuleSummary(27004600, 2, 1)), ?)
I'm not sure what kinds of info you need. But if you want to know query plan generated. You can use df.explain()

Basic query with TIMESTAMP by not producing output

I have a very basic setup, in which I never get any output if I use the TIMESTAMP BY statement.
I have a stream analytics job which is reading from Event Hub and writing to the table storage.
The query is the following:
SELECT
*
INTO
MyOutput
FROM
MyInput TIMESTAMP BY myDateTime;
If the query uses timestamp statement, I never get any output events. I do see incoming events in the monitoring, there are no errors neither in monitoring nor in the maintenance logs. I am pretty sure that the source data has the right column in the right format.
If I remove the timestamp statement, then everything is working fine. The reason why I need the timestamp statement in the first place is because I need to write a number of queries in the same job, writing various aggregations to different outputs. And if I use timestamp in one query, I am required to use it in all other queries itself.
Am I doing something wrong? Perhaps SELECT * does not play well with TIMESTAMP BY? I just did not find any documentation explaining that...
{"myDateTime":"2015-08-02T10:59:02.0000000Z", "EventEnqueuedUtcTime":"2015-08-07T10:59:07.6980000Z"}
Late tolerance window: 00.00:00:05
All of your events are considered late arriving because myDateTime is 5 days before EventEnqueuedUtcTime. Can you try sending new events where myDateTime is in UTC and is "now" so it matches within a couple of seconds?
Also, when you started the job, what did you pick as the job start date time? Can you make sure you pick a date before the myDateTime values? You might try this first.

duplicate jobs in sun grid engine

When I run qacct with the job ID, after it is finished, I get two results,
the one I run and an older job with the same jobid.
how can I delete the history of qacct?
Any one know how to solve this?
Thanks
Tsvi
Grid Endine (or SGE) has job IDs in the range 0..99999. This may roll over quickly in some clusters and people may be interested in finding statistics of older jobs with the same ID. You can identify your jobs knowing also the approximate submit time.
Anyway if you want to eliminate the duplicate job IDs from qacct you can rotate the accounting file (//common/accounting) using utilities like logchecker.sh.
Check the man page or this grid engine online documentation:
http://gridscheduler.sourceforge.net/howto/rotatelogs.html

How to know OPC job status using Syncsort or anyother method?

My objective is, I need to get the current timestamp using Syncsort if one OPC job(existing Job) run fine in production. In my case I can not interpret my new job after existing OPC job. Is there any facility to check the existing job ran fine in production ?
I mean any reference table to have production job details with status for each day ?
Please help anyone to move.
There are commercial packages that track jobs and job status. CA (computer associates) is one such vendor.
However, these packages cost a lot. A simple, home grown solution, is to have a dataset known to both jobs and write a one line record into that data set when job1 completes and the second job2 can read the dataset to "KNOW" if the job ran. IF this is what you are trying to do, it is not exactly clear from your question. But any solution along these lines works, until management wants to cough up $50K (or whatever) for a commercial package.

Resources