Airflow: Get trigger date of DAG Airflow - python-3.x

I schedule my DAGS daily but some dags run for 2,3 days also I have set max_active_runs=1 which means exactly one dag will run at a time and other will be queued.
So is there a way to get exactly when my DAG was triggered and was queued?

You can get this info from scheduler logs.
Search for DAG name and you will find state change and picking up DAG by scheduler.

Related

why the Job running time and command execution time not matching in databricks notebook?

I have a azure databricks job and it's triggered via ADF using a api call. I want see why the job has been taking n minutes to complete the tasks. When the job execution results, The job execution time says 15 mins and the individual cells/commands doesn't add up to even 4-5 mins
The interactive cluster is already up and running while this got triggered. Please tell me why this sum of individual cell execution time doesn't match with the overall job execution time ? Where can i see what has taken the additional time here ?
Please follow below reference it has detail explanation about:
Execution time taken for the command cell in data bricks notebook.
Measuring Apache Spark Workload Metrics for Performance.
Reference:
How to measure the execution time of a query on Spark
https://db-blog.web.cern.ch/blog/luca-canali/2017-03-measuring-apache-spark-workload-metrics-performance-troubleshooting
https://spark.apache.org/docs/latest/monitoring.html

Common airflow dag file to run multiple jobs

I have 20 bash scripts. Some run every min, every day and some on every hour using cron. Now I need to migrate in airflow. For this, as per the airflow concept, I need to create 20 more files (DAG file).
Does airflow provide away to create generic dag template which can execute all the bash scripts on given schedule time with different dag id?
I got a reference - Airflow dynamic DAG and Task Ids
But I am in doubt, is it the right way or not?
You can create 3 DAGS:
DAG scheduled every minute
hourly scheduled DAG
daily scheduled DAG.
Under these DAGS create tasks to execute your corresponding scripts.

In Apache Spark , do Tasks in the same Stage work simultaneously or not?

do tasks in the same stage work simultaneously? if so, the line between partitions in a stage refers to what? example of a DAG
here is a good link for your reading. that explains DAG in detail and few other things that may be of interest. databricks blog on DAG
I can try to explain. as each stage is created it has a set of tasks that are divided. when an action is encountered. Driver sends the task to executors. based on how your data is partitioned N number tasks are invoked on the data in your distributed cluster. so the arrows that you are seeing is execution plan. as in it cannot do map function prior to reading the file. each node that has some data will execute those tasks in order that is provided by the DAG.

Spark Streaming - Job Duration vs Submitted

I am trying to optimize a Spark Streaming application which collects data from a Kafka cluster, processes it and saves the results to various database tables. The Jobs tab in the Spark UI shows the duration of each job as well as the time it was submitted.
I would expect that for a specific batch, a job starts processing when the previous job is done. However, in the attached screenshot, the "Submitted" time of a job is not right after the previous job finishes. For example, job 1188 has a duration of 1 second and it was submitted at 12:02:12. I would expect that the next job would be submitted one second later, or at least close to it, but instead it was submitted six seconds later.
Any ideas on how this delay can be explained? These jobs belong to the same batch and are done sequentially. I know that there is some scheduling delay between jobs and tasks, but I would not expect it to be that large. Moreover, the Event Timeline of a Stage does not show large Scheduling Delay.
I am using Pyspark in a Standalone mode.

Use two job spark to write at the same time in HDFS inside the same repositor

I wonder if it is a bad practice to write (SaveMode.Append) at the same time in the same directory HDFS, with two job Spark.
Do you have any idea?
Its not a bad practice, but in reality in case of jobs chained one after other. The chances are very high that the output from one job is missed.
Example Spark job1 and job2 writes in hdfs path /user/output. Spark job 3 consumes from the hdfs path.
If you try to build the job chain from oozie there can be situation when Job 1 and Job 3 ran, while JOb 2 ran after Job 3 leading to Job2 data not being consumed from Job 3

Resources