I have 20 bash scripts. Some run every min, every day and some on every hour using cron. Now I need to migrate in airflow. For this, as per the airflow concept, I need to create 20 more files (DAG file).
Does airflow provide away to create generic dag template which can execute all the bash scripts on given schedule time with different dag id?
I got a reference - Airflow dynamic DAG and Task Ids
But I am in doubt, is it the right way or not?
You can create 3 DAGS:
DAG scheduled every minute
hourly scheduled DAG
daily scheduled DAG.
Under these DAGS create tasks to execute your corresponding scripts.
Related
I have a azure databricks job and it's triggered via ADF using a api call. I want see why the job has been taking n minutes to complete the tasks. When the job execution results, The job execution time says 15 mins and the individual cells/commands doesn't add up to even 4-5 mins
The interactive cluster is already up and running while this got triggered. Please tell me why this sum of individual cell execution time doesn't match with the overall job execution time ? Where can i see what has taken the additional time here ?
Please follow below reference it has detail explanation about:
Execution time taken for the command cell in data bricks notebook.
Measuring Apache Spark Workload Metrics for Performance.
Reference:
How to measure the execution time of a query on Spark
https://db-blog.web.cern.ch/blog/luca-canali/2017-03-measuring-apache-spark-workload-metrics-performance-troubleshooting
https://spark.apache.org/docs/latest/monitoring.html
I schedule my DAGS daily but some dags run for 2,3 days also I have set max_active_runs=1 which means exactly one dag will run at a time and other will be queued.
So is there a way to get exactly when my DAG was triggered and was queued?
You can get this info from scheduler logs.
Search for DAG name and you will find state change and picking up DAG by scheduler.
I have 3 spark scripts and every one of them has 1 spark sql to read a partitioned table and store to some hdfs location. Every script has a different sql statement and different folder location to store data into.
test1.py - Read from table 1 and store to location 1.
test2.py - Read from table 2 and store to location 2.
test3.py - Read from table 3 and store to location 3.
I run these scripts using fork action in oozie and all three run. But the problem is that the scripts are not storing data in parallel.
Once the store from one script is done then the other store starts.
My expectation is to store all 3 tables data into their respective locations parallely.
I have tried FAIR scheduling and other scheduler techniques in the sparks scripts but those don't work. Can anyone please help.I am stuck with it from last 2 days.
I am using AWS EMR 5.15, Spark 2.4 and Oozie 5.0.0.
For Capacity scheduler
If you are submitting job to a single queue whichever job come first in the queue gets the resources. intra-queue preemption won't work.
I can see a related Jira for Intraqueue preemption in Capacity scheduler. https://issues.apache.org/jira/browse/YARN-10073
You can read more https://blog.cloudera.com/yarn-capacity-scheduler/
For Fair scheduler
Setting the "yarn.scheduler.fair.preemption" parameter to "true" in yarn-site.xml enables preemption at the cluster level. By default this is false i.e no preemption.
Your problem could be:
1 job is taking maximum resources. To verify this please check Yarn UI and Spark UI.
Or if you have more than 1 yarn queue (other than default). Try setting User Limit Factor > 1 for the queue you are using.
When a Data frame is split and again joined with different columns, How many and how are stages created in DAG and how tasks are created in stages.
4. How DAG works in Spark?
The interpreter is the first layer, using a Scala interpreter, Spark interprets the code with some modifications.
Spark creates an operator graph when you enter your code in Spark console.
When we call an Action on Spark RDD at a high level, Spark submits the operator graph to the DAG Scheduler.
Divide the operators into stages of the task in the DAG Scheduler. A stage contains task based on the partition of the input data. The DAG scheduler pipelines operators together. For example, map operators schedule in a single stage.
The stages pass on to the Task Scheduler. It launches task through cluster manager. The dependencies of stages are unknown to the task scheduler.
The Workers execute the task on the slave.
You can get more information from the link: https://data-flair.training/blogs/dag-in-apache-spark/
I wonder if it is a bad practice to write (SaveMode.Append) at the same time in the same directory HDFS, with two job Spark.
Do you have any idea?
Its not a bad practice, but in reality in case of jobs chained one after other. The chances are very high that the output from one job is missed.
Example Spark job1 and job2 writes in hdfs path /user/output. Spark job 3 consumes from the hdfs path.
If you try to build the job chain from oozie there can be situation when Job 1 and Job 3 ran, while JOb 2 ran after Job 3 leading to Job2 data not being consumed from Job 3