I schedule my DAGS daily but some dags run for 2,3 days also I have set max_active_runs=1 which means exactly one dag will run at a time and other will be queued.
So is there a way to get exactly when my DAG was triggered and was queued?
You can get this info from scheduler logs.
Search for DAG name and you will find state change and picking up DAG by scheduler.
I have created a scheduled job in databricks to execute a notebook at regular intervals.
Within the notebook, there are many commands separated by cells. Spark streaming query is one of the command in a cell.
The scheduled job fails because the streaming query takes sometime to complete the execution. But the problem is before completing the streaming query, next command is trying to get executed. So the job gets failed.
How can I make dependency for these 2 commands? I want the next command to run only after completion of streaming query.
I am using Dataframe API using Pyspark. Thanks
you need to wait query to finish. it's usually done with .awaitTermination function (doc), like this:
query = df.writeStream.....
query.awaitTermination()
I have 20 bash scripts. Some run every min, every day and some on every hour using cron. Now I need to migrate in airflow. For this, as per the airflow concept, I need to create 20 more files (DAG file).
Does airflow provide away to create generic dag template which can execute all the bash scripts on given schedule time with different dag id?
I got a reference - Airflow dynamic DAG and Task Ids
But I am in doubt, is it the right way or not?
You can create 3 DAGS:
DAG scheduled every minute
hourly scheduled DAG
daily scheduled DAG.
Under these DAGS create tasks to execute your corresponding scripts.
I am trying to understand how jobs, stages, partitions and tasks interact in Spark. So I wrote the following simple script:
import org.apache.spark.sql.Row
case class DataRow(customer: String, ppg_desc: String, yyyymm: String, qty: Integer)
val data = Seq(
DataRow("23","300","201901",45),
DataRow("19","234","201902", 0),
DataRow("23","300","201901", 22),
DataRow("19","171","201901", 330)
)
val df = data.toDF()
val sums = df.groupBy("customer","ppg_desc","yyyymm").sum("qty")
sums.show()
Since I have only one action (the sums.show call), I expected to see one job. Since there is a groupBy involved, I expected this job to have 2 stages. Also, since I have not changed any defaults, I expected to have 200 partitions after the group by and therefore 200 tasks. However, when I ran this in spark-shell, I see 5 jobs being created:
All of these jobs appear to be triggered by the sums.show() call. I am running via spark-shell and lscpu for my docker container shows:
Looking within Job 0, I see the two stages I expect:
But looking in Job 3, I see that the first stage is skipped and the second executed. This, I gather, is because the input is already cached.
What I'm failing to understand is, how does Spark decide how many jobs to schedule? Is it related to the number of partitions to be processed?
I have a SLURM job script a which internally issues an sbatch call to a second job script b. Thus, the job a starts job b.
Now I also have an srun command in job a which depends on the successful execution of b. So I did
srun -d afterok:$jobid <command>
The issue is that dependencies are seemingly not honoured for job steps which I have in this case because my srun runs within the job allocation a (see the --dependency section of https://slurm.schedmd.com/srun.html).
The question: I really need to wait for job b to finish before I issue the job step. How can I do this without resorting to separate jobs?