Apache Spark: Relationship between action and job, Spark UI - apache-spark

To the best of my understanding till date, in spark a job is submitted whenever an action is called on a dataset/dataframe. the job may further be divided into stages and tasks, which I understand how to find out the number of stages and tasks. Given below is my small code
val spark = SparkSession.builder().master("local").getOrCreate()
val df = spark.read.json("/Users/vipulrajan/Downloads/demoStuff/data/rows/*.json").select("user_id", "os", "datetime", "response_time_ms")
df.show()
df.groupBy("user_id").count().show
To the best of my understanding it should have submitted one job at line 4 when I read. one on the first show and one on the second show. The first two assumptions are correct, but for the second show it submits 5 jobs. I can't understand why. Below is the screenshot of my UI
as you can see job 0 for reading the json, job 1 for the first show and 5 jobs for the second show. Can anyone help me understand what is this job in the spark UI?

Add something like
df.groupBy("user_id").count().explain()
to see, what actually are under the hood of your last show().

Related

Multiple jobs from a single action (Read, Transform, Write)

Currently using PySpark on Databricks Interactive Cluster (with Databricks-connect to submit jobs) and Snowflake as Input/Output data.
My Spark application is supposed to read data from Snowflake, apply some simple SQL transformations (mainly F.when.otherwise, narrow transformation) , then load it back to Snowflake. (FYI, schema are passed to Snowflake reader & writer)
EDIT : There's also an sort transformation at the end of the process, before writing.
For testing purpose, I named my job like this: (Writer, and Reader are supposed to be named)
sc.setJobDescription("Step Snowflake Reader")
I have trouble understanding what the Spark UI is showing me :
So, I get 3 jobs, with all same jobs name (Writer).
I can understand that I have only one Spark Action, so suppose to have one job, so Spark did name the jobs the last value set by sc.setJobDescription (Reader, which trigg spark compute).
I did also tag my "ReaderClass"
sc = spark.sparkContext
sc.setJobDescription("Step Snowflake Reader")
Why it doesn't show ?
Is the first job is like "Downloading Data from Snowflake", the second "Apply SQL transformation", then the third "Upload data to Snowflake" ?
Why all my jobs are related to same SQL Query ?
What is Query 0 which is related to ... zero jobs ?
Thanks for help.
There are a few things in this.
First thing, is that a job is triggered for an action, and tranformations are not really part of it (they're computed during an action, but a single action can do multiple transformations).
In your case, reading, tranformation and sorting, all these steps would take place when the action is triggered
Please note that reading from snowflake doesn't trigger a job (this is an assumption as the same behaviour is exhibited by Hive) because snowflake already has the metadata which spark needs by traversing the files.
If you'll read a parquet file directly, it'll trigger a different job, and you'll be able to see the job description.
Now comes the part of you naming your job
sc.setJobDescription("Step Snowflake Reader")
This will name the job that was triggered by your write action. And this action in turn is calling multiple jobs (but are still part of the last action you're performing, refer here for more details see this post
Similarly, the last configuration that you make before calling an action is picked up (Same thing happens for setting shufflePartition for instance, you may wanna have a perticular step with more or less shuffle, but for 1 complete action, it'll be set to a single value)
Hope this answers your question.

Is there a more systematic way to resolve a slow AWS Glue + PySpark execution stage?

I have this code snippet that I ran locally in standalone mode using 100 records only:
from awsglue.context import GlueContext
glue_context = GlueContext(sc)
glue_df = glue_context.create_dynamic_frame.from_catalog(database=db, table_name=table)
df = glue_df.toDF()
print(df.count())
The schema contains 89 columns all having string data type except 5 columns that have array of struct data type. The data size is 3.1 MB.
Also, here is some info about the environment used to run the code:
spark.executor.cores: 2
spark.executor.id: driver
spark.driver.memory: 1000M
Problem is I can't find out why stage 1 took 12 minutes to finish where it only has to count 100 records. I can't find what "Scan parquet" and "Exchange" Tasks mean as shown in this image:
Stage 1 DAG Visualization
My question is, is there a more systematic way to understand what those tasks mean. As a beginner, I heavily relied on Spark UI but it doesn't give much information about the tasks it has executed. I was able to find which task took the most time but I have no idea why it is the case and how to systematically resolve it.
The running time in spark code is calculating based on the cluster kick-off time, DAG scheduler optimisation time, running stages time. In your case, the issue could be because of the followings:
The number of parquet files. To test this easily read the table and write it back as one parquet file. You are calling a table but behind the scene, it's reading the physical parquet files so the number of files is an item to consider.
Number of spark clusters. The number of clusters should be a relevant number of computing resources you have. For example, in your case, you have 2 core with a small-size table. So it's more efficient to have just a few partitions instead of the default partition numbers which is 200.
To get more clarification on the spark stages use explain function and read the DAG result. As a result of this function you could see and compare Analyzed Logical Plan, Optimized Logical Plan, and Physical Plan that has been calculated by internal optimiser processes.
To find a more detailed description of the explain function please visit this LINK

Why does this simple spark application create so many jobs?

I am trying to understand how jobs, stages, partitions and tasks interact in Spark. So I wrote the following simple script:
import org.apache.spark.sql.Row
case class DataRow(customer: String, ppg_desc: String, yyyymm: String, qty: Integer)
val data = Seq(
DataRow("23","300","201901",45),
DataRow("19","234","201902", 0),
DataRow("23","300","201901", 22),
DataRow("19","171","201901", 330)
)
val df = data.toDF()
val sums = df.groupBy("customer","ppg_desc","yyyymm").sum("qty")
sums.show()
Since I have only one action (the sums.show call), I expected to see one job. Since there is a groupBy involved, I expected this job to have 2 stages. Also, since I have not changed any defaults, I expected to have 200 partitions after the group by and therefore 200 tasks. However, when I ran this in spark-shell, I see 5 jobs being created:
All of these jobs appear to be triggered by the sums.show() call. I am running via spark-shell and lscpu for my docker container shows:
Looking within Job 0, I see the two stages I expect:
But looking in Job 3, I see that the first stage is skipped and the second executed. This, I gather, is because the input is already cached.
What I'm failing to understand is, how does Spark decide how many jobs to schedule? Is it related to the number of partitions to be processed?

What triggers Jobs in Spark?

I'm learning how Spark works inside Databricks. I understand how shuffling causes stages within jobs, but I don't understand what causes jobs. I thought the relationship was one job per action, but sometimes many jobs happen per action.
E.g.
val initialDF = spark
.read
.parquet("/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/")
val someDF = initialDF
.orderBy($"project")
someDF.show
triggers two jobs, one to peek at the schema and one to do the .show.
And the same code with .groupBy instead
val initialDF = spark
.read
.parquet("/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/")
val someDF = initialDF
.groupBy($"project").sum()
someDF.show
...triggers nine jobs.
Replacing .show with .count, the .groupBy version triggers two jobs, and the .orderBy version triggers three.
Sorry I can't share the data to make this reproducible, but was hoping to understand the rules of when jobs are created in abstract. Happy to share the results of .explain if that's helpful.
show without an argument shows the first 20 rows as a result.
When show is triggered on dataset, it gets converted to head(20) action which in turn get converted to limit(20) action .
show -> head -> limit
About limit
Spark executes limit in an incremental fashion until the limit query is satisfied.
In its first attempt, it tries to retrieve the required number of rows from one partition.
If the limit requirement was not satisfied, in the second attempt, it tries to retrieve the required number of rows from 4 partitions (determined by spark.sql.limit.scaleUpFactor, default 4). and after which 16 partitions are processed and so on until either the limit is satisfied or data is exhausted.
In each of the attempts, a separate job is spawned.
code reference: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L365
Normally it is 1:1 as you state. That is to say, 1 Action results in 1 Job with 1..N Stages with M Tasks per Stage, and Stages which may be skipped.
However, some Actions trigger extra Jobs 'under water'. E.g. pivot: if you pass only the columns as parameter and not the values for the pivot, then Spark has to fetch all the distinct values first so as to generate columns, performing a collect, i.e. an extra Job.
show is also a special case of extra Job(s) being generated.

How does Spark decides no of Jobs?

As I understander spark decides no of jobs based on each Action performed. I have 6 actions in my spark which are further divided into stages but i see more than 6 jobs are being spawned.
Is my understanding correct or i am missing something ?
Thanks

Resources