Why does this simple spark application create so many jobs? - apache-spark

I am trying to understand how jobs, stages, partitions and tasks interact in Spark. So I wrote the following simple script:
import org.apache.spark.sql.Row
case class DataRow(customer: String, ppg_desc: String, yyyymm: String, qty: Integer)
val data = Seq(
DataRow("23","300","201901",45),
DataRow("19","234","201902", 0),
DataRow("23","300","201901", 22),
DataRow("19","171","201901", 330)
)
val df = data.toDF()
val sums = df.groupBy("customer","ppg_desc","yyyymm").sum("qty")
sums.show()
Since I have only one action (the sums.show call), I expected to see one job. Since there is a groupBy involved, I expected this job to have 2 stages. Also, since I have not changed any defaults, I expected to have 200 partitions after the group by and therefore 200 tasks. However, when I ran this in spark-shell, I see 5 jobs being created:
All of these jobs appear to be triggered by the sums.show() call. I am running via spark-shell and lscpu for my docker container shows:
Looking within Job 0, I see the two stages I expect:
But looking in Job 3, I see that the first stage is skipped and the second executed. This, I gather, is because the input is already cached.
What I'm failing to understand is, how does Spark decide how many jobs to schedule? Is it related to the number of partitions to be processed?

Related

What triggers Jobs in Spark?

I'm learning how Spark works inside Databricks. I understand how shuffling causes stages within jobs, but I don't understand what causes jobs. I thought the relationship was one job per action, but sometimes many jobs happen per action.
E.g.
val initialDF = spark
.read
.parquet("/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/")
val someDF = initialDF
.orderBy($"project")
someDF.show
triggers two jobs, one to peek at the schema and one to do the .show.
And the same code with .groupBy instead
val initialDF = spark
.read
.parquet("/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/")
val someDF = initialDF
.groupBy($"project").sum()
someDF.show
...triggers nine jobs.
Replacing .show with .count, the .groupBy version triggers two jobs, and the .orderBy version triggers three.
Sorry I can't share the data to make this reproducible, but was hoping to understand the rules of when jobs are created in abstract. Happy to share the results of .explain if that's helpful.
show without an argument shows the first 20 rows as a result.
When show is triggered on dataset, it gets converted to head(20) action which in turn get converted to limit(20) action .
show -> head -> limit
About limit
Spark executes limit in an incremental fashion until the limit query is satisfied.
In its first attempt, it tries to retrieve the required number of rows from one partition.
If the limit requirement was not satisfied, in the second attempt, it tries to retrieve the required number of rows from 4 partitions (determined by spark.sql.limit.scaleUpFactor, default 4). and after which 16 partitions are processed and so on until either the limit is satisfied or data is exhausted.
In each of the attempts, a separate job is spawned.
code reference: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L365
Normally it is 1:1 as you state. That is to say, 1 Action results in 1 Job with 1..N Stages with M Tasks per Stage, and Stages which may be skipped.
However, some Actions trigger extra Jobs 'under water'. E.g. pivot: if you pass only the columns as parameter and not the values for the pivot, then Spark has to fetch all the distinct values first so as to generate columns, performing a collect, i.e. an extra Job.
show is also a special case of extra Job(s) being generated.

Apache Spark: Relationship between action and job, Spark UI

To the best of my understanding till date, in spark a job is submitted whenever an action is called on a dataset/dataframe. the job may further be divided into stages and tasks, which I understand how to find out the number of stages and tasks. Given below is my small code
val spark = SparkSession.builder().master("local").getOrCreate()
val df = spark.read.json("/Users/vipulrajan/Downloads/demoStuff/data/rows/*.json").select("user_id", "os", "datetime", "response_time_ms")
df.show()
df.groupBy("user_id").count().show
To the best of my understanding it should have submitted one job at line 4 when I read. one on the first show and one on the second show. The first two assumptions are correct, but for the second show it submits 5 jobs. I can't understand why. Below is the screenshot of my UI
as you can see job 0 for reading the json, job 1 for the first show and 5 jobs for the second show. Can anyone help me understand what is this job in the spark UI?
Add something like
df.groupBy("user_id").count().explain()
to see, what actually are under the hood of your last show().

SparkSQL Number of Tasks

I have a Spark Standalone Cluster (which consists of two Workers with 2 cores each). I run an SQLQuery which joins 2 dataframes and shows the result. I have some questions regarding the above simle example.
val df1 = sc.read.text(fn1).toDF()
val df2 = sc.read.text(fn2).toDF()
df1.createOrReplaceTempView("v1")
df2.createOrReplaceTempView("v2")
val df_join = sc.sql("SELECT * FROM v1,v2 WHERE v1.value=v2.value AND v2.value<1500").show()
DAG Scheduler - Number of Tasks
From what i've understood so far when i spark-submit the application, a SparkContext is spawn for the handling of the Job(where job is the printing of result rows). SparkContext creates a Task Scheduler instance which then creates a DAGScheduler. Through a simple event mechanism, the DAGScheduler handles the job for execution(handleJobSubmitted function from the code). SparkSQL query has been transformed into a physical execution plan(through Catalyst Optimizer), and then to an RDD-Graph(with toRdd function). DagScheduler receives the RDD-Graph and recursively creates all the stages.
I do not understand how it finds the Number of Tasks(before the execution of any stage) in the last stage,keeping in mind that the result stage is the one that performs the join(and prints the results). The number of data(and the rdds and the number of their partitions, which define the number of tasks) we have is unknown until the parent stages have ended their execution.
Parallel Execution of Stages
Each one of the two first stages is independent of the other, as it loads data from different files. I have read many posts that say that Stages that do not have dependencies between them MAY be executed in parallel from the cluster. What is the condition that implies that independent stages's tasks are executed in parallel?
Task Dependencies
Finally, i've read that Task Scheduler does not know about Stage Dependencies. If i keep in mind that each Stage in Spark is a TakSet( aka a set of non dependent tasks, each task with same functionality packed up with different partition of data), then TaskScheduler does not know as well the dependencies between tasks of different Stages. As a result, how and when a task knows the data on which it'll execute a function?
If for example, the task knows apriori where to look for its input data, then it could be launched as soon as they become available.

When could one Spark application create multiple jobs and stages?

I use Databricks Community Edition.
My Spark program creates multiple jobs. Why? I thought there should be one job and it could have multiple stages.
My Understanding is, when spark program is submitted, it will create one JOB, multiple stages ( usually new stage per shuffle operation ).
Below is code being used where I have 2 possible shuffle operations ( reduceByKey / SortByKey ) and one action (Take(5)).
rdd1 = sc.textFile('/databricks-datasets/flights')
rdd2 = rdd1.flatMap(lambda x: x.split(",")).map(lambda x: (x,1)).reduceByKey(lambda x,y:x+y,8).sortByKey(ascending=False).take(5)
One more observation, jobs seem to have new stage ( some of them are skipped ), what is causing the new job creation.
Generally there will be a job for each action - but sortByKey is really weird - it is technically a transformation (so it should be lazily evaluated) but its implementation requires a eager action to be performed - so for that reason you're seeing a job for the sortByKey plus a job for the take.
That accounts for you seeing 2 of the jobs - I can't see where the third is coming from.
(The skipped stages are where the results of a shuffle are automatically cached - this is an optimization that has been present since around Spark 1.3).
Further information on the sortByKey internals - Why does sortBy transformation trigger a Spark job?

How are stages split into tasks in Spark?

Let's assume for the following that only one Spark job is running at every point in time.
What I get so far
Here is what I understand what happens in Spark:
When a SparkContext is created, each worker node starts an executor.
Executors are separate processes (JVM), that connects back to the driver program. Each executor has the jar of the driver program. Quitting a driver, shuts down the executors. Each executor can hold some partitions.
When a job is executed, an execution plan is created according to the lineage graph.
The execution job is split into stages, where stages containing as many neighbouring (in the lineage graph) transformations and action, but no shuffles. Thus stages are separated by shuffles.
I understand that
A task is a command sent from the driver to an executor by serializing the Function object.
The executor deserializes (with the driver jar) the command (task) and executes it on a partition.
but
Question(s)
How do I split the stage into those tasks?
Specifically:
Are the tasks determined by the transformations and actions or can be multiple transformations/actions be in a task?
Are the tasks determined by the partition (e.g. one task per per stage per partition).
Are the tasks determined by the nodes (e.g. one task per stage per node)?
What I think (only partial answer, even if right)
In https://0x0fff.com/spark-architecture-shuffle, the shuffle is explained with the image
and I get the impression that the rule is
each stage is split into #number-of-partitions tasks, with no regard for the number of nodes
For my first image I'd say that I'd have 3 map tasks and 3 reduce tasks.
For the image from 0x0fff, I'd say there are 8 map tasks and 3 reduce tasks (assuming that there are only three orange and three dark green files).
Open questions in any case
Is that correct? But even if that is correct, my questions above are not all answered, because it is still open, whether multiple operations (e.g. multiple maps) are within one task or are separated into one tasks per operation.
What others say
What is a task in Spark? How does the Spark worker execute the jar file? and How does the Apache Spark scheduler split files into tasks? are similar, but I did not feel that my question was answered clearly there.
You have a pretty nice outline here. To answer your questions
A separate task does need to be launched for each partition of data for each stage. Consider that each partition will likely reside on distinct physical locations - e.g. blocks in HDFS or directories/volumes for a local file system.
Note that the submission of Stages is driven by the DAG Scheduler. This means that stages that are not interdependent may be submitted to the cluster for execution in parallel: this maximizes the parallelization capability on the cluster. So if operations in our dataflow can happen simultaneously we will expect to see multiple stages launched.
We can see that in action in the following toy example in which we do the following types of operations:
load two datasources
perform some map operation on both of the data sources separately
join them
perform some map and filter operations on the result
save the result
So then how many stages will we end up with?
1 stage each for loading the two datasources in parallel = 2 stages
A third stage representing the join that is dependent on the other two stages
Note: all of the follow-on operations working on the joined data may be performed in the same stage because they must happen sequentially. There is no benefit to launching additional stages because they can not start work until the prior operation were completed.
Here is that toy program
val sfi = sc.textFile("/data/blah/input").map{ x => val xi = x.toInt; (xi,xi*xi) }
val sp = sc.parallelize{ (0 until 1000).map{ x => (x,x * x+1) }}
val spj = sfi.join(sp)
val sm = spj.mapPartitions{ iter => iter.map{ case (k,(v1,v2)) => (k, v1+v2) }}
val sf = sm.filter{ case (k,v) => v % 10 == 0 }
sf.saveAsTextFile("/data/blah/out")
And here is the DAG of the result
Now: how many tasks ? The number of tasks should be equal to
Sum of (Stage * #Partitions in the stage)
This might help you better understand different pieces:
Stage: is a collection of tasks. Same process running against
different subsets of data (partitions).
Task: represents a unit of
work on a partition of a distributed dataset. So in each stage,
number-of-tasks = number-of-partitions, or as you said "one task per
stage per partition”.
Each executer runs on one yarn container, and
each container resides on one node.
Each stage utilizes multiple executers, each executer is allocated multiple vcores.
Each vcore can execute exactly one task at a time
So at any stage, multiple tasks could be executed in parallel. number-of-tasks running = number-of-vcores being used.
If I understand correctly there are 2 ( related ) things that confuse you:
1) What determines the content of a task?
2) What determines the number of tasks to be executed?
Spark's engine "glues" together simple operations on consecutive rdds, for example:
rdd1 = sc.textFile( ... )
rdd2 = rdd1.filter( ... )
rdd3 = rdd2.map( ... )
rdd3RowCount = rdd3.count
so when rdd3 is (lazily) computed, spark will generate a task per partition of rdd1 and each task will execute both the filter and the map per line to result in rdd3.
The number of tasks is determined by the number of partitions. Every RDD has a defined number of partitions. For a source RDD that is read from HDFS ( using sc.textFile( ... ) for example ) the number of partitions is the number of splits generated by the input format. Some operations on RDD(s) can result in an RDD with a different number of partitions:
rdd2 = rdd1.repartition( 1000 ) will result in rdd2 having 1000 partitions ( regardless of how many partitions rdd1 had ).
Another example is joins:
rdd3 = rdd1.join( rdd2 , numPartitions = 1000 ) will result in rdd3 having 1000 partitions ( regardless of partitions number of rdd1 and rdd2 ).
( Most ) operations that change the number of partitions involve a shuffle, When we do for example:
rdd2 = rdd1.repartition( 1000 )
what actually happens is the task on each partition of rdd1 needs to produce an end-output that can be read by the following stage so to make rdd2 have exactly 1000 partitions ( How they do it? Hash or Sort ). Tasks on this side are sometimes referred to as "Map ( side ) tasks".
A task that will later run on rdd2 will act on one partition ( of rdd2! ) and would have to figure out how to read/combine the map-side outputs relevant to that partition. Tasks on this side are sometimes referred to as "Reduce ( side ) tasks".
The 2 questions are related: the number of tasks in a stage is the number of partitions ( common to the consecutive rdds "glued" together ) and the number of partitions of an rdd can change between stages ( by specifying the number of partitions to some shuffle causing operation for example ).
Once the execution of a stage commences, its tasks can occupy task slots. The number of concurrent task-slots is numExecutors * ExecutorCores. In general, these can be occupied by tasks from different, non-dependent stages.

Resources