Number of Tasks in Spark UI - apache-spark

I am new to Spark. I have couple of questions regarding the Spark Web UI:-
I have seen that Spark can create multiple Jobs for the same
application. On what basis does it creates the Jobs ?
I understand Spark creates multiple Stages for a single Job around
Shuffle boundaries. Also I understand that there is 1 task per
partition. However, I have seen that a particular Stage (E.g. Stage1)
of a particular Job creating lesser number of tasks than the default
shuffle partitions value (for e.g. only 2/2 completed). And I have
also seen, the next Stage (Stage 2) of the same Job creating
1500 tasks (for E.g. 1500/1500 completed) which is more than
the default shuffle partitions value.
So, how does Spark determine how many tasks should it
create for any particular Stage to execute ?
Can anyone please help me understand the above.

the max number of task in one moment dependent on you cores and exec numbers,
different stage have different task number

Related

Spark execution - Relationship between spark execution job and spark action

I have one question regarding Spark execution which .
We all know that Each spark application (or the driver program) may contain one or many actions.
My question is which one is correct - Do a collection of jobs correspond to one action OR Does each job correspond to one action. Here job means the one that can be seen in the Spark execution UI.
I think the latter is true (each job correspond to one action). Please validate
Thanks.
Your understanding is correct.
Each action in spark corresponds to a Spark Job. And these actions are called by the driver program in the application.
And therefore an action can involve many transformation on the dataset(or RDD). Which creates stages in the job.
A stage can be thought of as the set of calculations(tasks) that can each be computed on an executor without communication with other executors or with the driver.
In other words, a new stage begins whenever network travel between workers is required; for example in a shuffle. These dependencies that create stage boundaries are called ShuffleDependencies.

In Apache Spark , do Tasks in the same Stage work simultaneously or not?

do tasks in the same stage work simultaneously? if so, the line between partitions in a stage refers to what? example of a DAG
here is a good link for your reading. that explains DAG in detail and few other things that may be of interest. databricks blog on DAG
I can try to explain. as each stage is created it has a set of tasks that are divided. when an action is encountered. Driver sends the task to executors. based on how your data is partitioned N number tasks are invoked on the data in your distributed cluster. so the arrows that you are seeing is execution plan. as in it cannot do map function prior to reading the file. each node that has some data will execute those tasks in order that is provided by the DAG.

What is a stage in Apache Spark?

so I understand that a stage is a set of tasks that work in the same node.
so why do I get two stages when I work in local?
A stage is a set of parallel tasks - one task per partition.
Number of stages is defined by number of shuffle/wide transformations.
So coming back to your case, if you have shuffle operation then it will result in two stages.

SparkSQL Number of Tasks

I have a Spark Standalone Cluster (which consists of two Workers with 2 cores each). I run an SQLQuery which joins 2 dataframes and shows the result. I have some questions regarding the above simle example.
val df1 = sc.read.text(fn1).toDF()
val df2 = sc.read.text(fn2).toDF()
df1.createOrReplaceTempView("v1")
df2.createOrReplaceTempView("v2")
val df_join = sc.sql("SELECT * FROM v1,v2 WHERE v1.value=v2.value AND v2.value<1500").show()
DAG Scheduler - Number of Tasks
From what i've understood so far when i spark-submit the application, a SparkContext is spawn for the handling of the Job(where job is the printing of result rows). SparkContext creates a Task Scheduler instance which then creates a DAGScheduler. Through a simple event mechanism, the DAGScheduler handles the job for execution(handleJobSubmitted function from the code). SparkSQL query has been transformed into a physical execution plan(through Catalyst Optimizer), and then to an RDD-Graph(with toRdd function). DagScheduler receives the RDD-Graph and recursively creates all the stages.
I do not understand how it finds the Number of Tasks(before the execution of any stage) in the last stage,keeping in mind that the result stage is the one that performs the join(and prints the results). The number of data(and the rdds and the number of their partitions, which define the number of tasks) we have is unknown until the parent stages have ended their execution.
Parallel Execution of Stages
Each one of the two first stages is independent of the other, as it loads data from different files. I have read many posts that say that Stages that do not have dependencies between them MAY be executed in parallel from the cluster. What is the condition that implies that independent stages's tasks are executed in parallel?
Task Dependencies
Finally, i've read that Task Scheduler does not know about Stage Dependencies. If i keep in mind that each Stage in Spark is a TakSet( aka a set of non dependent tasks, each task with same functionality packed up with different partition of data), then TaskScheduler does not know as well the dependencies between tasks of different Stages. As a result, how and when a task knows the data on which it'll execute a function?
If for example, the task knows apriori where to look for its input data, then it could be launched as soon as they become available.

Distribution of spark code into jobs, stages and tasks [duplicate]

This question already has answers here:
What is the concept of application, job, stage and task in spark?
(5 answers)
Closed 5 years ago.
As per my understanding each action in whole job is translated to job, whil each shuffling stage within a job is traslated into stage and each partition for each stages input is translated into task.
Please corrrect me if I am wrong, I am unable to get any actual definition.
Invoking an action inside a Spark application triggers the launch of a Spark job to fulfill it.Spark examines the DAG and formulates an execution plan.The execution plan consists of assembling the job’s transformations into stages.
When Spark optimises code internally, it splits it into stages, where
each stage consists of many little tasks.Each stage contains a sequence of transformations that can be completed without shuffling the full data.
Every task for a given stage is a single-threaded atom of computation consisting of exactly the same
code, just applied to a different set of data.The number of tasks is determined by the number of partitions.
To manage the job flow and schedule tasks Spark relies on an active driver process.
The executor processes are responsible for executing this work, in the form of tasks, as well as for storing any data that the user chooses to cache
A single executor has a number of slots for running tasks and will run many concurrently throughout its lifetime.

Resources