Spark execution - Relationship between spark execution job and spark action - apache-spark

I have one question regarding Spark execution which .
We all know that Each spark application (or the driver program) may contain one or many actions.
My question is which one is correct - Do a collection of jobs correspond to one action OR Does each job correspond to one action. Here job means the one that can be seen in the Spark execution UI.
I think the latter is true (each job correspond to one action). Please validate
Thanks.

Your understanding is correct.
Each action in spark corresponds to a Spark Job. And these actions are called by the driver program in the application.
And therefore an action can involve many transformation on the dataset(or RDD). Which creates stages in the job.
A stage can be thought of as the set of calculations(tasks) that can each be computed on an executor without communication with other executors or with the driver.
In other words, a new stage begins whenever network travel between workers is required; for example in a shuffle. These dependencies that create stage boundaries are called ShuffleDependencies.

Related

Number of Tasks in Spark UI

I am new to Spark. I have couple of questions regarding the Spark Web UI:-
I have seen that Spark can create multiple Jobs for the same
application. On what basis does it creates the Jobs ?
I understand Spark creates multiple Stages for a single Job around
Shuffle boundaries. Also I understand that there is 1 task per
partition. However, I have seen that a particular Stage (E.g. Stage1)
of a particular Job creating lesser number of tasks than the default
shuffle partitions value (for e.g. only 2/2 completed). And I have
also seen, the next Stage (Stage 2) of the same Job creating
1500 tasks (for E.g. 1500/1500 completed) which is more than
the default shuffle partitions value.
So, how does Spark determine how many tasks should it
create for any particular Stage to execute ?
Can anyone please help me understand the above.
the max number of task in one moment dependent on you cores and exec numbers,
different stage have different task number

In Apache Spark , do Tasks in the same Stage work simultaneously or not?

do tasks in the same stage work simultaneously? if so, the line between partitions in a stage refers to what? example of a DAG
here is a good link for your reading. that explains DAG in detail and few other things that may be of interest. databricks blog on DAG
I can try to explain. as each stage is created it has a set of tasks that are divided. when an action is encountered. Driver sends the task to executors. based on how your data is partitioned N number tasks are invoked on the data in your distributed cluster. so the arrows that you are seeing is execution plan. as in it cannot do map function prior to reading the file. each node that has some data will execute those tasks in order that is provided by the DAG.

Distribution of spark code into jobs, stages and tasks [duplicate]

This question already has answers here:
What is the concept of application, job, stage and task in spark?
(5 answers)
Closed 5 years ago.
As per my understanding each action in whole job is translated to job, whil each shuffling stage within a job is traslated into stage and each partition for each stages input is translated into task.
Please corrrect me if I am wrong, I am unable to get any actual definition.
Invoking an action inside a Spark application triggers the launch of a Spark job to fulfill it.Spark examines the DAG and formulates an execution plan.The execution plan consists of assembling the job’s transformations into stages.
When Spark optimises code internally, it splits it into stages, where
each stage consists of many little tasks.Each stage contains a sequence of transformations that can be completed without shuffling the full data.
Every task for a given stage is a single-threaded atom of computation consisting of exactly the same
code, just applied to a different set of data.The number of tasks is determined by the number of partitions.
To manage the job flow and schedule tasks Spark relies on an active driver process.
The executor processes are responsible for executing this work, in the form of tasks, as well as for storing any data that the user chooses to cache
A single executor has a number of slots for running tasks and will run many concurrently throughout its lifetime.

Schedule each Apache Spark Stage to run on a specific Worker Node

Suppose, I am running a simple Wordcount application on Spark (actually Spark Streaming) with 2 worker nodes. By default each task (from any stage) is scheduled to any available resource based on a scheduling algorithm. However, I want to change the default scheduling to fix each stage to a specific worker node.
Here is what I am trying to achieve -
Worker Node 'A' should only process the first Stage (like 'map' stage). So all the data that comes in must first go to worker 'A'
and Worker Node 'B' should only process the second stage (like 'reduce' stage). Effectively, the results of Worker A are processed by Worker B.
My first question is - Is this sort of customisation possible on Spark or Spark Streaming by tuning the parameters or choosing a correct config option? (I don't think it is, but can someone confirm this?)
My second question is - Can I achieve this by making some change to the Spark scheduler code? I am ok hardcoding the IPs of the workers if necessary. Any hints or pointers to this specific problem or even understanding the Spark Scheduler code in more detail would be helpful..
I understand that this change defeats the efficiency goals of Spark to some extent but I am only looking to experiment with different setups for a project.
Thanks!

How to know which piece of code runs on driver or executor?

I am new to Spark. How to know which piece of code will run on the driver & which will run on the executors ?
Do we always have to try to code such that everything runs on the executors ?. Is there any recommendations/ways to make most of your code to run on executors ?
Update: I far as I understand Transformations run on executors & actions runs on driver because it needs to return value. So is it fine if the action runs on driver or should it also run on executor ? Where does the driver actually run ? on cluster ?
Any Spark application consists of a single Driver process and one or more Executor processes. The Driver process will run on the Master node of your cluster and the Executor processes run on the Worker nodes. You can increase or decrease the number of Executor processes dynamically depending upon your usage but the Driver process will exist throughout the lifetime of your application.
The Driver process is responsible for a lot of things including directing the overall control flow of your application, restarting failed stages and the entire high level direction of how your application will process the data.
Coding your application so that more data is processed by Executors falls more under the purview of optimising your application so that it processes data more efficiently/faster making use of all the resources available to it in the cluster.
In practice, you do not really need to worry about making sure that more of your data is being processed by executors.
That being said, there are some Actions, which when triggered, necessarily involve shuffling around of data. If you call the collect action on an RDD, all the data is brought to the Driver process and if your RDD had a sufficiently large amount of data in it, an Out Of Memory error will be triggered by the application, as the single machine running the Driver process will not be able to hold all the data.
Keeping the above in mind, Transformations are lazy and Actions are not.
Transformations basically transform one RDD into another. But calling a transformation on an RDD does not actually result in any data being processed anywhere, Driver or Executor. All a transformation does is that it adds to the DAG's lineage graph which will be executed when an Action is called.
So the actual processing happens when you call an Action on an RDD. The simplest example is that of calling collect. As soon as an action is called, Spark gets to work and executes the previously saved DAG computations on the specified RDD, returning the result back. Where these computations are executed depends entirely on your application.
There is no simple and straightforward answer here.
As a rule of thumb everything that is executed inside closures of higher order functions like mapPartitions (map, filter, flatMap) or combineByKey should be handled mostly by executor machines. Everything outside these are handled by the driver. But you have to be aware that it is a serious simplification.
Depending on a specific method and language at least a part of the job can be handled by the driver. For example when you use combine-like methods (reduce, aggregate) final merging is applied locally on the driver machine. Complex algorithms (like many can ML / MLlib tools) can interleave distributed and local processing when needed.
Moreover data processing is only a fraction of a whole job. Driver is responsible for bookeeping, accumulator processing, initial broadcasting and other secondary tasks. It also handles lineage and DAG processing and generating execution plans for higher level APIs (Dataset, SparkSQL).
While the whole picture is relatively complex in practice your choices are relatively limited. You can:
Avoid collecting data (collect, toLocalIterator) to process locally.
Perform more work on the workers with tree* (treeAggregate, treeReduce) methods.
Avoid unnecessary tasks which increase bookkeeping costs.
To this part of your question "Update: I far as I understand Transformations run on executors & actions runs on the driver because it needs to return value. "
It is not true that only transformation runs on the executor and all actions run on the driver.
If we have to join 2 datasets where there is no aggregate operation that needs to be performed eg :
dataset1.join(dataset2,dataset1.col("colA").equalTo(dataset2.col("colA)),
"left_semi").as(Encoders.bean(Some.class)).write("/user/datasetresult");
In this case, as soon as the executor machine completes working on its partition it starts writing down the result to HDFS/some persistence without waiting for other executors to complete. This is the reason why we see different part files, which are technically partitions that each executor processed.
Driver does not wait for all executors to complete its computation.
Where does the driver actually run? on cluster?
Depends on the --deploy-mode chosen.
If --deploy-mode client then the gateway where you launch your spark application is your driver machine.
If --deploy-mode cluster, cluster manager choose a machine(in yarn/mesos) which it feels has sufficient memory to run as the driver.

Resources