Spark Shuffle - How workers know where to pull data from - apache-spark

I am trying to understand how does Spark shuffle dependencies under the hood. Thus I have two questions:
In Spark, how does an executor know from what other executors it has to pull data from?
Does each executor, after finishing its map side task, update its status and location to some central entity ( may be driver) and reduce side executor first contact driver to get location of each executor to
pull from and then pull from those executors directly?
In a job with shuffle dependency, does driver schedule joins (or other tasks on shuffle dependency) only after all map side tasks has finished?
Does it mean that each task will notify driver about its status and driver will orchestrate other dependent tasks in timely manner.

I will answer your questions in points
1. How does an executor knows from what other executors it has to pull data from?
Simply executor doesn't know what other executor do, But Driver know you can think this process as queen and worker the queen push the tasks to the executor and each one finish the task return back by the results.
2. Does each executor, after finishing its map side task, update its status and location to some central entity ( may be driver)
Yes, actually the driver monitor the process but When you create the SparkContext, each worker starts an executor. This is a separate process (JVM), and it loads your jar too. The executors connect back to your driver program. Now the driver can send them commands, like flatMap, map and reduceByKey in your example. When the driver quits, the executors shut down. you can check also look at this answer What is a task in Spark? How does the Spark worker execute the jar file?
3. Reduce side executor first contact driver to get location of each executor to pull from and then pull from those executors directly?
Reduce task has the priority to be run on the same machine the data run on so, there will not be any shuffle unless the data is not available and there is no resources.
4. In a job with shuffle dependency, does driver schedule joins (or other tasks on shuffle dependency) only after all map side tasks has finished?
It is configurable you can change it. you can have a look for this link for more information https://0x0fff.com/spark-architecture-shuffle/
5. Does it mean that each task will notify driver about its status and driver will orchestrate other dependent tasks in timely manner?
Each task notifies and sent heartbeats to the driver and spark implement speculative execution technique. So, if any task fail/slow spark will run another one. more details here http://asyncified.io/2016/08/13/leveraging-spark-speculation-to-identify-and-re-schedule-slow-running-tasks/

Related

Is Broadcast variable in spark moved to each executor or each nodemanager in YARN?

In YARN how is broadcast variable distributed across nodes. I am confused if it goes to each executor in Nodemanager or only copy on each node manager in cluster.
Please let me know about it. Thanks in advance.
Broadcast variables in Apache Spark is a mechanism for sharing variables across executors that are meant to be read-only.
In order to understand Broadcast variable behavior please understand how the Spark life cycle works
Spark driver is started by YARN
It creates DAG for the Job
Job contains of Mapping an reducing tasks
These mapping and reducing tasks are ran in an Executor (i.e. a separate JVM process with its own thread pool)
This executor get its own copy of Broadcast variable at the time of initialization. The broadcast variable is distributed by Spark, not Yarn.
Step 3 will be repeated task based on volume of data or if its a Spark streaming job but there will always be one copy of broadcast variable and stay with the executor till it goes down.
If by node manager you are referring to Spark Worker nodes then I don't think Spark Worker nodes needs broadcast variable so they might not keep any copy of it.
Node manager (i.e. Spark worker) is responsible for,
- allocating the required resources
- registering executor to the driver
- maintaining the Driver - Executor communication

In a Spark cluster, is there one driver process per machine or one driver process per cluster?

I am reading Spark: The Definitive Guide and am learning a great deal.
However, one thing I am confused about while reading is how many driver processes there are per Spark job. When the text first introduces driver and executor processes, it implies that there is a driver per machine, but in the discussion about broadcast variables, it sounds as though there is one driver per cluster.
This is because the text talks about the driver process sending the broadcast variable to every node in the cluster so that it can be cached there. That makes it sound as though there is only one driver process in the whole cluster that is responsible for this.
Which one is it: one driver process per cluster, or one per machine? Or can it be both? I think I am missing something here.
In Spark architecture, there will be only one driver for your spark application.
The spark driver, as part of the spark application is responsible for instantiating a spark session. The spark driver has multiple responsibilities
It communicates with the cluster manager (CM).
Requests resources from the CM for spark's executor JVMs.
Transforms all spark operations into DAG computations, schedules them and distributes their execution as tasks across all spark executors.
It's interaction with the CM is merely to get Spark executor resources.
So, the workflow of running spark applications on a cluster can be seen as:
The user submits an application using spark-submit.
spark-submit launches the driver program and invokes the main method specified by the user.
The driver program contacts the cluster manager to ask for resources to start executor.
The cluster manager launches executors on behalf of the driver program.
The driver process runs through the user application. Based on the RDD or dataset operations in the program, the driver sends work to executors in the form of tasks.
The tasks are run on executor process to compute and save result.

apache spark executors and data locality

The spark literature says
Each application gets its own executor processes, which stay up for
the duration of the whole application and run tasks in multiple
threads.
And If I understand this right, In static allocation the executors are acquired by the Spark application when the Spark Context is created on all nodes in the cluster (in a cluster mode). I have a couple of questions
If executors are acquired on all nodes and will stay allocated to
this application during the the duration of the whole application,
isn't there a chance a lot of nodes remain idle?
What is the advantage of acquiring resources when Spark context is
created and not in the DAGScheduler? I mean the application could be
arbitrarily long and it is just holding the resources.
So when the DAGScheduler tries to get the preferred locations and
the executors in those nodes are running the tasks, would it
relinquish the executors on other nodes?
I have checked a related question
Does Spark on yarn deal with Data locality while launching executors
But I'm not sure there is a conclusive answer
If executors are acquired on all nodes and will stay allocated to this application during the the duration of the whole application, isn't there a chance a lot of nodes remain idle?
Yes there is chance. If you have data skew this will happen. The challenge is to tune the executors and executor core so that you get maximum utilization. Spark also provides dynamic resource allocation which ensures the idle executors are removed.
What is the advantage of acquiring resources when Spark context is created and not in the DAGScheduler? I mean the application could be arbitrarily long and it is just holding the resources.
Spark tries to keep data in memory while doing transformation. Contrary to map-reduce model where after every Map operation it writes to disk. Spark can keep the data in memory only if it can ensure the code is executed in the same machine. This is the reason of allocating resource beforehand.
So when the DAGScheduler tries to get the preferred locations and the executors in those nodes are running the tasks, would it relinquish the executors on other nodes?
Spark can't start a task on an executor unless the executor is free. Now spark application master negotiates with the yarn to get the preferred location. It may or may not get that. If it doesn't get, it will start task in different executor.

Does master node execute actual tasks in Spark?

My question may sound silly, but it bothers me for a long time.
The picture shown above is the components of a distributed Spark application. I think this picture indicates that the master node will never execute actual tasks, but only is served as a cluster manager. Is it true?
By the way, the tasks here refers to the user-submit tasks.
Yes, the master node executes the driver process and does not run tasks. Tasks run in executor processes on the worker nodes. The master node is rarely stressed from a CPU standpoint but, depending on how broadcast variables, accumulators and collect are used, it may be quite stressed in terms of RAM usage.
To explain a bit more on the different roles:
The driver prepares the context and declares the operations on the data using RDD transformations and actions.
The driver submits the serialized RDD graph to the master. The master creates tasks out of it and submits them to the workers for execution. It coordinates the different job stages.
The workers is where the tasks are actually executed. They should have the resources and network connectivity required to execute the operations requested on the RDDs.

Is there a way to understand which executors were used for a job in java/scala code?

I'm having trouble evenly distributing streaming receivers among all executors of a yarn-cluster.
I've got a yarn-cluster with 8 executors, I create 8 streaming custom receivers and spark is supposed to launch these receivers one per executor. However this doesn't happen all the time and sometimes all receivers are launched on the same executor (here's the jira bug: https://issues.apache.org/jira/browse/SPARK-10730).
So my idea is to run a dummy job, get the executors that were involved in that job and if I got all the executors, create the streaming receivers.
For doing that anyway I need to understand if there is a way to understand which executors were used for a job in java/scala code.
I believe it is possible to look what executors where doing what jobs by accessing Spark UI and Spark logs. From the official 1.5.0 documentation (here):
Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application. This includes:
A list of scheduler stages and tasks
A summary of RDD sizes and memory usage
Environmental information.
Information about the running executors
In the following screen you can see what executors are active. In case there are cores/nodes that are not being used, you can detect them by just looking what cores/nodes are actually active and running.
In addition, every executor displays information about the number of tasks that are being running on it.

Resources