Does master node execute actual tasks in Spark? - apache-spark

My question may sound silly, but it bothers me for a long time.
The picture shown above is the components of a distributed Spark application. I think this picture indicates that the master node will never execute actual tasks, but only is served as a cluster manager. Is it true?
By the way, the tasks here refers to the user-submit tasks.

Yes, the master node executes the driver process and does not run tasks. Tasks run in executor processes on the worker nodes. The master node is rarely stressed from a CPU standpoint but, depending on how broadcast variables, accumulators and collect are used, it may be quite stressed in terms of RAM usage.

To explain a bit more on the different roles:
The driver prepares the context and declares the operations on the data using RDD transformations and actions.
The driver submits the serialized RDD graph to the master. The master creates tasks out of it and submits them to the workers for execution. It coordinates the different job stages.
The workers is where the tasks are actually executed. They should have the resources and network connectivity required to execute the operations requested on the RDDs.

Related

Spark BroadcastHashJoin operator and Dynamic Allocation enabled

In my company, we have the following scenario:
we have dynamic allocation enabled by default for every data pipeline written, so we can save some costs and enable resource sharing among the different execution
also, most of the queries running perform joins and Spark has some interesting optimizations regarding it, like the join strategy change, which occurs when Spark identifies that one side of the join is small enough to be broadcasted. This is what we called BroadcastHashJoin and we have lots of queries with these operators in their respective query plans
last but not least, our pipelines run on EMR clusters using the client mode.
We are having a problem that happens when the YARN (RM on EMR) queue where a job was submitted is full, and there are not enough resources to allocate new executors for a given application. Since the driver process runs on the machine that submitted the application (client-mode), the broadcast job started and, after 300s it fails showing the broadcast timeout error.
Running the same job in a different schedule (a time when the queue usage is not too high), it was able to run successfully.
My questions are all related to how these three different things work together (dynamic allocation enabled, BHJ, client-mode). So, if you haven't enabled dynamic allocation, it's easier to see that the broadcast operation will occur for every executor that was requested initially through the spark-submit command. But, if we enable dynamic allocation, how the broadcast operation will occur for the next executors that will be dynamically allocated? Will the driver have to send it again for every new executor? Will they be subject to the same 300 timeout seconds? Is there a way to prevent the driver (client-mode) from starting the broadcast operation unless it has enough executors?
Source: BroadcastExchangeExec source code here
PS: we have tried already define spark.dynamicAllocation.minExecutors property equal to 1, but no success. The job still started only with the driver allocated and errored after the 300s.

spark executors running multiple copies of the same application

In the spark documentation it says,
Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
From the phrase
... executor side (tasks from different applications run in different
JVMs)
Does this mean that
If you're running multiple copies(multiple spark-submits) of the same application in a cluster that has many executors, is it possible for an executor to run tasks that belong to different spark-submits in parallel?
If the above is possible, using singleton objects that are shared between tasks of an executor, can cause data collisions between different copies(different spark-submits) of the same application?
Each executor is separate JVM process, and only used for one application. No need to worry about data collision.

How does spark.dynamicAllocation.enabled influence the order of jobs?

Need an understanding on when to use spark.dynamicAllocation.enabled - What are advantages and disadvantages of using it? I have queue where jobs get submitted.
9:30 AM --> Job A gets submitted with dynamicAllocation enabled.
10:30 AM --> Job B gets submitted with dynamicAllocation enabled.
Note: My Data is huge (processing will be done on 10GB data with transformations).
Which Job gets the preference on allocation of executors to Job A or Job B and how does the spark co-ordinates b/w 2 applications?
Dynamic Allocation of Executors is about resizing your pool of executors.
Quoting Dynamic Allocation:
spark.dynamicAllocation.enabled Whether to use dynamic resource allocation, which scales the number of executors registered with this application up and down based on the workload.
And later on in Dynamic Resource Allocation:
Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.
In other words, job A will usually finish before job B will be executed. Spark jobs are usually executed sequentially, i.e. a job has to finish before another can start.
Usually...
SparkContext is thread-safe and can handle jobs from a Spark application. That means that you may submit jobs at the same time or one after another and in some configuration expect that these two jobs will run in parallel.
Quoting Scheduling Within an Application:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc.
it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
Wrapping up...
Which Job gets the preference on allocation of executors to Job A or Job B and how does the spark co-ordinates b/w 2 applications?
Job A.
Unless you have enabled Fair Scheduler Pools:
The fair scheduler also supports grouping jobs into pools, and setting different scheduling options (e.g. weight) for each pool. This can be useful to create a “high-priority” pool for more important jobs, for example, or to group the jobs of each user together and give users equal shares regardless of how many concurrent jobs they have instead of giving jobs equal shares.

Spark Shuffle - How workers know where to pull data from

I am trying to understand how does Spark shuffle dependencies under the hood. Thus I have two questions:
In Spark, how does an executor know from what other executors it has to pull data from?
Does each executor, after finishing its map side task, update its status and location to some central entity ( may be driver) and reduce side executor first contact driver to get location of each executor to
pull from and then pull from those executors directly?
In a job with shuffle dependency, does driver schedule joins (or other tasks on shuffle dependency) only after all map side tasks has finished?
Does it mean that each task will notify driver about its status and driver will orchestrate other dependent tasks in timely manner.
I will answer your questions in points
1. How does an executor knows from what other executors it has to pull data from?
Simply executor doesn't know what other executor do, But Driver know you can think this process as queen and worker the queen push the tasks to the executor and each one finish the task return back by the results.
2. Does each executor, after finishing its map side task, update its status and location to some central entity ( may be driver)
Yes, actually the driver monitor the process but When you create the SparkContext, each worker starts an executor. This is a separate process (JVM), and it loads your jar too. The executors connect back to your driver program. Now the driver can send them commands, like flatMap, map and reduceByKey in your example. When the driver quits, the executors shut down. you can check also look at this answer What is a task in Spark? How does the Spark worker execute the jar file?
3. Reduce side executor first contact driver to get location of each executor to pull from and then pull from those executors directly?
Reduce task has the priority to be run on the same machine the data run on so, there will not be any shuffle unless the data is not available and there is no resources.
4. In a job with shuffle dependency, does driver schedule joins (or other tasks on shuffle dependency) only after all map side tasks has finished?
It is configurable you can change it. you can have a look for this link for more information https://0x0fff.com/spark-architecture-shuffle/
5. Does it mean that each task will notify driver about its status and driver will orchestrate other dependent tasks in timely manner?
Each task notifies and sent heartbeats to the driver and spark implement speculative execution technique. So, if any task fail/slow spark will run another one. more details here http://asyncified.io/2016/08/13/leveraging-spark-speculation-to-identify-and-re-schedule-slow-running-tasks/

Creating spark tasks from within tasks (map functions) on the same application

Is it possible to do a map from a mapper function (i.e from tasks) in pyspark?
In other words, is it possible to open "sub tasks" from a task?
If so - how do i pass the sparkContext to the tasks - just as a variable?
I would like to have a job that is composed from many tasks - each of these tasks should create many tasks as well, without going back to the driver.
My use case is like this:
I am doing a code porting of an application that was written using work queues - to pyspark.
In my old application tasks created other tasks - and we used this functionality. I don't want to redesign the whole code because of the move to spark (especially because i will have to make sure that both platform works in the transient phase between the systems)...
Is it possible to open "sub tasks" from a task?
No, at least not in a healthy manner*.
A task is a command sent from the driver and Spark has as one Driver (central coordinator) that communicates with many distributed workers (executors).
As a result, what you ask for here, implies that every task can play the role of a sub-Driver. Not even a worker, which would have the same faith in my answer as the task.
Remarkable resources:
What is a task in Spark? How does the Spark worker execute the jar file?
What are workers, executors, cores in Spark Standalone cluster?
*With that said, I mean that I am not aware of any hack or something, which if exists would be too specific.

Resources