Spark BroadcastHashJoin operator and Dynamic Allocation enabled - apache-spark

In my company, we have the following scenario:
we have dynamic allocation enabled by default for every data pipeline written, so we can save some costs and enable resource sharing among the different execution
also, most of the queries running perform joins and Spark has some interesting optimizations regarding it, like the join strategy change, which occurs when Spark identifies that one side of the join is small enough to be broadcasted. This is what we called BroadcastHashJoin and we have lots of queries with these operators in their respective query plans
last but not least, our pipelines run on EMR clusters using the client mode.
We are having a problem that happens when the YARN (RM on EMR) queue where a job was submitted is full, and there are not enough resources to allocate new executors for a given application. Since the driver process runs on the machine that submitted the application (client-mode), the broadcast job started and, after 300s it fails showing the broadcast timeout error.
Running the same job in a different schedule (a time when the queue usage is not too high), it was able to run successfully.
My questions are all related to how these three different things work together (dynamic allocation enabled, BHJ, client-mode). So, if you haven't enabled dynamic allocation, it's easier to see that the broadcast operation will occur for every executor that was requested initially through the spark-submit command. But, if we enable dynamic allocation, how the broadcast operation will occur for the next executors that will be dynamically allocated? Will the driver have to send it again for every new executor? Will they be subject to the same 300 timeout seconds? Is there a way to prevent the driver (client-mode) from starting the broadcast operation unless it has enough executors?
Source: BroadcastExchangeExec source code here
PS: we have tried already define spark.dynamicAllocation.minExecutors property equal to 1, but no success. The job still started only with the driver allocated and errored after the 300s.

Related

How to handle executor failure in apache spark

i have run the job using spark-submit while that time we lost executor and the certain point we can recover or not if recover how we will recover and while how we have to get back that executor
You cannot handle executor failures programmatically in your application, if thats what you are asking.
You can configure spark configuration properties which guides the actual job execution including how YARN would schedule jobs and handle task and executor failures.
https://spark.apache.org/docs/latest/configuration.html#scheduling
Some important properties you may want to check out:
spark.task.maxFailures(default=4): Number of failures of any particular task
before giving up on the job. The total number of failures spread
across different tasks will not cause the job to fail; a particular
task has to fail this number of attempts. Should be greater than or
equal to 1. Number of allowed retries = this value - 1.
spark.blacklist.application.maxFailedExecutorsPerNode(default=2): (Experimental)
How many different executors must be blacklisted for the entire
application, before the node is blacklisted for the entire
application. Blacklisted nodes will be automatically added back to the
pool of available resources after the timeout specified by
spark.blacklist.timeout. Note that with dynamic allocation, though,
the executors on the node may get marked as idle and be reclaimed by
the cluster manager.
spark.blacklist.task.maxTaskAttemptsPerExecutor(default=1): (Experimental)
For a given task, how many times it can be retried on one executor
before the executor is blacklisted for that task.

How does spark.dynamicAllocation.enabled influence the order of jobs?

Need an understanding on when to use spark.dynamicAllocation.enabled - What are advantages and disadvantages of using it? I have queue where jobs get submitted.
9:30 AM --> Job A gets submitted with dynamicAllocation enabled.
10:30 AM --> Job B gets submitted with dynamicAllocation enabled.
Note: My Data is huge (processing will be done on 10GB data with transformations).
Which Job gets the preference on allocation of executors to Job A or Job B and how does the spark co-ordinates b/w 2 applications?
Dynamic Allocation of Executors is about resizing your pool of executors.
Quoting Dynamic Allocation:
spark.dynamicAllocation.enabled Whether to use dynamic resource allocation, which scales the number of executors registered with this application up and down based on the workload.
And later on in Dynamic Resource Allocation:
Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.
In other words, job A will usually finish before job B will be executed. Spark jobs are usually executed sequentially, i.e. a job has to finish before another can start.
Usually...
SparkContext is thread-safe and can handle jobs from a Spark application. That means that you may submit jobs at the same time or one after another and in some configuration expect that these two jobs will run in parallel.
Quoting Scheduling Within an Application:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc.
it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
Wrapping up...
Which Job gets the preference on allocation of executors to Job A or Job B and how does the spark co-ordinates b/w 2 applications?
Job A.
Unless you have enabled Fair Scheduler Pools:
The fair scheduler also supports grouping jobs into pools, and setting different scheduling options (e.g. weight) for each pool. This can be useful to create a “high-priority” pool for more important jobs, for example, or to group the jobs of each user together and give users equal shares regardless of how many concurrent jobs they have instead of giving jobs equal shares.

Sudden surge in number of YARN apps on HDInsight cluster

For some reason sometimes the cluster seems to misbehave for I suddenly see surge in number of YARN jobs.We are using HDInsight Linux based Hadoop cluster. We run Azure Data Factory jobs to basically execute some hive script pointing to this cluster. Generally average number of YARN apps at any given time are like 50 running and 40-50 pending. None uses this cluster for ad-hoc query execution. But once in few days we notice something weird. Suddenly number of Yarn apps start increasing, both running as well as pending, but especially pending apps. So this number goes more than 100 for running Yarn apps and as for pending it is more than 400 or sometimes even 500+. We have a script that kills all Yarn apps one by one but it takes long time, and that too is not really a solution. From our experience we found that the only solution, when it happens, is to delete and recreate the cluster. It may be possible that for some time cluster's response time is delayed (Hive component especially) but in that case even if ADF keeps retrying several times if a slice is failing, is it possible that the cluster is storing all the supposedly failed slice execution requests (according to ADF) in a pool and trying to run when it can? That's probably the only explanation why it could be happening. Has anyone faced this issue?
Check if all the running jobs in the default queue are Templeton jobs. If so, then your queue is deadlocked.
Azure Data factory uses WebHCat (Templeton) to submit jobs to HDInsight. WebHCat spins up a parent Templeton job which then submits a child job which is the actual Hive script you are trying to run. The yarn queue can get deadlocked if there are too many parents jobs at one time filling up the cluster capacity that no child job (the actual work) is able to spin up an Application Master, thus no work is actually being done. Note that if you kill the Templeton job this will result in Data Factory marking the time slice as completed even though obviously it was not.
If you are already in a deadlock, you can try adjusting the Maximum AM Resource from the default 33% to something higher and/or scaling up your cluster. The goal is to be able to allow some of the pending child jobs to run and slowly draining the queue.
As a correct long term fix, you need to configure WebHCat so that parent templeton job is submitted to a separate Yarn queue. You can do this by (1) creating a separate yarn queue and (2) set templeton.hadoop.queue.name to the newly created queue.
To create queue you can do this via the Ambari > Yarn Queue Manager.
To update WebHCat config via Ambari go to Hive tab > Advanced > Advanced WebHCat-site, and update the config value there.
More info on WebHCat config:
https://cwiki.apache.org/confluence/display/Hive/WebHCat+Configure

Does master node execute actual tasks in Spark?

My question may sound silly, but it bothers me for a long time.
The picture shown above is the components of a distributed Spark application. I think this picture indicates that the master node will never execute actual tasks, but only is served as a cluster manager. Is it true?
By the way, the tasks here refers to the user-submit tasks.
Yes, the master node executes the driver process and does not run tasks. Tasks run in executor processes on the worker nodes. The master node is rarely stressed from a CPU standpoint but, depending on how broadcast variables, accumulators and collect are used, it may be quite stressed in terms of RAM usage.
To explain a bit more on the different roles:
The driver prepares the context and declares the operations on the data using RDD transformations and actions.
The driver submits the serialized RDD graph to the master. The master creates tasks out of it and submits them to the workers for execution. It coordinates the different job stages.
The workers is where the tasks are actually executed. They should have the resources and network connectivity required to execute the operations requested on the RDDs.

Is there a way to understand which executors were used for a job in java/scala code?

I'm having trouble evenly distributing streaming receivers among all executors of a yarn-cluster.
I've got a yarn-cluster with 8 executors, I create 8 streaming custom receivers and spark is supposed to launch these receivers one per executor. However this doesn't happen all the time and sometimes all receivers are launched on the same executor (here's the jira bug: https://issues.apache.org/jira/browse/SPARK-10730).
So my idea is to run a dummy job, get the executors that were involved in that job and if I got all the executors, create the streaming receivers.
For doing that anyway I need to understand if there is a way to understand which executors were used for a job in java/scala code.
I believe it is possible to look what executors where doing what jobs by accessing Spark UI and Spark logs. From the official 1.5.0 documentation (here):
Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application. This includes:
A list of scheduler stages and tasks
A summary of RDD sizes and memory usage
Environmental information.
Information about the running executors
In the following screen you can see what executors are active. In case there are cores/nodes that are not being used, you can detect them by just looking what cores/nodes are actually active and running.
In addition, every executor displays information about the number of tasks that are being running on it.

Resources