Is there a way to understand which executors were used for a job in java/scala code? - apache-spark

I'm having trouble evenly distributing streaming receivers among all executors of a yarn-cluster.
I've got a yarn-cluster with 8 executors, I create 8 streaming custom receivers and spark is supposed to launch these receivers one per executor. However this doesn't happen all the time and sometimes all receivers are launched on the same executor (here's the jira bug: https://issues.apache.org/jira/browse/SPARK-10730).
So my idea is to run a dummy job, get the executors that were involved in that job and if I got all the executors, create the streaming receivers.
For doing that anyway I need to understand if there is a way to understand which executors were used for a job in java/scala code.

I believe it is possible to look what executors where doing what jobs by accessing Spark UI and Spark logs. From the official 1.5.0 documentation (here):
Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application. This includes:
A list of scheduler stages and tasks
A summary of RDD sizes and memory usage
Environmental information.
Information about the running executors
In the following screen you can see what executors are active. In case there are cores/nodes that are not being used, you can detect them by just looking what cores/nodes are actually active and running.
In addition, every executor displays information about the number of tasks that are being running on it.

Related

In a Spark cluster, is there one driver process per machine or one driver process per cluster?

I am reading Spark: The Definitive Guide and am learning a great deal.
However, one thing I am confused about while reading is how many driver processes there are per Spark job. When the text first introduces driver and executor processes, it implies that there is a driver per machine, but in the discussion about broadcast variables, it sounds as though there is one driver per cluster.
This is because the text talks about the driver process sending the broadcast variable to every node in the cluster so that it can be cached there. That makes it sound as though there is only one driver process in the whole cluster that is responsible for this.
Which one is it: one driver process per cluster, or one per machine? Or can it be both? I think I am missing something here.
In Spark architecture, there will be only one driver for your spark application.
The spark driver, as part of the spark application is responsible for instantiating a spark session. The spark driver has multiple responsibilities
It communicates with the cluster manager (CM).
Requests resources from the CM for spark's executor JVMs.
Transforms all spark operations into DAG computations, schedules them and distributes their execution as tasks across all spark executors.
It's interaction with the CM is merely to get Spark executor resources.
So, the workflow of running spark applications on a cluster can be seen as:
The user submits an application using spark-submit.
spark-submit launches the driver program and invokes the main method specified by the user.
The driver program contacts the cluster manager to ask for resources to start executor.
The cluster manager launches executors on behalf of the driver program.
The driver process runs through the user application. Based on the RDD or dataset operations in the program, the driver sends work to executors in the form of tasks.
The tasks are run on executor process to compute and save result.

Spark job in Dataproc dynamic vs static allocation

I have a Dataproc cluster:
master - 6cores| 32g
worker{0-7} - 6cores| 32g
Maximum allocation: memory:24576, vCores:6
Have two spark-streaming jobs to submit, one after another
In the first place, I tried to submit with default configurations spark.dynamicAllocation.enabled=true
In 30% of cases, I saw that the first job caught almost all available memory and the second was queued and waited for resources for ages. (This is a streaming job which took a small portion of resources every batch ).
My second try was to change a dynamic allocation. I submitted the same two jobs with identical configurations:
spark.dynamicAllocation.enabled=false
spark.executor.memory=12g
spark.executor.cores=3
spark.executor.instances=6
spark.driver.memory=8g
Surprisingly in Yarn UI I saw:
7 Running Containers with 84g Memory allocation for the first job.
3 Running Containers with 36g Memory allocation and 72g Reserved Memory for the second job
In Spark UI there are 6 executors and driver for the first job and 2 executors and driver for the second job
After retrying(deleting previous jobs and submitting the same jobs) without dynamic allocation and same configurations, I got a totally different result:
5 containers 59g Memory allocation for both jobs and 71g Reserved Memory for the second job. In spark UI I see 4 executors and driver in both cases.
I have a couple of questions:
If dynamicAllocation=false, why the number of yarn containers is
different from the number of executors? (Firstly I thought that
additional yarn container is a driver, but it differs in memory.)
If dynamicAllocation=false, Why Yarn doesn't create containers by my
exact requirements- 6 containers(spark executors) for both jobs. Why two different attempts with the same configuration lead to different results?
If dynamicAllocation=true - how may it be possible that low consuming memory spark job takes control of all Yarn resources
Thanks
Spark and YARN scheduling are pretty confusing. I'm going to answer the questions in reverse order:
3) You should not be using dynamic allocation in Spark streaming jobs.
The issue is that Spark continuously asks YARN for more executors as long as there's a backlog of tasks to run. Once a Spark job gets an executor, it keeps it until the executor is idle for 1 minute (configurable, of course). In batch jobs, this is okay because there's generally a large, continuous backlog of tasks.
However, in streaming jobs, there's a spike of tasks at the start of every micro-batch, but executors are actually idle most of the time. So a streaming job will grab a lot of executors that it doesn't need.
To fix this, the old streaming API (DStreams) has its own version of dynamic allocation: https://issues.apache.org/jira/browse/SPARK-12133. This JIRA has more background on why Spark's batch dynamic allocation algorithm isn't a good fit for streaming.
However, Spark Structured Streaming (likely what you're using) does not support dynamic allocation: https://issues.apache.org/jira/browse/SPARK-24815.
tl;dr Spark requests executors based on its task backlog, not based on memory used.
1 & 2) #Vamshi T is right. Every YARN application has an "Application Master", which is responsible for requesting containers for the application. Each of your Spark jobs has an app master that proxies requests for containers from the driver.
Your configuration doesn't seem to match what you're seeing in YARN, so not sure what's going on there. You have 8 workers with 24g given to YARN. With 12g executors, you should have 2 executors per node, for a total of 16 "slots". An app master + 6 executors should be 7 containers per application, so both applications should fit within the 16 slots.
We configure the app master to have less memory, that's why total memory for an application isn't a clean multiple of 12g.
If you want both applications to schedule all their executors concurrently, you should set spark.executor.instances=5.
Assuming you're using structured streaming, you could also just run both streaming jobs in the same Spark application (submitting them from different threads on the driver).
Useful references:
Running multiple jobs in one application: https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
Dynamic allocation: https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
Spark-on-YARN: https://spark.apache.org/docs/latest/running-on-yarn.html
I have noticed similar behavior in my experience as well and here is what I observed. Firstly the resource allocation by yarn depends on available resources on cluster when the job is submitted. When both jobs are submitted at almost the same time with same config, yarn distributes the available resources equally between the jobs. Now when you throw dynamic allocation in to the mix, things get a little confusing/complex. Now in your case below:
7 Running Containers with 84g Memory allocation for the first job.
--You got 7 containers because you requested 6 executors, one container for each executor and the extra one container is for the application Master
3 Running Containers with 36g Memory allocation and 72g Reserved Memory for the second job
--Since the second job was submitted after some time, Yarn allocated the remaining resources...2 containers, one for each executor and the extra one for your application master.
Your containers will never match the executors you requested and will always be one more than the number of executors you requested because you need one container to run your application master.
Hope that answers part of your question.

What is 'Active Jobs' in Spark History Server Spark UI Jobs section

I'm trying to understand Spark History server components.
I know that, History server shows completed Spark applications.
Nonetheless, I see 'Active Jobs' set to 1 for a completed Spark application. I'm trying to understand what is 'Active Jobs' mean in Jobs section.
Also, Application completed within 30 minutes, but when I opened History Server after 8 hours, 'Duration' shows 8.0h.
Please see the screenshot.
Could you please help me understand 'Active Jobs', 'Duration' and 'Stages: Succeeded/Total' items in above image?
Finally after some research, found answer for my question.
A Spark application consists of a driver and one or more executors. The driver program instantiates SparkContext, which coordinates the executors to run the Spark application. This information is displayed on Spark History Server Web UI 'Active Jobs' section.
The executors run tasks assigned by the driver.
When Spark application runs on YARN, it has its own implementation of yarn client and yarn application master.
YARN application has a yarn client, yarn application master and list of container running on node managers.
In my case Yarn is running in standalone mode, thus driver program is running as a thread of the yarn application master. The Yarn client pulls status from the application master and application master coordinates the containers to run the tasks.
This running job could be monitored in YARN applications page in the Cloudera Manager Admin Console, while it is running.
If application succeeds, then History server will show list of 'Completed Jobs' and also 'Active Jobs' section will be removed.
If application fails at the containers level and YARN communicates this information to Driver then, History server will show list of 'Failed Jobs' and also 'Active Jobs' section will be removed.
Nonetheless, if application fails at the containers level and YARN couldn't communicate that to driver, then Driver instantiated job gets into oblivion state. It thinks job is still being run and keeps waiting to hear from YARN application master for the job status. Hence, in History Server, it still shows up in 'Active Jobs' as running.
So my take away from this is:
To check the status of running job, go to YARN applications page in the Cloudera Manager Admin Console or use YARN CLI command.
After job completion/failure, Open the Spark History Server to get more details on resources usage, DAG and execution timeline information.
Invoking an action(count is action in your case) inside a Spark application triggers the launch of a job to fulfill it. Spark examines the dataset on which that action depends and formulates an execution plan. The execution plan assembles the dataset transformations into stages.
A stage is a physical unit of the execution plan. In shorts, Stage is a set of parallel tasks i.e. one task per partition. Basically, each job which gets divided into smaller sets of tasks is a stage. Although, it totally depends on each other. However, it somewhat same as the map and reduce stages in MapReduce.
each type of Spark Stages in detail:
a. ShuffleMapStage in Spark
ShuffleMapStage is considered as an intermediate Spark stage in the physical execution of DAG.
Basically, it produces data for another stage(s).
consider ShuffleMapStage in Spark as input for other following Spark stages in the DAG of stages.
However, it is possible that there is n number of multiple pipeline operations, in ShuffleMapStage.
like map and filter, before shuffle operation. Furthermore, we can share single ShuffleMapStage among different jobs.
b. ResultStage in Spark
By running a function on a spark RDD Stage which executes a Spark action in a user program is a ResultStage.It is considered as a final stage in spark. ResultStage implies as a final stage in a job that applies a function on one or many partitions of the target RDD in Spark, helps for computation of the result of an action.
coming back to the question of active jobs on history sever there some notes listed on official docs
as history server.Also there is jira [SPARK-7889] issue regarding the same link.
for more details follow the link
source-1

How does spark.dynamicAllocation.enabled influence the order of jobs?

Need an understanding on when to use spark.dynamicAllocation.enabled - What are advantages and disadvantages of using it? I have queue where jobs get submitted.
9:30 AM --> Job A gets submitted with dynamicAllocation enabled.
10:30 AM --> Job B gets submitted with dynamicAllocation enabled.
Note: My Data is huge (processing will be done on 10GB data with transformations).
Which Job gets the preference on allocation of executors to Job A or Job B and how does the spark co-ordinates b/w 2 applications?
Dynamic Allocation of Executors is about resizing your pool of executors.
Quoting Dynamic Allocation:
spark.dynamicAllocation.enabled Whether to use dynamic resource allocation, which scales the number of executors registered with this application up and down based on the workload.
And later on in Dynamic Resource Allocation:
Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.
In other words, job A will usually finish before job B will be executed. Spark jobs are usually executed sequentially, i.e. a job has to finish before another can start.
Usually...
SparkContext is thread-safe and can handle jobs from a Spark application. That means that you may submit jobs at the same time or one after another and in some configuration expect that these two jobs will run in parallel.
Quoting Scheduling Within an Application:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc.
it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
Wrapping up...
Which Job gets the preference on allocation of executors to Job A or Job B and how does the spark co-ordinates b/w 2 applications?
Job A.
Unless you have enabled Fair Scheduler Pools:
The fair scheduler also supports grouping jobs into pools, and setting different scheduling options (e.g. weight) for each pool. This can be useful to create a “high-priority” pool for more important jobs, for example, or to group the jobs of each user together and give users equal shares regardless of how many concurrent jobs they have instead of giving jobs equal shares.

Spark Standalone cluster master web UI inaccessible after an application finishes

I have a spark application that finishes without error, but once it's done and saved all of its outputs and the process terminates, the Spark standalone cluster master process becomes a CPU hog, using 16 CPU's full time for hours, and the web UI becomes unresponsive. I have no idea what it could be doing, is there some complicated clean up step?
Some more details:
I've got a Spark standalone cluster (27 workers/nodes) that I've been successfully submitting jobs to for a while. I recently scaled up the size of my applications, the largest now takes 3.5 hours using 100 cores over 27 workers, and each worker has ~dozens of GB of shuffle read/write over the course of the job. Otherwise, the application is no different than the smaller jobs that have run successfully before.
This is a known issue with Spark's standalone cluster, and is caused by the massive event log created by large applications. You can read more at the issue tracking link below.
https://issues.apache.org/jira/browse/SPARK-12299
At the current time, the best work-around is to disable event logging for large jobs.

Resources