Uneven distribution of tasks among the spark executors - apache-spark

I am using spark-streaming 2.2.1 on production and in this application i read the data from RabbitMQ and do the further processing and finally save it in the cassandra. So, i am facing this strange issue where number of tasks are not evenly distributed amongst executors on one of the nodes. I restarted the streaming but still issue persist.
As you can see on 10.10.4.72 i have 2 executors. The one running on 41893 port has completed approx. double the number of tasks on rest of the nodes (10.10.3.73 and 10.10.3.72). where as executor running on 33451 port on 10.10.4.72 has completed only 18 tasks. And this issue persist even if i restart my spark-streaming.
Edit question
After 12 hours still as you can see in the below image the same executor has not processed not even a single task in this duration.

Related

best way to run 300+ concurrent spark jobs in Dataproc?

I have a Dataproc cluster with 2 worker nodes (n1s2). There is an external server which submits around 360 spark jobs within an hour (with a couple of minutes spacing between each submission). The first job completes successfully but the subsequent ones get stuck and do not proceed at all.
Each job crunches some timeseries numbers and writes to Cassandra. And the time taken is usually 3-6 minutes when the cluster is completely free.
I feel this can be solved by just scaling up the cluster, but would become very costly for me.
What would be the other options to best solve this use case?
Running 300+ concurrent jobs on a 2 worker nodes cluster doesn't sound like feasible. You want to first estimate how much resource (CPU, memory, disk) each job needs then make a plan for the cluster size. YARN metrics like available CPU, available memory, especially pending memory would be helpful for identifying the situation where it is lack of resources.

Spark job in Dataproc dynamic vs static allocation

I have a Dataproc cluster:
master - 6cores| 32g
worker{0-7} - 6cores| 32g
Maximum allocation: memory:24576, vCores:6
Have two spark-streaming jobs to submit, one after another
In the first place, I tried to submit with default configurations spark.dynamicAllocation.enabled=true
In 30% of cases, I saw that the first job caught almost all available memory and the second was queued and waited for resources for ages. (This is a streaming job which took a small portion of resources every batch ).
My second try was to change a dynamic allocation. I submitted the same two jobs with identical configurations:
spark.dynamicAllocation.enabled=false
spark.executor.memory=12g
spark.executor.cores=3
spark.executor.instances=6
spark.driver.memory=8g
Surprisingly in Yarn UI I saw:
7 Running Containers with 84g Memory allocation for the first job.
3 Running Containers with 36g Memory allocation and 72g Reserved Memory for the second job
In Spark UI there are 6 executors and driver for the first job and 2 executors and driver for the second job
After retrying(deleting previous jobs and submitting the same jobs) without dynamic allocation and same configurations, I got a totally different result:
5 containers 59g Memory allocation for both jobs and 71g Reserved Memory for the second job. In spark UI I see 4 executors and driver in both cases.
I have a couple of questions:
If dynamicAllocation=false, why the number of yarn containers is
different from the number of executors? (Firstly I thought that
additional yarn container is a driver, but it differs in memory.)
If dynamicAllocation=false, Why Yarn doesn't create containers by my
exact requirements- 6 containers(spark executors) for both jobs. Why two different attempts with the same configuration lead to different results?
If dynamicAllocation=true - how may it be possible that low consuming memory spark job takes control of all Yarn resources
Thanks
Spark and YARN scheduling are pretty confusing. I'm going to answer the questions in reverse order:
3) You should not be using dynamic allocation in Spark streaming jobs.
The issue is that Spark continuously asks YARN for more executors as long as there's a backlog of tasks to run. Once a Spark job gets an executor, it keeps it until the executor is idle for 1 minute (configurable, of course). In batch jobs, this is okay because there's generally a large, continuous backlog of tasks.
However, in streaming jobs, there's a spike of tasks at the start of every micro-batch, but executors are actually idle most of the time. So a streaming job will grab a lot of executors that it doesn't need.
To fix this, the old streaming API (DStreams) has its own version of dynamic allocation: https://issues.apache.org/jira/browse/SPARK-12133. This JIRA has more background on why Spark's batch dynamic allocation algorithm isn't a good fit for streaming.
However, Spark Structured Streaming (likely what you're using) does not support dynamic allocation: https://issues.apache.org/jira/browse/SPARK-24815.
tl;dr Spark requests executors based on its task backlog, not based on memory used.
1 & 2) #Vamshi T is right. Every YARN application has an "Application Master", which is responsible for requesting containers for the application. Each of your Spark jobs has an app master that proxies requests for containers from the driver.
Your configuration doesn't seem to match what you're seeing in YARN, so not sure what's going on there. You have 8 workers with 24g given to YARN. With 12g executors, you should have 2 executors per node, for a total of 16 "slots". An app master + 6 executors should be 7 containers per application, so both applications should fit within the 16 slots.
We configure the app master to have less memory, that's why total memory for an application isn't a clean multiple of 12g.
If you want both applications to schedule all their executors concurrently, you should set spark.executor.instances=5.
Assuming you're using structured streaming, you could also just run both streaming jobs in the same Spark application (submitting them from different threads on the driver).
Useful references:
Running multiple jobs in one application: https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
Dynamic allocation: https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
Spark-on-YARN: https://spark.apache.org/docs/latest/running-on-yarn.html
I have noticed similar behavior in my experience as well and here is what I observed. Firstly the resource allocation by yarn depends on available resources on cluster when the job is submitted. When both jobs are submitted at almost the same time with same config, yarn distributes the available resources equally between the jobs. Now when you throw dynamic allocation in to the mix, things get a little confusing/complex. Now in your case below:
7 Running Containers with 84g Memory allocation for the first job.
--You got 7 containers because you requested 6 executors, one container for each executor and the extra one container is for the application Master
3 Running Containers with 36g Memory allocation and 72g Reserved Memory for the second job
--Since the second job was submitted after some time, Yarn allocated the remaining resources...2 containers, one for each executor and the extra one for your application master.
Your containers will never match the executors you requested and will always be one more than the number of executors you requested because you need one container to run your application master.
Hope that answers part of your question.

Google dataproc spark cluster with too many Preemptible nodes sometime hangs

When running spark cluster on dataproc with only 2 nonpreemptable worker nodes and other 100~ preemptable nodes I sometimes get a cluster that is not usable at all due to too many connection errors, datanode errors, lost executors but still being tracked for the heartbeat...
Always getting errors like that:
18/08/08 15:40:11 WARN org.apache.hadoop.hdfs.DataStreamer: Error Recovery for BP-877400388-10.128.0.31-1533740979408:blk_1073742308_1487 in pipeline [DatanodeInfoWithStorage[10.128.0.35:9866,DS-60d8a566-a1b3-4fce-b9e2-1eeeb4ac840b,DISK], DatanodeInfoWithStorage[10.128.0.7:9866,DS-9f1d8b17-0fee-41c7-9d31-8ad89f0df69f,DISK]]: datanode 0(DatanodeInfoWithStorage[10.128.0.35:9866,DS-60d8a566-a1b3-4fce-b9e2-1eeeb4ac840b,DISK]) is bad.
And errors reporting Slow ReadProcessor read fields for block BP-877400388-10.128.0.31-1533740979408:blk_1073742314_1494
from what I see there appear to be something not functioning correctly for those clusters but nothing is reported to indicate that.
Plus the application master is also created on a preemptable node why is that?
According to the documentation, the number of preemptible workers needs to be less than 50% of the total number of nodes within your cluster to have the best results. Regarding the application master within the preemptible node, you could report this behavior by filling an issue tracker for Dataproc.

Spark Standalone cluster master web UI inaccessible after an application finishes

I have a spark application that finishes without error, but once it's done and saved all of its outputs and the process terminates, the Spark standalone cluster master process becomes a CPU hog, using 16 CPU's full time for hours, and the web UI becomes unresponsive. I have no idea what it could be doing, is there some complicated clean up step?
Some more details:
I've got a Spark standalone cluster (27 workers/nodes) that I've been successfully submitting jobs to for a while. I recently scaled up the size of my applications, the largest now takes 3.5 hours using 100 cores over 27 workers, and each worker has ~dozens of GB of shuffle read/write over the course of the job. Otherwise, the application is no different than the smaller jobs that have run successfully before.
This is a known issue with Spark's standalone cluster, and is caused by the massive event log created by large applications. You can read more at the issue tracking link below.
https://issues.apache.org/jira/browse/SPARK-12299
At the current time, the best work-around is to disable event logging for large jobs.

Is there a way to understand which executors were used for a job in java/scala code?

I'm having trouble evenly distributing streaming receivers among all executors of a yarn-cluster.
I've got a yarn-cluster with 8 executors, I create 8 streaming custom receivers and spark is supposed to launch these receivers one per executor. However this doesn't happen all the time and sometimes all receivers are launched on the same executor (here's the jira bug: https://issues.apache.org/jira/browse/SPARK-10730).
So my idea is to run a dummy job, get the executors that were involved in that job and if I got all the executors, create the streaming receivers.
For doing that anyway I need to understand if there is a way to understand which executors were used for a job in java/scala code.
I believe it is possible to look what executors where doing what jobs by accessing Spark UI and Spark logs. From the official 1.5.0 documentation (here):
Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application. This includes:
A list of scheduler stages and tasks
A summary of RDD sizes and memory usage
Environmental information.
Information about the running executors
In the following screen you can see what executors are active. In case there are cores/nodes that are not being used, you can detect them by just looking what cores/nodes are actually active and running.
In addition, every executor displays information about the number of tasks that are being running on it.

Resources