Spark is dropping all executors at the beginning of a job - apache-spark

I'm trying to configure a spark job to run with fixed resources on a Dataproc cluster, however after the job was running for 6 minutes I noticed that all but 7 executors had been dropped. 45 minutes later the job has not progressed at all, and I cannot find any errors or logs to explain.
When I check the timeline in the job details it shows all but 7 executors being removed at the 6 minute mark, with the message Container [really long number] exited from explicit termination request..
The command I am running is:
gcloud dataproc jobs submit spark --region us-central1 --cluster [mycluster] \
--class=path.to.class.app --jars="gs://path-to-jar-file" --project=my-project \
--properties=spark.executor.instances=72,spark.driver.memory=28g,spark.executor.memory=28g
My cluster is 1 + 24 n2-highmem16 instances if that helps.
EDIT: I terminated the job, reset, and tried again. The exact same thing happened at the same point in the job (Job 9 Stage 9/12)

Typically that message is expected to be associated with Spark Dynamic Allocation; if you want to always have a fixed number of executors, you can try to add the property:
...
--properties=spark.dynamicAllocation.enabled=false,spark.executor.instances=72...
However, that probably won't address the root problem in your case aside from seeing idle executors continue to stick around; if the dynamic allocation was relinquishing those executors, that would be due to those tasks having completed already but where your remaining executors for whatever reason are not yet done for a long time. This often indicates some kind of data skew where the remaining executors have a lot more work to do than the ones that already completed for whatever reason, unless the remaining executors were simply all equally loaded as part of a smaller phase of the pipeline, maybe in a "reduce" phase.
If you're seeing lagging tasks out of a large number of equivalent tasks, you might consider adding a repartition() step to your job to chop it up more fine-grained in the hopes of spreading out those skewed partitions, or otherwise changing the way your group or partition your data through other means.

Fixed. The job was running out of resources. Allocated some more executors to the job and it completed.

Related

Spark UI Executor

In Spark UI, there are 18 executors are added and 6 executors are removed. When I checked the executor tabs, I've seen many dead and excluded executors. Currently, dynamic allocation is used in EMR.
I've looked up some postings about dead executors but these mostly related with job failure. For my case, it seems that the job itself is not failed but can see dead and excluded executors.
What are these "dead" and "excluded" executors?
How does it affect the performance of current spark cluster configuration?
(If it affects performance) then what would be good way to improve the performance?
With dynamic alocation enabled spark is trying to adjust number of executors to number of tasks in active stages. Lets take a look at this example:
Job started, first stage is read from huge source which is taking some time. Lets say that this source is partitioned and Spark generated 100 task to get the data. If your executor has 5 cores, Spark is going to spawn 20 executors to ensure the best parallelism (20 executors x 5 cores = 100 tasks in parallel)
Lets say that on next step you are doing repartitioning or sor merge join, with shuffle partitions set to 200 spark is going to generated 200 tasks. He is smart enough to figure out that he has currently only 100 cores avilable so if new resources are avilable he will try to spawn another 20 executors (40 executors x 5 cores = 200 tasks in parallel)
Now the join is done, in next stage you have only 50 partitions, to calculate this in parallel you dont need 40 executors, 10 is ok (10 executors x 5 cores = 50 tasks in paralell). Right now if process is taking enough of time Spark can free some resources and you are going to see deleted executors.
Now we have next stage which involves repartitioning. Number of partitions equals to 200. Withs 10 executors you can process in paralell only 50 partitions. Spark will try to get new executors...
You can read this blog post: https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/
The problem with the spark.dynamicAllocation.enabled property is that
it requires you to set subproperties. Some example subproperties are
spark.dynamicAllocation.initialExecutors, minExecutors, and
maxExecutors. Subproperties are required for most cases to use the
right number of executors in a cluster for an application, especially
when you need multiple applications to run simultaneously. Setting
subproperties requires a lot of trial and error to get the numbers
right. If they’re not right, the capacity might be reserved but never
actually used. This leads to wastage of resources or memory errors for
other applications.
Here you will find some hints, from my experience it is worth to set maxExecutors if you are going to run few jobs in parallel in the same cluster as most of the time it is not worth to starve other jobs just to get 100% efficiency from one job

Spark not using all CPUs available

I am running a query using Hive on Spark which is exhibiting some strange behavior. I've run it multiple times and observed the same behavior. The query:
reads from a large Hive external table
Spark creates about ~990,000 tasks
runs in a YARN queue with > 2900 CPUs available
uses 700 executors with 4 CPUs per executor
All is well at the start of the job. After ~1.5 hours of 2800 CPUs cranking, the job is ~80% complete (800k/990k tasks). From there, things start to nosedive: Spark stops using all of the CPUs available to it to work on tasks. With ~190k tasks to go, Spark will gradually drop from using 2800 CPUs to double digits (usually bottoming out around 20 total CPUs). This makes the last 190k tasks take significantly longer to finish than the previous 800k.
I could see as the job got very close to completing that Spark would be unable to parallelize a small amount of remaining tasks across a large number of CPUs. But with 190k tasks left to be started, it seems way too early for that.
Things I've checked:
No other job is pre-empting its resources in YARN. (In addition, if this were the case, I would expect the job to randomly lose/regain resources, instead of predictably losing steam at the 80% mark).
This occurs whether dynamic allocation is enabled or disabled. If disabled, Spark has all 2800 CPUs available for the entire run time of the job - it just doesn't use them. If enabled, Spark does spin down executors as it decides it no longer needs them.
If data skew were the issue, I could see some tasks taking longer than others to finish. But it doesn't explain why Spark wouldn't be using idle CPUs to start on the backlog of tasks still to go.
Does anyone have any advice?
For posterity, this answer from Travis Hegner contained the answer.
Setting spark.locality.wait=0s fixes this issue. I'm also not sure why a 3 second wait causes such a pile up in Spark's ability to schedule tasks, but setting to 0 makes the job run extremely well.

Possible reasons that spark waits and does not schedule tasks to run?

This might be a very generic question but hope someone can point some hint. But I found that sometimes, my job spark seems to hit a "pause" many times:
The natural of the job is: read orc files (from a hive table), filter by certain columns, no join, then write out to another hive table.
There were total 64K tasks for my job / stage (FileScan orc, followed by Filter, Project).
The application has 500 executors, each has 4 cores. Initially, about 2000 tasks were running concurrently, things look good.
After a while, I noticed the number running tasks dropped all the way near 100. Many cores/executors were just waiting with nothing to do. (I checked the log from these waiting executors, there was no error. All assigned tasks were done on them, they were just waiting)
After about 3-5 minutes, then these waiting executors suddenly got tasks assigned and now were working happily.
Any particular reasons this can be? The application is running from spark-shell (--master yarn --deploy-mode client, with number of executors/sizes etc. specified)
Thanks!

Why is there a delay in the launch of spark executors?

While trying to optimise a Spark job, I am having trouble understanding a delay of 3-4s in the launch of the second and of 6-7s third and fourth executors.
This is what I'm working with:
Spark 2.2
Two worker nodes having 8 cpu cores each. (master node separate)
Executors are configured to use 3 cores each.
Following is the screenshot of the jobs tab in Spark UI.
The job is divided into three stages. As seen, second, third and fourth executors are added only during the second stage.
Following is the snap of the Stage 0.
And following the snap of the Stage 1.
As seen in the image above, executor 2 (on the same worker as the first) takes around 3s to launch. Executors 3 and 4 (on the second worker) taken even longer, approximately 6s.
I tried playing around with the spark.locality.wait variable : values of 0s, 1s, 1ms. But there does not seem to be any change in the launch times of the executors.
Is there some other reason for this delay? Where else can I look to understand this better?
You might be interested to check Spark's executor request policy, and review the settings spark.dynamicAllocation.schedulerBacklogTimeout and spark.dynamicAllocation.sustainedSchedulerBacklogTimeout for your application.
A Spark application with dynamic allocation enabled requests
additional executors when it has pending tasks waiting to be
scheduled. ...
Spark requests executors in rounds. The actual request is triggered
when there have been pending tasks for
spark.dynamicAllocation.schedulerBacklogTimeout seconds, and then
triggered again every
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout seconds
thereafter if the queue of pending tasks persists. Additionally, the
number of executors requested in each round increases exponentially
from the previous round. For instance, an application will add 1
executor in the first round, and then 2, 4, 8 and so on executors in
the subsequent rounds.
Another potential source for a delay could be spark.locality.wait. Since in Stage 1 you have quite a bit of tasks with sub-optimal locality levels (Rack local: 59), and the default for spark.locality.wait is 3 seconds, it could actually be the primary reason for the delays that you're seeing.
It takes time for the yarn to create the executors, Nothing can be done about this overhead. If you want to optimize you can set up a Spark server and then create requests for the server, And this saves the warm up time.

Spark (yarn-client mode) not releasing memory after job/stage finishes

We are consistently observing this behavior with interactive spark jobs in spark-shell or running Sparklyr in RStudio etc.
Say I launched spark-shell in yarn-client mode and performed an action, which triggered several stages in a job and consumed x cores and y MB memory. Once this job finishes, and the corresponding spark session is still active, the allocated cores & memory is not released even though that job is finished.
Is this normal behavior?
Until the corresponding spark session is finished, the ip:8088/ws/v1/cluster/apps/application_1536663543320_0040/
kept showing:
y
x
z
I would assume, Yarn would dynamically allocate these unused resources to other spark jobs which are awaiting resources.
Please clarify if I am missing something here.
You need to play with configs around dynamic allocation https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation -
Set spark.dynamicAllocation.executorIdleTimeout to a smaller value say 10s. Default value of this parameter is 60s. This config tells spark that it should release the executor only when it is idle for this much time.
Check spark.dynamicAllocation.initialExecutors/spark.dynamicAllocation.minExecutors. Set these to a small number - say 1/2. The spark application will never downscale below this number unless the SparkSession is closed.
Once you set these two configs, your application should release the extra executors once they are idle for 10 seconds.
Yes the resources are allocated until the SparkSession is active. To handle this better you can use dynamic allocation.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-dynamic-allocation.html

Resources