I am getting two types of errors in running a job on Google dataproc and it is causing executors to be lost one by one until the last executor is lost and the job fails. I have set my master node to n1-highmem-2 (2 vCPU, 13 GB memory) and have set two worker nodes to n1-highmem-8 (8 vCPU, 52 GB memory). The two errors I get are:
"Container exited from explicit termination request."
"Lost executor x: Executor heartbeat timed out"
From my understanding in what I could see online, I need to increase spark.executor.memoryOverhead. I don't know if this is the right answer or not, but I can't see how to change this in the Google dataproc console, and I don't know what to change it to. Any help would be great!
Thanks,
jim
You can set Spark properties at cluster-level with
gcloud dataproc clusters create ... --properties spark:<name>=<value>,...
and/or job-level with
gcloud dataproc jobs submit spark ... --properties <name>=<value>,...
The former requires the spark: prefix, the latter doesn't. If both are set, the latter takes precedence. See more details in this doc.
It turns out the memory per vCPU was the limitation causing the executors to fail one by one. Initially, I was trying to use the custom configuration in console mode for the cluster to add add'l memory per vCPU. It turns that the UI has some bug (per the Google Dataproc team), which limits you from increasing the memory per vCPU (if you use the slider bar to increase the memory beyond the default max of 6.5GB, the cluster set up will fail). However, if you use the command line equivalent of the console, it does allow the set up of the cluster and the increased memory per vCPU was enough to complete the job w/o the executors failing one by one.
Related
My technical task is to synchronize data from GCS (Google Cloud Storage) to our Elasticsearch cluster.
We use Apache Spark 2.4 with the Elastic Hadoop connector on a Google Dataproc cluster (autoscaling enabled).
During the execution, if the Dataproc cluster downscaled, all tasks on the decommissioned node are lost and the processed data on this node are never pushed to elastic.
This problem does not exist when I save to GCS or HDFS for example.
How to make resilient this task even when nodes are decommissioned?
An extract of the stacktrace:
Lost task 50.0 in stage 2.3 (TID 427, xxxxxxx-sw-vrb7.c.xxxxxxx, executor 43): FetchFailed(BlockManagerId(30, xxxxxxx-w-23.c.xxxxxxx, 7337, None), shuffleId=0, mapId=26, reduceId=170, message=org.apache.spark.shuffle.FetchFailedException: Failed to connect to xxxxxxx-w-23.c.xxxxxxx:7337
Caused by: java.net.UnknownHostException: xxxxxxx-w-23.c.xxxxxxx
Task 50.0 in stage 2.3 (TID 427) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded).
Thanks.
Fred
I'll go through a bit of background on the "downscaling problem" and how to mitigate it. Note that this information applies to both manual downscaling as well as preemptible VMs getting preempted.
Background
Autoscaling removes nodes based on the amount of "available" YARN memory in the cluster. It does not take into account shuffle data on the cluster. Here's an illustration from a recent presentation we gave.
In a MapReduce-style job (Spark jobs are a set of MapReduce-style shuffles between stages), data from all mappers must get to all reducers. Mappers write their shuffle data to local disk, and then reducers fetch data from each mapper. There is a server on every node dedicated to serving shuffle data, and it runs outside of YARN. Therefore, a node can appear idle in YARN even though it needs to stay around to serve its shuffle data.
When a single node gets removed, pretty much all reducers will fail, since they all need to fetch data from every node. Reducers will specifically fail with FetchFailedException (as you saw), indicating they were unable to get shuffle data from a particular node. The driver will eventually re-run necessary mappers, and then re-run the reduce stage. Spark is a bit inefficient (https://issues.apache.org/jira/browse/SPARK-20178), but it works.
Note that you can lose nodes in one of three scenarios:
Intentionally removing nodes (autoscaling or manual downscaling)
Preemptible VMs
getting preempted. Preemptible VMs get preempted at least every 24 hours.
(Relatively rare) a standard GCE VM is
ungracefully terminated by GCE, and restarted. Usually, standard VMs
are transparently live migrated.
When you create an autoscaling cluster, Dataproc adds several properties to improve
job resiliency in the face of losing nodes:
yarn:yarn.resourcemanager.am.max-attempts=10
mapred:mapreduce.map.maxattempts=10
mapred:mapreduce.reduce.maxattempts=10
spark:spark.task.maxFailures=10
spark:spark.stage.maxConsecutiveAttempts=10
spark:spark.yarn.am.attemptFailuresValidityInterval=1h
spark:spark.yarn.executor.failuresValidityInterval=1h
Note that if you enable autoscaling on an existing cluster, it will not have these properties set. (But you can set them manually when creating a cluster).
Mitigations
1) Use graceful decommissioning
Dataproc integrates with YARN's Graceful Decommissioning, and can be set on Autoscaling Policies or manual downscale operations.
When gracefully decommissioning a node, YARN keeps it around until applications that ran containers on the node finish, but does not let it run new containers. That gives nodes an opportunity to serve their shuffle data before being removed.
You will need to ensure that your graceful decommission timeout is long enough to encompass your longest jobs. The autoscaling docs suggest 1h as a starting point.
Note that graceful decommissioning only really makes sense only long-running clusters that process lots of short jobs.
On ephemeral clusters, you would be better off "right-sizing" the cluster from the start, or disabling downscaling unless the cluster is completely idle (set scaleDownMinWorkerFraction=1.0).
2) Avoid preemptible VMs
Even when using graceful decommissioning, preemptible VMs will be periodically terminated through "preemptions". GCE guarantees preemptible VMs will get preempted within 24 hours, and preemptions on large clusters are very spread out.
If you are using graceful decommissioning, and the FetchFailedException error messages include -sw-, you are likely seeing fetch failures due to nodes being preempted.
You have two options to avoid using preemptible VMs:
1. In your autoscaling policy, you can set secondaryWorkerConfig to have 0 min and max instances, and instead put all workers in the primary group.
2. Alternatively, you can keep using "secondary" workers, but set --properties dataproc:secondary-workers.is-preemptible.override=false. That will make your secondary workers be standard VMs.
3) Long term: Enhanced Flexibility Mode
Dataproc's Enhanced Flexibility Mode is the long term answer to the shuffle problem.
The downscaling problem is caused by shuffle data getting stored on local disk. EFM will include new shuffle implementations that allow placing shuffle data on a fixed set of nodes (e.g. just primary workers), or on storage outside of the cluster.
That will make secondary workers stateless, which means they can be removed at any time. This makes autoscaling far more compelling.
At the moment, EFM is still in Alpha, and does not scale to real-world workloads, but look out for a production-ready Beta by the summer.
I have a Dataproc cluster:
master - 6cores| 32g
worker{0-7} - 6cores| 32g
Maximum allocation: memory:24576, vCores:6
Have two spark-streaming jobs to submit, one after another
In the first place, I tried to submit with default configurations spark.dynamicAllocation.enabled=true
In 30% of cases, I saw that the first job caught almost all available memory and the second was queued and waited for resources for ages. (This is a streaming job which took a small portion of resources every batch ).
My second try was to change a dynamic allocation. I submitted the same two jobs with identical configurations:
spark.dynamicAllocation.enabled=false
spark.executor.memory=12g
spark.executor.cores=3
spark.executor.instances=6
spark.driver.memory=8g
Surprisingly in Yarn UI I saw:
7 Running Containers with 84g Memory allocation for the first job.
3 Running Containers with 36g Memory allocation and 72g Reserved Memory for the second job
In Spark UI there are 6 executors and driver for the first job and 2 executors and driver for the second job
After retrying(deleting previous jobs and submitting the same jobs) without dynamic allocation and same configurations, I got a totally different result:
5 containers 59g Memory allocation for both jobs and 71g Reserved Memory for the second job. In spark UI I see 4 executors and driver in both cases.
I have a couple of questions:
If dynamicAllocation=false, why the number of yarn containers is
different from the number of executors? (Firstly I thought that
additional yarn container is a driver, but it differs in memory.)
If dynamicAllocation=false, Why Yarn doesn't create containers by my
exact requirements- 6 containers(spark executors) for both jobs. Why two different attempts with the same configuration lead to different results?
If dynamicAllocation=true - how may it be possible that low consuming memory spark job takes control of all Yarn resources
Thanks
Spark and YARN scheduling are pretty confusing. I'm going to answer the questions in reverse order:
3) You should not be using dynamic allocation in Spark streaming jobs.
The issue is that Spark continuously asks YARN for more executors as long as there's a backlog of tasks to run. Once a Spark job gets an executor, it keeps it until the executor is idle for 1 minute (configurable, of course). In batch jobs, this is okay because there's generally a large, continuous backlog of tasks.
However, in streaming jobs, there's a spike of tasks at the start of every micro-batch, but executors are actually idle most of the time. So a streaming job will grab a lot of executors that it doesn't need.
To fix this, the old streaming API (DStreams) has its own version of dynamic allocation: https://issues.apache.org/jira/browse/SPARK-12133. This JIRA has more background on why Spark's batch dynamic allocation algorithm isn't a good fit for streaming.
However, Spark Structured Streaming (likely what you're using) does not support dynamic allocation: https://issues.apache.org/jira/browse/SPARK-24815.
tl;dr Spark requests executors based on its task backlog, not based on memory used.
1 & 2) #Vamshi T is right. Every YARN application has an "Application Master", which is responsible for requesting containers for the application. Each of your Spark jobs has an app master that proxies requests for containers from the driver.
Your configuration doesn't seem to match what you're seeing in YARN, so not sure what's going on there. You have 8 workers with 24g given to YARN. With 12g executors, you should have 2 executors per node, for a total of 16 "slots". An app master + 6 executors should be 7 containers per application, so both applications should fit within the 16 slots.
We configure the app master to have less memory, that's why total memory for an application isn't a clean multiple of 12g.
If you want both applications to schedule all their executors concurrently, you should set spark.executor.instances=5.
Assuming you're using structured streaming, you could also just run both streaming jobs in the same Spark application (submitting them from different threads on the driver).
Useful references:
Running multiple jobs in one application: https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
Dynamic allocation: https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
Spark-on-YARN: https://spark.apache.org/docs/latest/running-on-yarn.html
I have noticed similar behavior in my experience as well and here is what I observed. Firstly the resource allocation by yarn depends on available resources on cluster when the job is submitted. When both jobs are submitted at almost the same time with same config, yarn distributes the available resources equally between the jobs. Now when you throw dynamic allocation in to the mix, things get a little confusing/complex. Now in your case below:
7 Running Containers with 84g Memory allocation for the first job.
--You got 7 containers because you requested 6 executors, one container for each executor and the extra one container is for the application Master
3 Running Containers with 36g Memory allocation and 72g Reserved Memory for the second job
--Since the second job was submitted after some time, Yarn allocated the remaining resources...2 containers, one for each executor and the extra one for your application master.
Your containers will never match the executors you requested and will always be one more than the number of executors you requested because you need one container to run your application master.
Hope that answers part of your question.
When running spark cluster on dataproc with only 2 nonpreemptable worker nodes and other 100~ preemptable nodes I sometimes get a cluster that is not usable at all due to too many connection errors, datanode errors, lost executors but still being tracked for the heartbeat...
Always getting errors like that:
18/08/08 15:40:11 WARN org.apache.hadoop.hdfs.DataStreamer: Error Recovery for BP-877400388-10.128.0.31-1533740979408:blk_1073742308_1487 in pipeline [DatanodeInfoWithStorage[10.128.0.35:9866,DS-60d8a566-a1b3-4fce-b9e2-1eeeb4ac840b,DISK], DatanodeInfoWithStorage[10.128.0.7:9866,DS-9f1d8b17-0fee-41c7-9d31-8ad89f0df69f,DISK]]: datanode 0(DatanodeInfoWithStorage[10.128.0.35:9866,DS-60d8a566-a1b3-4fce-b9e2-1eeeb4ac840b,DISK]) is bad.
And errors reporting Slow ReadProcessor read fields for block BP-877400388-10.128.0.31-1533740979408:blk_1073742314_1494
from what I see there appear to be something not functioning correctly for those clusters but nothing is reported to indicate that.
Plus the application master is also created on a preemptable node why is that?
According to the documentation, the number of preemptible workers needs to be less than 50% of the total number of nodes within your cluster to have the best results. Regarding the application master within the preemptible node, you could report this behavior by filling an issue tracker for Dataproc.
I was running an application on AWS EMR-Spark. Here, is the spark-submit job;-
Arguments : spark-submit --deploy-mode cluster --class com.amazon.JavaSparkPi s3://spark-config-test/SWALiveOrderModelSpark-1.0.assembly.jar s3://spark-config-test/2017-08-08
AWS uses YARN for resource management. I was looking at the metrics (screenshot below), and have a doubt regarding the YARN 'container' metrics.
Here, the container allocated is shown as 2. However, I was using 4 nodes (3 slave + 1 master),all 8 cores CPU. So, how are only 2 container allocated?
A couple of thing you need to do. First of all, you need to set the following configuration in capacity-scheduler.xml
"yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
otherwise YARN will not use all the cores you specify. Secondly, you need to actually specify the number of executors you need, and also the number of cores you need and the amount of memory you want allocated on executors (and possibly on the driver as well, if you either have many shuffle partitions or if you collect data to the driver).
YARN is designed to manage clusters running many different jobs at the time, so it will not per default assign all ressources to a single job, unless you force it to by setting the above mentioned setting. Furthermore, the default setting for Spark are also not sufficient for most jobs and you need to set them explicitly. Please have a read through this blog post to get a better understanding of how to tune spark settings for optimal performance.
We (an engineering team) are running an EMR cluster with YARN and Spark. What is typically happening is that when one user submits a heavy memory intensive job, it grabs all the YARN available memory and then all the subsequent users submitted jobs have to wait for that memory to clear (I know that autoscaling will solve this problem to a certain extent and we are looking into that, but we would like to avoid a single user occupying all the memory even when the cluster is autoscaled to it's full limits).
Is there a way to configure YARN such that any application (Spark or otherwise) may not occupy more than, say 75% of available memory?
Thanks
According to the documentation, you can manage the amount of memory allocated to an executor using the parameter: spark.executor.memory