YARN workers running out of disk space - apache-spark

We are facing a No space on device error with Spark jobs running on our YARN cluster.
This has a few bad results. First, the Spark jobs take longer or fail. Second, since the disk fills up, the nodes are disabled by the YARN NodeManager and are removed from the pool and marked as unhealthy.
Is there a way to configure the maximum disk space that jobs are allowed to use on each NodeManager?
I'm hoping to be able to say something like "I have a disk of 1TB, you can use up to 900GB for jobs" and have YARN manage those resources is such a way that will never result in filling up the disk.
Alternatively, how can I make sure that YARN keeps removing old data from its local disk so it doesn't fill up? I don't care if that causes jobs to fail. That's inevitable when you overuse resources.

Related

How is abnormal Driver termination handled for a Spark App in Yarn cluster mode

We're using AWS EMR for our spark jobs. All our jobs are submitted in yarn cluster mode, so the driver will run in one of the cluster nodes. We use on-demand node for master, and spot-instances for the core nodes. Now, although we almost always choose instances with < 5% interruption rate, sometimes it so happens that a significant fraction of our cluster nodes get terminated prematurely (probably because of higher demands).
So, I was wondering, in the above situation, what happens if a node containing the driver process goes down? Is there any chance of recovery for the spark job in that case? Or is the job gone forever?
The Spark driver is a single point of failure because it holds all cluster state for the running App.
In practice non-ephemeral storage can be used for check-pointing batch Apps after expensive expensive transformations. That said, trying to re-start after such a situation can be done, but when I looked into it, it is quite difficult to say the least. I asked such a question under my name some time ago, you can find it. I am quite technical but felt: gosh what a lot of hard work.
So, the recovery means rolling your own stuff, or accepting a re-run. Since I last evaluated EMR I see that the driver can run on the Master and that can be failed-over, but that is not the same thing as far as I can see, nor what you wish.
EMR has node leveling for CORE nodes in Yarn. Your spark driver/ Application master only gets created in CORE nodes. And HDFS also resides in CORE nodes only.
So to handle your situation in a best way, you may consider to use both CORE and TASK group.
What you can do to tackle this -
MASTER: On-demand
CORE: On-demand. Minimum no of Instances can be 1.
TASK: Spot with autoscaling with minimal EBS volume. Minimum no of Instances can be 0 this case.
This will reduce your cost also ensure that node containing the driver process never goes down.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master-core-task-nodes.html

Spark 2.4 to Elasticsearch : prevent data loss during Dataproc nodes decommissioning?

My technical task is to synchronize data from GCS (Google Cloud Storage) to our Elasticsearch cluster.
We use Apache Spark 2.4 with the Elastic Hadoop connector on a Google Dataproc cluster (autoscaling enabled).
During the execution, if the Dataproc cluster downscaled, all tasks on the decommissioned node are lost and the processed data on this node are never pushed to elastic.
This problem does not exist when I save to GCS or HDFS for example.
How to make resilient this task even when nodes are decommissioned?
An extract of the stacktrace:
Lost task 50.0 in stage 2.3 (TID 427, xxxxxxx-sw-vrb7.c.xxxxxxx, executor 43): FetchFailed(BlockManagerId(30, xxxxxxx-w-23.c.xxxxxxx, 7337, None), shuffleId=0, mapId=26, reduceId=170, message=org.apache.spark.shuffle.FetchFailedException: Failed to connect to xxxxxxx-w-23.c.xxxxxxx:7337
Caused by: java.net.UnknownHostException: xxxxxxx-w-23.c.xxxxxxx
Task 50.0 in stage 2.3 (TID 427) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded).
Thanks.
Fred
I'll go through a bit of background on the "downscaling problem" and how to mitigate it. Note that this information applies to both manual downscaling as well as preemptible VMs getting preempted.
Background
Autoscaling removes nodes based on the amount of "available" YARN memory in the cluster. It does not take into account shuffle data on the cluster. Here's an illustration from a recent presentation we gave.
In a MapReduce-style job (Spark jobs are a set of MapReduce-style shuffles between stages), data from all mappers must get to all reducers. Mappers write their shuffle data to local disk, and then reducers fetch data from each mapper. There is a server on every node dedicated to serving shuffle data, and it runs outside of YARN. Therefore, a node can appear idle in YARN even though it needs to stay around to serve its shuffle data.
When a single node gets removed, pretty much all reducers will fail, since they all need to fetch data from every node. Reducers will specifically fail with FetchFailedException (as you saw), indicating they were unable to get shuffle data from a particular node. The driver will eventually re-run necessary mappers, and then re-run the reduce stage. Spark is a bit inefficient (https://issues.apache.org/jira/browse/SPARK-20178), but it works.
Note that you can lose nodes in one of three scenarios:
Intentionally removing nodes (autoscaling or manual downscaling)
Preemptible VMs
getting preempted. Preemptible VMs get preempted at least every 24 hours.
(Relatively rare) a standard GCE VM is
ungracefully terminated by GCE, and restarted. Usually, standard VMs
are transparently live migrated.
When you create an autoscaling cluster, Dataproc adds several properties to improve
job resiliency in the face of losing nodes:
yarn:yarn.resourcemanager.am.max-attempts=10
mapred:mapreduce.map.maxattempts=10
mapred:mapreduce.reduce.maxattempts=10
spark:spark.task.maxFailures=10
spark:spark.stage.maxConsecutiveAttempts=10
spark:spark.yarn.am.attemptFailuresValidityInterval=1h
spark:spark.yarn.executor.failuresValidityInterval=1h
Note that if you enable autoscaling on an existing cluster, it will not have these properties set. (But you can set them manually when creating a cluster).
Mitigations
1) Use graceful decommissioning
Dataproc integrates with YARN's Graceful Decommissioning, and can be set on Autoscaling Policies or manual downscale operations.
When gracefully decommissioning a node, YARN keeps it around until applications that ran containers on the node finish, but does not let it run new containers. That gives nodes an opportunity to serve their shuffle data before being removed.
You will need to ensure that your graceful decommission timeout is long enough to encompass your longest jobs. The autoscaling docs suggest 1h as a starting point.
Note that graceful decommissioning only really makes sense only long-running clusters that process lots of short jobs.
On ephemeral clusters, you would be better off "right-sizing" the cluster from the start, or disabling downscaling unless the cluster is completely idle (set scaleDownMinWorkerFraction=1.0).
2) Avoid preemptible VMs
Even when using graceful decommissioning, preemptible VMs will be periodically terminated through "preemptions". GCE guarantees preemptible VMs will get preempted within 24 hours, and preemptions on large clusters are very spread out.
If you are using graceful decommissioning, and the FetchFailedException error messages include -sw-, you are likely seeing fetch failures due to nodes being preempted.
You have two options to avoid using preemptible VMs:
1. In your autoscaling policy, you can set secondaryWorkerConfig to have 0 min and max instances, and instead put all workers in the primary group.
2. Alternatively, you can keep using "secondary" workers, but set --properties dataproc:secondary-workers.is-preemptible.override=false. That will make your secondary workers be standard VMs.
3) Long term: Enhanced Flexibility Mode
Dataproc's Enhanced Flexibility Mode is the long term answer to the shuffle problem.
The downscaling problem is caused by shuffle data getting stored on local disk. EFM will include new shuffle implementations that allow placing shuffle data on a fixed set of nodes (e.g. just primary workers), or on storage outside of the cluster.
That will make secondary workers stateless, which means they can be removed at any time. This makes autoscaling far more compelling.
At the moment, EFM is still in Alpha, and does not scale to real-world workloads, but look out for a production-ready Beta by the summer.

Apache Spark running on YARN with fix allocation

What's happening right now is YARN simply gets a number of executor from one spark job and give it to another spark job. As a result, this spark job encounters error and die.
Is there a way or an existing configuration where a certain spark job running on YARN have a fix resource allocation?
Fix resource allocation is an old concept and doesn't give benefit of proper resource utilization. Dynamic resource allocation is an advanced/expected feature of YARN. So, I recommend that you see what is happening actually. If a job is already running on then YARN doesn't take the resources and gives it to others. If resources are not available then the 2nd job will get queued and resources will not be pulled up abruptly from the 1st job. The reason is containers have a combination of memory and CPU. If memory is allocated to other job then basically it means that the JVM of the 1st job is lost for ever. YARN doesn't do what have mentioned.

Limit Spark application from grabbing all the resources in a YARN cluster

We (an engineering team) are running an EMR cluster with YARN and Spark. What is typically happening is that when one user submits a heavy memory intensive job, it grabs all the YARN available memory and then all the subsequent users submitted jobs have to wait for that memory to clear (I know that autoscaling will solve this problem to a certain extent and we are looking into that, but we would like to avoid a single user occupying all the memory even when the cluster is autoscaled to it's full limits).
Is there a way to configure YARN such that any application (Spark or otherwise) may not occupy more than, say 75% of available memory?
Thanks
According to the documentation, you can manage the amount of memory allocated to an executor using the parameter: spark.executor.memory

Make YARN clean up appcache before retry

The situation is the following:
A YARN application is started. It gets scheduled.
It writes a lot to its appcache directory.
The application fails.
YARN restarts it. It goes pending, because there is not enough disk space anywhere to schedule it. The disks are filled up by the appcache from the failed run.
If I manually intervene and kill the application, the disk space is cleaned up. Now I can manually restart the application and it's fine.
I wish I could tell the automated retry to clean up the disk. Alternatively I suppose it could count that used disk as part of the new allocation, since it belongs to the application anyway.
I'll happily take any solution you can offer. I don't know much about YARN. It's an Apache Spark application started with spark-submit in yarn-client mode. The files that fill up the disk are the shuffle spill files.
So here's what happens:
When you submit yarn application it creates a private local resource folder (appcache directory).
Inside this directory spark block manager creates directory for storing block data. As mentioned:
local directories and won't be deleted on JVM exit when using the external shuffle service.
This directory can be cleaned via:
Shutdown hook. This what's happen when you kill the application.
Yarn DeletionService. It should be done automatically on application finish. Make sure yarn.nodemanager.delete.debug-delay-sec=0. Otherwise there is some unresolved yarn bug

Resources