Make YARN clean up appcache before retry - apache-spark

The situation is the following:
A YARN application is started. It gets scheduled.
It writes a lot to its appcache directory.
The application fails.
YARN restarts it. It goes pending, because there is not enough disk space anywhere to schedule it. The disks are filled up by the appcache from the failed run.
If I manually intervene and kill the application, the disk space is cleaned up. Now I can manually restart the application and it's fine.
I wish I could tell the automated retry to clean up the disk. Alternatively I suppose it could count that used disk as part of the new allocation, since it belongs to the application anyway.
I'll happily take any solution you can offer. I don't know much about YARN. It's an Apache Spark application started with spark-submit in yarn-client mode. The files that fill up the disk are the shuffle spill files.

So here's what happens:
When you submit yarn application it creates a private local resource folder (appcache directory).
Inside this directory spark block manager creates directory for storing block data. As mentioned:
local directories and won't be deleted on JVM exit when using the external shuffle service.
This directory can be cleaned via:
Shutdown hook. This what's happen when you kill the application.
Yarn DeletionService. It should be done automatically on application finish. Make sure yarn.nodemanager.delete.debug-delay-sec=0. Otherwise there is some unresolved yarn bug

Related

How to know if a machine in a Spark cluster 'participate's a job

I wanted to know when it is safe to remove a node from a machine from a cluster.
My assumption is that it could be safe to remove a machine if the machine does not have any containers, and it does not store any useful data.
By the APIs at https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html, we can do
GET http://<rm http address:port>/ws/v1/cluster/nodes
to get the information of each node like
<node>
<rack>/default-rack</rack>
<state>RUNNING</state>
<id>host1.domain.com:54158</id>
<nodeHostName>host1.domain.com</nodeHostName>
<nodeHTTPAddress>host1.domain.com:8042</nodeHTTPAddress>
<lastHealthUpdate>1476995346399</lastHealthUpdate>
<version>3.0.0-SNAPSHOT</version>
<healthReport></healthReport>
<numContainers>0</numContainers>
<usedMemoryMB>0</usedMemoryMB>
<availMemoryMB>8192</availMemoryMB>
<usedVirtualCores>0</usedVirtualCores>
<availableVirtualCores>8</availableVirtualCores>
<resourceUtilization>
<nodePhysicalMemoryMB>1027</nodePhysicalMemoryMB>
<nodeVirtualMemoryMB>1027</nodeVirtualMemoryMB>
<nodeCPUUsage>0.006664445623755455</nodeCPUUsage>
<aggregatedContainersPhysicalMemoryMB>0</aggregatedContainersPhysicalMemoryMB>
<aggregatedContainersVirtualMemoryMB>0</aggregatedContainersVirtualMemoryMB>
<containersCPUUsage>0.0</containersCPUUsage>
</resourceUtilization>
</node>
If numContainers is 0, I assume it does not run containers. However can it still store any data on disk that other downstream tasks can read?
I did not get if Spark lets us know this. I assume if a machine still stores some data useful for the running job, the machine may maintain a heart beat with Spark Driver or some central controller? Can we check this by scanning tcp or udp connections?
Is there any other way to check if a machine in a Spark cluster participates a job?
I am not sure whether you just want to know if a node is running any task (is that's what you mean by 'participate') or you want to know if it is safe to remove a node from the Spark cluster
I will try to explain the latter point.
Spark has the ability to recover from the failure, which also applies to any node being removed from the cluster.
The node removed can be an executor or an application master.
If an application master is removed, the entire job fails. But is you are using yarn as a resource manager, the job is retried and yarn gives a new application master. The number if retries is configured in :
yarn.resourcemanager.am.max-attempts
By default, this value is 2
If a node on which a task is running is removed, the resource manager (which is handled by yarn) will stop getting heartbeats from that node. Application master will know it is supposed to reschedule the failed job as it will no longer receive progress status from the previous node. It will then request resource manager for resources and then reschedule the job.
As far as data on these nodes is concerned, you need to understand how the tasks and their output are handled. Every node has its own local storage to store the output of the tasks running on them. After the tasks are run successfully, the OutputCommitter will move the output from local storage to the shared storage (HDFS) of the job from where the data is picked for the next step of the job.
When a task fails (may be because the node that runs this job failed or was removed), the task is rerun on another available node.
In fact, the application master will also rerun the successfully run tasks on this node as their output stored on the node's local storage will not longer be available.

Spark temp files not getting deleted automatically

I have spark yarn client submitting jobs and when it does that, it creates a directory under my "spark.local.dir" which has files like:
__spark_conf__8681611713144350374.zip
__spark_libs__4985837356751625488.zip
Is there a way these can be automatically cleaned? Whenever I submit a spark job I see new entries for these again in the same folder. This is flooding up my directory what should I set to make this clear automatically?
I have looked at a couple of links online even on SO but couldn't find a solution to this problem. All I found was a way to specify the dir path by
"spark.local.dir".
Three SPARK_WORKER_OPTS exists to support the worker application folder cleanup, copied here for further reference: from Spark Doc
spark.worker.cleanup.enabled, default value is false, Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up.
spark.worker.cleanup.interval, default is 1800, i.e. 30 minutes, Controls the interval, in seconds, at which the worker cleans up old application work dirs on the local machine.
spark.worker.cleanup.appDataTtl, default is 7*24*3600 (7 days), The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk space you have. Application logs and jars are downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, especially if you run jobs very frequently.

YARN workers running out of disk space

We are facing a No space on device error with Spark jobs running on our YARN cluster.
This has a few bad results. First, the Spark jobs take longer or fail. Second, since the disk fills up, the nodes are disabled by the YARN NodeManager and are removed from the pool and marked as unhealthy.
Is there a way to configure the maximum disk space that jobs are allowed to use on each NodeManager?
I'm hoping to be able to say something like "I have a disk of 1TB, you can use up to 900GB for jobs" and have YARN manage those resources is such a way that will never result in filling up the disk.
Alternatively, how can I make sure that YARN keeps removing old data from its local disk so it doesn't fill up? I don't care if that causes jobs to fail. That's inevitable when you overuse resources.

Why are my Spark completed applications still using my worker's disk space?

My Datastax Spark completed applications are using my worker's disc space. Therefore my spark can't run because it doesn't have any disk space left.
This is my spark worker directory. These blue lined applications in total take up 92GB but they shouldn't even exist anymore since they are completed applications Thanks for the help don't know where the problem lies.
This is my spark front UI:
Spark doesn't automatically clean up the jars transfered to the worker nodes. If you want it to do so, and you're running Spark Standalone (YARN is a bit different and won't work the same) you can set spark.worker.cleanup.enabled to true, and set the cleanup interval via spark.worker.cleanup.interval. This will allow Spark to clean up the data retained in your workers. You may also configure a default TTL for all application directories.
From the docs of spark.worker.cleanup.enabled:
Enable periodic cleanup of worker / application directories. Note that
this only affects standalone mode, as YARN works differently. Only the
directories of stopped applications are cleaned up.
For more, see Spark Configuration.

Best practise to clean up old Spark 1.2.0 application logs?

I am running Spark 1.2.0. I noticed that I have a bunch of old application logs under /var/lib/spark/work that don't seem to get cleaned up. What are best practises for cleaning these up? A cronjob? Looks like newer Spark versions has some kind of cleaner.
Three SPARK_WORKER_OPTS exists to support the worker application folder cleanup, copied here for further reference from spark doc:
spark.worker.cleanup.enabled, default value is false, Enable periodic
cleanup of worker / application directories. Note that this only
affects standalone mode, as YARN works differently. Only the
directories of stopped applications are cleaned up.
spark.worker.cleanup.interval, default is 1800, i.e. 30 minutes, Controls the interval, in seconds, at which the worker cleans up old
application work dirs on the local machine.
spark.worker.cleanup.appDataTtl, default is 7*24*3600 (7 days), The number of seconds to retain application work directories on each
worker. This is a Time To Live and should depend on the amount of
available disk space you have. Application logs and jars are
downloaded to each application work dir. Over time, the work dirs can
quickly fill up disk space, especially if you run jobs very
frequently.

Resources