I am running Spark 1.2.0. I noticed that I have a bunch of old application logs under /var/lib/spark/work that don't seem to get cleaned up. What are best practises for cleaning these up? A cronjob? Looks like newer Spark versions has some kind of cleaner.
Three SPARK_WORKER_OPTS exists to support the worker application folder cleanup, copied here for further reference from spark doc:
spark.worker.cleanup.enabled, default value is false, Enable periodic
cleanup of worker / application directories. Note that this only
affects standalone mode, as YARN works differently. Only the
directories of stopped applications are cleaned up.
spark.worker.cleanup.interval, default is 1800, i.e. 30 minutes, Controls the interval, in seconds, at which the worker cleans up old
application work dirs on the local machine.
spark.worker.cleanup.appDataTtl, default is 7*24*3600 (7 days), The number of seconds to retain application work directories on each
worker. This is a Time To Live and should depend on the amount of
available disk space you have. Application logs and jars are
downloaded to each application work dir. Over time, the work dirs can
quickly fill up disk space, especially if you run jobs very
frequently.
Related
I am currently running/experimenting with Spark in a Windows environment and have noticed a large number of orphaned blockmgr folders and rdd files. These are being created when I have insufficient memory to cache a full data set.
I suspect that they are being left behind when processes fail.
At the moment, I am manually deleting them from time to time (when I run out of disk space..... ). I have also toyed around with a simple file operation script.
I was wondering, is there any pyspark functions or scripts available that would clean these up, or any way to check for them when a process is initiated?
Thanks
As per #cronoik, this was solved by setting the following properties:
spark.worker.cleanup.enabled true
In my instance, using both 'local' and 'standalone' modes on a single node Windows environment, I have set this within spark-defaults.conf file.
Refer to the documentation for more information: Spark Standalone Mode
I have spark yarn client submitting jobs and when it does that, it creates a directory under my "spark.local.dir" which has files like:
__spark_conf__8681611713144350374.zip
__spark_libs__4985837356751625488.zip
Is there a way these can be automatically cleaned? Whenever I submit a spark job I see new entries for these again in the same folder. This is flooding up my directory what should I set to make this clear automatically?
I have looked at a couple of links online even on SO but couldn't find a solution to this problem. All I found was a way to specify the dir path by
"spark.local.dir".
Three SPARK_WORKER_OPTS exists to support the worker application folder cleanup, copied here for further reference: from Spark Doc
spark.worker.cleanup.enabled, default value is false, Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up.
spark.worker.cleanup.interval, default is 1800, i.e. 30 minutes, Controls the interval, in seconds, at which the worker cleans up old application work dirs on the local machine.
spark.worker.cleanup.appDataTtl, default is 7*24*3600 (7 days), The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk space you have. Application logs and jars are downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, especially if you run jobs very frequently.
My Datastax Spark completed applications are using my worker's disc space. Therefore my spark can't run because it doesn't have any disk space left.
This is my spark worker directory. These blue lined applications in total take up 92GB but they shouldn't even exist anymore since they are completed applications Thanks for the help don't know where the problem lies.
This is my spark front UI:
Spark doesn't automatically clean up the jars transfered to the worker nodes. If you want it to do so, and you're running Spark Standalone (YARN is a bit different and won't work the same) you can set spark.worker.cleanup.enabled to true, and set the cleanup interval via spark.worker.cleanup.interval. This will allow Spark to clean up the data retained in your workers. You may also configure a default TTL for all application directories.
From the docs of spark.worker.cleanup.enabled:
Enable periodic cleanup of worker / application directories. Note that
this only affects standalone mode, as YARN works differently. Only the
directories of stopped applications are cleaned up.
For more, see Spark Configuration.
I have a spark application that finishes without error, but once it's done and saved all of its outputs and the process terminates, the Spark standalone cluster master process becomes a CPU hog, using 16 CPU's full time for hours, and the web UI becomes unresponsive. I have no idea what it could be doing, is there some complicated clean up step?
Some more details:
I've got a Spark standalone cluster (27 workers/nodes) that I've been successfully submitting jobs to for a while. I recently scaled up the size of my applications, the largest now takes 3.5 hours using 100 cores over 27 workers, and each worker has ~dozens of GB of shuffle read/write over the course of the job. Otherwise, the application is no different than the smaller jobs that have run successfully before.
This is a known issue with Spark's standalone cluster, and is caused by the massive event log created by large applications. You can read more at the issue tracking link below.
https://issues.apache.org/jira/browse/SPARK-12299
At the current time, the best work-around is to disable event logging for large jobs.
The situation is the following:
A YARN application is started. It gets scheduled.
It writes a lot to its appcache directory.
The application fails.
YARN restarts it. It goes pending, because there is not enough disk space anywhere to schedule it. The disks are filled up by the appcache from the failed run.
If I manually intervene and kill the application, the disk space is cleaned up. Now I can manually restart the application and it's fine.
I wish I could tell the automated retry to clean up the disk. Alternatively I suppose it could count that used disk as part of the new allocation, since it belongs to the application anyway.
I'll happily take any solution you can offer. I don't know much about YARN. It's an Apache Spark application started with spark-submit in yarn-client mode. The files that fill up the disk are the shuffle spill files.
So here's what happens:
When you submit yarn application it creates a private local resource folder (appcache directory).
Inside this directory spark block manager creates directory for storing block data. As mentioned:
local directories and won't be deleted on JVM exit when using the external shuffle service.
This directory can be cleaned via:
Shutdown hook. This what's happen when you kill the application.
Yarn DeletionService. It should be done automatically on application finish. Make sure yarn.nodemanager.delete.debug-delay-sec=0. Otherwise there is some unresolved yarn bug