Temp storage in Apache spark - apache-spark

I'm setting up spark cluster of 10 nodes.
spark creates temp files while running spark job. Does it creates temp files for all worker nodes in master node or on resp worker nodes ?
what will be the path for that temp directory ? Where do we set that path ?
Secondly, If that temp dir path gets filled, surely it will throw an error while storing more. How can I delete those temp files while running spark job itself to avoid this error ? Setting spark.worker.cleanup.enabled to true will work ?

Spark Doc to set temp dir
spark.local.dir can be use
Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks. NOTE: In Spark 1.0 and later this will be overridden by SPARK_LOCAL_DIRS (Standalone), MESOS_SANDBOX (Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.
Spark Docs for temp dir clean up configs
spark.worker.cleanup.enabled, default value is false, Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up.
spark.worker.cleanup.interval, default is 1800, i.e. 30 minutes, Controls the interval, in seconds, at which the worker cleans up old application work dirs on the local machine.
spark.worker.cleanup.appDataTtl, default is 7*24*3600 (7 days), The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk space you have. Application logs and jars are downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, especially if you run jobs very frequently.

Related

Balance files created by Spark on SPARK_LOCAL_DIR

I have read about the spark.local.dir configuration property but I still not understand how does Spark use this directory.
I have 4 machines on my cluster, there is 1 machine that plays a role as both worker and master, other machines are worker only. The spark.local.dir on my spark-defaults.conf is configured with default value /tmp.
Then for every Spark application, there are a lot of files created and stored on /tmp of machine 1 while others do not store any files/data.
How can I make the files/data created by the Spark application balanced on all machines?

Spark - Is there a way to cleanup orphaned RDD files and block manager folders (using pyspark)?

I am currently running/experimenting with Spark in a Windows environment and have noticed a large number of orphaned blockmgr folders and rdd files. These are being created when I have insufficient memory to cache a full data set.
I suspect that they are being left behind when processes fail.
At the moment, I am manually deleting them from time to time (when I run out of disk space..... ). I have also toyed around with a simple file operation script.
I was wondering, is there any pyspark functions or scripts available that would clean these up, or any way to check for them when a process is initiated?
Thanks
As per #cronoik, this was solved by setting the following properties:
spark.worker.cleanup.enabled true
In my instance, using both 'local' and 'standalone' modes on a single node Windows environment, I have set this within spark-defaults.conf file.
Refer to the documentation for more information: Spark Standalone Mode

Spark temp files not getting deleted automatically

I have spark yarn client submitting jobs and when it does that, it creates a directory under my "spark.local.dir" which has files like:
__spark_conf__8681611713144350374.zip
__spark_libs__4985837356751625488.zip
Is there a way these can be automatically cleaned? Whenever I submit a spark job I see new entries for these again in the same folder. This is flooding up my directory what should I set to make this clear automatically?
I have looked at a couple of links online even on SO but couldn't find a solution to this problem. All I found was a way to specify the dir path by
"spark.local.dir".
Three SPARK_WORKER_OPTS exists to support the worker application folder cleanup, copied here for further reference: from Spark Doc
spark.worker.cleanup.enabled, default value is false, Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up.
spark.worker.cleanup.interval, default is 1800, i.e. 30 minutes, Controls the interval, in seconds, at which the worker cleans up old application work dirs on the local machine.
spark.worker.cleanup.appDataTtl, default is 7*24*3600 (7 days), The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk space you have. Application logs and jars are downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, especially if you run jobs very frequently.

Why are my Spark completed applications still using my worker's disk space?

My Datastax Spark completed applications are using my worker's disc space. Therefore my spark can't run because it doesn't have any disk space left.
This is my spark worker directory. These blue lined applications in total take up 92GB but they shouldn't even exist anymore since they are completed applications Thanks for the help don't know where the problem lies.
This is my spark front UI:
Spark doesn't automatically clean up the jars transfered to the worker nodes. If you want it to do so, and you're running Spark Standalone (YARN is a bit different and won't work the same) you can set spark.worker.cleanup.enabled to true, and set the cleanup interval via spark.worker.cleanup.interval. This will allow Spark to clean up the data retained in your workers. You may also configure a default TTL for all application directories.
From the docs of spark.worker.cleanup.enabled:
Enable periodic cleanup of worker / application directories. Note that
this only affects standalone mode, as YARN works differently. Only the
directories of stopped applications are cleaned up.
For more, see Spark Configuration.

Best practise to clean up old Spark 1.2.0 application logs?

I am running Spark 1.2.0. I noticed that I have a bunch of old application logs under /var/lib/spark/work that don't seem to get cleaned up. What are best practises for cleaning these up? A cronjob? Looks like newer Spark versions has some kind of cleaner.
Three SPARK_WORKER_OPTS exists to support the worker application folder cleanup, copied here for further reference from spark doc:
spark.worker.cleanup.enabled, default value is false, Enable periodic
cleanup of worker / application directories. Note that this only
affects standalone mode, as YARN works differently. Only the
directories of stopped applications are cleaned up.
spark.worker.cleanup.interval, default is 1800, i.e. 30 minutes, Controls the interval, in seconds, at which the worker cleans up old
application work dirs on the local machine.
spark.worker.cleanup.appDataTtl, default is 7*24*3600 (7 days), The number of seconds to retain application work directories on each
worker. This is a Time To Live and should depend on the amount of
available disk space you have. Application logs and jars are
downloaded to each application work dir. Over time, the work dirs can
quickly fill up disk space, especially if you run jobs very
frequently.

Resources