Spark Worker /tmp directory - apache-spark

Im using spark-2.1.1-bin-hadoop-2.7 standalone mode (cluster of 4 workers, 120g memory, 32 cores total)
Although I defined spark.local.dir conf param to write to /opt, spark worker keep writing to /tmp dir, for example /tmp/spark-e071ae1b-1970-47b2-bfec-19ca66693768
Is there a way to tell spark worker not write to /tmp dir?

As per the spark documentation, few environment variables will override the property 'spark.local.dir', please try checking these environment variables
Quoting from the documentation:
spark.local.dir
Directory to use for "scratch" space in Spark, including map output
files and RDDs that get stored on disk. This should be on a fast,
local disk in your system. It can also be a comma-separated list of
multiple directories on different disks. Note: This will be overridden
by SPARK_LOCAL_DIRS (Standalone), MESOS_SANDBOX (Mesos) or LOCAL_DIRS
(YARN) environment variables set by the cluster manager.

Related

Specify which file system spark uses for spilling RDDs

How do we specify the local (unix) file system where Spark spills RDDs when they won't fit in memory? We cannot find this in the documentation. Analysis confirms that it is being saved in the Unix file system, not in HDFS.
We are running on Amazon with Elastic Map Reduce. Spark is spilling to /mnt. On our system, /mnt is an EBS volume while /mnt1 is a SSD. We want to spill to /mnt. If that fills up, we want to spill to /mnt2. We want /mnt to be the spillage of last resort. It's unclear how to configure this way, and how to monitor spilling.
We have reviewed the existing SO questions:
Understanding Spark shuffle spill appears out of date.
Why SPARK cached RDD spill to disk? and Use SSD for SPARK RDD discuss spill behavior, but not where the files are spilled.
Spark shuffle spill metrics is an unanswered question showing the Spill UI, but does not provide the details we are requesting.
Checkout https://spark.apache.org/docs/2.2.1/configuration.html#application-properties and search for
spark.local.dir
This defaults to /tmp, try and set it to the location of your EBS
NOTE: In Spark 1.0 and later this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.
Also look at the following stack overflow post for more insightful info

Spark worker directory

I'm running Spark on a node with 4 disks (/mnt1, /mnt2, /mnt3, /mnt4). I want to write my temporary output from executors to a local directory. Is there any way to uniformly assign each of these disks to executors, so that all disks are uniformly used ? Currently, all data it written to /mnt1 from "forEachParition" action.

Spark - config file that sets spark.storage.memoryFraction

I have come to learn that spark.storage.memoryFraction and spark.storage.safteyFraction are multiplied by the executor memory supplied in the sparkcontext. Also, I have learned that it is desirable to lower the memoryFraction for better performance.
The question is where do I set the spark.storage.memoryFraction? Is there a config file?
The default file that Spark search for such configurations is conf/spark-defaults.conf
If you want to change dir conf to a customized position, set SPARK_CONF_DIR in conf/spark-env.sh
I recommend you to keep it on per job basis instead of updatring spark-defaults.conf
you can create a config file per job, say spark.properties and pass it in spark-submit
--properties-file /spark.properties

worker dont have sufficent memory

I get the following WARN-message:
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
when i try to run the following spark-task:
spark/bin/spark-submit --master $SPARK_MASTER_URL --executor-memory 8g --driver-memory 8g --name "Test-Task" --class path.to.my.Class myJAR.jar
Master and all worker have enough memory for this task (see picture), but it seems like they don't get it allocated.
My setup looks like this:
SparkConf conf = new SparkConf().set("spark.executor.memory", "8g");
When I start my task and then type
ps -fux | more
in my console, it shows me these options:
-Xms512m -Xmx512m
Can anyone tell me what I'm doing wrong?
Edit:
What I am doing:
I have a huge file saved on my master disk, which is about 5gb when I load it into memory (it's a map of maps). So I first load the whole map into memory and then give each node a part of this map to process. As I understand, that's the reason why I need much memory also on my master instance. Maybe not a good solution?
To enlarge the heap size of the master node you can set SPARK_DAEMON_MEMORY environment variable (in spark-env.sh for instance). But I doubt it will solve your memory allocation problem since the master node is not loading data.
I don't understand what your "map of maps" file is. But usually, to process a big file, you make it available to each worker node using a shared folder (NFS) or, better, a distributed file system (HDFS, GlusterFS). Then each worker can read a part of the file and process it. This works as long as the file format is splittable, Spark support JSON file format for instance.

Why does a job fail with "No space left on device", but df says otherwise?

When performing a shuffle my Spark job fails and says "no space left on device", but when I run df -h it says I have free space left! Why does this happen, and how can I fix it?
By default Spark uses the /tmp directory to store intermediate data. If you actually do have space left on some device -- you can alter this by creating the file SPARK_HOME/conf/spark-defaults.conf and adding the line. Here SPARK_HOME is wherever you root directory for the spark install is.
spark.local.dir SOME/DIR/WHERE/YOU/HAVE/SPACE
You need to also monitor df -i which shows how many inodes are in use.
on each machine, we create M * R temporary files for shuffle, where M = number of map tasks, R = number of reduce tasks.
https://spark-project.atlassian.net/browse/SPARK-751
If you do indeed see that disks are running out of inodes to fix the problem you can:
Decrease partitions (see coalesce with shuffle = false).
One can drop the number to O(R) by “consolidating files”. As different file-systems behave differently it’s recommended that you read up on spark.shuffle.consolidateFiles and see https://spark-project.atlassian.net/secure/attachment/10600/Consolidating%20Shuffle%20Files%20in%20Spark.pdf.
Sometimes you may simply find that you need your DevOps to increase the number of inodes the FS supports.
EDIT
Consolidating files has been removed from spark since version 1.6.
https://issues.apache.org/jira/browse/SPARK-9808
I encountered a similar problem. By default, spark uses "/tmp" to save intermediate files. When the job is running, you can tab df -h to see the used space of fs mounted at "/" growing up. When the space of the dev is runned out of, this exception is thrown. To solve the problem, I set the SPARK_LOCAL_DIRS in the SPARK_HOME/conf/spark_defaults.conf with a path in a fs leaving enough space.
Another scenario for this error:
I have a spark-job which uses two sources of data (~150GB and ~100GB) and performs an inner join, many group-by, filtering, and mapping operations.
I created a 20 nodes(r3.2xlarge) spark-cluster using spark ec-2 scripts
Problem:
My job throwing error "No space left on device". As you can see my job requires so many shuffling, So to counter this problem I have used 20-nodes initially then increased to 40-nodes. Somehow the problem was still happening. I tried all other stuff like changing the spark.local.dir, repartitioning, Custom partitions, and parameter tuning(compression, spiling, memory, memory fraction, etc.) as much I could do. Also, I used instance type r3.2xlarge which has 1 x 160 SSD but the problem still happening.
Solution:
I logged into one of the nodes, and executed df -h / I found the node has only one mounted EBS volume(8GB) but there was no SSD(160GB). Then I looked into ls /dev/ and SSD was attached. This problem was not happening for all the nodes in the cluster. The error "No space left on device" happening for only those nodes which do not have SSD mounted. As they are dealing with only 8GB(EBS) and out of that ~4 GB space was available.
I created another bash script which launches the spark cluster using the spark-ec2 script then mount the disk after formatting it.
ec2-script to launch cluster
MASTER_HOST = <ec2-script> get-master $CLUSTER_NAME
ssh -o StrictHostKeyChecking=no root#$MASTER_HOST "cd /root/spark/sbin/ && ./slaves.sh mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/sdb && ./slaves.sh mount -o defaults,noatime,nodiratime /dev/sdb /mnt"
On the worker machine, set the environment variable "SPARK_LOCAL_DIRS" to the place you have free space. Setting the configuration variable "spark.local.dir" doesn't work from Spark 1.0 and later.
Some other workarounds:
Explicitly removing the intermidiate shuffe files. If you don't want
to keep the rdd for later computation, you can call .unpersist()
which will flag the intermidiate shuffle files for removal (you can
also re-assign the rdd variable to None).
Use more workers, adding more workers will reduce on average the
number of intermidiate suffle file needed / worker.
More about the "No space left on device" error on this databricks thread:
https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html
What space is this?
Spark actually writes temporary output files from “map” tasks and RDDs to external storage called “scratch space”, and by default, “scratch space” is on local machine’s /tmp directory.
/tmp is usually the operating system’s (OS) temporary output directory, accessed by OS users, and /tmp is typically small and on a single disk. So when Spark runs lots of jobs, long jobs, or complex jobs, /tmp can fill up quickly, forcing Spark to throw “No space left on device” exceptions.
Because Spark constantly writes to and reads from its scratch space, disk IO can be heavy and can slow down your workload. The best way to resolve this issue and to boost performance is to give as many disks as possible to handle scratch space disk IO. To achieve both, explicitly define parameter spark.local.dir in spark-defaults.conf configuration file, as follows:
spark.local.dir /data1/tmp,/data2/tmp,/data3/tmp,/data4/tmp,/data5/tmp,/data6/tmp,/data7/tmp,/data8/tmp
The above comma-delimited setting will spread out Spark scratch space onto 8 disks (make sure each /data* directory is configured on a separate physical data disk), and under the /data*/tmp directories. You can create any sub directory names instead of ‘tmp’.
Source: https://developer.ibm.com/hadoop/2016/07/18/troubleshooting-and-tuning-spark-for-heavy-workloads/
Please change the SPARK_HOME directory, as we have to give the directory which has more space available for running our job smoothly.

Resources