I have read about the spark.local.dir configuration property but I still not understand how does Spark use this directory.
I have 4 machines on my cluster, there is 1 machine that plays a role as both worker and master, other machines are worker only. The spark.local.dir on my spark-defaults.conf is configured with default value /tmp.
Then for every Spark application, there are a lot of files created and stored on /tmp of machine 1 while others do not store any files/data.
How can I make the files/data created by the Spark application balanced on all machines?
Related
In our Spark application, we store the local application cache in /mnt/yarn/app-cache/ directory, which is shared between app containers on the same ec2 instance
/mnt/... is chosen because it is a fast NVMe SSD on r5d instances
This approach worked well for several years on EMR 5.x - /mnt/yarn belongs to the yarn user, and apps containers run from yarn, and it can create directories
In EMR 6.x things changed - containers now run from the hadoop user which does not have write access to /mnt/yarn/
hadoop user can create directories in /mnt/, but yarn can not, and I want to keep compatibility - the app should be able to run successfully on both EMR 5.x and 6.x
java.io.tmpdir also doesn't work - it is different for each container
What should be the proper place to store cache on NVMe SSD (/mnt, /mnt1) so it can be accessible by all containers and can be operable on both EMR 5.x and 6.x?
On your EMR cluster, you can add the yarn user to the super user group; by default, this group is called supergroup. You can confirm if this is the right group by checking the dfs.permissions.superusergroup in the hdfs-site.xml file.
You could also try modifying the following HDFS properties (in the file named above): dfs.permissions.enabled or dfs.datanode.data.dir.perm.
I am new to Apache Spark.
I have a cluster with a master and one worker. I am connected to master with pyspark (all are on Ubuntu VM).
I am reading this documentation: RDD external-datasets
in particular I have executed:
distFile = sc.textFile("data.txt")
I understand that this creates an RDD from file, which should be managed by the driver, hence by pyspark app.
But the doc states:
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
Question is why do workers need access to the file path if the RDD is created by the driver only (afterwards distributed to the nodes)?
I'm setting up spark cluster of 10 nodes.
spark creates temp files while running spark job. Does it creates temp files for all worker nodes in master node or on resp worker nodes ?
what will be the path for that temp directory ? Where do we set that path ?
Secondly, If that temp dir path gets filled, surely it will throw an error while storing more. How can I delete those temp files while running spark job itself to avoid this error ? Setting spark.worker.cleanup.enabled to true will work ?
Spark Doc to set temp dir
spark.local.dir can be use
Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks. NOTE: In Spark 1.0 and later this will be overridden by SPARK_LOCAL_DIRS (Standalone), MESOS_SANDBOX (Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.
Spark Docs for temp dir clean up configs
spark.worker.cleanup.enabled, default value is false, Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up.
spark.worker.cleanup.interval, default is 1800, i.e. 30 minutes, Controls the interval, in seconds, at which the worker cleans up old application work dirs on the local machine.
spark.worker.cleanup.appDataTtl, default is 7*24*3600 (7 days), The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk space you have. Application logs and jars are downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, especially if you run jobs very frequently.
I need some hints about defining a path to a directory with lots of files in Spark. I have set up a Standalone Cluster with one machine as Worker and another machine as Master and the driver is my local machine. I develop my code on the local machine with python. I have copied all files to the Master and Worker, the path on both machines is equal (like: /data/test/). I have set up a SparkSession but now I do not know how to define the path to the directory in my script. So my problem is how to say Spark that it can find the data on both machines in the directory above?
Another question for me is how to deal with file formats like .mal, how can I read in such files? Thanks for any hints!
When a spark job is submitted to a driver (master), few things happended
Driver program creates an execution plan. It creates multiples stages and each stage contains multiple tasks.
Cluster manager allocate resources and launch executors from worker based on arguments whiling submitting the job.
The tasks are given to executors to be executed and driver monitor each task execution. Resources is deallocated and executors are terminated when sparkContext is closed or the scope of application program is finished.
The driver or master where spark job is submitted needs the accessible data path, as it control all the execution plan. Driver program and cluster manager will take care all the things to do different kinds of operation in the worker. As the spark job is submitted in master, It's enough to provide the data path which is accessible by spark from the master machine.
I have a DataProc Spark cluster. Initially, the master and 2 worker nodes are of type n1-standard-4 (4 vCPU, 15.0 GB memory), then I resized all of them to n1-highmem-8 (8 vCPUs, 52 GB memory) via the web console.
I noticed that the two workers nodes are not being fully used. In particular, there are only 2 executors on the first worker node and 1 executor on the second worker node, with
spark.executor.cores 2
spark.executor.memory 4655m
in the /usr/lib/spark/conf/spark-defaults.conf. I thought with spark.dynamicAllocation.enabled true, the number of executors will be increased automatically.
Also, The information on DataProc page of the web console doesn't get updated automatically, either. It seems that DataProc still think that all nodes are n1-standard-4.
My questions are
why are there more executors on the first worker node than the second?
why are not more executors added to each node?
Ideally, I want the whole cluster to get fully utilized, if the spark configuration needs updated, how?
As you've found a cluster's configuration is set when the cluster is first created and does not adjust to manual resizing.
To answer your questions:
The Spark ApplicationMaster takes a container in YARN on a worker node, usually the first worker if only a single spark application is running.
When a cluster is started, Dataproc attempts to fit two YARN containers per machine.
The YARN NodeManager configuration on each machine determines how much of the machine's resources should be dedicated to YARN. This can be changed on each VM under /etc/hadoop/conf/yarn-site.xml, followed by a sudo service hadoop-yarn-nodemanager restart. Once machines are advertising more resources to the ResourceManager, Spark can start more containers. After adding more resources to YARN, you may want to modify the size of containers requested by Spark by modifying spark.executor.memory and spark.executor.cores.
Instead of resizing cluster nodes and manually editing configuration files afterwards, consider starting a new cluster with new machine sizes and copy any data from your old cluster to the new cluster. In general, the simplest way to move data is to use hadoop's built in distcp utility. An example usage would be something along the lines of:
$ hadoop distcp hdfs:///some_directory hdfs://other-cluster-m:8020/
Or if you can use Cloud Storage:
$ hadoop distcp hdfs:///some_directory gs://<your_bucket>/some_directory
Alternatively, consider always storing data in Cloud Storage and treating each cluster as an ephemeral resource that can be torn down and recreated at any time. In general, any time you would save data to HDFS, you can also save it as:
gs://<your_bucket>/path/to/file
Saving to GCS has the nice benefit of allowing you to delete your cluster (and data in HDFS, on persistent disks) when not in use.