How to configure where spark spills to disk? - apache-spark

I am not able to find this configuration anywhere in the official documentation. Say I decide to install spark, or use a spark docker image. I would like to configure where the "spill to disk" happens so that I may mount a volume that can accommodate that. Where does the default location of the spill to disk occur and how is it possible to change it?

Cloud or bare metal worker nodes have spill location per node that is
local file system, not HDFS. This is standardly handled, but not by
you explicitly. A certain amount of the fs is used for spilling,
shuffling and is local fs, the rest for HDFS. You can name a location
or let HDFS handle that for local fs, or the fs can be an NFS, etc.
For Docker, say, you need simulated HDFS or some linux-like fs for Spark intermediate processing. See https://www.kdnuggets.com/2020/07/apache-spark-cluster-docker.html an excellent guide.
For Spark with YARN, use yarn.nodemanager.local-dirs. See https://spark.apache.org/docs/latest/running-on-yarn.html
For Spark Standalone, use SPARK_LOCAL_DIRS."Scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks.

Related

Is distributed file storage(HDFS/Cassandra/S3 etc.) mandatory for spark to run in clustered mode? if yes, why?

Is distributed file storage(HDFS/Cassandra/S3 etc.) mandatory for spark to run in clustered mode? if yes, why?
Spark is distributed data processing engine used for computing huge volumes of data. Let's say I have huge volume of data stored in mysql which I want to perform processing on. Spark reads the data from mysql and perform in-memory (or disk) computation on the cluster nodes itself. I am still not able to understand why is distributed file storage needed to run spark in a clustered mode?
is distributed file storage(HDFS/Cassandra/S3 etc.) mandatory for spark to run in clustered mode?
Pretty Much
if yes, why?
Because the spark workers take input from a shared table, distribute the computation amongst themselves, then are choreographed by the spark driver to write their data back to another shared table.
If you are trying to work exclusively with mysql you might be able to use the local filesystem ("file://) as the cluster FS. However, if any RDD or stage in a spark query does try to use a shared filesystem as a way of committing work, the output isn't going to propagate from the workers (which will have written to their local filesystem) and the spark driver (which can only read its local filesystem)

Run Spark or Flink on a distributed file system other than HDFS or S3

Is there a way to run Spark or Flink on a distributed file system say lustre or anything except from HDFS or S3.
So we are able to create a distributed file system framework using Unix cluster, Can we run spark/flink on a cluster mode rather than standalone.
you can use file:/// as a DFS provided every node has access to common paths, and *your app is configured to use those common paths for sharing source libraries, source data, intermediate data, final data
Things like lustre tend to do that and/or have a specific hadoop filesystem client lib which wraps/extends that.

Local disk configuration in Spark

Hi the official Spark documentation state:
While Spark can perform a lot of its computation in memory, it still
uses local disks to store data that doesn’t fit in RAM, as well as to
preserve intermediate output between stages. We recommend having 4-8
disks per node, configured without RAID (just as separate mount
points). In Linux, mount the disks with the noatime option to reduce
unnecessary writes. In Spark, configure the spark.local.dir variable
to be a comma-separated list of the local disks. If you are running
HDFS, it’s fine to use the same disks as HDFS.
I wonder what is the purpose of 4-8 per node
Is it for parallel write ? I am not sure to understand the reason why as it is not explained.
I have no clue for this: "If you are running HDFS, it’s fine to use
the same disks as HDFS".
Any idea what is meant here...
Purpose of usage 4-8 RAID disks to mirror partitions adding redundancy to prevent data lost in case of fault on hardware level. In case of HDFS the redundancy that RAID provides is not needed, since HDFS handles it by replication between nodes.
Reference

Using Apache Spark with HDFS vs. other distributed storage

On the Spark's FAQ it specifically says one doesn't have to use HDFS:
Do I need Hadoop to run Spark?
No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode.
So, what are the advantages/disadvantages of using Apache Spark with HDFS vs. other distributed file systems (such as NFS) if I'm not planning to use Hadoop MapReduce? Will I be missing an important feature if I use NFS instead of HDFS for the nodes storage (for checkpoint, shuffle spill, etc)?
After a few months and some experience with both NFS and HDFS, I can now answer my own question:
NFS allows to view/change files on a remote machines as if they were stored a local machine.
HDFS can also do that, but it is distributed (as opposed to NFS) and also fault-tolerant and scalable.
The advantage of using NFS is the simplicity of setup, so I would probably use it for QA environments or small clusters.
The advantage of HDFS is of course its fault-tolerance but a bigger advantage, IMHO, is the ability to utilize locality when HDFS is co-located with the Spark nodes which provides best performance for checkpoints, shuffle spill, etc.

Distributed storage for Spark

Official guide says:
If using a path on the local filesystem, the file must also be
accessible at the same path on worker nodes. Either copy the file to
all workers or use a network-mounted shared file system.
Does Spark need some sort of distributed file system for shuffle or whatever? Or can I just copy input across all nodes and don't bother with NFS, HDFS etc?
Spark does not depend on a distirbuted file system for shuffle. Unlike traditional map reduce, Spark doesn't need to write to HDFS (or similar) system, instead Spark achieves resiliency by tracking the lineage of the data and using that in the event of node failure by re-computing any data which was on that node.

Resources