Using Apache Spark with HDFS vs. other distributed storage - apache-spark

On the Spark's FAQ it specifically says one doesn't have to use HDFS:
Do I need Hadoop to run Spark?
No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode.
So, what are the advantages/disadvantages of using Apache Spark with HDFS vs. other distributed file systems (such as NFS) if I'm not planning to use Hadoop MapReduce? Will I be missing an important feature if I use NFS instead of HDFS for the nodes storage (for checkpoint, shuffle spill, etc)?

After a few months and some experience with both NFS and HDFS, I can now answer my own question:
NFS allows to view/change files on a remote machines as if they were stored a local machine.
HDFS can also do that, but it is distributed (as opposed to NFS) and also fault-tolerant and scalable.
The advantage of using NFS is the simplicity of setup, so I would probably use it for QA environments or small clusters.
The advantage of HDFS is of course its fault-tolerance but a bigger advantage, IMHO, is the ability to utilize locality when HDFS is co-located with the Spark nodes which provides best performance for checkpoints, shuffle spill, etc.

Related

Is distributed file storage(HDFS/Cassandra/S3 etc.) mandatory for spark to run in clustered mode? if yes, why?

Is distributed file storage(HDFS/Cassandra/S3 etc.) mandatory for spark to run in clustered mode? if yes, why?
Spark is distributed data processing engine used for computing huge volumes of data. Let's say I have huge volume of data stored in mysql which I want to perform processing on. Spark reads the data from mysql and perform in-memory (or disk) computation on the cluster nodes itself. I am still not able to understand why is distributed file storage needed to run spark in a clustered mode?
is distributed file storage(HDFS/Cassandra/S3 etc.) mandatory for spark to run in clustered mode?
Pretty Much
if yes, why?
Because the spark workers take input from a shared table, distribute the computation amongst themselves, then are choreographed by the spark driver to write their data back to another shared table.
If you are trying to work exclusively with mysql you might be able to use the local filesystem ("file://) as the cluster FS. However, if any RDD or stage in a spark query does try to use a shared filesystem as a way of committing work, the output isn't going to propagate from the workers (which will have written to their local filesystem) and the spark driver (which can only read its local filesystem)

How to configure where spark spills to disk?

I am not able to find this configuration anywhere in the official documentation. Say I decide to install spark, or use a spark docker image. I would like to configure where the "spill to disk" happens so that I may mount a volume that can accommodate that. Where does the default location of the spill to disk occur and how is it possible to change it?
Cloud or bare metal worker nodes have spill location per node that is
local file system, not HDFS. This is standardly handled, but not by
you explicitly. A certain amount of the fs is used for spilling,
shuffling and is local fs, the rest for HDFS. You can name a location
or let HDFS handle that for local fs, or the fs can be an NFS, etc.
For Docker, say, you need simulated HDFS or some linux-like fs for Spark intermediate processing. See https://www.kdnuggets.com/2020/07/apache-spark-cluster-docker.html an excellent guide.
For Spark with YARN, use yarn.nodemanager.local-dirs. See https://spark.apache.org/docs/latest/running-on-yarn.html
For Spark Standalone, use SPARK_LOCAL_DIRS."Scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks.

Run Spark or Flink on a distributed file system other than HDFS or S3

Is there a way to run Spark or Flink on a distributed file system say lustre or anything except from HDFS or S3.
So we are able to create a distributed file system framework using Unix cluster, Can we run spark/flink on a cluster mode rather than standalone.
you can use file:/// as a DFS provided every node has access to common paths, and *your app is configured to use those common paths for sharing source libraries, source data, intermediate data, final data
Things like lustre tend to do that and/or have a specific hadoop filesystem client lib which wraps/extends that.

What is the advantage of using spark with HDFS as file storage system and YARN as resource manager?

I am trying to understand if spark is an alternative to the vanilla MapReduce approach for analysis of BigData. Since spark saves the operations on the data in the memory so while using the HDFS as storage system for spark , does it take the advantage of distributed storage of the HDFS? For instance suppose i have 100GB CSV file stored in HDFS, now i want to do analysis on it. If i load that from HDFS to spark , will spark load the complete data in-memory to do the transformations or it will use the distributed environment for doing its jobs that HDFS provides for Storage which is leveraged by the MapReduce programs written in hadoop. If not then what is the advantage of using spark over HDFS ?
PS: I know spark spills on the disks if there is RAM overflow but does this spill occur for data per node(suppose 5 GB per node) of the cluster or for the complete data(100GB)?
Spark jobs can be configured to spill to local executor disk, if there is not enough memory to read your files. Or you can enable HDFS snapshots and caching between Spark stages.
You mention CSV, which is just a bad format to have in Hadoop in general. If you have 100GB of CSV, you could just as easily have less than half that if written in Parquet or ORC...
At the end of the day, you need some processing engine, and some storage layer. For example, Spark on Mesos or Kubernetes might work just as well as on YARN, but those are separate systems, and are not bundled and tied together as nicely as HDFS and YARN. Plus, like MapReduce, when using YARN, you are moving the execution to the NodeManagers on the datanodes, rather than pulling over data over the network, which you would be doing with other Spark execution modes. The NameNode and ResourceManagers coordinate this communication for where data is stored and processed
If you are convinced that MapReduceV2 can be better than Spark, I would encourage looking at Tez instead

Distributed storage for Spark

Official guide says:
If using a path on the local filesystem, the file must also be
accessible at the same path on worker nodes. Either copy the file to
all workers or use a network-mounted shared file system.
Does Spark need some sort of distributed file system for shuffle or whatever? Or can I just copy input across all nodes and don't bother with NFS, HDFS etc?
Spark does not depend on a distirbuted file system for shuffle. Unlike traditional map reduce, Spark doesn't need to write to HDFS (or similar) system, instead Spark achieves resiliency by tracking the lineage of the data and using that in the event of node failure by re-computing any data which was on that node.

Resources