Run Spark or Flink on a distributed file system other than HDFS or S3 - apache-spark

Is there a way to run Spark or Flink on a distributed file system say lustre or anything except from HDFS or S3.
So we are able to create a distributed file system framework using Unix cluster, Can we run spark/flink on a cluster mode rather than standalone.

you can use file:/// as a DFS provided every node has access to common paths, and *your app is configured to use those common paths for sharing source libraries, source data, intermediate data, final data
Things like lustre tend to do that and/or have a specific hadoop filesystem client lib which wraps/extends that.

Related

How to configure where spark spills to disk?

I am not able to find this configuration anywhere in the official documentation. Say I decide to install spark, or use a spark docker image. I would like to configure where the "spill to disk" happens so that I may mount a volume that can accommodate that. Where does the default location of the spill to disk occur and how is it possible to change it?
Cloud or bare metal worker nodes have spill location per node that is
local file system, not HDFS. This is standardly handled, but not by
you explicitly. A certain amount of the fs is used for spilling,
shuffling and is local fs, the rest for HDFS. You can name a location
or let HDFS handle that for local fs, or the fs can be an NFS, etc.
For Docker, say, you need simulated HDFS or some linux-like fs for Spark intermediate processing. See https://www.kdnuggets.com/2020/07/apache-spark-cluster-docker.html an excellent guide.
For Spark with YARN, use yarn.nodemanager.local-dirs. See https://spark.apache.org/docs/latest/running-on-yarn.html
For Spark Standalone, use SPARK_LOCAL_DIRS."Scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks.

How to execute custom C++ binary on HDFS file

I have custom c++ binaries which reads raw data file and writes derived data file. The size of files are in 100Gbs. Moreover, I would like to process multiple 100Gb files in parallel and generated a materialized view of derived metadata. Hence, map-reduce paradigm seems more scalable.
I am a newbie in Hadoop ecosystem. I have used Ambari to setup a Hadoop cluster on AWS. I have built my custom C++ binaries on every data node and loaded the raw data files on HDFS. What are my options to execute this binary on HDFS files?
Hadoop streaming is the simplest way to run non-Java applications as MapReduce.
Refer to Hadoop Streaming for more details.

How to set Cassandra as my Distributed Storage(File System) for my Spark Cluster

I am new to big data and Spark(pyspark).
Recently I just setup a spark cluster and wanted to use Cassandra File System (CFS) on my spark cluster to help upload files.
Can any one tell me how to set it up and briefly introduce how to use CFS system? (like how to upload files / from where)
BTW I don't even know how to use HDFS(I downloaded pre-built spark-bin-hadoop but I can't find hadoop in my system tho.)
Thanks in advance!
CFS only exists in DataStax Enterprise and isn't appropriate for most Distributed File applications. It's primary focused as a substitute for HDFS for map/reduce jobs and small temporary but distributed files.
To use it you just use the CFS:// uri and make sure you are using dse spark-submit from your application.

Using Apache Spark with HDFS vs. other distributed storage

On the Spark's FAQ it specifically says one doesn't have to use HDFS:
Do I need Hadoop to run Spark?
No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode.
So, what are the advantages/disadvantages of using Apache Spark with HDFS vs. other distributed file systems (such as NFS) if I'm not planning to use Hadoop MapReduce? Will I be missing an important feature if I use NFS instead of HDFS for the nodes storage (for checkpoint, shuffle spill, etc)?
After a few months and some experience with both NFS and HDFS, I can now answer my own question:
NFS allows to view/change files on a remote machines as if they were stored a local machine.
HDFS can also do that, but it is distributed (as opposed to NFS) and also fault-tolerant and scalable.
The advantage of using NFS is the simplicity of setup, so I would probably use it for QA environments or small clusters.
The advantage of HDFS is of course its fault-tolerance but a bigger advantage, IMHO, is the ability to utilize locality when HDFS is co-located with the Spark nodes which provides best performance for checkpoints, shuffle spill, etc.

Distributed storage for Spark

Official guide says:
If using a path on the local filesystem, the file must also be
accessible at the same path on worker nodes. Either copy the file to
all workers or use a network-mounted shared file system.
Does Spark need some sort of distributed file system for shuffle or whatever? Or can I just copy input across all nodes and don't bother with NFS, HDFS etc?
Spark does not depend on a distirbuted file system for shuffle. Unlike traditional map reduce, Spark doesn't need to write to HDFS (or similar) system, instead Spark achieves resiliency by tracking the lineage of the data and using that in the event of node failure by re-computing any data which was on that node.

Resources