Is it possible to use GridGain file system with storage other than Hadoop - gridgain

Is there an way to use the GridGain in-memory file system on top of storage other than the Hadoop file system?
Actually my idea it would to use just like a cache on top of a plain file system or shared NFS.

GridGain chose Hadoop FileSystem API as the secondary file system interface. So, the answer is Yes, as long as you wrap another file system into Hadoop FileSystem interface.
Also, you may wish to look at HDFS NFS Gateway project.

Related

Run Spark or Flink on a distributed file system other than HDFS or S3

Is there a way to run Spark or Flink on a distributed file system say lustre or anything except from HDFS or S3.
So we are able to create a distributed file system framework using Unix cluster, Can we run spark/flink on a cluster mode rather than standalone.
you can use file:/// as a DFS provided every node has access to common paths, and *your app is configured to use those common paths for sharing source libraries, source data, intermediate data, final data
Things like lustre tend to do that and/or have a specific hadoop filesystem client lib which wraps/extends that.

How to set Cassandra as my Distributed Storage(File System) for my Spark Cluster

I am new to big data and Spark(pyspark).
Recently I just setup a spark cluster and wanted to use Cassandra File System (CFS) on my spark cluster to help upload files.
Can any one tell me how to set it up and briefly introduce how to use CFS system? (like how to upload files / from where)
BTW I don't even know how to use HDFS(I downloaded pre-built spark-bin-hadoop but I can't find hadoop in my system tho.)
Thanks in advance!
CFS only exists in DataStax Enterprise and isn't appropriate for most Distributed File applications. It's primary focused as a substitute for HDFS for map/reduce jobs and small temporary but distributed files.
To use it you just use the CFS:// uri and make sure you are using dse spark-submit from your application.

Using Apache Spark with HDFS vs. other distributed storage

On the Spark's FAQ it specifically says one doesn't have to use HDFS:
Do I need Hadoop to run Spark?
No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode.
So, what are the advantages/disadvantages of using Apache Spark with HDFS vs. other distributed file systems (such as NFS) if I'm not planning to use Hadoop MapReduce? Will I be missing an important feature if I use NFS instead of HDFS for the nodes storage (for checkpoint, shuffle spill, etc)?
After a few months and some experience with both NFS and HDFS, I can now answer my own question:
NFS allows to view/change files on a remote machines as if they were stored a local machine.
HDFS can also do that, but it is distributed (as opposed to NFS) and also fault-tolerant and scalable.
The advantage of using NFS is the simplicity of setup, so I would probably use it for QA environments or small clusters.
The advantage of HDFS is of course its fault-tolerance but a bigger advantage, IMHO, is the ability to utilize locality when HDFS is co-located with the Spark nodes which provides best performance for checkpoints, shuffle spill, etc.

Distributed storage for Spark

Official guide says:
If using a path on the local filesystem, the file must also be
accessible at the same path on worker nodes. Either copy the file to
all workers or use a network-mounted shared file system.
Does Spark need some sort of distributed file system for shuffle or whatever? Or can I just copy input across all nodes and don't bother with NFS, HDFS etc?
Spark does not depend on a distirbuted file system for shuffle. Unlike traditional map reduce, Spark doesn't need to write to HDFS (or similar) system, instead Spark achieves resiliency by tracking the lineage of the data and using that in the event of node failure by re-computing any data which was on that node.

Differences between Linux and Hadoop file system

What are the differences between a Linux and a Hadoop file sytem? I knew few of them, just wanted to know more details.
see this similar question.
First of all you cannot compare Linux file system with HDFS, But
upto my knowledge,
HDFS - Name itself says that its a distributed file system where the data stores into several blocks on different clusters.
HDFS write once read many but Local file system write many, ready many
Local file system is a default storage architecture comes with OS but HDFS is a file system for hadoop framework refer here HDFS.
HDFS is an another layer for Local file system.
Refer the below links to see the difference :
Linux File System
Hadoop File System

Resources