What are the differences between a Linux and a Hadoop file sytem? I knew few of them, just wanted to know more details.
see this similar question.
First of all you cannot compare Linux file system with HDFS, But
upto my knowledge,
HDFS - Name itself says that its a distributed file system where the data stores into several blocks on different clusters.
HDFS write once read many but Local file system write many, ready many
Local file system is a default storage architecture comes with OS but HDFS is a file system for hadoop framework refer here HDFS.
HDFS is an another layer for Local file system.
Refer the below links to see the difference :
Linux File System
Hadoop File System
Related
Is there a way to run Spark or Flink on a distributed file system say lustre or anything except from HDFS or S3.
So we are able to create a distributed file system framework using Unix cluster, Can we run spark/flink on a cluster mode rather than standalone.
you can use file:/// as a DFS provided every node has access to common paths, and *your app is configured to use those common paths for sharing source libraries, source data, intermediate data, final data
Things like lustre tend to do that and/or have a specific hadoop filesystem client lib which wraps/extends that.
I have developed a distributed file system which provide interfaces like standard Linux file system. Now I want it to be supported by Spark which means Spark can read files from it and save files to it just like HDFS. Since I am not familiar with Spark, my question is what interfaces should I provide to Spark or what requirements should the file system meet to be successfully operated by Spark?
Do the output of a spark job need to be written to hdfs and downloaded from there. Or could it be written to local file system directly.
Fundamentally, no, you cannot use spark's native writing APIs (e.g. df.write.parquet) to write to local filesystem files. When running in spark local mode (on your own computer, not a cluster), you will be reading/writing from your local filesystem. However, in a cluster setting (standalone/YARN/etc), writing to HDFS is the only logical approach since partitions are [generally] contained on separate nodes.
Writing to HDFS is inherently distributed, whereas writing to local filesystem would involve at least 1 of 2 problems:
1) writing to node-local filesystem would mean files on all different nodes (5 files on 1 node, 7 files on another, etc)
2) writing to driver's filesystem would require sending all the executors' results to the driver akin to running collect
You can write to the driver local filesystem using traditional I/O operations built-into languages like Python or Scala.
Relevant SOs:
How to write to CSV in Spark
Save a spark RDD to the local file system using Java
Spark (Scala) Writing (and reading) to local file system from driver
I have a spark application which runs in local mode currently and writes an output to a file in a local UNIX directory.
Now, I want to run the same job in yarn cluster mode and still want to write into that UNIX folder.
Can I use the same saveAsTextFile(path)?
Yes, you can. But it is not the best practice to do that. The spark itself can run standalone and on distributed file system. The reason we are using distributed file system is that the data is huge and the expected output might be huge.
So, if you are compeltely sure that the output will fit in to your local file system, go for it or you can save it to your local storage using the below command.
bin/hadoop fs -copyToLocal /hdfs/source/path /localfs/destination/path
Is there an way to use the GridGain in-memory file system on top of storage other than the Hadoop file system?
Actually my idea it would to use just like a cache on top of a plain file system or shared NFS.
GridGain chose Hadoop FileSystem API as the secondary file system interface. So, the answer is Yes, as long as you wrap another file system into Hadoop FileSystem interface.
Also, you may wish to look at HDFS NFS Gateway project.