Can you read/write directly to hard disk from a spark job? - apache-spark

Do the output of a spark job need to be written to hdfs and downloaded from there. Or could it be written to local file system directly.

Fundamentally, no, you cannot use spark's native writing APIs (e.g. df.write.parquet) to write to local filesystem files. When running in spark local mode (on your own computer, not a cluster), you will be reading/writing from your local filesystem. However, in a cluster setting (standalone/YARN/etc), writing to HDFS is the only logical approach since partitions are [generally] contained on separate nodes.
Writing to HDFS is inherently distributed, whereas writing to local filesystem would involve at least 1 of 2 problems:
1) writing to node-local filesystem would mean files on all different nodes (5 files on 1 node, 7 files on another, etc)
2) writing to driver's filesystem would require sending all the executors' results to the driver akin to running collect
You can write to the driver local filesystem using traditional I/O operations built-into languages like Python or Scala.
Relevant SOs:
How to write to CSV in Spark
Save a spark RDD to the local file system using Java
Spark (Scala) Writing (and reading) to local file system from driver

Related

Is distributed file storage(HDFS/Cassandra/S3 etc.) mandatory for spark to run in clustered mode? if yes, why?

Is distributed file storage(HDFS/Cassandra/S3 etc.) mandatory for spark to run in clustered mode? if yes, why?
Spark is distributed data processing engine used for computing huge volumes of data. Let's say I have huge volume of data stored in mysql which I want to perform processing on. Spark reads the data from mysql and perform in-memory (or disk) computation on the cluster nodes itself. I am still not able to understand why is distributed file storage needed to run spark in a clustered mode?
is distributed file storage(HDFS/Cassandra/S3 etc.) mandatory for spark to run in clustered mode?
Pretty Much
if yes, why?
Because the spark workers take input from a shared table, distribute the computation amongst themselves, then are choreographed by the spark driver to write their data back to another shared table.
If you are trying to work exclusively with mysql you might be able to use the local filesystem ("file://) as the cluster FS. However, if any RDD or stage in a spark query does try to use a shared filesystem as a way of committing work, the output isn't going to propagate from the workers (which will have written to their local filesystem) and the spark driver (which can only read its local filesystem)

Run Spark or Flink on a distributed file system other than HDFS or S3

Is there a way to run Spark or Flink on a distributed file system say lustre or anything except from HDFS or S3.
So we are able to create a distributed file system framework using Unix cluster, Can we run spark/flink on a cluster mode rather than standalone.
you can use file:/// as a DFS provided every node has access to common paths, and *your app is configured to use those common paths for sharing source libraries, source data, intermediate data, final data
Things like lustre tend to do that and/or have a specific hadoop filesystem client lib which wraps/extends that.

Spark without HDFS in cluster mode: Which data is stored where?

I am using Spark 1.5 without HDFS in cluster mode to build an application. I was wondering, when having a saving operation, e.g.,
df.write.parquet("...")
which data is stored where? Is all the data stored at the master, or is each worker storing its local data?
Generally speaking all workers nodes will perform writes to its local file system with driver writing only a _SUCCESS file.

Using Apache Spark with HDFS vs. other distributed storage

On the Spark's FAQ it specifically says one doesn't have to use HDFS:
Do I need Hadoop to run Spark?
No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode.
So, what are the advantages/disadvantages of using Apache Spark with HDFS vs. other distributed file systems (such as NFS) if I'm not planning to use Hadoop MapReduce? Will I be missing an important feature if I use NFS instead of HDFS for the nodes storage (for checkpoint, shuffle spill, etc)?
After a few months and some experience with both NFS and HDFS, I can now answer my own question:
NFS allows to view/change files on a remote machines as if they were stored a local machine.
HDFS can also do that, but it is distributed (as opposed to NFS) and also fault-tolerant and scalable.
The advantage of using NFS is the simplicity of setup, so I would probably use it for QA environments or small clusters.
The advantage of HDFS is of course its fault-tolerance but a bigger advantage, IMHO, is the ability to utilize locality when HDFS is co-located with the Spark nodes which provides best performance for checkpoints, shuffle spill, etc.

Distributed storage for Spark

Official guide says:
If using a path on the local filesystem, the file must also be
accessible at the same path on worker nodes. Either copy the file to
all workers or use a network-mounted shared file system.
Does Spark need some sort of distributed file system for shuffle or whatever? Or can I just copy input across all nodes and don't bother with NFS, HDFS etc?
Spark does not depend on a distirbuted file system for shuffle. Unlike traditional map reduce, Spark doesn't need to write to HDFS (or similar) system, instead Spark achieves resiliency by tracking the lineage of the data and using that in the event of node failure by re-computing any data which was on that node.

Resources