Why do Apache Spark nodes need access to datafile path? - apache-spark

I am new to Apache Spark.
I have a cluster with a master and one worker. I am connected to master with pyspark (all are on Ubuntu VM).
I am reading this documentation: RDD external-datasets
in particular I have executed:
distFile = sc.textFile("data.txt")
I understand that this creates an RDD from file, which should be managed by the driver, hence by pyspark app.
But the doc states:
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
Question is why do workers need access to the file path if the RDD is created by the driver only (afterwards distributed to the nodes)?

Related

How to specify where to read data from HDFS when submitting Spark application?

I have been trying to deploy a spark multi-node cluster on three machines (master, slave1 and slave2). I have successfully deployed the spark cluster but I am confused about how to distribute my HDFS data over the slaves? Do I need to manually put data on my slave nodes and how can I specify where to read data from when submitting an application from the client? I have searched multiple forums but haven't been able to figure out how to use HDFS with Spark without using Hadoop.
tl;dr Store files to be processed by a Spark application on Hadoop HDFS and Spark executors will be told how to access them.
From HDFS Users Guide:
This document is a starting point for users working with Hadoop Distributed File System (HDFS) either as a part of a Hadoop cluster or as a stand-alone general purpose distributed file system.
A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data.
So, HDFS is a mere file system that you can use to store files and use them in a distributed application, incl. a Spark application.
To my great surprise, it's only in HDFS Architecture where you can find a HDFS URI, i.e. hdfs://localhost:8020/user/hadoop/delete/test1 that is a HDFS URL to a resource delete/test1 that belongs to the user hadoop.
The URL that start with hdfs points at a HDFS that in the above example is managed by a NameNode at localhost:8020.
That means that HDFS does not require Hadoop YARN, but is usually used together because they come together and is just simple to use together.
Do I need to manually put data on my slave nodes and how can I specify where to read data from when submitting an application from the client?
Spark supports Hadoop HDFS with or without Hadoop YARN. A cluster manager (aka master URL) is an orthogonal concern to HDFS.
Wrapping it up, just use hdfs://hostname:port/path/to/directory with to access files on HDFS.

How to load some files into Spark nodes without duplication?

I have some text files on master server to be processed by a Spark cluster for some statistics purpose.
For example, I have 1.txt, 2.txt,3.txt on master server in a specified directory like /data/.I want use a Spark cluster to process all of them one times. If I use sc.textFile("/data/*.txt") to load all files, other node in cluster cannot find these files on local file system. However if I use sc.addFile and SparkFiles.get to achieve them on each node, the 3 text files will be downloaded to each node and all of them will be processed multi times.
How to solve it without HDFS? Thanks.
According to official document, just copy all files to all nodes.
http://spark.apache.org/docs/1.2.1/programming-guide.html#external-datasets
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.

Spark: hdfs cluster mode

I'm just getting started using Apache Spark. I'm using cluster mode (master, slave1, slave2) and I want to process a big file which is kept in Hadoop (hdfs). I am using the textFile method from SparkContext; while the file is being processing I monitorize the nodes and I can see that just the slave2 is working. After processing, slave2 has tasks but slave1 has no task.
If instead of using a hdfs I use a local file then both slaves work simultaneously.
I don't get why this behaviour. Please, can anybody give me a clue?
The main reason of that behavior is the concept of data locality. When Spark's Application Master asks for the creation of new executors, they are tried to be allocated in the same node where data resides.
I.e. in your case, HDFS is likely to have written all the blocks of the file on the same node. Thus Spark will instantiate the executors on that node. Instead, if you use a local file, it will be present in all nodes, so data locality won't be an issue anymore.

Distributed storage for Spark

Official guide says:
If using a path on the local filesystem, the file must also be
accessible at the same path on worker nodes. Either copy the file to
all workers or use a network-mounted shared file system.
Does Spark need some sort of distributed file system for shuffle or whatever? Or can I just copy input across all nodes and don't bother with NFS, HDFS etc?
Spark does not depend on a distirbuted file system for shuffle. Unlike traditional map reduce, Spark doesn't need to write to HDFS (or similar) system, instead Spark achieves resiliency by tracking the lineage of the data and using that in the event of node failure by re-computing any data which was on that node.

Using Spark Shell (CLI) in standalone mode on distributed files

I am using Spark 1.3.1 in standalone mode (No YARN/HDFS involved - Only Spark) on a cluster with 3 machines. I have a dedicated node for master (no workers running on it) and 2 separate worker nodes.
The cluster starts healthy, and I am just trying to test my installation by running some simple examples via spark-shell (CLI - which I started on the master machine) : I simply put a file on the localfs on the master node (workers do NOT have a copy of this file) and I simply run:
$SPARKHOME/bin/spark-shell
...
scala> val f = sc.textFile("file:///PATH/TO/LOCAL/FILE/ON/MASTER/FS/file.txt")
scala> f.count()
and it returns the words count results correctly.
My Questions are:
1) This contradicts with what spark documentation (on using External Datasets) say as:
"If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system."
I am not using NFS and I did not copy the file to workers, so how does it work ? (Is it because spark-shell does NOT really launch jobs on the cluster, and does the computation locally (It is weird as I do NOT have a worker running on the node, I started shell on)
2) If I want to run SQL scripts (in standalone mode) against some large data files (which do not fit into one machine) through Spark's thrift server (like the way beeline or hiveserver2 is used in Hive) , do I need to put the files on NFS so each worker can see the whole file, or is it possible that I create chunks out of the files, and put each smaller chunk (which would fit on a single machine) on each worker, and then use multiple paths (comma separated) to pass them all to the submitted queries ?
The problem is that you are running the spark-shell locally. The default for running a spark-shell is as --master local[*], which will run your code locally on as many cores as you have. If you want to run against your workers, then you will need to run with the --master parameter specifying the master's entry point. If you want to see the possible options you can use with spark-shell, just type spark-shell --help
As to whether you need to put the file on each server, the short answer is yes. Something like HDFS will split it up across the nodes and the manager will handle the fetching as appropriate. I am not as familiar with NFS and if it has this capability, though

Resources