How to load some files into Spark nodes without duplication? - apache-spark

I have some text files on master server to be processed by a Spark cluster for some statistics purpose.
For example, I have 1.txt, 2.txt,3.txt on master server in a specified directory like /data/.I want use a Spark cluster to process all of them one times. If I use sc.textFile("/data/*.txt") to load all files, other node in cluster cannot find these files on local file system. However if I use sc.addFile and SparkFiles.get to achieve them on each node, the 3 text files will be downloaded to each node and all of them will be processed multi times.
How to solve it without HDFS? Thanks.

According to official document, just copy all files to all nodes.
http://spark.apache.org/docs/1.2.1/programming-guide.html#external-datasets
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.

Related

Why do Apache Spark nodes need access to datafile path?

I am new to Apache Spark.
I have a cluster with a master and one worker. I am connected to master with pyspark (all are on Ubuntu VM).
I am reading this documentation: RDD external-datasets
in particular I have executed:
distFile = sc.textFile("data.txt")
I understand that this creates an RDD from file, which should be managed by the driver, hence by pyspark app.
But the doc states:
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
Question is why do workers need access to the file path if the RDD is created by the driver only (afterwards distributed to the nodes)?

Run Spark or Flink on a distributed file system other than HDFS or S3

Is there a way to run Spark or Flink on a distributed file system say lustre or anything except from HDFS or S3.
So we are able to create a distributed file system framework using Unix cluster, Can we run spark/flink on a cluster mode rather than standalone.
you can use file:/// as a DFS provided every node has access to common paths, and *your app is configured to use those common paths for sharing source libraries, source data, intermediate data, final data
Things like lustre tend to do that and/or have a specific hadoop filesystem client lib which wraps/extends that.

Unable to save RDD's and DF's in a Spark Cluster

When running in a single node no-cluster mode, whenever I do rdd.saveAsTextFile("file://...") or df.write().csv("file://...") it creates a folder at that path with part-files and a file called _SUCCESS.
But when I use the same code for cluster mode, it doesn't work. I doesn't throw any errors but there are no part-files created in that folder. Though the folder and the _SUCCESS file are created, the actual part files data is not.
I am not sure what exactly the problem is here. Any suggestions on how to solve this are greatly appreaciated.
Since in cluster mode, tasks are performed in worker machines
You should try to save the file in hadoop or S3 or some fileserver, like ftp if you are running in cluster mode.

csv data processing in spark standalone mode

I have two node and Let's call A(192.168.2.100) and B(192.168.2.200).
A is for a master and a worker.
in A node
./bin/spark-class org.apache.spark.deploy.worker
./bin/spark-class org.apache.spark.deploy.master
B is for a woker
./bin/spark-class org.apache.spark.deploy.worker
my app is need to load cav file to process
in node A,
./spark-submit --class "myApp" --master spark://192.168.2.100:7077 /spark/app.jar
But It comes error with "need csv file in B".
Is there any way to to share this file to node B?
Do Really I need yarn of mesos to do this?
as the diagram shows bellow: all the data files you want to process should be accessible from all of your workers [ and be sure that your driver can be reachable from your worker ]
so here, you need to put your data files into a place from where workers can read data, in most situations, we put data files into HDFS.
As stated before that file has to be available on every node. So you either have multiple copies, one per node or you use an external hadoop data source (HDFS, Cassandra, Amazon s3). There is another easier solution. You can use NFS and mount a remote drive/partition/location to every node. This way you don't need to have multiple copies and you don't have to learn about an external storage. You can even use sshfs if you want to have a secure mount point over ssh.

Using Spark Shell (CLI) in standalone mode on distributed files

I am using Spark 1.3.1 in standalone mode (No YARN/HDFS involved - Only Spark) on a cluster with 3 machines. I have a dedicated node for master (no workers running on it) and 2 separate worker nodes.
The cluster starts healthy, and I am just trying to test my installation by running some simple examples via spark-shell (CLI - which I started on the master machine) : I simply put a file on the localfs on the master node (workers do NOT have a copy of this file) and I simply run:
$SPARKHOME/bin/spark-shell
...
scala> val f = sc.textFile("file:///PATH/TO/LOCAL/FILE/ON/MASTER/FS/file.txt")
scala> f.count()
and it returns the words count results correctly.
My Questions are:
1) This contradicts with what spark documentation (on using External Datasets) say as:
"If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system."
I am not using NFS and I did not copy the file to workers, so how does it work ? (Is it because spark-shell does NOT really launch jobs on the cluster, and does the computation locally (It is weird as I do NOT have a worker running on the node, I started shell on)
2) If I want to run SQL scripts (in standalone mode) against some large data files (which do not fit into one machine) through Spark's thrift server (like the way beeline or hiveserver2 is used in Hive) , do I need to put the files on NFS so each worker can see the whole file, or is it possible that I create chunks out of the files, and put each smaller chunk (which would fit on a single machine) on each worker, and then use multiple paths (comma separated) to pass them all to the submitted queries ?
The problem is that you are running the spark-shell locally. The default for running a spark-shell is as --master local[*], which will run your code locally on as many cores as you have. If you want to run against your workers, then you will need to run with the --master parameter specifying the master's entry point. If you want to see the possible options you can use with spark-shell, just type spark-shell --help
As to whether you need to put the file on each server, the short answer is yes. Something like HDFS will split it up across the nodes and the manager will handle the fetching as appropriate. I am not as familiar with NFS and if it has this capability, though

Resources