Unable to save RDD's and DF's in a Spark Cluster - apache-spark

When running in a single node no-cluster mode, whenever I do rdd.saveAsTextFile("file://...") or df.write().csv("file://...") it creates a folder at that path with part-files and a file called _SUCCESS.
But when I use the same code for cluster mode, it doesn't work. I doesn't throw any errors but there are no part-files created in that folder. Though the folder and the _SUCCESS file are created, the actual part files data is not.
I am not sure what exactly the problem is here. Any suggestions on how to solve this are greatly appreaciated.

Since in cluster mode, tasks are performed in worker machines
You should try to save the file in hadoop or S3 or some fileserver, like ftp if you are running in cluster mode.

Related

How to load some files into Spark nodes without duplication?

I have some text files on master server to be processed by a Spark cluster for some statistics purpose.
For example, I have 1.txt, 2.txt,3.txt on master server in a specified directory like /data/.I want use a Spark cluster to process all of them one times. If I use sc.textFile("/data/*.txt") to load all files, other node in cluster cannot find these files on local file system. However if I use sc.addFile and SparkFiles.get to achieve them on each node, the 3 text files will be downloaded to each node and all of them will be processed multi times.
How to solve it without HDFS? Thanks.
According to official document, just copy all files to all nodes.
http://spark.apache.org/docs/1.2.1/programming-guide.html#external-datasets
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.

Output file is getting generated on slave machine in apache spark

I am facing some issue while running a spark java program that reads a file, do some manipulation and then generates output file at a given path.
Every thing works fine when master and slaves are on same machine .ie: in Standalone-cluster mode.
But problem started when I deployed same program in multi machine multi node cluster set up. That means the master is running at x.x.x.102 and slave is running on x.x.x.104.
Both the master -slave have shared their SSH keys and are reachable from each other.
Initially slave was not able to read input file , for that I came to know I need to call sc.addFile() before sc.textFile(). that solved issue. But now I see output is being generated on slave machine in a _temporary folder under the output path. ie: /tmp/emi/_temporary/0/task-xxxx/part-00000
In local cluster mode it works fine and generates output file in /tmp/emi/part-00000.
I came to know that i need to use SparkFiles.get(). but i am not able to understand how and where to use this method.
till now I am using
DataFrame dataobj = ...
dataObj.javaRDD().coalesce(1).saveAsTextFile("file:/tmp/emi");
Can any one please let me know how to call SparkFiles.get()?
In short how can I tell slave to create output file in the machine where driver is running?
Please help.
Thanks a lot in advance.
There is nothing unexpected here. Each worker writes its own part of the data separately. Using file scheme only means that data is writer to a file in the file system local from the worker perspective.
Regarding SparkFiles it is not applicable in this particular case. SparkFiles can be used to distribute common files to the worker machines not to deal with the results.
If for some reason you want to perform writes on the machine used to run driver code you'll have to fetch data to the driver machine first (either collect which requires enough memory to fit all data or toLocalIterator which collects partition at the time and requires multiple jobs) and use standard tools to write results to local file system. In general though writing to driver is not a good practice and most of the time is simply useless.

Spark: hdfs cluster mode

I'm just getting started using Apache Spark. I'm using cluster mode (master, slave1, slave2) and I want to process a big file which is kept in Hadoop (hdfs). I am using the textFile method from SparkContext; while the file is being processing I monitorize the nodes and I can see that just the slave2 is working. After processing, slave2 has tasks but slave1 has no task.
If instead of using a hdfs I use a local file then both slaves work simultaneously.
I don't get why this behaviour. Please, can anybody give me a clue?
The main reason of that behavior is the concept of data locality. When Spark's Application Master asks for the creation of new executors, they are tried to be allocated in the same node where data resides.
I.e. in your case, HDFS is likely to have written all the blocks of the file on the same node. Thus Spark will instantiate the executors on that node. Instead, if you use a local file, it will be present in all nodes, so data locality won't be an issue anymore.

Spark saveAsNewAPIHadoopFile works on local mode but not on Cluster mode

After upgrading to CDH5.4 and Spark streaming 1.3, I'm encountering a strange issue where saveAsNewAPIHadoopFile is no longer saving files to HDFS as it's suppose to. I can see that the _temp directory being generated, but when the Save is complete, the _temp is removed and leaving the directory empty with just a SUCCESS file. I have a feeling that the files are generated but afterward, they were unable to be moved out of the _temp directory before _temp is deleted.
This issue only happen when running on the Spark Cluster (standalone mode). If I run the job with local spark, files are saved as expected.
Some help would be appreciated.
Are you running this on your laptop/desktop?
One way this can happen is if the path you use for your output is a relative path on NFS. In that case, Spark assumes relative paths are hdfs:// not file:// and can't write out to disk.

Using Spark Shell (CLI) in standalone mode on distributed files

I am using Spark 1.3.1 in standalone mode (No YARN/HDFS involved - Only Spark) on a cluster with 3 machines. I have a dedicated node for master (no workers running on it) and 2 separate worker nodes.
The cluster starts healthy, and I am just trying to test my installation by running some simple examples via spark-shell (CLI - which I started on the master machine) : I simply put a file on the localfs on the master node (workers do NOT have a copy of this file) and I simply run:
$SPARKHOME/bin/spark-shell
...
scala> val f = sc.textFile("file:///PATH/TO/LOCAL/FILE/ON/MASTER/FS/file.txt")
scala> f.count()
and it returns the words count results correctly.
My Questions are:
1) This contradicts with what spark documentation (on using External Datasets) say as:
"If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system."
I am not using NFS and I did not copy the file to workers, so how does it work ? (Is it because spark-shell does NOT really launch jobs on the cluster, and does the computation locally (It is weird as I do NOT have a worker running on the node, I started shell on)
2) If I want to run SQL scripts (in standalone mode) against some large data files (which do not fit into one machine) through Spark's thrift server (like the way beeline or hiveserver2 is used in Hive) , do I need to put the files on NFS so each worker can see the whole file, or is it possible that I create chunks out of the files, and put each smaller chunk (which would fit on a single machine) on each worker, and then use multiple paths (comma separated) to pass them all to the submitted queries ?
The problem is that you are running the spark-shell locally. The default for running a spark-shell is as --master local[*], which will run your code locally on as many cores as you have. If you want to run against your workers, then you will need to run with the --master parameter specifying the master's entry point. If you want to see the possible options you can use with spark-shell, just type spark-shell --help
As to whether you need to put the file on each server, the short answer is yes. Something like HDFS will split it up across the nodes and the manager will handle the fetching as appropriate. I am not as familiar with NFS and if it has this capability, though

Resources