I have a spark application which runs in local mode currently and writes an output to a file in a local UNIX directory.
Now, I want to run the same job in yarn cluster mode and still want to write into that UNIX folder.
Can I use the same saveAsTextFile(path)?
Yes, you can. But it is not the best practice to do that. The spark itself can run standalone and on distributed file system. The reason we are using distributed file system is that the data is huge and the expected output might be huge.
So, if you are compeltely sure that the output will fit in to your local file system, go for it or you can save it to your local storage using the below command.
bin/hadoop fs -copyToLocal /hdfs/source/path /localfs/destination/path
Related
Is there a way to run Spark or Flink on a distributed file system say lustre or anything except from HDFS or S3.
So we are able to create a distributed file system framework using Unix cluster, Can we run spark/flink on a cluster mode rather than standalone.
you can use file:/// as a DFS provided every node has access to common paths, and *your app is configured to use those common paths for sharing source libraries, source data, intermediate data, final data
Things like lustre tend to do that and/or have a specific hadoop filesystem client lib which wraps/extends that.
I want to use local text files in my Spark program which I am running in HDP 2.5 Sandbox in VMWare.
1) Is there any drag and drop way to directly get it in the HDFS of the VM?
2) Can I import it using Zeppelin? If yes, then how to get the absolute path (location) to use it in Spark?
3) Any other way? What and how, if yes?
To get data into HDFS within your VM, you will need to use the hdfs command to push the files from local file system within your VM into HDFS within the VM. The command should look something like:
hadoop fs -put filename.log /my/hdfs/path
For more information on HDFS commands, please refer to Hadoop File System Shell Commands.
Saying this, as you are using Apache Spark, you can also refer to the local file system instead of HDFS. To do this, you would use the file:///... instead of hdfs://.... For example, to access a file within HDFS via Spark, you can usually run a command like:
val mobiletxt = sc.textFile("/data/filename.txt")
but you can also access the VM's local file system like:
val mobiletxt = sc.textFile("file:///home/user/data/filename.txt")
As for Apache Zeppelin, this is a notebook interface to work with Apache Spark (and other systems); there current is no import mechanism within Zeppelin itself. Instead, you will do something like above within your notebook to access either the VM's HDFS or local file system.
I am facing some issue while running a spark java program that reads a file, do some manipulation and then generates output file at a given path.
Every thing works fine when master and slaves are on same machine .ie: in Standalone-cluster mode.
But problem started when I deployed same program in multi machine multi node cluster set up. That means the master is running at x.x.x.102 and slave is running on x.x.x.104.
Both the master -slave have shared their SSH keys and are reachable from each other.
Initially slave was not able to read input file , for that I came to know I need to call sc.addFile() before sc.textFile(). that solved issue. But now I see output is being generated on slave machine in a _temporary folder under the output path. ie: /tmp/emi/_temporary/0/task-xxxx/part-00000
In local cluster mode it works fine and generates output file in /tmp/emi/part-00000.
I came to know that i need to use SparkFiles.get(). but i am not able to understand how and where to use this method.
till now I am using
DataFrame dataobj = ...
dataObj.javaRDD().coalesce(1).saveAsTextFile("file:/tmp/emi");
Can any one please let me know how to call SparkFiles.get()?
In short how can I tell slave to create output file in the machine where driver is running?
Please help.
Thanks a lot in advance.
There is nothing unexpected here. Each worker writes its own part of the data separately. Using file scheme only means that data is writer to a file in the file system local from the worker perspective.
Regarding SparkFiles it is not applicable in this particular case. SparkFiles can be used to distribute common files to the worker machines not to deal with the results.
If for some reason you want to perform writes on the machine used to run driver code you'll have to fetch data to the driver machine first (either collect which requires enough memory to fit all data or toLocalIterator which collects partition at the time and requires multiple jobs) and use standard tools to write results to local file system. In general though writing to driver is not a good practice and most of the time is simply useless.
I am using Spark 1.3.1 in standalone mode (No YARN/HDFS involved - Only Spark) on a cluster with 3 machines. I have a dedicated node for master (no workers running on it) and 2 separate worker nodes.
The cluster starts healthy, and I am just trying to test my installation by running some simple examples via spark-shell (CLI - which I started on the master machine) : I simply put a file on the localfs on the master node (workers do NOT have a copy of this file) and I simply run:
$SPARKHOME/bin/spark-shell
...
scala> val f = sc.textFile("file:///PATH/TO/LOCAL/FILE/ON/MASTER/FS/file.txt")
scala> f.count()
and it returns the words count results correctly.
My Questions are:
1) This contradicts with what spark documentation (on using External Datasets) say as:
"If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system."
I am not using NFS and I did not copy the file to workers, so how does it work ? (Is it because spark-shell does NOT really launch jobs on the cluster, and does the computation locally (It is weird as I do NOT have a worker running on the node, I started shell on)
2) If I want to run SQL scripts (in standalone mode) against some large data files (which do not fit into one machine) through Spark's thrift server (like the way beeline or hiveserver2 is used in Hive) , do I need to put the files on NFS so each worker can see the whole file, or is it possible that I create chunks out of the files, and put each smaller chunk (which would fit on a single machine) on each worker, and then use multiple paths (comma separated) to pass them all to the submitted queries ?
The problem is that you are running the spark-shell locally. The default for running a spark-shell is as --master local[*], which will run your code locally on as many cores as you have. If you want to run against your workers, then you will need to run with the --master parameter specifying the master's entry point. If you want to see the possible options you can use with spark-shell, just type spark-shell --help
As to whether you need to put the file on each server, the short answer is yes. Something like HDFS will split it up across the nodes and the manager will handle the fetching as appropriate. I am not as familiar with NFS and if it has this capability, though
What are the differences between a Linux and a Hadoop file sytem? I knew few of them, just wanted to know more details.
see this similar question.
First of all you cannot compare Linux file system with HDFS, But
upto my knowledge,
HDFS - Name itself says that its a distributed file system where the data stores into several blocks on different clusters.
HDFS write once read many but Local file system write many, ready many
Local file system is a default storage architecture comes with OS but HDFS is a file system for hadoop framework refer here HDFS.
HDFS is an another layer for Local file system.
Refer the below links to see the difference :
Linux File System
Hadoop File System