I am running a spark program on my Windows. I wanted to load the data from SQL to HDFS. I was able to load the data into the data frame and using below query to write it to HDFS (present on my cloudera VM ). Below is the syntax I found. Can some one please guide where to get my cloudera cluster details in my VM ? Like what to pass after hdfs:// ?. I expect /user/hdfs/test/ represents directory structure in below url.
df.write.csv("hdfs://cluster/user/hdfs/test/example.csv")
Related
I am new to Apache Spark.
I have a cluster with a master and one worker. I am connected to master with pyspark (all are on Ubuntu VM).
I am reading this documentation: RDD external-datasets
in particular I have executed:
distFile = sc.textFile("data.txt")
I understand that this creates an RDD from file, which should be managed by the driver, hence by pyspark app.
But the doc states:
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
Question is why do workers need access to the file path if the RDD is created by the driver only (afterwards distributed to the nodes)?
I have a huge file stored in S3 and loading ii into my Spark Cluster and i want to invoke a custom Java Library which takes a Input File Location, process the Data and writes to a given output location. How ever i cannot rewrite that custom logic in Spark.
I am trying to see whether i can load the file from S3 and save the partition to local disk and give that location to Custom Java App and once it is processed load all the partitions and save it into S3.
Is this possible ? What ever i have read so far it looks like i need to use RDD Api. but couldn't find more info on how i can save each partition to local disk.
Appreciate any inputs.
I have been trying to deploy a spark multi-node cluster on three machines (master, slave1 and slave2). I have successfully deployed the spark cluster but I am confused about how to distribute my HDFS data over the slaves? Do I need to manually put data on my slave nodes and how can I specify where to read data from when submitting an application from the client? I have searched multiple forums but haven't been able to figure out how to use HDFS with Spark without using Hadoop.
tl;dr Store files to be processed by a Spark application on Hadoop HDFS and Spark executors will be told how to access them.
From HDFS Users Guide:
This document is a starting point for users working with Hadoop Distributed File System (HDFS) either as a part of a Hadoop cluster or as a stand-alone general purpose distributed file system.
A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data.
So, HDFS is a mere file system that you can use to store files and use them in a distributed application, incl. a Spark application.
To my great surprise, it's only in HDFS Architecture where you can find a HDFS URI, i.e. hdfs://localhost:8020/user/hadoop/delete/test1 that is a HDFS URL to a resource delete/test1 that belongs to the user hadoop.
The URL that start with hdfs points at a HDFS that in the above example is managed by a NameNode at localhost:8020.
That means that HDFS does not require Hadoop YARN, but is usually used together because they come together and is just simple to use together.
Do I need to manually put data on my slave nodes and how can I specify where to read data from when submitting an application from the client?
Spark supports Hadoop HDFS with or without Hadoop YARN. A cluster manager (aka master URL) is an orthogonal concern to HDFS.
Wrapping it up, just use hdfs://hostname:port/path/to/directory with to access files on HDFS.
I want to use local text files in my Spark program which I am running in HDP 2.5 Sandbox in VMWare.
1) Is there any drag and drop way to directly get it in the HDFS of the VM?
2) Can I import it using Zeppelin? If yes, then how to get the absolute path (location) to use it in Spark?
3) Any other way? What and how, if yes?
To get data into HDFS within your VM, you will need to use the hdfs command to push the files from local file system within your VM into HDFS within the VM. The command should look something like:
hadoop fs -put filename.log /my/hdfs/path
For more information on HDFS commands, please refer to Hadoop File System Shell Commands.
Saying this, as you are using Apache Spark, you can also refer to the local file system instead of HDFS. To do this, you would use the file:///... instead of hdfs://.... For example, to access a file within HDFS via Spark, you can usually run a command like:
val mobiletxt = sc.textFile("/data/filename.txt")
but you can also access the VM's local file system like:
val mobiletxt = sc.textFile("file:///home/user/data/filename.txt")
As for Apache Zeppelin, this is a notebook interface to work with Apache Spark (and other systems); there current is no import mechanism within Zeppelin itself. Instead, you will do something like above within your notebook to access either the VM's HDFS or local file system.
I am new to Spark and needed help in figuring out why my Hive databases are not accessible to perform a data load through Spark.
Background:
I am running Hive, Spark, and my Java program on a single machine. It's a Cloudera QuickStart VM, CDH5.4x, on a VirtualBox.
I have downloaded pre-built Spark 1.3.1.
I am using the Hive bundled with the VM and can run hive queries through Spark-shell and Hive cmd line without any issue. This includes running the command:
LOAD DATA INPATH 'hdfs://quickstart.cloudera:8020/user/cloudera/test_table/result.parquet/' INTO TABLE test_spark.test_table PARTITION(part = '2015-08-21');
Problem:
I am writing a Java program to read data from Cassandra and load it into Hive. I have saved the results of the Cassandra read in parquet format in a folder called 'result.parquet'.
Now I would like to load this into Hive. For this, I
Copied the Hive-site.xml to the Spark conf folder.
I made a change to this xml. I noticed that I had two hive-site.xml - one which was auto generated and another which had Hive execution parameters. I combined both into a single hive-site.xml.
Code used (Java):
HiveContext hiveContext = new
HiveContext(JavaSparkContext.toSparkContext(sc));
hiveContext.sql("show databases").show();
hiveContext.sql("LOAD DATA INPATH
'hdfs://quickstart.cloudera:8020/user/cloudera/test_table/result.parquet/'
INTO TABLE test_spark.test_table PARTITION(part = '2015-08-21')").show();
So, this worked. And I could load data into Hive. Except, after I restarted my VM, it has stopped working.
When I run the show databases Hive query, I get a result saying
result
default
instead of the databases in Hive, which are
default
test_spark
I also notice a folder called metastore_db being created in my Project Folder. From googling around, I know this happens when Spark can't connect to the Hive metastore, so it creates one of its own.I thought I had fixed that, but clearly not.
What am I missing?