how to write data into hive table using spark remotely? - apache-spark

I'm new to hadoop world. I have installed spark 2.3.1 in my windows machine and installed cloudera inside vm in the same machine. i'm doing some data transformation in the form of dataframe using spark shell. Now i want to put this data to hive which is in cloudera using spark. i have googled and did the following steps.
1) Copied all the files in /etc/hive/conf and pasted to spark/conf in my windows.
2) In windows spark/conf open “hive-site.xml” and change property as below.
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://MyclouderaIP:9083</value>
</property>
<property>
3) Put host entry in window system C:\Windows\System32\drivers\etc\hosts
example : MyclouderaIP quickstart.cloudera
4) In cloudera vm open “/etc/hive/conf/hdfs-site.xml” and change the property like below
<property>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
</property>
After finishing all the steps i am facing issues below.
scala> val Main = sc.textFile("D:\\Windows\\CompanyData.txt")
scala> Main.collect
Error :
java.lang.IllegalArgumentException: Pathname /D:/Windows/CompanyData.txt from hdfs://quickstart.cloudera:8020/D:/Windows/CompanyData.txt is not a valid DFS filename.
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:197)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
i have removed "core-site.xml" from spark/conf and it can able to read textfile in windows . But it saprk is not able to communicate with cloudera while inserting a record .
scala> import org.apache.spark.sql.hive.HiveContext
scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> sqlContext.sql("insert into TestTable select 1")
Error:
org.apache.hadoop.ipc.RemoteException(java.io.IOException):
File /user/hive/warehouse/TestTable/.hive-staging_hive_2018-10-17_00-03-48_369_2112774544260501723-1/-ext-10000/_temporary/0/_temporary/attempt_20181017000351_0000_m_000000_0/part-00000-8fcba81b-8a51-48a6-9c47-ac5f1c9dafdb-c000
could only be replicated to 0 nodes instead of minReplication (=1).
There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
please can somebody help me .

Related

hive on spark :org.apache.hadoop.hive.ql.metadata.HiveException

Buddy!
I get a problem while using hive(version 3.1.2) on spark(version 3.2.1) in my mac(Catalina 10.15.7). My hadoop and hive run in my mac in local mode and they both work well (I can insert records into hive table and select them out when I set hive.execution.engine=mr ).
After I made the configuration listed below in hive's hive-site.xml
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
<property>
<name>spark.home</name>
<value>/Users/admin/spark/spark-3.2.1-bin-hadoop3.2/</value>
</property>
<property>
<name>spark.master</name>
<value>spark://localhost:7077</value>
</property>
Then I run hive in terminal and it works well(I can run select statement).
But when I insert records into table ,I get exception
FAILED: SemanticException Failed to get a spark session:
org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create Spark client for
Spark session b9704c3a-6932-44b3-a011-600ae80f39e1

Unable to write data on hive using spark

I am using spark1.6. I am creating hivecontext using spark context. When I save the data into hive it gives error. I am using cloudera vm. My hive is inside cloudera vm and spark in on my system. I can access the vm using IP. I have started the thrift server and hiveserver2 on vm. I have user thrift server uri for hive.metastore.uris
val hiveContext = new HiveContext(sc)
hiveContext.setConf("hive.metastore.uris", "thrift://IP:9083")
............
............
df.write.mode(SaveMode.Append).insertInto("test")
I get the following error:
FAILED: SemanticException java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClien‌​t
Probably inside spark conf folder, hive-site.xml is not available , I have added the details below.
Adding hive-site.xml inside spark configuration folder.
creating a symlink which points to hive-site.xml in hive configuration folder.
sudo ln -s /usr/lib/hive/conf/hive-site.xml /usr/lib/spark/conf/hive-site.xml
after the above steps, restarting spark-shell should help.

sqlContext.tableNames not fetching any hive tables [duplicate]

I have Hive 0.13 installation and have created custom databases. I have spark 1.1.0 single node cluster built using mvn -hive option.
I want to access tables in this database in spark application using hivecontext. But hivecontext is always reading the local metastore created in spark directory. I have copied the hive-site.xml in spark/conf directory.
Do I need to do any other configuration?
Step 1:
Setup SPARK with latest version....
$ cd $SPARK_Home; ./sbt/sbt -Phive assembly
$ cd $SPARK_Home; ./sbt/sbt -Phivethriftserver assembly
By executing this you will download some jar files and bydefault it will be added no need to add....
Step 2:
Copy hive-site.xml from your Hive cluster to your $SPARK_HOME/conf/dir and edit the XML file and add these properties to that file which is listed below:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://MYSQL_HOST:3306/hive_{version}</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore/description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>XXXXXXXX</value>
<description>Username to use against metastore database/description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>XXXXXXXX</value>
<description>Password to use against metastore database/description>
</property>
Step 3: Download MYSQL JDBC connector and add that to SPARK CLASSPATH.
Run this command bin/compute-classpath.sh
and add the below line for the following script.
CLASSPATH=”$CLASSPATH:$PATH_TO_mysql-connector-java-5.1.10.jar
How to retrieve the data from HIVE to SPARK....
Step 1:
Start all deamons by the following command....
start-all.sh
Step 2:
Start hive thrift server 2 by the following command....
hive --service hiveserver2 &
Step 3:
Start spark server by the following command....
start-spark.sh
And finally check whether these are started or not by checking with the following command....
RunJar
ResourceManager
Master
NameNode
SecondaryNameNode
Worker
Jps
JobHistoryServer
DataNode
NodeManager
Step 4:
Start the master by the following command....
./sbin/start-master.sh
To stop the master use the below command.....
./sbin/stop-master.sh
Step 5:
Open a new terminal....
Start the beeline by the following path....
hadoop#localhost:/usr/local/hadoop/hive/bin$ beeline
After it asks for input... Pass the input which is listed below....
!connect jdbc:hive2://localhost:10000 hadoop "" org.apache.hive.jdbc.HiveDriver
After that set the SPARK by the following commands....
Note:set these configurations on a conf file so no need to run always....
set spark.master=spark://localhost:7077;
set hive.execution.engines=spark;
set spark.executor.memory=2g; // set the memory depends on your server
set spark.serializer=org.apache.spark.serializer.kryoSerializer;
set spark.io.compression.codec=org.apache.spark.io.LZFCompressionCodec;
After it asks for input.... Pass the Query which you want to retrieve the data.... and open a browser and check in the URL by the following command localhost:8080 You can see the Running Jobs and Completed Jobs in the URL....

How to get rid of derby.log, metastore_db from Spark Shell

When running spark-shell it creates a file derby.log and a folder metastore_db. How do I configure spark to put these somewhere else?
For derby log I've tried Getting rid of derby.log like so spark-shell --driver-memory 10g --conf "-spark.driver.extraJavaOptions=Dderby.stream.info.file=/dev/null" with a couple of different properties but spark ignores them.
Does anyone know how to get rid of these or specify a default directory for them?
The use of the hive.metastore.warehouse.dir is deprecated since Spark 2.0.0,
see the docs.
As hinted by this answer, the real culprit for both the metastore_db directory and the derby.log file being created in every working subdirectory is the derby.system.home property defaulting to ..
Thus, a default location for both can be specified by adding the following line to spark-defaults.conf:
spark.driver.extraJavaOptions -Dderby.system.home=/tmp/derby
where /tmp/derby can be replaced by the directory of your choice.
For spark-shell, to avoid having the metastore_db directory and avoid doing it in the code (since the context/session is already created and you won't stop them and recreate them with the new configuration each time), you have to set its location in hive-site.xml file and copy this file into spark conf directory.
A sample hive-site.xml file to make the location of metastore_db in /tmp (refer to my answer here):
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/tmp/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/tmp/</value>
<description>location of default database for the warehouse</description>
</property>
</configuration>
After that you could start your spark-shell as the following to get rid of derby.log as well
$ spark-shell --conf "spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp"
Try setting derby.system.home to some other directory as a system property before firing up the spark shell. Derby will create new databases there. The default value for this property is .
Reference: https://db.apache.org/derby/integrate/plugin_help/properties.html
Use hive.metastore.warehouse.dir property. From docs:
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
For derby log: Getting rid of derby.log could be the answer. In general create derby.properties file in your working directory with following content:
derby.stream.error.file=/path/to/desired/log/file
For me setting the Spark property didn't work, neither on the driver nor the executor. So searching for this issue, I ended up setting the property for my system instead with:
System.setProperty("derby.system.home", "D:\\tmp\\derby")
val spark: SparkSession = SparkSession.builder
.appName("UT session")
.master("local[*]")
.enableHiveSupport
.getOrCreate
[...]
And that finally got me rid of those annoying items.
In case if you are using Jupyter/Jupyterhub/Jupyterlab or just setting this conf parameter inside python, use the following will work:
from pyspark import SparkConf, SparkContext
conf = (SparkConf()
.setMaster("local[*]")
.set('spark.driver.extraJavaOptions','-Dderby.system.home=/tmp/derby')
)
sc = SparkContext(conf = conf)
I used the below configuration for a pyspark project, i was able to setup sparkwarehouse db and derby db in sample path, so was able to avoid them setup in current directory.
from pyspark.sql import SparkSession
from os.path import abspath
location = abspath("C:\self\demo_dbx\data\spark-warehouse") #Path where you want to setup sparkwarehouse
local_spark = SparkSession.builder \
.master("local[*]") \
.appName('Spark_Dbx_Session') \
.config("spark.sql.warehouse.dir", location)\
.config("spark.driver.extraJavaOptions",
f"Dderby.system.home='{location}'")\
.getOrCreate()

Setting MySQL as the metastore for built in spark hive

I have an spark, scala sbt project using spark. I need to multiple create HiveContexts, which is not allowed by the built in derby for spark hive. Can someone help me with setting up mysql as the metastore instead of derby, which is the default db. I don't have actual hive installed or spark installed. I use sbt dependency for spark and hive.
Copy hive-site.xml file in Spark's conf directory and change some properties in that file
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
<description>password to use against metastore database</description>
</property>
Shakti
You need to have the conf files in the class path. I'm using hadoop, hive, and spark with Intellij. In Intellij I have file:/usr/local/spark/conf/, file:/usr/local/hadoop/etc/hadoop/, and file:/usr/local/hive/conf/ in my class path. You can use following to print your run time class path:
val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.foreach(println)
I hope this help if you haven't already found a fix.

Resources