Using hive external metadata in spark - apache-spark

I have my metastore in external mysql created using hive metastore. My metadata of the table is in external mysql. I would like to connect this to my spark and create dataframe using the metadata so that all column information is populated using this metadata.
How can I do it

You can use Spark-Jdbc connection to connect to Mysql and query hive metastore located in Mysql.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.master("local").appName("mysql connect").enableHiveSupport().getOrCreate()
val mysql_df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:<port>/<db_name>").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "<table_name/query>").option("user", "<user_name>").option("password", "<password>").load()
mysql_df.show()
Note:
We need to add mysql connector jar and start your spark shell with the jar (or) include jar in your eclipse project.

Related

connect spark with Hive [duplicate]

This question already has answers here:
How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?
(11 answers)
Closed 2 years ago.
I'm running spark locally and want to to access Hive tables, which are located in the remote Hadoop cluster.
I'm able to access the hive tables by lauching beeline under SPARK_HOME
[ml#master spark-2.0.0]$./bin/beeline
Beeline version 1.2.1.spark2 by Apache Hive
beeline> !connect jdbc:hive2://remote_hive:10000
Connecting to jdbc:hive2://remote_hive:10000
Enter username for jdbc:hive2://remote_hive:10000: root
Enter password for jdbc:hive2://remote_hive:10000: ******
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/ml/spark/spark-2.0.0/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/10/12 19:06:39 INFO jdbc.Utils: Supplied authorities: remote_hive:10000
16/10/12 19:06:39 INFO jdbc.Utils: Resolved authority: remote_hive:10000
16/10/12 19:06:39 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://remote_hive:10000
Connected to: Apache Hive (version 1.2.1000.2.4.2.0-258)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://remote_hive:10000>
how can I access the remote hive tables programmatically from spark?
JDBC is not required
Spark connects directly to the Hive metastore, not through HiveServer2. To configure this,
Put hive-site.xml on your classpath, and specify hive.metastore.uris to where your hive metastore hosted. Also see How to connect to a Hive metastore programmatically in SparkSQL?
Import org.apache.spark.sql.hive.HiveContext, as it can perform SQL query over Hive tables.
Define val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
Verify sqlContext.sql("show tables") to see if it works
SparkSQL on Hive tables
Conclusion : If you must go with jdbc way
Have a look connecting apache spark with apache hive remotely.
Please note that beeline also connects through jdbc. from your log it self its evident.
[ml#master spark-2.0.0]$./bin/beeline Beeline version 1.2.1.spark2 by
Apache Hive beeline> !connect jdbc:hive2://remote_hive:10000
Connecting to jdbc:hive2://remote_hive:10000
So please have a look at this interesting article
Method 1: Pull table into Spark using JDBC
Method 2: Use Spark JdbcRDD with HiveServer2 JDBC driver
Method 3: Fetch dataset on a client side, then create RDD manually
Currently HiveServer2 driver doesn't allow us to use "Sparkling" Method 1 and 2, we can rely only on Method 3
Below is example code snippet though which it can be achieved
Loading data from one Hadoop cluster (aka "remote") into another one (where my Spark lives aka "domestic") thru HiveServer2 JDBC connection.
import java.sql.Timestamp
import scala.collection.mutable.MutableList
case class StatsRec (
first_name: String,
last_name: String,
action_dtm: Timestamp,
size: Long,
size_p: Long,
size_d: Long
)
val conn: Connection = DriverManager.getConnection(url, user, password)
val res: ResultSet = conn.createStatement
.executeQuery("SELECT * FROM stats_201512301914")
val fetchedRes = MutableList[StatsRec]()
while(res.next()) {
var rec = StatsRec(res.getString("first_name"),
res.getString("last_name"),
Timestamp.valueOf(res.getString("action_dtm")),
res.getLong("size"),
res.getLong("size_p"),
res.getLong("size_d"))
fetchedRes += rec
}
conn.close()
val rddStatsDelta = sc.parallelize(fetchedRes)
rddStatsDelta.cache()
// Basically we are done. To check loaded data:
println(rddStatsDelta.count)
rddStatsDelta.collect.take(10).foreach(println)
After providing the hive-ste.xml configuration to SPARK and after starting the HIVE Metastore service,
Two things need to be configured in SPARK Session while connecting to HIVE:
Since Spark SQL connects to Hive metastore using thrift, we need to provide the thrift server uri while creating the Spark session.
Hive Metastore warehouse which is the directory where Spark SQL persists tables.
Use Property 'spark.sql.warehouse.dir' which is corresponding to 'hive.metastore.warehouse.dir' (as this is deprecated in Spark 2.0)
Something like:
SparkSession spark=SparkSession.builder().appName("Spark_SQL_5_Save To Hive").enableHiveSupport().getOrCreate();
spark.sparkContext().conf().set("spark.sql.warehouse.dir", "/user/hive/warehouse");
spark.sparkContext().conf().set("hive.metastore.uris", "thrift://localhost:9083");
Hope this was helpful !!
As per documentation:
Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse.
So in SparkSession you need to specify spark.sql.uris instead of hive.metastore.uris
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.uris", "thrift://<remote_ip>:9083") \
.enableHiveSupport() \
.getOrCreate()
spark.sql("show tables").show()

How to connect to hive databases in spark Using Java

I am able to connect to hive using hive.metastore.uris in Sparksession. What I want is to connect to a particular database of hive with this connection so that I don't need to add database name to each table names in queries. Is there any way to achieve this ?
Expecting code something like
SparkSession sparkSession = SparkSession.config("hive.metastore.uris", "thrift://dhdhdkkd136.india.sghjd.com:9083/hive_database")
You can use the catalog API accessible from the SparkSession.
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.catalog.Catalog
You can then call sparkSession.catalog.setCurrentDatabase(<db_name>)

How to connect to remote hive server from spark [duplicate]

This question already has answers here:
How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?
(11 answers)
Closed 2 years ago.
I'm running spark locally and want to to access Hive tables, which are located in the remote Hadoop cluster.
I'm able to access the hive tables by lauching beeline under SPARK_HOME
[ml#master spark-2.0.0]$./bin/beeline
Beeline version 1.2.1.spark2 by Apache Hive
beeline> !connect jdbc:hive2://remote_hive:10000
Connecting to jdbc:hive2://remote_hive:10000
Enter username for jdbc:hive2://remote_hive:10000: root
Enter password for jdbc:hive2://remote_hive:10000: ******
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/ml/spark/spark-2.0.0/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/10/12 19:06:39 INFO jdbc.Utils: Supplied authorities: remote_hive:10000
16/10/12 19:06:39 INFO jdbc.Utils: Resolved authority: remote_hive:10000
16/10/12 19:06:39 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://remote_hive:10000
Connected to: Apache Hive (version 1.2.1000.2.4.2.0-258)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://remote_hive:10000>
how can I access the remote hive tables programmatically from spark?
JDBC is not required
Spark connects directly to the Hive metastore, not through HiveServer2. To configure this,
Put hive-site.xml on your classpath, and specify hive.metastore.uris to where your hive metastore hosted. Also see How to connect to a Hive metastore programmatically in SparkSQL?
Import org.apache.spark.sql.hive.HiveContext, as it can perform SQL query over Hive tables.
Define val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
Verify sqlContext.sql("show tables") to see if it works
SparkSQL on Hive tables
Conclusion : If you must go with jdbc way
Have a look connecting apache spark with apache hive remotely.
Please note that beeline also connects through jdbc. from your log it self its evident.
[ml#master spark-2.0.0]$./bin/beeline Beeline version 1.2.1.spark2 by
Apache Hive beeline> !connect jdbc:hive2://remote_hive:10000
Connecting to jdbc:hive2://remote_hive:10000
So please have a look at this interesting article
Method 1: Pull table into Spark using JDBC
Method 2: Use Spark JdbcRDD with HiveServer2 JDBC driver
Method 3: Fetch dataset on a client side, then create RDD manually
Currently HiveServer2 driver doesn't allow us to use "Sparkling" Method 1 and 2, we can rely only on Method 3
Below is example code snippet though which it can be achieved
Loading data from one Hadoop cluster (aka "remote") into another one (where my Spark lives aka "domestic") thru HiveServer2 JDBC connection.
import java.sql.Timestamp
import scala.collection.mutable.MutableList
case class StatsRec (
first_name: String,
last_name: String,
action_dtm: Timestamp,
size: Long,
size_p: Long,
size_d: Long
)
val conn: Connection = DriverManager.getConnection(url, user, password)
val res: ResultSet = conn.createStatement
.executeQuery("SELECT * FROM stats_201512301914")
val fetchedRes = MutableList[StatsRec]()
while(res.next()) {
var rec = StatsRec(res.getString("first_name"),
res.getString("last_name"),
Timestamp.valueOf(res.getString("action_dtm")),
res.getLong("size"),
res.getLong("size_p"),
res.getLong("size_d"))
fetchedRes += rec
}
conn.close()
val rddStatsDelta = sc.parallelize(fetchedRes)
rddStatsDelta.cache()
// Basically we are done. To check loaded data:
println(rddStatsDelta.count)
rddStatsDelta.collect.take(10).foreach(println)
After providing the hive-ste.xml configuration to SPARK and after starting the HIVE Metastore service,
Two things need to be configured in SPARK Session while connecting to HIVE:
Since Spark SQL connects to Hive metastore using thrift, we need to provide the thrift server uri while creating the Spark session.
Hive Metastore warehouse which is the directory where Spark SQL persists tables.
Use Property 'spark.sql.warehouse.dir' which is corresponding to 'hive.metastore.warehouse.dir' (as this is deprecated in Spark 2.0)
Something like:
SparkSession spark=SparkSession.builder().appName("Spark_SQL_5_Save To Hive").enableHiveSupport().getOrCreate();
spark.sparkContext().conf().set("spark.sql.warehouse.dir", "/user/hive/warehouse");
spark.sparkContext().conf().set("hive.metastore.uris", "thrift://localhost:9083");
Hope this was helpful !!
As per documentation:
Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse.
So in SparkSession you need to specify spark.sql.uris instead of hive.metastore.uris
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.uris", "thrift://<remote_ip>:9083") \
.enableHiveSupport() \
.getOrCreate()
spark.sql("show tables").show()

How to Create a Database in Spark SQL

How do I create a database or multiple databases in sparkSQL. I am executing the SQL from spark-sql CLI. The query like in hive create database sample_db does not work here. I have Hadoop 2.7 and Spark 1.6 installed on my system.
spark.sql("create database test")
//fetch metadata data from the catalog. your database name will be listed here
spark.catalog.listDatabases.show(false)
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] db_name
[COMMENT comment_text]
[LOCATION path]
[WITH DBPROPERTIES (key1=val1, key2=val2, ...)]
Link for your reference

Spark Sql is connecting to Derby DB for Hive meta store in place of my SQL and not able to find the Data Base

I am using Cloudera VM. I have created SQLContext and trying to connect a hive Data base buit not able to. Similarly when I am creating a table from using spark sql , am not able to see in hive also.
I see the hive site xml has configuration for mysql db as metastore but spark is trying to connect derby db. Please let me know if I am missing some thing.
from pyspark.sql import SQLContext
hiveContext=SQLContext(sc)
hiveContext.sql("use test ")
I can see it is connecting to DERBY DB.
16/06/19 18:51:32 INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
16/06/19 18:51:32 INFO metastore.ObjectStore: Initialized ObjectStore
16/06/19 18:51:33 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.1.0
hive> show databases;
OK
default
test
Time taken: 0.571 seconds, Fetched: 2 row(s)

Resources