Remote Database not found while Connecting to remote Hive from Spark using JDBC in Python? - apache-spark

I am using pyspark script to read data from remote Hive through JDBC Driver. I have tried other method using enableHiveSupport, Hive-site.xml. but that technique is not possible for me due to some limitations(Access was blocked to launch yarn jobs from outside the cluster). Below is the only way I can connect to Hive.
from pyspark.sql import SparkSession
spark=SparkSession.builder \
.appName("hive") \
.config("spark.sql.hive.metastorePartitionPruning", "true") \
.config("hadoop.security.authentication" , "kerberos") \
.getOrCreate()
jdbcdf=spark.read.format("jdbc").option("url","urlname")\
.option("driver","com.cloudera.hive.jdbc41.HS2Driver").option("user","username").option("dbtable","dbname.tablename").load()
spark.sql("show tables from dbname").show()
Giving me below error:
py4j.protocol.Py4JJavaError: An error occurred while calling o31.sql.
: org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'vqaa' not found;
Could someone please help how I can access remote db/tables using this method? Thanks

add .enableHiveSupport() to your sparksession in order to access hive catalog

Related

Spark integration with Vertica Failing

We are using Vertica Community Edition "vertica_community_edition-11.0.1-0", and are using Spark 3.2, with local[*] master. When we are trying to save data in vertica database using following:
member.write()
.format("com.vertica.spark.datasource.VerticaSource")
.mode(SaveMode.Overwrite)
.option("host", "192.168.1.25")
.option("port", "5433")
.option("user", "Fred")
.option("db", "store")
.option("password", "password")
//.option("dbschema", "store")
.option("table", "Test")
// .option("staging_fs_url", "hdfs://172.16.20.17:9820")
.save();
We are getting following exception:
com.vertica.spark.util.error.ConnectorException: Fatal error: spark context did not exist
at com.vertica.spark.datasource.VerticaSource.extractCatalog(VerticaDatasourceV2.scala:76)
at org.apache.spark.sql.connector.catalog.CatalogV2Util$.getTableProviderCatalog(CatalogV2Util.scala:363)
Kindly let know how to solve the exception.
We had a similar case. The root cause was that SparkSession.getActiveSession() returned None, due to that spark session was registered on another thread of the JVM. We could still get to the single session we had using SparkSession.getDefaultSession() and manually register it with SparkSession.SetActiveSession(...).
Our case happened in a jupyter kernel where we were using pyspark.
The workaround code was:
sp = sc._jvm.SparkSession.getDefaultSession().get()
sc._jvm.SparkSession.setActiveSession(sp)
I can't try scala or java, I suppose it should look like this:
SparkSession.setActiveSession(SparkSession.getDefaultSession())
vertica doesn't support spark version 3.2 with vertica 11.0 officially. Please find the below documentation link.
https://www.vertica.com/docs/11.0.x/HTML/Content/Authoring/SupportedPlatforms/SparkIntegration.htm
Please try using spark connector v2 with the supported version of spark and try running examples from github
https://github.com/vertica/spark-connector/tree/main/examples

Query remote Hive Metastore from PySpark

I am trying to query a remote Hive metastore within PySpark using a username/password/jdbc url. I can initialize the SparkSession just fine but am unable to actually query the tables. I would like to keep everything in a python environment if possible. Any ideas?
from pyspark.sql import SparkSession
url = f"jdbc:hive2://{jdbcHostname}:{jdbcPort}/{jdbcDatabase}"
driver = "org.apache.hive.jdbc.HiveDriver"
# initialize
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("hive.metastore.uris", url) \ # also tried .config("javax.jdo.option.ConnectionURL", url)
.config("javax.jdo.option.ConnectionDriverName", driver) \
.config("javax.jdo.option.ConnectionUserName", username) \
.config("javax.jdo.option.ConnectionPassword", password) \
.enableHiveSupport() \
.getOrCreate()
# query
spark.sql("select * from database.tbl limit 100").show()
AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
Before I was able to connect to a single table using JDBC but was unable to retrieve any data, see Errors querying Hive table from PySpark
The metastore uris are not JDBC addresses, they are simply server:port addresses opened up by the Metastore server process. Typically port 9083
The metastore itself would not be a jdbc:hive2 connection, and would instead be the respective RDBMS that the metastore would be configured with (as set by the hive-site.xml)
If you want to use Spark with JDBC, then you don't need those javax.jdo options, as the JDBC reader has its own username, driver, etc options

connect spark with Hive [duplicate]

This question already has answers here:
How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?
(11 answers)
Closed 2 years ago.
I'm running spark locally and want to to access Hive tables, which are located in the remote Hadoop cluster.
I'm able to access the hive tables by lauching beeline under SPARK_HOME
[ml#master spark-2.0.0]$./bin/beeline
Beeline version 1.2.1.spark2 by Apache Hive
beeline> !connect jdbc:hive2://remote_hive:10000
Connecting to jdbc:hive2://remote_hive:10000
Enter username for jdbc:hive2://remote_hive:10000: root
Enter password for jdbc:hive2://remote_hive:10000: ******
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/ml/spark/spark-2.0.0/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/10/12 19:06:39 INFO jdbc.Utils: Supplied authorities: remote_hive:10000
16/10/12 19:06:39 INFO jdbc.Utils: Resolved authority: remote_hive:10000
16/10/12 19:06:39 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://remote_hive:10000
Connected to: Apache Hive (version 1.2.1000.2.4.2.0-258)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://remote_hive:10000>
how can I access the remote hive tables programmatically from spark?
JDBC is not required
Spark connects directly to the Hive metastore, not through HiveServer2. To configure this,
Put hive-site.xml on your classpath, and specify hive.metastore.uris to where your hive metastore hosted. Also see How to connect to a Hive metastore programmatically in SparkSQL?
Import org.apache.spark.sql.hive.HiveContext, as it can perform SQL query over Hive tables.
Define val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
Verify sqlContext.sql("show tables") to see if it works
SparkSQL on Hive tables
Conclusion : If you must go with jdbc way
Have a look connecting apache spark with apache hive remotely.
Please note that beeline also connects through jdbc. from your log it self its evident.
[ml#master spark-2.0.0]$./bin/beeline Beeline version 1.2.1.spark2 by
Apache Hive beeline> !connect jdbc:hive2://remote_hive:10000
Connecting to jdbc:hive2://remote_hive:10000
So please have a look at this interesting article
Method 1: Pull table into Spark using JDBC
Method 2: Use Spark JdbcRDD with HiveServer2 JDBC driver
Method 3: Fetch dataset on a client side, then create RDD manually
Currently HiveServer2 driver doesn't allow us to use "Sparkling" Method 1 and 2, we can rely only on Method 3
Below is example code snippet though which it can be achieved
Loading data from one Hadoop cluster (aka "remote") into another one (where my Spark lives aka "domestic") thru HiveServer2 JDBC connection.
import java.sql.Timestamp
import scala.collection.mutable.MutableList
case class StatsRec (
first_name: String,
last_name: String,
action_dtm: Timestamp,
size: Long,
size_p: Long,
size_d: Long
)
val conn: Connection = DriverManager.getConnection(url, user, password)
val res: ResultSet = conn.createStatement
.executeQuery("SELECT * FROM stats_201512301914")
val fetchedRes = MutableList[StatsRec]()
while(res.next()) {
var rec = StatsRec(res.getString("first_name"),
res.getString("last_name"),
Timestamp.valueOf(res.getString("action_dtm")),
res.getLong("size"),
res.getLong("size_p"),
res.getLong("size_d"))
fetchedRes += rec
}
conn.close()
val rddStatsDelta = sc.parallelize(fetchedRes)
rddStatsDelta.cache()
// Basically we are done. To check loaded data:
println(rddStatsDelta.count)
rddStatsDelta.collect.take(10).foreach(println)
After providing the hive-ste.xml configuration to SPARK and after starting the HIVE Metastore service,
Two things need to be configured in SPARK Session while connecting to HIVE:
Since Spark SQL connects to Hive metastore using thrift, we need to provide the thrift server uri while creating the Spark session.
Hive Metastore warehouse which is the directory where Spark SQL persists tables.
Use Property 'spark.sql.warehouse.dir' which is corresponding to 'hive.metastore.warehouse.dir' (as this is deprecated in Spark 2.0)
Something like:
SparkSession spark=SparkSession.builder().appName("Spark_SQL_5_Save To Hive").enableHiveSupport().getOrCreate();
spark.sparkContext().conf().set("spark.sql.warehouse.dir", "/user/hive/warehouse");
spark.sparkContext().conf().set("hive.metastore.uris", "thrift://localhost:9083");
Hope this was helpful !!
As per documentation:
Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse.
So in SparkSession you need to specify spark.sql.uris instead of hive.metastore.uris
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.uris", "thrift://<remote_ip>:9083") \
.enableHiveSupport() \
.getOrCreate()
spark.sql("show tables").show()

Using hive external metadata in spark

I have my metastore in external mysql created using hive metastore. My metadata of the table is in external mysql. I would like to connect this to my spark and create dataframe using the metadata so that all column information is populated using this metadata.
How can I do it
You can use Spark-Jdbc connection to connect to Mysql and query hive metastore located in Mysql.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.master("local").appName("mysql connect").enableHiveSupport().getOrCreate()
val mysql_df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:<port>/<db_name>").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "<table_name/query>").option("user", "<user_name>").option("password", "<password>").load()
mysql_df.show()
Note:
We need to add mysql connector jar and start your spark shell with the jar (or) include jar in your eclipse project.

How to connect to remote hive server from spark [duplicate]

This question already has answers here:
How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?
(11 answers)
Closed 2 years ago.
I'm running spark locally and want to to access Hive tables, which are located in the remote Hadoop cluster.
I'm able to access the hive tables by lauching beeline under SPARK_HOME
[ml#master spark-2.0.0]$./bin/beeline
Beeline version 1.2.1.spark2 by Apache Hive
beeline> !connect jdbc:hive2://remote_hive:10000
Connecting to jdbc:hive2://remote_hive:10000
Enter username for jdbc:hive2://remote_hive:10000: root
Enter password for jdbc:hive2://remote_hive:10000: ******
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/ml/spark/spark-2.0.0/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/10/12 19:06:39 INFO jdbc.Utils: Supplied authorities: remote_hive:10000
16/10/12 19:06:39 INFO jdbc.Utils: Resolved authority: remote_hive:10000
16/10/12 19:06:39 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://remote_hive:10000
Connected to: Apache Hive (version 1.2.1000.2.4.2.0-258)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://remote_hive:10000>
how can I access the remote hive tables programmatically from spark?
JDBC is not required
Spark connects directly to the Hive metastore, not through HiveServer2. To configure this,
Put hive-site.xml on your classpath, and specify hive.metastore.uris to where your hive metastore hosted. Also see How to connect to a Hive metastore programmatically in SparkSQL?
Import org.apache.spark.sql.hive.HiveContext, as it can perform SQL query over Hive tables.
Define val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
Verify sqlContext.sql("show tables") to see if it works
SparkSQL on Hive tables
Conclusion : If you must go with jdbc way
Have a look connecting apache spark with apache hive remotely.
Please note that beeline also connects through jdbc. from your log it self its evident.
[ml#master spark-2.0.0]$./bin/beeline Beeline version 1.2.1.spark2 by
Apache Hive beeline> !connect jdbc:hive2://remote_hive:10000
Connecting to jdbc:hive2://remote_hive:10000
So please have a look at this interesting article
Method 1: Pull table into Spark using JDBC
Method 2: Use Spark JdbcRDD with HiveServer2 JDBC driver
Method 3: Fetch dataset on a client side, then create RDD manually
Currently HiveServer2 driver doesn't allow us to use "Sparkling" Method 1 and 2, we can rely only on Method 3
Below is example code snippet though which it can be achieved
Loading data from one Hadoop cluster (aka "remote") into another one (where my Spark lives aka "domestic") thru HiveServer2 JDBC connection.
import java.sql.Timestamp
import scala.collection.mutable.MutableList
case class StatsRec (
first_name: String,
last_name: String,
action_dtm: Timestamp,
size: Long,
size_p: Long,
size_d: Long
)
val conn: Connection = DriverManager.getConnection(url, user, password)
val res: ResultSet = conn.createStatement
.executeQuery("SELECT * FROM stats_201512301914")
val fetchedRes = MutableList[StatsRec]()
while(res.next()) {
var rec = StatsRec(res.getString("first_name"),
res.getString("last_name"),
Timestamp.valueOf(res.getString("action_dtm")),
res.getLong("size"),
res.getLong("size_p"),
res.getLong("size_d"))
fetchedRes += rec
}
conn.close()
val rddStatsDelta = sc.parallelize(fetchedRes)
rddStatsDelta.cache()
// Basically we are done. To check loaded data:
println(rddStatsDelta.count)
rddStatsDelta.collect.take(10).foreach(println)
After providing the hive-ste.xml configuration to SPARK and after starting the HIVE Metastore service,
Two things need to be configured in SPARK Session while connecting to HIVE:
Since Spark SQL connects to Hive metastore using thrift, we need to provide the thrift server uri while creating the Spark session.
Hive Metastore warehouse which is the directory where Spark SQL persists tables.
Use Property 'spark.sql.warehouse.dir' which is corresponding to 'hive.metastore.warehouse.dir' (as this is deprecated in Spark 2.0)
Something like:
SparkSession spark=SparkSession.builder().appName("Spark_SQL_5_Save To Hive").enableHiveSupport().getOrCreate();
spark.sparkContext().conf().set("spark.sql.warehouse.dir", "/user/hive/warehouse");
spark.sparkContext().conf().set("hive.metastore.uris", "thrift://localhost:9083");
Hope this was helpful !!
As per documentation:
Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse.
So in SparkSession you need to specify spark.sql.uris instead of hive.metastore.uris
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.uris", "thrift://<remote_ip>:9083") \
.enableHiveSupport() \
.getOrCreate()
spark.sql("show tables").show()

Resources