How to connect to hive databases in spark Using Java - apache-spark

I am able to connect to hive using hive.metastore.uris in Sparksession. What I want is to connect to a particular database of hive with this connection so that I don't need to add database name to each table names in queries. Is there any way to achieve this ?
Expecting code something like
SparkSession sparkSession = SparkSession.config("hive.metastore.uris", "thrift://dhdhdkkd136.india.sghjd.com:9083/hive_database")

You can use the catalog API accessible from the SparkSession.
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.catalog.Catalog
You can then call sparkSession.catalog.setCurrentDatabase(<db_name>)

Related

How can we add MySQL details as property in PySpark?

While creating a SparkSession, as there is a property to connect to Cassandra called
.config("spark.cassandra.connection.host", "ip-address")
that can be directly added while creating a SparkSession, can we add the MySQL details similar to this so that we can avoid passing them in every Spark function?
No, there is no such option when connecting to MySQL. Cassandra has its own spark-cassandra-connector while for MySQL it uses JDBC which requires the connection params to be passed as Java Properties.
They differ in configuration options and in how they works.

Using hive external metadata in spark

I have my metastore in external mysql created using hive metastore. My metadata of the table is in external mysql. I would like to connect this to my spark and create dataframe using the metadata so that all column information is populated using this metadata.
How can I do it
You can use Spark-Jdbc connection to connect to Mysql and query hive metastore located in Mysql.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.master("local").appName("mysql connect").enableHiveSupport().getOrCreate()
val mysql_df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:<port>/<db_name>").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "<table_name/query>").option("user", "<user_name>").option("password", "<password>").load()
mysql_df.show()
Note:
We need to add mysql connector jar and start your spark shell with the jar (or) include jar in your eclipse project.

Query my temporary tables outside my java app

I have created a java application starting spark (local[*]) and exploiting it to read a csv file as a Dataset<Row> and to create a temporary view with createOrReplaceTempView.
At this point I am able to exploit SQL to query the view inside my application.
What I would like to do, for development and debugging purposes, is to execute queries in an interactive way from outside my application.
Any hints?
Thanks in advance
You can use spark's DeveloperApi - HiveThriftServer2.
#DeveloperApi
def startWithContext(sqlContext: SQLContext): Unit = {
val server = new HiveThriftServer2(sqlContext)
Only thing you need to do in your application is to get SQLContext and use it as follows:
HiveThriftServer2.startWithContext(sqlContext)
This will start hive thrift server (by default on port 10000) and you can use sql client - e.g. beeline for accessing and querying your data in temp tables.
Also you will need to set --conf spark.sql.hive.thriftServer.singleSession=true which allows you to see temp tables. By default it's set to false so each connection has it's own session and they dont see others temp tables.
"spark.sql.hive.thriftServer.singleSession" - When set to true, Hive Thrift server is running in a single session
mode. All the JDBC/ODBC connections share the temporary views, function registries, SQL configuration and the current database.

connect to HiveMetaStore from HiveContext

I'm doing some tests on tables I created via HiveContext.sql(). Are there anyways I can connect to underlying HiveMetaStore using org.apache.hadoop.hive.metastore.HiveMetaStoreClient?
I tried to init HiveMetaStoreClient(hiveContext.hiveconf()) but I can't see tables from the HiveMetaStoreClient.

Accessing Spark RDDs from a web browser via thrift server - java

We have processed our data using Spark 1.2.1 with Java and stored in Hive tables. We want to access this data as RDDs from an web browser.
I read documentation and I understood the steps to do the task.
I am unable to find the way to interact with Spark SQL RDDs via thrift server. Examples I found have belw line in the code and I am not find the class for this in Spark 1.2.1 java API docs.
HiveThriftServer2.startWithContext
In github i saw scala examples using
import org.apache.spark.sql.hive.thriftserver , but I dont see this in Java API docs. Not sure if I am missing something.
Did anybody had luck with accessing Spark SQL RDDs from a browser via thrift? Can you post the code snippet. We are using Java.
I've got most of this working. Lets dissect each part of it: (References at bottom of post)
HiveThriftServer2.startWithContext is defined in Scala. I was never able to access it from Java or from Python using Py4j, and am no JVM expert, but I ended up switching to Scala. This may have something to do with the annotation #DeveloperApi . This is how I imported it Scala in Spark 1.6.1:
import org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
For anyone reading this and not using Hive, a Spark SQL context won't do, and you need a hive context. However, the HiveContext constructor requires a Java spark context, not a scala one.
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql.hive.HiveContext
var hiveContext = new HiveContext(JavaSparkContext.toSparkContext(sc))
Now start the thrift server
HiveThriftServer2.startWithContext(hiveContext)
// Yay
Next, we need to make our RDDs available as SQL tables. First, we have to convert them into Spark SQL DataFrames:
val someDF = hiveContext.createDataFrame(someRDD)
Then, we need to turn them into Spark SQL tables. You do this by persisting them to Hive, or making the RDD available as a temporary table.
Persist to Hive:
// Deprecated since Spark 1.4, to be removed in Spark 2.0:
someDF.saveAsTable("someTable")
// Up-to-date at time of writing
someDF.write().saveAsTable("someTable")
Or, use a temporary table:
// Use the Data Frame as a Temporary Table
// Introduced in Spark 1.3.0
someDF.registerTempTable("someTable")
Note - temporary tables are isolated to an SQL session.
Spark's hive thrift server is multi-session by default
in version 1.6 (one session per connection). Therefore,
for clients to access temporary tables you've registered,
you'll need to set the option spark.sql.hive.thriftServer.singleSession to true
You can test this by querying the tables in beeline, a command line utility for interacting with the hive thrift server. It ships with Spark.
Finally, you need a way of accessing the hive thrift server from the browser. Thanks to its awesome developers, it has an HTTP mode, so if you want to build a web app, you can use the thrift protocol over AJAX requests from the browser. A simpler strategy might be to create an IPython notebook, and use pyhive to connect to the thrift server.
Data Frame Reference:
https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/DataFrame.html
singleSession option pull request:
https://mail-archives.apache.org/mod_mbox/spark-commits/201511.mbox/%3Cc2bd1313f7ca4e618ec89badbd8f9f31#git.apache.org%3E
HTTP mode and beeline howto:
https://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine
Pyhive:
https://github.com/dropbox/PyHive
HiveThriftServer2 startWithContext definition:
https://github.com/apache/spark/blob/6b1a6180e7bd45b0a0ec47de9f7c7956543f4dfa/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L56-73
Thrift is JDBC/ODBC server.
You can connect to it via JDBC/ODBC connections and access content through the HiveDriver.
You can not get RDDs back from it, because HiveContext is not available.
What you refered to is an experimental feature not available for Java.
As a workaround, you could re-parse the results and create your structures for your client.
For example:
private static String driverName = "org.apache.hive.jdbc.HiveDriver";
private static String hiveConnectionString = "jdbc:hive2://YourHiveServer:Port";
private static String tableName = "SOME_TABLE";
Class c = Class.forName(driverName);
Connection con = DriverManager.getConnection(hiveConnectionString, "user", "pwd");
Statement stmt = con.createStatement();
String sql = "select * from "+tableName;
ResultSet res = stmt.executeQuery(sql);
parseResultsToObjects(res);

Resources