Azure data bricks external hive metastore creation

Azure data bricks external hive metastore creation - apache-spark

I am creating a metastore in azure databricks for azure sql.I have given below commands to cluster config using 7.3 runtime. As mentioned in the documentation
https://learn.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore#spark-options
spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver
spark.hadoop.javax.jdo.option.ConnectionURL jdbc:sqlserver://xxx.database.windows.net:1433;database=hivemetastore
spark.hadoop.javax.jdo.option.ConnectionUserName xxxx
datanucleus.fixedDatastore false
spark.hadoop.javax.jdo.option.ConnectionPassword xxxx
datanucleus.autoCreateSchema true
spark.sql.hive.metastore.jars builtin
spark.sql.hive.metastore.version 1.2.1
hive.metastore.schema.verification.record.version false
hive.metastore.schema.verification false
--
After this when I tried to create database metastore I will get cancelled automatically.
Error I am getting in Data section in databricks which I am not able to copy also.
Cluster setting
Command

--Update
According to the error message updated in the comments
The maximum length allowed is 8000, when the the length specified in declaring a VARCHAR column.
WorkAround: Use either VARCHAR(8000) or VARCHAR(MAX) for column 'PARAM_VALUE'. I would prefer using nvarchar(max), cause an nvarchar (MAX) can store up to 2GB of characters.
Apparently found an official record of the know issue!
See Error in CREATE TABLE with external Hive metastore
This is a known issue with MySQL 8.0 when the default charset is
utfmb4.
Try running this to confirm
SELECT default_character_set_name FROM information_schema.SCHEMATA S WHERE schema_name = "<database-name>"
If yes, Refer Solution
You need to update or recreate the database and set the charset to
latin1.
You have 2 options:
Manually run create statements in the Hive database with DEFAULT CHARSET=latin1 at the end of each CREATE TABLE statement.
Setup the database and user accounts. And create the database and run alter database hive character set latin1; before you launch the metastore. (This command sets the default CHARSET for the database. It is applied when the metastore creates tables.)

Related

Azure Databricks external Hive Metastore

I checked the [documentation][1] about usage of Azure Databricks external Hive Metastore (Azure SQL database).
I was able to download jars and place them into /dbfs/hive_metastore_jar
My next step is to run cluster with Init file:
# Hive-specific configuration options.
# spark.hadoop prefix is added to make sure these Hive specific options propagate to the metastore client.
# JDBC connect string for a JDBC metastore
spark.hadoop.javax.jdo.option.ConnectionURL jdbc:sqlserver://<host>.database.windows.net:1433;database=<database> #should I add more parameters?
# Username to use against metastore database
spark.hadoop.javax.jdo.option.ConnectionUserName admin
# Password to use against metastore database
spark.hadoop.javax.jdo.option.ConnectionPassword p#ssword
# Driver class name for a JDBC metastore
spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver
# Spark specific configuration options
spark.sql.hive.metastore.version 2.7.3 #I am not sure about this
# Skip this one if <hive-version> is 0.13.x.
spark.sql.hive.metastore.jars /dbfs/hive_metastore_jar
I've uploaded ini file to the DBMS and launch cluster. It was failed to read ini. Something wrong..
[1]: https://learn.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore

I solved this for now. The problems I faced:
I didn't copy Hive jars to the local cluster. This is important, I couldn't refer to the DBMS and should refer spark.sql.hive.metastore.jars to the local copy of Hive. With INI script I can copy them.
connection was good. I also used the Azure template with Vnet, it is more preferable. Then I allow traffic for Azure SQL from my Vnet with databricks.
last issue - I had to create Hive schema before start databricks by copy and run DDL from Git with Hive version 1.2 I deployed it into Azure SQL Database and then I was good to go.
There is a useful notebook with steps to download jars. It is downloading jars to tmp then we should copy it to the own folder. Finally, within cluster creation we should refer to INI script that has all parameters. It has the step of copy jars from DBFS to local file system of cluster.
// This example is for an init script named `external-metastore_hive121.sh`.
dbutils.fs.put(
"dbfs:/databricks/scripts/external-metastore_hive121.sh",
"""#!/bin/sh
|# A temporary workaround to make sure /dbfs is available.
|sleep 10
|# Copy metastore jars from DBFS to the local FileSystem of every node.
|cp -r /dbfs/metastore_jars/hive-v1_2/* /databricks/hive_1_2_1_metastore_jars
|# Loads environment variables to determine the correct JDBC driver to use.
|source /etc/environment
|# Quoting the label (i.e. EOF) with single quotes to disable variable interpolation.
|cat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf
|[driver] {
| # Hive specific configuration options.
| # spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore client.
| # JDBC connect string for a JDBC metastore
| "spark.hadoop.javax.jdo.option.ConnectionURL" = "jdbc:sqlserver://host--name.database.windows.net:1433;database=tcdatabricksmetastore_dev;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net"
|
| # Username to use against metastore database
| "spark.hadoop.javax.jdo.option.ConnectionUserName" = "admin"
|
| # Password to use against metastore database
| "spark.hadoop.javax.jdo.option.ConnectionPassword" = "P#ssword"
|
| # Driver class name for a JDBC metastore
| "spark.hadoop.javax.jdo.option.ConnectionDriverName" = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
|
| # Spark specific configuration options
| "spark.sql.hive.metastore.version" = "1.2.1"
| # Skip this one if ${hive-version} is 0.13.x.
| "spark.sql.hive.metastore.jars" = "/databricks/hive_1_2_1_metastore_jars/*"
|}
|EOF
|""".stripMargin,
overwrite = true)
The command will create a file in DBFS and we will use it as a reference for the cluster creation.
According to the documentation, we should use config:
datanucleus.autoCreateSchema true
datanucleus.fixedDatastore false
In order to create the Hive DDL. It didn't work for me, that's why I used git and create schema and tables myself.
You can test that all works with command:
%sql show databases

Setting up Azure SQL External Metastore for Azure Databricks — Invalid column name ‘IS_REWRITE_ENABLED’

I’m attempting to set up an external Hive metastore for Azure Databricks. The Metastore is in Azure SQL and the Hive version is 1.2.1 (included with azure HdInsight 3.6).
I have followed the setup instructions on the “External Apache Hive metastore” page in Azure documentation.
I can see all of the databases and tables in the metastore but if I look at a specific table I get the following.
Caused by: javax.jdo.JDOException: Exception thrown when executing query : SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MTable' AS NUCLEUS_TYPE,A0.CREATE_TIME,A0.LAST_ACCESS_TIME,A0.OWNER,A0.RETENTION,A0.IS_REWRITE_ENABLED,A0.TBL_NAME,A0.TBL_TYPE,A0.TBL_ID FROM TBLS A0 LEFT OUTER JOIN DBS B0 ON A0.DB_ID = B0.DB_ID WHERE A0.TBL_NAME = ? AND B0."NAME" = ?
NestedThrowables:
com.microsoft.sqlserver.jdbc.SQLServerException: Invalid column name 'IS_REWRITE_ENABLED'.
I was expecting to see errors relating to the underlying storage but this appears to be a problem with the metastore.
Anybody have any idea what’s wrong?

The error message seems to suggest that the column IS_REWRITE_ENABLED doesn't exist on table TBLS with alias A0.
Looking at the script for the hive-schema for the derby db, it can help guide you in seeing that the column in question does exist here.
metastore script definition
If you've got admin access to the Azure SQL db, you can alter the table and add the column:
ALTER TABLE TBLS
ADD IS_REWRITE_ENABLED char(1) NOT NULL DEFAULT 'N';
I don't believe this is the actual fix but it does work around the error.

Query my temporary tables outside my java app

I have created a java application starting spark (local[*]) and exploiting it to read a csv file as a Dataset<Row> and to create a temporary view with createOrReplaceTempView.
At this point I am able to exploit SQL to query the view inside my application.
What I would like to do, for development and debugging purposes, is to execute queries in an interactive way from outside my application.
Any hints?
Thanks in advance

You can use spark's DeveloperApi - HiveThriftServer2.
#DeveloperApi
def startWithContext(sqlContext: SQLContext): Unit = {
val server = new HiveThriftServer2(sqlContext)
Only thing you need to do in your application is to get SQLContext and use it as follows:
HiveThriftServer2.startWithContext(sqlContext)
This will start hive thrift server (by default on port 10000) and you can use sql client - e.g. beeline for accessing and querying your data in temp tables.
Also you will need to set --conf spark.sql.hive.thriftServer.singleSession=true which allows you to see temp tables. By default it's set to false so each connection has it's own session and they dont see others temp tables.
"spark.sql.hive.thriftServer.singleSession" - When set to true, Hive Thrift server is running in a single session
mode. All the JDBC/ODBC connections share the temporary views, function registries, SQL configuration and the current database.

User Concurrency is not working in Spark for hive

I have configured 3 node Spark (version 1.4.0) cluster environment with Hive 0.13.1 version. and started Spark thrift service using ./sbin/start-thriftserver.sh.
Multiple users are using same thrift service with same port and different usernames.
But the problem is that when one user executes query like use mytest. then database change is automatically reflects for other users.

Have you set hive concurrency values as mentioned in the docs:
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-ConfigurationValuestoSetforINSERT,UPDATE,DELETE
Configuration Values to Set for INSERT, UPDATE, DELETE
In addition to the new parameters listed above, some existing parameters need to be set to support INSERT ... VALUES, UPDATE, and DELETE.
Configuration key
Must be set to
hive.support.concurrency true (default is false)
hive.enforce.bucketing true (default is false)
hive.exec.dynamic.partition.mode nonstrict (default is strict)

how to use presto to query hive data

I just installed presto and when I use the presto-cli to query hive data, I get the following error:
$ ./presto --server node6:8080 --catalog hive --schema default
presto:default> show tables;
Query 20131113_150006_00002_u8uyp failed: Table hive.information_schema.tables does not exist
The config.properties is:
coordinator=true
datasources=jmx,hive
http-server.http.port=8080
presto-metastore.db.type=h2
presto-metastore.db.filename=/root/h2
task.max-memory=1GB
discovery-server.enabled=true
discovery.uri=`http://node6:8080`
And the hive.properties is:
connector.name=hive-cdh4
hive.metastore.uri=thrift://node6:9083
The hadoop distribution I used is CDH 4.4. I believe it's properly installed and hive can process queries successfully on its own.
Can anyone help me work it out? Any ideas will be appreciated.

As recommended by the Getting Started, I created a controller (jmx only) and a separate worker (jmx,hive), each on separate machines.
What finally solved this for me was to specify the worker's hostname and http-server.http.port as the --server argument to presto. When specifying the controller, it didn't work.
This all makes sense, but I am still wondering what will happen when I have two Presto-Hive workers...

Add more line to etc/catalog/hive.properties
"hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml"
ofcourse check values of path before do it.
presto-metastore.db.filename= <- is this the value for Hive Warehouse
Directory ?
=> this presto's metastore,not hive.

I just figured out what was wrong in my case:
you also have to add following line to $HIVE_HOME/conf/hive-env.sh for informing hive to open thrift port(same as mentioned under hive.metastore.uris property in hive-site.xml file). This port is used by hive client to connect to Metastore through RPC.
export METASTORE_PORT=9084
in the hive-env.sh file in the conf folder.
This should sync your hive with presto.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Azure data bricks external hive metastore creation - apache-spark

Related

Azure Databricks external Hive Metastore

Setting up Azure SQL External Metastore for Azure Databricks — Invalid column name ‘IS_REWRITE_ENABLED’

Query my temporary tables outside my java app

User Concurrency is not working in Spark for hive

how to use presto to query hive data

Categories

Resources