I checked the [documentation][1] about usage of Azure Databricks external Hive Metastore (Azure SQL database).
I was able to download jars and place them into /dbfs/hive_metastore_jar
My next step is to run cluster with Init file:
# Hive-specific configuration options.
# spark.hadoop prefix is added to make sure these Hive specific options propagate to the metastore client.
# JDBC connect string for a JDBC metastore
spark.hadoop.javax.jdo.option.ConnectionURL jdbc:sqlserver://<host>.database.windows.net:1433;database=<database> #should I add more parameters?
# Username to use against metastore database
spark.hadoop.javax.jdo.option.ConnectionUserName admin
# Password to use against metastore database
spark.hadoop.javax.jdo.option.ConnectionPassword p#ssword
# Driver class name for a JDBC metastore
spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver
# Spark specific configuration options
spark.sql.hive.metastore.version 2.7.3 #I am not sure about this
# Skip this one if <hive-version> is 0.13.x.
spark.sql.hive.metastore.jars /dbfs/hive_metastore_jar
I've uploaded ini file to the DBMS and launch cluster. It was failed to read ini. Something wrong..
[1]: https://learn.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore
I solved this for now. The problems I faced:
I didn't copy Hive jars to the local cluster. This is important, I couldn't refer to the DBMS and should refer spark.sql.hive.metastore.jars to the local copy of Hive. With INI script I can copy them.
connection was good. I also used the Azure template with Vnet, it is more preferable. Then I allow traffic for Azure SQL from my Vnet with databricks.
last issue - I had to create Hive schema before start databricks by copy and run DDL from Git with Hive version 1.2 I deployed it into Azure SQL Database and then I was good to go.
There is a useful notebook with steps to download jars. It is downloading jars to tmp then we should copy it to the own folder. Finally, within cluster creation we should refer to INI script that has all parameters. It has the step of copy jars from DBFS to local file system of cluster.
// This example is for an init script named `external-metastore_hive121.sh`.
dbutils.fs.put(
"dbfs:/databricks/scripts/external-metastore_hive121.sh",
"""#!/bin/sh
|# A temporary workaround to make sure /dbfs is available.
|sleep 10
|# Copy metastore jars from DBFS to the local FileSystem of every node.
|cp -r /dbfs/metastore_jars/hive-v1_2/* /databricks/hive_1_2_1_metastore_jars
|# Loads environment variables to determine the correct JDBC driver to use.
|source /etc/environment
|# Quoting the label (i.e. EOF) with single quotes to disable variable interpolation.
|cat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf
|[driver] {
| # Hive specific configuration options.
| # spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore client.
| # JDBC connect string for a JDBC metastore
| "spark.hadoop.javax.jdo.option.ConnectionURL" = "jdbc:sqlserver://host--name.database.windows.net:1433;database=tcdatabricksmetastore_dev;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net"
|
| # Username to use against metastore database
| "spark.hadoop.javax.jdo.option.ConnectionUserName" = "admin"
|
| # Password to use against metastore database
| "spark.hadoop.javax.jdo.option.ConnectionPassword" = "P#ssword"
|
| # Driver class name for a JDBC metastore
| "spark.hadoop.javax.jdo.option.ConnectionDriverName" = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
|
| # Spark specific configuration options
| "spark.sql.hive.metastore.version" = "1.2.1"
| # Skip this one if ${hive-version} is 0.13.x.
| "spark.sql.hive.metastore.jars" = "/databricks/hive_1_2_1_metastore_jars/*"
|}
|EOF
|""".stripMargin,
overwrite = true)
The command will create a file in DBFS and we will use it as a reference for the cluster creation.
According to the documentation, we should use config:
datanucleus.autoCreateSchema true
datanucleus.fixedDatastore false
In order to create the Hive DDL. It didn't work for me, that's why I used git and create schema and tables myself.
You can test that all works with command:
%sql show databases
Related
I am creating a metastore in azure databricks for azure sql.I have given below commands to cluster config using 7.3 runtime. As mentioned in the documentation
https://learn.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore#spark-options
spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver
spark.hadoop.javax.jdo.option.ConnectionURL jdbc:sqlserver://xxx.database.windows.net:1433;database=hivemetastore
spark.hadoop.javax.jdo.option.ConnectionUserName xxxx
datanucleus.fixedDatastore false
spark.hadoop.javax.jdo.option.ConnectionPassword xxxx
datanucleus.autoCreateSchema true
spark.sql.hive.metastore.jars builtin
spark.sql.hive.metastore.version 1.2.1
hive.metastore.schema.verification.record.version false
hive.metastore.schema.verification false
--
After this when I tried to create database metastore I will get cancelled automatically.
Error I am getting in Data section in databricks which I am not able to copy also.
Cluster setting
Command
--Update
According to the error message updated in the comments
The maximum length allowed is 8000, when the the length specified in declaring a VARCHAR column.
WorkAround: Use either VARCHAR(8000) or VARCHAR(MAX) for column 'PARAM_VALUE'. I would prefer using nvarchar(max), cause an nvarchar (MAX) can store up to 2GB of characters.
Apparently found an official record of the know issue!
See Error in CREATE TABLE with external Hive metastore
This is a known issue with MySQL 8.0 when the default charset is
utfmb4.
Try running this to confirm
SELECT default_character_set_name FROM information_schema.SCHEMATA S WHERE schema_name = "<database-name>"
If yes, Refer Solution
You need to update or recreate the database and set the charset to
latin1.
You have 2 options:
Manually run create statements in the Hive database with DEFAULT CHARSET=latin1 at the end of each CREATE TABLE statement.
Setup the database and user accounts. And create the database and run alter database hive character set latin1; before you launch the metastore. (This command sets the default CHARSET for the database. It is applied when the metastore creates tables.)
I have:
An existing Databricks cluster
Azure blob store (wasb) mounted to HDFS
A Database with its LOCATION set to a path on wasb (via mount path)
A Delta table (Which ultimately writes Delta-formatted parquet files to blob store path)
A kubernetes cluster
Reads and writes data in parquet and/or Delta format within the same Azure blob store that Databricks uses (writing as delta format via spark-submit pyspark)
What I want to do:
Utilize the managed Hive metastore in Databricks to act as data catalog for all data within Azure blob store
To this end, I'd like to connect to the metastore from my outside pyspark job such that I can use consistent code to have a catalog that accurately represents my data.
In other words, if I were to prep my db from within Databricks:
dbutils.fs.mount(
source = "wasbs://container#storage.blob.core.windows.net",
mount_point = "/mnt/db",
extra_configs = {..})
spark.sql('CREATE DATABASE db LOCATION "/mnt/db"')
Then from my Kubernetes pyspark cluster, I'd like to execute
df.write.mode('overwrite').format("delta").saveAsTable("db.table_name")
Which should write the data to wasbs://container#storage.blob.core.windows.net/db/table_name as well as register this table with Hive (and thus be able to query it with HiveQL)
How to I connect to the Databricks managed Hive from a pyspark session outside of Databricks environment?
This doesn't answer my question (I don't think it's possible), but it mostly solves my problem: Writing a crawler to create tables from delta files.
Mount Blob container and create a DB as in question
Write a file in delta format from anywhere:
df.write.mode('overwrite').format("delta").save("/mnt/db/table") # equivilantly, save to wasb:..../db/table
Create a Notebook, schedule it as a job to run regularly
import os
def find_delta_dirs(ls_path):
for dir_path in dbutils.fs.ls(ls_path):
if dir_path.isFile():
pass
elif dir_path.isDir() and ls_path != dir_path.path:
if dir_path.path.endswith("_delta_log/"):
yield os.path.dirname(os.path.dirname(dir_path.path))
yield from find_delta_dirs(dir_path.path)
def fmt_name(full_blob_path, mount_path):
relative_path = full_blob_path.split(mount_path)[-1].strip("/")
return relative_path.replace("/", "_")
db_mount_path = f"/mnt/db"
for path in find_delta_dirs(db_mount_path):
spark.sql(f"CREATE TABLE IF NOT EXISTS {db_name}.{fmt_name(path, db_mount_path)} USING DELTA LOCATION '{path}'")
I have my metastore in external mysql created using hive metastore. My metadata of the table is in external mysql. I would like to connect this to my spark and create dataframe using the metadata so that all column information is populated using this metadata.
How can I do it
You can use Spark-Jdbc connection to connect to Mysql and query hive metastore located in Mysql.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.master("local").appName("mysql connect").enableHiveSupport().getOrCreate()
val mysql_df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:<port>/<db_name>").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "<table_name/query>").option("user", "<user_name>").option("password", "<password>").load()
mysql_df.show()
Note:
We need to add mysql connector jar and start your spark shell with the jar (or) include jar in your eclipse project.
I am having trouble being able to accessing a table in the Glue Data Catalog using pySpark in Hue/Zeppelin on EMR. I have tried both emr-5.13.0 and emr-5.12.1.
I tried following https://github.com/aws-samples/aws-glue-samples/blob/master/examples/data_cleaning_and_lambda.md
but when trying to import the GlueContext it errors saying No module named awsglue.context.
Another note is that when doing a spark.sql("SHOW TABLES").show() it comes up empty for Hue/Zeppelin but when using the pyspark shell on the master node I am able to see and query the table from the Glue Data Catalog.
Any help is much appreciated, thanks!
Ok, I spent some time to simulate the issue, so I spinned up an EMR, with "Use AWS Glue Data Catalog for table metadata" enabled. After enabling web connections, and in zeppelin I issued a show databases command, and it worked fine. Please find herewith the command & output from Zeppelin:
%spark
spark.sql("show databases").show
+-------------------+
|airlines-historical|
| default|
| glue-poc-tpch|
| legislator-new|
| legislators|
| nursinghomedb|
| nycitytaxianalysis|
| ohare-airport-2006|
| payments|
| s100g|
| s1g|
| sampledb|
| testdb|
| tpch|
| tpch_orc|
| tpch_parquet|
+-------------------+
As far as your other issue of "No module named awsglue.context", I think it may not be possible with an EMR commissioned Zeppelin. I think the only way, an awsglue.context can be accessed / used, is via a Glue Devendpoint that you may need to set up in AWS Glue, and then, use an glue jupyter notebook or a locally setup Zeppelin notebook connected to glue development endpoint.
Am not sure if glue context can be directly accessed from an EMR commissioned Zeppelin notebook, maybe am wrong.
You can still access the glue catalog, since EMR provides you with an option for the same, so you can access the databases and do your ETL jobs.
Thanks.
Please check out the details in this link from AWS, and see if the EMR is configured correctly as recommended (Configure Glue Catalog in EMR). Also ensure that appropriate permissions are granted to access AWS Glue catalog. Details are in the attached link. Hope this helps.
you can use the below function to check the availability of databases in glue
def isDatabasePresent(database_name):
"""
check if the glue database exists
:return: Boolean
"""
client = get_glue_client()
responseGetDatabases = client.get_databases()
databaseList = responseGetDatabases['DatabaseList']
for databaseDict in databaseList:
if database_name == databaseDict['Name']:
return True
I have created a java application starting spark (local[*]) and exploiting it to read a csv file as a Dataset<Row> and to create a temporary view with createOrReplaceTempView.
At this point I am able to exploit SQL to query the view inside my application.
What I would like to do, for development and debugging purposes, is to execute queries in an interactive way from outside my application.
Any hints?
Thanks in advance
You can use spark's DeveloperApi - HiveThriftServer2.
#DeveloperApi
def startWithContext(sqlContext: SQLContext): Unit = {
val server = new HiveThriftServer2(sqlContext)
Only thing you need to do in your application is to get SQLContext and use it as follows:
HiveThriftServer2.startWithContext(sqlContext)
This will start hive thrift server (by default on port 10000) and you can use sql client - e.g. beeline for accessing and querying your data in temp tables.
Also you will need to set --conf spark.sql.hive.thriftServer.singleSession=true which allows you to see temp tables. By default it's set to false so each connection has it's own session and they dont see others temp tables.
"spark.sql.hive.thriftServer.singleSession" - When set to true, Hive Thrift server is running in a single session
mode. All the JDBC/ODBC connections share the temporary views, function registries, SQL configuration and the current database.