How to Create a Database in Spark SQL - apache-spark

How do I create a database or multiple databases in sparkSQL. I am executing the SQL from spark-sql CLI. The query like in hive create database sample_db does not work here. I have Hadoop 2.7 and Spark 1.6 installed on my system.

spark.sql("create database test")
//fetch metadata data from the catalog. your database name will be listed here
spark.catalog.listDatabases.show(false)

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] db_name
[COMMENT comment_text]
[LOCATION path]
[WITH DBPROPERTIES (key1=val1, key2=val2, ...)]
Link for your reference

Related

Azure data bricks external hive metastore creation

I am creating a metastore in azure databricks for azure sql.I have given below commands to cluster config using 7.3 runtime. As mentioned in the documentation
https://learn.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore#spark-options
spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver
spark.hadoop.javax.jdo.option.ConnectionURL jdbc:sqlserver://xxx.database.windows.net:1433;database=hivemetastore
spark.hadoop.javax.jdo.option.ConnectionUserName xxxx
datanucleus.fixedDatastore false
spark.hadoop.javax.jdo.option.ConnectionPassword xxxx
datanucleus.autoCreateSchema true
spark.sql.hive.metastore.jars builtin
spark.sql.hive.metastore.version 1.2.1
hive.metastore.schema.verification.record.version false
hive.metastore.schema.verification false
--
After this when I tried to create database metastore I will get cancelled automatically.
Error I am getting in Data section in databricks which I am not able to copy also.
Cluster setting
Command
--Update
According to the error message updated in the comments
The maximum length allowed is 8000, when the the length specified in declaring a VARCHAR column.
WorkAround: Use either VARCHAR(8000) or VARCHAR(MAX) for column 'PARAM_VALUE'. I would prefer using nvarchar(max), cause an nvarchar (MAX) can store up to 2GB of characters.
Apparently found an official record of the know issue!
See Error in CREATE TABLE with external Hive metastore
This is a known issue with MySQL 8.0 when the default charset is
utfmb4.
Try running this to confirm
SELECT default_character_set_name FROM information_schema.SCHEMATA S WHERE schema_name = "<database-name>"
If yes, Refer Solution
You need to update or recreate the database and set the charset to
latin1.
You have 2 options:
Manually run create statements in the Hive database with DEFAULT CHARSET=latin1 at the end of each CREATE TABLE statement.
Setup the database and user accounts. And create the database and run alter database hive character set latin1; before you launch the metastore. (This command sets the default CHARSET for the database. It is applied when the metastore creates tables.)

Setting up Azure SQL External Metastore for Azure Databricks — Invalid column name ‘IS_REWRITE_ENABLED’

I’m attempting to set up an external Hive metastore for Azure Databricks. The Metastore is in Azure SQL and the Hive version is 1.2.1 (included with azure HdInsight 3.6).
I have followed the setup instructions on the “External Apache Hive metastore” page in Azure documentation.
I can see all of the databases and tables in the metastore but if I look at a specific table I get the following.
Caused by: javax.jdo.JDOException: Exception thrown when executing query : SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MTable' AS NUCLEUS_TYPE,A0.CREATE_TIME,A0.LAST_ACCESS_TIME,A0.OWNER,A0.RETENTION,A0.IS_REWRITE_ENABLED,A0.TBL_NAME,A0.TBL_TYPE,A0.TBL_ID FROM TBLS A0 LEFT OUTER JOIN DBS B0 ON A0.DB_ID = B0.DB_ID WHERE A0.TBL_NAME = ? AND B0."NAME" = ?
NestedThrowables:
com.microsoft.sqlserver.jdbc.SQLServerException: Invalid column name 'IS_REWRITE_ENABLED'.
I was expecting to see errors relating to the underlying storage but this appears to be a problem with the metastore.
Anybody have any idea what’s wrong?
The error message seems to suggest that the column IS_REWRITE_ENABLED doesn't exist on table TBLS with alias A0.
Looking at the script for the hive-schema for the derby db, it can help guide you in seeing that the column in question does exist here.
metastore script definition
If you've got admin access to the Azure SQL db, you can alter the table and add the column:
ALTER TABLE TBLS
ADD IS_REWRITE_ENABLED char(1) NOT NULL DEFAULT 'N';
I don't believe this is the actual fix but it does work around the error.

Using hive external metadata in spark

I have my metastore in external mysql created using hive metastore. My metadata of the table is in external mysql. I would like to connect this to my spark and create dataframe using the metadata so that all column information is populated using this metadata.
How can I do it
You can use Spark-Jdbc connection to connect to Mysql and query hive metastore located in Mysql.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.master("local").appName("mysql connect").enableHiveSupport().getOrCreate()
val mysql_df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:<port>/<db_name>").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "<table_name/query>").option("user", "<user_name>").option("password", "<password>").load()
mysql_df.show()
Note:
We need to add mysql connector jar and start your spark shell with the jar (or) include jar in your eclipse project.

How to connect to hive databases in spark Using Java

I am able to connect to hive using hive.metastore.uris in Sparksession. What I want is to connect to a particular database of hive with this connection so that I don't need to add database name to each table names in queries. Is there any way to achieve this ?
Expecting code something like
SparkSession sparkSession = SparkSession.config("hive.metastore.uris", "thrift://dhdhdkkd136.india.sghjd.com:9083/hive_database")
You can use the catalog API accessible from the SparkSession.
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.catalog.Catalog
You can then call sparkSession.catalog.setCurrentDatabase(<db_name>)

Spark Sql is connecting to Derby DB for Hive meta store in place of my SQL and not able to find the Data Base

I am using Cloudera VM. I have created SQLContext and trying to connect a hive Data base buit not able to. Similarly when I am creating a table from using spark sql , am not able to see in hive also.
I see the hive site xml has configuration for mysql db as metastore but spark is trying to connect derby db. Please let me know if I am missing some thing.
from pyspark.sql import SQLContext
hiveContext=SQLContext(sc)
hiveContext.sql("use test ")
I can see it is connecting to DERBY DB.
16/06/19 18:51:32 INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
16/06/19 18:51:32 INFO metastore.ObjectStore: Initialized ObjectStore
16/06/19 18:51:33 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.1.0
hive> show databases;
OK
default
test
Time taken: 0.571 seconds, Fetched: 2 row(s)

Resources