AnalysisException when dropping a hive database using spark - apache-spark

Environment
spark 3.0.0
hive metastore (standalone) 3.0.0
mysql 8 as the metastore db
Problem
Every time I try to drop a database in the metastore via spark, I get AnalysisException and I don't know what is causing it or whether the drop operation is succeeding in it's entirety
Example
spark.sql(f"CREATE DATABASE IF NOT EXISTS myDb LOCATION 'shared-metastore-location/myDb.db'")
################
# DB creation succeeds and I can view the db in the metastore, add tables etc
################
spark.sql(f"DROP DATABASE myDb CASCADE")
################
AnalysisException: 'org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to clean up java.sql.SQLException: The table does not comply with the requirements by an external plugin.\n\tat com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:129)\n\tat com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97)\n\tat com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)\n\tat com.mysql.cj.jdbc.StatementImpl.executeUpdateInternal(StatementImpl.java:1335)\n\tat com.mysql.cj.jdbc.StatementImpl.executeLargeUpdate(StatementImpl.java:2108)\n\tat com.mysql.cj.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1245)\n\tat com.zaxxer.hikari.pool.ProxyStatement.executeUpdate(ProxyStatement.java:117)\n\tat com.zaxxer.hikari.pool.HikariProxyStatement.executeUpdate(HikariProxyStatement.java)\n\tat org.apache.hadoop.hive.metastore.txn.TxnHandler.cleanupRecords(TxnHandler.java:2741)\n\tat org.apache.hadoop.hive.metastore.AcidEventListener.onDropDatabase(AcidEventListener.java:52)\n\tat org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier$21.notify(MetaStoreListenerNotifier.java:85)\n\tat org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:264)\n\tat org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:326)\n\tat org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:364)\n\tat org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_database_core(HiveMetaStore.java:1537)\n\tat org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_database(HiveMetaStore.java:1575)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:147)\n\tat org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:108)\n\tat com.sun.proxy.$Proxy32.drop_database(Unknown Source)\n\tat org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$drop_database.getResult(ThriftHiveMetastore.java:14352)\n\tat org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$drop_database.getResult(ThriftHiveMetastore.java:14336)\n\tat org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)\n\tat org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:111)\n\tat org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:107)\n\tat java.security.AccessController.doPrivileged(Native Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:422)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)\n\tat org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:119)\n\tat org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n);'
Despite the Exception, the database does disappear after I run this code. And if I try to run the drop command a second time, I get a different exception saying that the database doesn't exist. But, I have no idea whether the operation has succeeded in its entirety or whether it's leaving a mess behind. I'm not familiar enough with Hive to know what should be deleted in the metastore to completely delete a db
I seem to get the same result whether I have tables in the db or not. I've also tried dropping the table without cascade. Same result

Related

Spark SQL persistent view over jdbc data source

I want to create a persistent (global) view in spark sql that gets data from an underlying jdbc database connection. It works fine when I use a temporary (session-scoped) view as shown below but fails when trying to create a regular (persistent and global) view.
I don't understand why the latter should not work but couldn't find any docs/hints as all examples are always done with temporary views. Technically, I cannot see why it shouldn't work as the data is properly retrieved from jdbc source in the temporary view and thus it should not matter if I wanted to "store" the query in a persistent view so that whenever calling the view it would retrieve data directly from jdbc source.
Config.
tbl_in = myjdbctable
tbl_out = myview
db_user = 'myuser'
db_pw = 'mypw'
jdbc_url = 'jdbc:sqlserver://myserver.domain:1433;database=mydb'
This works.
query = f"""
create or replace temporary view {tbl_out}
using jdbc
options(
dbtable '{tbl_in}',
user '{db_user}',
password '{db_pw}',
url '{jdbc_url}'
)
"""
spark.sql(query)
> DataFrame[]
This does not work.
query = f"""
create or replace view {tbl_out}
using jdbc
options(
dbtable '{tbl_in}',
user '{db_user}',
password '{db_pw}',
url '{jdbc_url}'
)
"""
spark.sql(query)
> ParseException:
Error.
ParseException:
mismatched input 'using' expecting {'(', 'UP_TO_DATE', 'AS', 'COMMENT', 'PARTITIONED', 'TBLPROPERTIES'}(line 3, pos 0)
== SQL ==
create or replace view myview
using jdbc
^^^
options(
dbtable 'myjdbctable',
user 'myuser',
password '[REDACTED]',
url 'jdbc:sqlserver://myserver.domain:1433;database=mydb'
)
TL;DR: A spark sql table over jdbc source behaves like a view and so can be used like one.
It seems my assumptions about jdbc tables in spark sql were flawed. It turns out that a sql table with a jdbc source (i.e. created via using jdbc) is actually a live query against the jdbc source (and not a one-off jdbc query during table creation as I assumed). In my mind it actually behaves like a view then. That means if the underlying jdbc source changes (e.g. new entries in a column) this is reflected in the spark sql table on read (e.g. select from) without having to re-create the table.
It follows that the spark sql table over jdbc source satisfies my requirements of having an always up2date reflection of the underlying table/sql object in the jdbc source. Usually, I would use a view for that. Maybe this is the reason why there is no persistent view over a jdbc source but only temporary views (which of course still make sense as they are session-scoped). It should be noted that the spark sql jdbc table behaves like a view which may be surprising, in particular:
if you add a column in underlying jdbc table, it will not show up in spark sql table
if you remove a column from underlying jdbc table, an error will occur when spark sql table is accessed (assuming the removed column was present during spark sql table creation)
if you remove the underlying jdbc table, an error will occur when spark sql table is accessed
The input of spark.sql should be DML (Data Manipulation Language). Its output is a dataframe.
In terms of best practices, you should avoid using DDL (Data Definition Language) with spark.sql. Even if some statements may work, that's not meant to be used this way.
If you want to use DDL, simply connect to your DB using python packages.
If you want to create a temp view in spark, do it using spark syntaxe createTempView

Spark not able to write into a new hive table in partitioned and append mode

Created a new table in hive in partitioned and ORC format.
Writing into this table using spark by using append ,orc and partitioned mode.
It fails with the exception:
org.apache.spark.sql.AnalysisException: The format of the existing table test.table1 is `HiveFileFormat`. It doesn't match the specified format `OrcFileFormat`.;
I change the format to "hive" from "orc" while writing . It still fails with the exception :
Spark not able to understand the underlying structure of table .
So this issue is happening because spark is not able to write into hive table in append mode , because it cant create a new table . I am able to do overwrite successfully because spark creates a table again.
But my use case is to write into append mode from starting. InsertInto also does not work specifically for partitioned tables. I am pretty much blocked with my use case. Any help would be great.
Edit1:
Working on HDP 3.1.0 environment.
Spark Version is 2.3.2
Hive Version is 3.1.0
Edit 2:
// Reading the table
val inputdf=spark.sql("select id,code,amount from t1")
//writing into table
inputdf.write.mode(SaveMode.Append).partitionBy("code").format("orc").saveAsTable("test.t2")
Edit 3: Using insertInto()
val df2 =spark.sql("select id,code,amount from t1")
df2.write.format("orc").mode("append").insertInto("test.t2");
I get the error as:
20/05/17 19:15:12 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
20/05/17 19:15:12 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
20/05/17 19:15:13 WARN AcidUtils: Cannot get ACID state for test.t1 from null
20/05/17 19:15:13 WARN AcidUtils: Cannot get ACID state for test.t1 from null
20/05/17 19:15:13 WARN HiveMetastoreCatalog: Unable to infer schema for table test.t1 from file format ORC (inference mode: INFER_AND_SAVE). Using metastore schema.
If I rerun the insertInto command I get the following exception :
20/05/17 19:16:37 ERROR Hive: MetaException(message:The transaction for alter partition did not commit successfully.)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partitions_req_result$alter_partitions_req_resultStandardScheme.read(ThriftHiveMetastore.java)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partitions_req_result$alter_partitions_req_resultStandardScheme.read(ThriftHiveMetastore.java)
Error in hive metastore logs :
2020-05-17T21:17:43,891 INFO [pool-8-thread-198]: metastore.HiveMetaStore (HiveMetaStore.java:logInfo(907)) - 163: alter_partitions : tbl=hive.test.t1
2020-05-17T21:17:43,891 INFO [pool-8-thread-198]: HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(349)) - ugi=X#A.ORG ip=10.10.1.36 cmd=alter_partitions : tbl=hive.test.t1
2020-05-17T21:17:43,891 INFO [pool-8-thread-198]: metastore.HiveMetaStore (HiveMetaStore.java:alter_partitions_with_environment_context(5119)) - New partition values:[BR]
2020-05-17T21:17:43,913 ERROR [pool-8-thread-198]: metastore.ObjectStore (ObjectStore.java:alterPartitions(4397)) - Alter failed
org.apache.hadoop.hive.metastore.api.MetaException: Cannot change stats state for a transactional table without providing the transactional write state for verification (new write ID -1, valid write IDs null; current state null; new state {}
I was able to resolve the issue by using external tables in my use case. We currently have an open issue in spark , which is related to acid properties of hive . Once I create hive table in external mode , I am able to do append operations in partitioned/non partitioned table.
https://issues.apache.org/jira/browse/SPARK-15348

Hive: create database fail with 'database already exists'

I have a test suite which runs several spark unit tests. Each of these tests shares the same underlying spark context.
During the running of these tests I check if a db exists and if not I create it:
def dbExists(db: String) = spark.sql(s"show databases like '$db'").count > 0
if (!dbExists(db)) spark.sql(s"create database $db")
For some reasons, one of the tests is failing. Debugging I saw that for a certain db dbExists(db) returns false and the creation command fails with
ERROR RetryingHMSHandler:159 - AlreadyExistsException(message:Database db already exists)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:891)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Everytime a test start, I clean the environment running drop database db cascade for each db that is not the default one.
The only explanation I can give is that some corrupted metadata is in the catalogue and spark sql thinks that the db exists, while it is not anymore.
The problem happens also inside a container with a fresh git clone of the project, meaning that it is not a previous run of the application that might pollute environment.
I run with hive support enabled.
Try this:
You are absolutely right, It is important to check the existence of the database before creation. This should work and would be much easier for the hive to check.
def dbExists(db: String) = spark.sql(s"show databases like '$db'").count > 0
spark.sql(s"create database if not exists $db")
This should work for you.

How to use Airflow SparkSQLOperator?

I'd like to run a query using Spark SQL in Airflow, it looks like the SparkSQLOperator is perfect for this (https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_sql_operator.py)
However, I can't figure out how the connection must be configured.
In DB Visualizer, I can connect to the Hive database using:
driver : jdbc
database url : jdbc:hive2://myserver.com:10000/default
database userid : me
database password : mypassword
Applying these settings to the spark_sql_default connection gives me:
enter[2017-12-12 11:35:33,774] {models.py:1462} ERROR - Cannot execute
on hive2://myserver.com:10000/default. Error code is: 1. Output: ,
Stderr:
Any ideas?
Spark isn't a database, so you configure your Spark connection just like you would if you were submitting regular Spark jobs, rather than the JDBC-like parameters you're used to.
If you look at the signature of the operator:
def __init__(self,
sql,
conf=None,
conn_id='spark_sql_default',
executor_cores=None,
executor_memory=None,
keytab=None,
master='yarn',
name='default-name',
num_executors=None,
yarn_queue='default',
*args,
**kwargs)
You need to create a connection (see models.Connection) that specifies the Spark master, which would be one of 'yarn', 'local[x]', or 'spark://hostname:port'. That should really be it, since they default the rest for you and it is likely handled by the underlying Spark config.
Your SQL or HQL code/script can be passed into the first parameter.

Azure SQL Data Warehouse: No catalog entry found for partition ID <id> in database <id>. The metadata is inconsistent. Run DBCC CHECKDB

I am working on moving stored procedures from an on-prem SQL Server database to an Azure SQL Data Warehouse (ASDW). Throughout the process I have had to work around a few missing features - time consuming but not impossible. One thing I have had to do is replace CTE's followed by MERGE statements with temp tables followed by UPDATE/INSERT/DELETE statements (since CTE's cannot be followed by these statements). At the beginning of each SP I check for the temp tables and delete them if they exist.
Today, I created another stored procedure in the ASDW without any temp tables (no updates/inserts/deletes so I left the CTE's in there), it "compiled", and I was able to run it without issue (returned an empty result set, as there is no data yet). I created another SP after this, and when I went to execute it, I got the following error:
...No catalog entry found for partition ID (id) in database 26. The metadata is inconsistent. Run DBCC CHECKDB to check for a metadata corruption...
I then went back to the first SP that I mentioned, and it gave me the same error, even though it had previously run without flaw.
I tried running DBCC CHECKDB as instructed but alas, it is not supported/doesn't work.
I dug around a lot, and what I ended up doing was scaling my database from 100DWU's to 500DWU's. I am at 0.16% of my database storage size limit, and there is barely any data anywhere (total DB size is <300MB).
Is there an explanation for this? If not, I can't in good conscience use this platform in a production environment.
Full error:
Msg 110802, Level 16, State 1, Line 1
110802;An internal DMS error occurred that caused this operation to fail.
Details: Exception: Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.DmsSqlNativeException,
Message: SqlNativeBufferReader.Run, error in OdbcExecuteQuery: SqlState:
42000, NativeError: 608, 'Error calling: SQLExecDirect(this->GetHstmt(), (SQLWCHAR *)statementText, SQL_NTS), SQL return code: -1 | SQL Error Info:
SrvrMsgState: 1, SrvrSeverity: 16, Error <1>: ErrorMsg: [Microsoft][ODBC Driver 11 for SQL Server][SQL Server]No catalog entry found for partition ID
72057594047758336 in database 36. The metadata is inconsistent. Run DBCC
CHECKDB to check for a metadata corruption. | Error calling: pReadConn-
>ExecuteQuery(statementText, bufferFormat) | state: FFFF, number: 134148,
active connections: 100', Connection String: Driver={pdwodbc};APP=TypeC01-
DmsNativeReader:DB196\mpdwsvc (2504)- ODBC;Trusted_Connection=yes;AutoTranslate=no;Server=\\.\pipe\DB.196-
bb5f9dd884cf\sql\query
I'm sorry to hear about your experience with Azure SQL Data Warehouse. I believe this is a defect related to BIT data type handling for NOT NULL columns. Can you confirm that you have a BIT NOT NULL column (e.g., CREATE TABLE t1 (IsTrue BIT NOT NULL);)?
If so, a fix has been coded and is in testing for release. To mitigate this now, you can either switch to a TINY INT or remove the NOT NULL setting for the column.

Resources