Hive: create database fail with 'database already exists' - apache-spark

I have a test suite which runs several spark unit tests. Each of these tests shares the same underlying spark context.
During the running of these tests I check if a db exists and if not I create it:
def dbExists(db: String) = spark.sql(s"show databases like '$db'").count > 0
if (!dbExists(db)) spark.sql(s"create database $db")
For some reasons, one of the tests is failing. Debugging I saw that for a certain db dbExists(db) returns false and the creation command fails with
ERROR RetryingHMSHandler:159 - AlreadyExistsException(message:Database db already exists)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:891)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Everytime a test start, I clean the environment running drop database db cascade for each db that is not the default one.
The only explanation I can give is that some corrupted metadata is in the catalogue and spark sql thinks that the db exists, while it is not anymore.
The problem happens also inside a container with a fresh git clone of the project, meaning that it is not a previous run of the application that might pollute environment.
I run with hive support enabled.

Try this:
You are absolutely right, It is important to check the existence of the database before creation. This should work and would be much easier for the hive to check.
def dbExists(db: String) = spark.sql(s"show databases like '$db'").count > 0
spark.sql(s"create database if not exists $db")
This should work for you.

Related

AnalysisException when dropping a hive database using spark

Environment
spark 3.0.0
hive metastore (standalone) 3.0.0
mysql 8 as the metastore db
Problem
Every time I try to drop a database in the metastore via spark, I get AnalysisException and I don't know what is causing it or whether the drop operation is succeeding in it's entirety
Example
spark.sql(f"CREATE DATABASE IF NOT EXISTS myDb LOCATION 'shared-metastore-location/myDb.db'")
################
# DB creation succeeds and I can view the db in the metastore, add tables etc
################
spark.sql(f"DROP DATABASE myDb CASCADE")
################
AnalysisException: 'org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to clean up java.sql.SQLException: The table does not comply with the requirements by an external plugin.\n\tat com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:129)\n\tat com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97)\n\tat com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)\n\tat com.mysql.cj.jdbc.StatementImpl.executeUpdateInternal(StatementImpl.java:1335)\n\tat com.mysql.cj.jdbc.StatementImpl.executeLargeUpdate(StatementImpl.java:2108)\n\tat com.mysql.cj.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1245)\n\tat com.zaxxer.hikari.pool.ProxyStatement.executeUpdate(ProxyStatement.java:117)\n\tat com.zaxxer.hikari.pool.HikariProxyStatement.executeUpdate(HikariProxyStatement.java)\n\tat org.apache.hadoop.hive.metastore.txn.TxnHandler.cleanupRecords(TxnHandler.java:2741)\n\tat org.apache.hadoop.hive.metastore.AcidEventListener.onDropDatabase(AcidEventListener.java:52)\n\tat org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier$21.notify(MetaStoreListenerNotifier.java:85)\n\tat org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:264)\n\tat org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:326)\n\tat org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:364)\n\tat org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_database_core(HiveMetaStore.java:1537)\n\tat org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_database(HiveMetaStore.java:1575)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:147)\n\tat org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:108)\n\tat com.sun.proxy.$Proxy32.drop_database(Unknown Source)\n\tat org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$drop_database.getResult(ThriftHiveMetastore.java:14352)\n\tat org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$drop_database.getResult(ThriftHiveMetastore.java:14336)\n\tat org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)\n\tat org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:111)\n\tat org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:107)\n\tat java.security.AccessController.doPrivileged(Native Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:422)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)\n\tat org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:119)\n\tat org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n);'
Despite the Exception, the database does disappear after I run this code. And if I try to run the drop command a second time, I get a different exception saying that the database doesn't exist. But, I have no idea whether the operation has succeeded in its entirety or whether it's leaving a mess behind. I'm not familiar enough with Hive to know what should be deleted in the metastore to completely delete a db
I seem to get the same result whether I have tables in the db or not. I've also tried dropping the table without cascade. Same result

Append database prefix to table name with sequelize

I'm wondering if there is a way to have Sequelize append the database name to a specific query.
When Sequelize runs a query it looks like this:
SELECT "desired_field" FROM "user_account" AS "user_account" WHERE "user_account"."username" = 'jacstrong' LIMIT 1;
And it returns nothing.
However when I run a manual query from the command line it returns the data I want.
SELECT "desired_field" FROM database_name."user_account" AS "user_account" WHERE "user_account"."username" = 'jacstrong' LIMIT 1;
Is there any way to make Sequelize do this?
Note: Everything is running fine in my production environment, but I exported the production db and ran pg_restore on my local machine and the application isn't connecting to it correctly.
i do this all the time.... before you do the backup make sure you tick the Dump Options --> Do Not Save --> Owner to yes.... Sometimes mine still looks like it fails but really it doesn't... i also don't bother dropping the whole database all the time, i just drop the schema i am restoring... so in reality you could just go ahead and create your database locally with whatever credentials your dev environment is using and just drop the desired schema/schemas and restore the db with no owner when you wanna blow your data away

WSO2 Stream Processor Connecting to Cassandra gives error

I am trying to connect my siddhi application to cassandra data store by following the instructions given in the example program given in the editor.
I downloaded the datastax java jar(Osgi) and placed it in the WSO2/lib folder and now started the application. Now I get an error
> [2019-03-10_16-41-17_549] ERROR {org.wso2.siddhi.core.table.Table} - Error on 'Store-cassandra'. . Error while connecting to Table 'SweetProductionTable'. (Encoded)
java.lang.NullPointerException
at org.wso2.extension.siddhi.store.cassandra.CassandraEventTable.connect(CassandraEventTable.java:443)
at org.wso2.siddhi.core.table.Table.connectWithRetry(Table.java:388)
at org.wso2.siddhi.core.SiddhiAppRuntime.startWithoutSources(SiddhiAppRuntime.java:401)
at org.wso2.siddhi.core.SiddhiAppRuntime.start(SiddhiAppRuntime.java:376)
at org.wso2.carbon.siddhi.editor.core.internal.DebugRuntime.start(DebugRuntime.java:68)
at org.wso2.carbon.siddhi.editor.core.internal.DebugProcessorService.start(DebugProcessorService.java:37)
at org.wso2.carbon.siddhi.editor.core.internal.EditorMicroservice.start(EditorMicroservice.java:588)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.wso2.msf4j.internal.router.HttpMethodInfo.invokeResource(HttpMethodInfo.java:187)
at org.wso2.msf4j.internal.router.HttpMethodInfo.invoke(HttpMethodInfo.java:143)
at org.wso2.msf4j.internal.MSF4JHttpConnectorListener.dispatchMethod(MSF4JHttpConnectorListener.java:218)
at org.wso2.msf4j.internal.MSF4JHttpConnectorListener.lambda$onMessage$57(MSF4JHttpConnectorListener.java:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[2019-03-10_16-41-17_551] ERROR {org.wso2.siddhi.core.SiddhiAppRuntime} - Error starting Siddhi App 'Store-cassandra', triggering shutdown process. null (Encoded)
Below is the corresponding code
define stream SweetProductionStream (id int, name string);
#Store(type='cassandra' , cassandra.host='localhost' ,keyspace='production')
#index('uid')
#primaryKey('uid')
define table SweetProductionTable (uid int, name string);
/* Inserting event into the cassandra keyspace */
#info(name='query1')
from SweetProductionStream
select SweetProductionStream.id as uid, SweetProductionStream.name
insert into SweetProductionTable;
and below is the instructions given in the example
Prerequisites:
1) Ensure that Cassandra version 3 or above is installed on your machine.
2) Add the DataStax Java driver into {WSO2_SP_HOME}/lib as follows:
a) Download the DataStax Java driver from: http://central.maven.org/maven2/com/datastax/cassandra/cassandra-driver-core/3.3.2/cassandra-driver-core-3.3.2.jar
b) Use the "jartobundle" tool in {WSO2_SP_Home}/bin to extract and convert the above JARs into OSGi bundles.
For Windows: <SP_HOME>/bin/jartobundle.bat <PATH_OF_DOWNLOADED_JAR> <PATH_OF_CONVERTED_JAR>
For Linux: <SP_HOME>/bin/jartobundle.sh <PATH_OF_DOWNLOADED_JAR> <PATH_OF_CONVERTED_JAR>
Note: The driver given in the above link is a OSGi bundled one. Please skip this step if the jar is already OSGi bunbled.
c) Copy the converted bundles to the {WSO2_SP_Home}/lib directory.
3) Create a keyspace named 'production' in Cassanndra store.
4) In the store configuration of this application, replace 'username' and 'password' values with your Cassandra credentials.
5) Save this sample.
Executing the Sample:
1) Start the Siddhi application by clicking on 'Run'.
2) If the Siddhi application starts successfully, the following message is shown on the console
* Store-cassandra.siddhi - Started Successfully!
Note:
If you want to edit this application while it's running, stop the application, make your edits and save the application, and then start it again.
Testing the Sample:
1) Simulate single events:
a) Click on 'Event Simulator' (double arrows on left tab) and click 'Single Simulation'
b) Select 'Store-cassandra' as 'Siddhi App Name' and select 'searchSweetProductionStream' as 'Stream Name'.
c) Provide attribute values, and then click Send.
2) Send at least one event where the name matches a name value in the data you previously inserted into the SweetProductionTable. This will satisfy the 'on' condition of the join query.
3) Optionally, send events to the other corresponding streams to add, delete, update, insert, and search events.
Notes:
- After a change in the store, you can use the search stream to see whether the operation is successful.
- The Primary Key constraint in SweetProductionTable is disabled, because the name cannot be used as a PrimaryKey in a ProductionTable.
- You can use Siddhi functions to create a unique ID for the received events, which can then be used to apply the Primary Key constraint on the data store records. (http://wso2.github.io/siddhi/documentation/siddhi-4.0/#function)
Viewing the Results:
See the output for raw materials on the console. You can use searchSweetProductionStream to check for inserted, deleted, and updated events.
*/
Thanks in advance.
please provide your Cassandra credentials (username and password).
Eg: #store(type='cassandra' , cassandra.host='localhost', username='cassandra', password='cassandra',keyspace='production',
column.family='SweetProductionTable')
Please refer this sample.
Siddhi store casssandra

How to use Airflow SparkSQLOperator?

I'd like to run a query using Spark SQL in Airflow, it looks like the SparkSQLOperator is perfect for this (https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_sql_operator.py)
However, I can't figure out how the connection must be configured.
In DB Visualizer, I can connect to the Hive database using:
driver : jdbc
database url : jdbc:hive2://myserver.com:10000/default
database userid : me
database password : mypassword
Applying these settings to the spark_sql_default connection gives me:
enter[2017-12-12 11:35:33,774] {models.py:1462} ERROR - Cannot execute
on hive2://myserver.com:10000/default. Error code is: 1. Output: ,
Stderr:
Any ideas?
Spark isn't a database, so you configure your Spark connection just like you would if you were submitting regular Spark jobs, rather than the JDBC-like parameters you're used to.
If you look at the signature of the operator:
def __init__(self,
sql,
conf=None,
conn_id='spark_sql_default',
executor_cores=None,
executor_memory=None,
keytab=None,
master='yarn',
name='default-name',
num_executors=None,
yarn_queue='default',
*args,
**kwargs)
You need to create a connection (see models.Connection) that specifies the Spark master, which would be one of 'yarn', 'local[x]', or 'spark://hostname:port'. That should really be it, since they default the rest for you and it is likely handled by the underlying Spark config.
Your SQL or HQL code/script can be passed into the first parameter.

Azure SQL Data Warehouse: No catalog entry found for partition ID <id> in database <id>. The metadata is inconsistent. Run DBCC CHECKDB

I am working on moving stored procedures from an on-prem SQL Server database to an Azure SQL Data Warehouse (ASDW). Throughout the process I have had to work around a few missing features - time consuming but not impossible. One thing I have had to do is replace CTE's followed by MERGE statements with temp tables followed by UPDATE/INSERT/DELETE statements (since CTE's cannot be followed by these statements). At the beginning of each SP I check for the temp tables and delete them if they exist.
Today, I created another stored procedure in the ASDW without any temp tables (no updates/inserts/deletes so I left the CTE's in there), it "compiled", and I was able to run it without issue (returned an empty result set, as there is no data yet). I created another SP after this, and when I went to execute it, I got the following error:
...No catalog entry found for partition ID (id) in database 26. The metadata is inconsistent. Run DBCC CHECKDB to check for a metadata corruption...
I then went back to the first SP that I mentioned, and it gave me the same error, even though it had previously run without flaw.
I tried running DBCC CHECKDB as instructed but alas, it is not supported/doesn't work.
I dug around a lot, and what I ended up doing was scaling my database from 100DWU's to 500DWU's. I am at 0.16% of my database storage size limit, and there is barely any data anywhere (total DB size is <300MB).
Is there an explanation for this? If not, I can't in good conscience use this platform in a production environment.
Full error:
Msg 110802, Level 16, State 1, Line 1
110802;An internal DMS error occurred that caused this operation to fail.
Details: Exception: Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.DmsSqlNativeException,
Message: SqlNativeBufferReader.Run, error in OdbcExecuteQuery: SqlState:
42000, NativeError: 608, 'Error calling: SQLExecDirect(this->GetHstmt(), (SQLWCHAR *)statementText, SQL_NTS), SQL return code: -1 | SQL Error Info:
SrvrMsgState: 1, SrvrSeverity: 16, Error <1>: ErrorMsg: [Microsoft][ODBC Driver 11 for SQL Server][SQL Server]No catalog entry found for partition ID
72057594047758336 in database 36. The metadata is inconsistent. Run DBCC
CHECKDB to check for a metadata corruption. | Error calling: pReadConn-
>ExecuteQuery(statementText, bufferFormat) | state: FFFF, number: 134148,
active connections: 100', Connection String: Driver={pdwodbc};APP=TypeC01-
DmsNativeReader:DB196\mpdwsvc (2504)- ODBC;Trusted_Connection=yes;AutoTranslate=no;Server=\\.\pipe\DB.196-
bb5f9dd884cf\sql\query
I'm sorry to hear about your experience with Azure SQL Data Warehouse. I believe this is a defect related to BIT data type handling for NOT NULL columns. Can you confirm that you have a BIT NOT NULL column (e.g., CREATE TABLE t1 (IsTrue BIT NOT NULL);)?
If so, a fix has been coded and is in testing for release. To mitigate this now, you can either switch to a TINY INT or remove the NOT NULL setting for the column.

Resources