Rebuild index failed on Hive on Azure HDInsight with Tez - azure

I try to create indexes on Hive on Azure HDInsight with Tez enabled.
I can successfully create indexes but I can't rebuild them : the job failed with this output :
Map 1: -/- Reducer 2: 0/1
Status: Failed
Vertex failed, vertexName=Map 1, vertexId=vertex_1421234198072_0091_1_01, diagnostics=[Vertex Input: measures initializer failed.]
Vertex killed, vertexName=Reducer 2, vertexId=vertex_1421234198072_0091_1_00, diagnostics=[Vertex > received Kill in INITED state.]
DAG failed due to vertex failure. failedVertices:1 killedVertices:1
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask
I have created my table and indexes with the following job :
DROP TABLE IF EXISTS Measures;
CREATE TABLE Measures(
topology string,
val double,
date timestamp,
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE LOCATION 'wasb://<mycontainer>#<mystorage>.blob.core.windows.net/';
CREATE INDEX measures_index_topology ON TABLE Measures (topology) AS 'COMPACT' WITH DEFERRED REBUILD;
CREATE INDEX measures_index_date ON TABLE Measures (date) AS 'COMPACT' WITH DEFERRED REBUILD;
ALTER INDEX measures_index_topology ON Measures REBUILD;
ALTER INDEX measures_index_date ON Measures REBUILD;
Where am I wrong ? And why my rebuilding index fail ?
Best regards

It looks like Tez might have a problem with generating an index on an empty table. I was able to get the same error as you (without using the JSON SerDe), and if you look at the application logs for the DAG that fails, you might see something like:
java.lang.NullPointerException
at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:254)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:299)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getSplits(TezGroupedSplitsInputFormat.java:68)
at org.apache.tez.mapreduce.hadoop.MRHelpers.generateOldSplits(MRHelpers.java:263)
at org.apache.tez.mapreduce.common.MRInputAMSplitGenerator.initialize(MRInputAMSplitGenerator.java:139)
at org.apache.tez.dag.app.dag.RootInputInitializerRunner$InputInitializerCallable$1.run(RootInputInitializerRunner.java:154)
at org.apache.tez.dag.app.dag.RootInputInitializerRunner$InputInitializerCallable$1.run(RootInputInitializerRunner.java:146)
...
If you populate the table with a single dummy record, it seems to work fine. I used:
INSERT INTO TABLE Measures SELECT market,0,0 FROM hivesampletable limit 1;
After that, the index rebuild was able to run without error.

Related

Spark - nondeterministic expressions not allowed using Delete FROM and exists

The goal of this query is to delete the records that have the same keys of the new dataframe.
This is the query that I'm executing:
val op=spark.sql(s"""
DELETE
FROM TABLE1 AS t
WHERE EXISTS (
SELECT 1
FROM TABLE2 AS s
WHERE t.DAY=s.DAY
AND t.DATA_STREAM=s.DATA_STREAM
)""")
The error that I'm getting is:
AnalysisException: nondeterministic expressions are only allowed in
Project, Filter, Aggregate, Window, or Generate, but found:
exists(t.DAY, t.DATA_STREAM)
in operator DeleteCommandEdge com.databricks.sql.transaction.tahoe.DeltaLog#49e1c8fd, exists#107963 [DAY#108138 && DATA_STREAM#108193]
;
DeleteCommandEdge com.databricks.sql.transaction.tahoe.DeltaLog#49e1c8fd, exists#107963 [DAY#108138 && DATA_STREAM#108193]```
As far as I know exists is deterministic. Is there another way to achieve the same result with another query?
The query is running on a Databricks cluster with Spark 3.2.1

Getting SyntaxException programmatically creating a table with the Cassandra Python driver

Error:
cassandra.protocol.SyntaxException: \
<Error from server: code=2000 [Syntax error in CQL query] \
message="line 1:36 no viable alternative at input '(' \
(CREATE TABLE master_table(dict_keys[(]...)">
Code:
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider)
session=cluster.connect('firstkey')
ColName={"qty_dot_url": "int",
"qty_hyphen_url": "int",
"qty_underline_url": "int",
"qty_slash_url": "int"}
columns = ColName.keys()
values = ColName.values()
session.execute('CREATE TABLE master_table({ColName} {dataType}),PRIMARY KEY(qty_dot_url)'.format(ColName=columns, dataType=values))
How to resolve above mentioned error?
So I replaced the session.execute with a print, and it produced this:
CREATE TABLE master_table(dict_keys(['qty_dot_url', 'qty_hyphen_url', 'qty_underline_url', 'qty_slash_url']) dict_values(['int', 'int', 'int', 'int'])),PRIMARY KEY(qty_dot_url)
That is not valid CQL. It needs to look like this:
CREATE TABLE master_table(qty_dot_url int, qty_hyphen_url int,
qty_underline_url int, qty_slash_url int, PRIMARY KEY(qty_dot_url))
I was able to create that by making these adjustments to your code:
createTableCQL = "CREATE TABLE master_table("
for key, value in ColName.items():
createTableCQL += key + " " + value + ", "
createTableCQL += "PRIMARY KEY(qty_dot_url))"
You could then follow that with a session.execute(createTableCQL).
Notes:
The PRIMARY KEY definition must be inside the paren list.
Creating schema from inside application code is often problematic, and can create a schema disagreement in the cluster. It's almost always better to create tables outside of code.
The syntax exception is a result of your Python code generating an invalid CQL which Aaron pointed out in his response.
To add to his answer, you need to add additional steps whenever you are programatically making schema changes. In particular, you need to make sure that you check for schema agreement (i.e. the schema change has been propagated to all nodes) before moving on to the next bit in your code.
You will need to modify your code to save the result from the schema change, for example:
resultset = session.execute(SimpleStatement("CREATE TABLE ..."))
then call this in your code:
resultset.response_future.is_schema_agreed
You'll need to loop through this check until True is returned. Depending on how long you want to wait (default max_schema_agreement_wait is 10 seconds), you'll need to implement some logic to do [something] when schema agreement is not achieved (because a node is down for example) -- this requires manual intervention from an operator to investigate the cluster.
As Aaron already said, performing schema changes programatically is very problematic and we discourage doing this unless you fully understand the pitfalls and know how to handle failures. Cheers!

How to drop partitions from hive views?

I have a partitioned view and I am trying to drop an existing partition from the view definition using hive CLI. However, when I try to drop a partition, it throws me the following error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. null
Here is my create statement for view:
CREATE or replace VIEW test_view (logrecordtype, datacenter, ts_date, gen_date)
PARTITIONED ON (ts_date, gen_date)
AS SELECT logrecordtype, datacenter, ts_date, gen_date from test_table1 where ts_date <= '20200720'
union all
select logrecordtype, datacenter, ts_date, gen_date from test_table2 where ts_date != '20200720';
The underlying tables test_table1, test_table2 are also partitioned by (ts_date, gen_date).
Drop partition command:
ALTER VIEW test_view DROP IF EXISTS PARTITION (ts_date = '20200720', gen_date = '2020072201')
I am able to add partitions and issue show partition on my view but drop partition fails.
My show partition command shows:
show partitions test_view;
ts_date=20200720/gen_date=2020072201

Azure SQL Data Warehouse: No catalog entry found for partition ID <id> in database <id>. The metadata is inconsistent. Run DBCC CHECKDB

I am working on moving stored procedures from an on-prem SQL Server database to an Azure SQL Data Warehouse (ASDW). Throughout the process I have had to work around a few missing features - time consuming but not impossible. One thing I have had to do is replace CTE's followed by MERGE statements with temp tables followed by UPDATE/INSERT/DELETE statements (since CTE's cannot be followed by these statements). At the beginning of each SP I check for the temp tables and delete them if they exist.
Today, I created another stored procedure in the ASDW without any temp tables (no updates/inserts/deletes so I left the CTE's in there), it "compiled", and I was able to run it without issue (returned an empty result set, as there is no data yet). I created another SP after this, and when I went to execute it, I got the following error:
...No catalog entry found for partition ID (id) in database 26. The metadata is inconsistent. Run DBCC CHECKDB to check for a metadata corruption...
I then went back to the first SP that I mentioned, and it gave me the same error, even though it had previously run without flaw.
I tried running DBCC CHECKDB as instructed but alas, it is not supported/doesn't work.
I dug around a lot, and what I ended up doing was scaling my database from 100DWU's to 500DWU's. I am at 0.16% of my database storage size limit, and there is barely any data anywhere (total DB size is <300MB).
Is there an explanation for this? If not, I can't in good conscience use this platform in a production environment.
Full error:
Msg 110802, Level 16, State 1, Line 1
110802;An internal DMS error occurred that caused this operation to fail.
Details: Exception: Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.DmsSqlNativeException,
Message: SqlNativeBufferReader.Run, error in OdbcExecuteQuery: SqlState:
42000, NativeError: 608, 'Error calling: SQLExecDirect(this->GetHstmt(), (SQLWCHAR *)statementText, SQL_NTS), SQL return code: -1 | SQL Error Info:
SrvrMsgState: 1, SrvrSeverity: 16, Error <1>: ErrorMsg: [Microsoft][ODBC Driver 11 for SQL Server][SQL Server]No catalog entry found for partition ID
72057594047758336 in database 36. The metadata is inconsistent. Run DBCC
CHECKDB to check for a metadata corruption. | Error calling: pReadConn-
>ExecuteQuery(statementText, bufferFormat) | state: FFFF, number: 134148,
active connections: 100', Connection String: Driver={pdwodbc};APP=TypeC01-
DmsNativeReader:DB196\mpdwsvc (2504)- ODBC;Trusted_Connection=yes;AutoTranslate=no;Server=\\.\pipe\DB.196-
bb5f9dd884cf\sql\query
I'm sorry to hear about your experience with Azure SQL Data Warehouse. I believe this is a defect related to BIT data type handling for NOT NULL columns. Can you confirm that you have a BIT NOT NULL column (e.g., CREATE TABLE t1 (IsTrue BIT NOT NULL);)?
If so, a fix has been coded and is in testing for release. To mitigate this now, you can either switch to a TINY INT or remove the NOT NULL setting for the column.

Composite key in Cassandra with Pig

We have a CQL table that looks something like this:
CREATE table data (
occurday text,
seqnumber int,
occurtimems bigint,
unique bigint,
fields map<text, text>,
primary key ((occurday, seqnumber), occurtimems, unique)
)
I can query this table from cqlsh like this:
select * from data where seqnumber = 10 AND occurday = '2013-10-01';
This query works and returns the expected data.
If I execute this query as part of a LOAD from within Pig, however, things don't work.
-- Need to URL encode the query
data = LOAD 'cql://ks/data?where_clause=seqnumber%3D10%20AND%20occurday%3D%272013-10-01%27' USING CqlStorage();
gives
InvalidRequestException(why:seqnumber cannot be restricted by more than one relation if it includes an Equal)
at org.apache.cassandra.thrift.Cassandra$prepare_cql3_query_result.read(Cassandra.java:39567)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_prepare_cql3_query(Cassandra.java:1625)
at org.apache.cassandra.thrift.Cassandra$Client.prepare_cql3_query(Cassandra.java:1611)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.prepareQuery(CqlPagingRecordReader.java:591)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:621)
Shouldn't these behave the same? Why is the version through Pig failing where the straight cqlsh command works?
Hadoop is using CqlPagingRecordReader to try to load your data. This is leading to queries that are not identical to what you have entered. The paging record reader is trying to obtain small slices of Cassandra data at a time to avoid timeouts.
This means that your query is executed as
SELECT * FROM "data" WHERE token("occurday","seqnumber") > ? AND
token("occurday","seqnumber") <= ? AND occurday='A Great Day'
AND seqnumber=1 LIMIT 1000 ALLOW FILTERING
And this is why you are seeing your repeated key error. I'll submit a bug to the Cassandra Project.
Jira:
https://issues.apache.org/jira/browse/CASSANDRA-6151

Resources