Cassandra ArrayIndexOutOfBoundsException - cassandra

I have created the following schema to represent the association between a user and a set of threads ordered by their last message (which threads the user has read and which ones he has not):
CREATE TABLE table(user_id bigint, message_id bigint, thread_id bigint, read boolean, PRIMARY KEY(user_id, message_id)) WITH CLUSTERING ORDER BY (message_id DESC);
CREATE INDEX ON table(read);
After inserting some values, I try to run this query to get the most recent read or unread threads for a user:
SELECT thread_id, message_id FROM table WHERE user_id = ? AND message_id < ? AND read = ? LIMIT ?
The query works if run via cqlsh. However, when run through the datastax client, on the client side we get a timeout exception and on the server side, the Cassandra log shows this exception:
ERROR [ReadStage:4190] 2013-12-10 13:18:03,579 CassandraDaemon.java (line 187) Exception in thread Thread[ReadStage:4190,5,main]
java.lang.RuntimeException: java.lang.ArrayIndexOutOfBoundsException: 0
at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1940)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
at org.apache.cassandra.db.filter.SliceQueryFilter.start(SliceQueryFilter.java:261)
at org.apache.cassandra.db.index.composites.CompositesSearcher.makePrefix(CompositesSearcher.java:66)
at org.apache.cassandra.db.index.composites.CompositesSearcher.getIndexedIterator(CompositesSearcher.java:101)
at org.apache.cassandra.db.index.composites.CompositesSearcher.search(CompositesSearcher.java:53)
at org.apache.cassandra.db.index.SecondaryIndexManager.search(SecondaryIndexManager.java:537)
at org.apache.cassandra.db.ColumnFamilyStore.search(ColumnFamilyStore.java:1669)
at org.apache.cassandra.db.PagedRangeCommand.executeLocally(PagedRangeCommand.java:109)
at org.apache.cassandra.service.StorageProxy$LocalRangeSliceRunnable.runMayThrow(StorageProxy.java:1423)
at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1936)
... 3 more
Does anyone know what the problem is? Thanks!

This bug can now be tracked at https://issues.apache.org/jira/browse/CASSANDRA-6470

Related

spark GroupBy throws StateSchemaNotCompatible exception with different "Existing key schema"

I am reading and writing events from EventHub in spark after trying to aggregated based on few keys like this:
val df1 = df0
.groupBy(
colKey,
colTimestamp
)
.agg(
collect_list(
struct(
colCreationTimestamp,
colRecordId
)
).as("Records")
)
But i am getting this error at runtime:
Error
Caused by: org.apache.spark.sql.execution.streaming.state.StateSchemaNotCompatible: Provided schema doesn't match to the schema for existing state! Please note that Spark allow difference of field name: check count of fields and data type of each field.
- Provided key schema: StructType(StructField(Key,StringType,true), StructField(Timestamp,TimestampType,true)
- Provided value schema: StructType(StructField(buf,BinaryType,true))
- Existing key schema: StructType(StructField(_1,StringType,true), StructField(_2,TimestampType,true))
- Existing value schema: StructType(StructField(buf,BinaryType,true))
If you want to force running query without schema validation, please set spark.sql.streaming.stateStore.stateSchemaCheck to false.
Please note running query with incompatible schema could cause indeterministic behavior.
at org.apache.spark.sql.execution.streaming.state.StateSchemaCompatibilityChecker.check(StateSchemaCompatibilityChecker.scala:60)
at org.apache.spark.sql.execution.streaming.state.StateStore$.$anonfun$getStateStoreProvider$2(StateStore.scala:487)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.sql.execution.streaming.state.StateStore$.$anonfun$getStateStoreProvider$1(StateStore.scala:487)
at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86)
The exception doesnt contains the exact line number to reference my code, so i narrowed down to this code based on the provided key schema columns, and also if i change the groupBy key columns the error changes accordingly.
I tried different things like explicit df0.select() before group by for the required column to ensure that incoming data had the given column. but got the same error.
can someone suggest how its picking the Existing key schema, or what should i look for to resolve this?
update [Solved for me]
While uploading the records to eventHub, EventHubSpark library stores the states in checkpoint directory, where it had old state and causing the StateSchemaNotCompatible issue, pointing to new Checkpoint dir solved the issue for me.

What are datanucleus DELETEME* tables and when are they CREATED and DROPPED

I am getting following exception while executing spark Jobs.
org.datanucleus.exceptions.NucleusDataStoreException: Exception thrown obtaining schema column information from datastore
Which is caused by
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Table 'hive_metastore.DELETEME1530184568175' doesn't exist
I had the same problem:
22/04/10 04:10:30 ERROR Datastore: Error thrown executing CREATE TABLE `DELETEME1649563830414`
(
`UNUSED` INTEGER NOT NULL
) ENGINE=INNODB : CREATE command denied to user 'someuser'#'xx.xx.xx.xx' for table 'DELETEME1649563830414'
since it was for develop, I just granted all privileges
mysql> GRANT ALL PRIVILEGES ON *.* TO 'someuser'#'%';

presto: error while query from information_schema.columns

When run sql select * from hive.information_schema.columns; in presto client. I get the error infomation:
Query 20170208_085534_00061_ny9tu failed: outputFormat should not be accessed from a null StorageFormat
However, It succeed when select from other table in information_schema like select * from hive.information_schema.tables;
Can anybody help?
Thanks.
This is a bug caused by a table that has metadata we aren't expecting.
I scan all tables in hive and find some tables' InputFormat/OutputFormat is null. It will get the same error information if I DESCRIBE TABLENAME for those table with null InputFormat/OutputFormat in presto.
reference

Rebuild index failed on Hive on Azure HDInsight with Tez

I try to create indexes on Hive on Azure HDInsight with Tez enabled.
I can successfully create indexes but I can't rebuild them : the job failed with this output :
Map 1: -/- Reducer 2: 0/1
Status: Failed
Vertex failed, vertexName=Map 1, vertexId=vertex_1421234198072_0091_1_01, diagnostics=[Vertex Input: measures initializer failed.]
Vertex killed, vertexName=Reducer 2, vertexId=vertex_1421234198072_0091_1_00, diagnostics=[Vertex > received Kill in INITED state.]
DAG failed due to vertex failure. failedVertices:1 killedVertices:1
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask
I have created my table and indexes with the following job :
DROP TABLE IF EXISTS Measures;
CREATE TABLE Measures(
topology string,
val double,
date timestamp,
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE LOCATION 'wasb://<mycontainer>#<mystorage>.blob.core.windows.net/';
CREATE INDEX measures_index_topology ON TABLE Measures (topology) AS 'COMPACT' WITH DEFERRED REBUILD;
CREATE INDEX measures_index_date ON TABLE Measures (date) AS 'COMPACT' WITH DEFERRED REBUILD;
ALTER INDEX measures_index_topology ON Measures REBUILD;
ALTER INDEX measures_index_date ON Measures REBUILD;
Where am I wrong ? And why my rebuilding index fail ?
Best regards
It looks like Tez might have a problem with generating an index on an empty table. I was able to get the same error as you (without using the JSON SerDe), and if you look at the application logs for the DAG that fails, you might see something like:
java.lang.NullPointerException
at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:254)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:299)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getSplits(TezGroupedSplitsInputFormat.java:68)
at org.apache.tez.mapreduce.hadoop.MRHelpers.generateOldSplits(MRHelpers.java:263)
at org.apache.tez.mapreduce.common.MRInputAMSplitGenerator.initialize(MRInputAMSplitGenerator.java:139)
at org.apache.tez.dag.app.dag.RootInputInitializerRunner$InputInitializerCallable$1.run(RootInputInitializerRunner.java:154)
at org.apache.tez.dag.app.dag.RootInputInitializerRunner$InputInitializerCallable$1.run(RootInputInitializerRunner.java:146)
...
If you populate the table with a single dummy record, it seems to work fine. I used:
INSERT INTO TABLE Measures SELECT market,0,0 FROM hivesampletable limit 1;
After that, the index rebuild was able to run without error.

Storing CQL3 Map from Pig with CqlStorage

I have created a Table in Cassandra with CQL3 and I want to store some Data from Pig into it. I created the Table with
CREATE TABLE test
(firstname VARCHAR,
surname VARCHAR,
averagekm DOUBLE,
firstyear INT,
lastyear INT,
km MAP<INT,INT>,
PRIMARY KEY (firstname, surname)
);
My dataset in Pig is:
grunt> describe cqlFormat;
cqlFormat: {((chararray,Firstname: chararray), (chararray,Surname: chararray)), (Averagekm: double,FirstYear: int,LastYear: int,Km: (chararray, (Year: int,Km: int)))}
And I want to store it with
grunt> store cqlFormat into 'cql://test/test?output_query=UPDATE+test.test+set+averagekm+%3D+%3F%2C+firstyear+%3D+%3F%2C+lastyear+%3D+%3F%2C+km+%3D+%3F' using CqlStorage();
I will get this error in my Logfile:
Backend error message
---------------------
java.io.IOException: java.io.IOException: InvalidRequestException(why:Not enough bytes to read a map)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:465)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:428)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:408)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:262)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:260)
Caused by: java.io.IOException: InvalidRequestException(why:Not enough bytes to read a map)
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:248)
Caused by: InvalidRequestException(why:Not enough bytes to read a map)
at org.apache.cassandra.thrift.Cassandra$execute_prepared_cql3_query_result.read(Cassandra.java:41868)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_execute_prepared_cql3_query(Cassandra.java:1689)
at org.apache.cassandra.thrift.Cassandra$Client.execute_prepared_cql3_query(Cassandra.java:1674)
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:232)
The storing of data works well, by storing without the Map of Kilometers.
I am using Datastax DSE 3.1.2 with Pig 0.9.2 (r1234438), cqlsh 3.1.2, Cassandra 1.2.6.5, CQL spec 3.0.0. Additionaly I created the Tuples by reading the Links:
https://issues.apache.org/jira/browse/CASSANDRA-5867?focusedCommentId=13753070&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13753070
http://www.datastax.com/docs/datastax_enterprise3.1/solutions/about_pig#pig-read-write
Thank you!
Dump cqlFormat, it shows the data format.
Make sure it follows
(((firstname, somename),(surname, somename)),(averagekm, firstyear,lastyear,(map,(key1,value1),(key2,value2)))) format, then it should work

Resources