Storing CQL3 Map from Pig with CqlStorage - cassandra

I have created a Table in Cassandra with CQL3 and I want to store some Data from Pig into it. I created the Table with
CREATE TABLE test
(firstname VARCHAR,
surname VARCHAR,
averagekm DOUBLE,
firstyear INT,
lastyear INT,
km MAP<INT,INT>,
PRIMARY KEY (firstname, surname)
);
My dataset in Pig is:
grunt> describe cqlFormat;
cqlFormat: {((chararray,Firstname: chararray), (chararray,Surname: chararray)), (Averagekm: double,FirstYear: int,LastYear: int,Km: (chararray, (Year: int,Km: int)))}
And I want to store it with
grunt> store cqlFormat into 'cql://test/test?output_query=UPDATE+test.test+set+averagekm+%3D+%3F%2C+firstyear+%3D+%3F%2C+lastyear+%3D+%3F%2C+km+%3D+%3F' using CqlStorage();
I will get this error in my Logfile:
Backend error message
---------------------
java.io.IOException: java.io.IOException: InvalidRequestException(why:Not enough bytes to read a map)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:465)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:428)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:408)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:262)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:260)
Caused by: java.io.IOException: InvalidRequestException(why:Not enough bytes to read a map)
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:248)
Caused by: InvalidRequestException(why:Not enough bytes to read a map)
at org.apache.cassandra.thrift.Cassandra$execute_prepared_cql3_query_result.read(Cassandra.java:41868)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_execute_prepared_cql3_query(Cassandra.java:1689)
at org.apache.cassandra.thrift.Cassandra$Client.execute_prepared_cql3_query(Cassandra.java:1674)
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:232)
The storing of data works well, by storing without the Map of Kilometers.
I am using Datastax DSE 3.1.2 with Pig 0.9.2 (r1234438), cqlsh 3.1.2, Cassandra 1.2.6.5, CQL spec 3.0.0. Additionaly I created the Tuples by reading the Links:
https://issues.apache.org/jira/browse/CASSANDRA-5867?focusedCommentId=13753070&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13753070
http://www.datastax.com/docs/datastax_enterprise3.1/solutions/about_pig#pig-read-write
Thank you!

Dump cqlFormat, it shows the data format.
Make sure it follows
(((firstname, somename),(surname, somename)),(averagekm, firstyear,lastyear,(map,(key1,value1),(key2,value2)))) format, then it should work

Related

Spark not able to write into a new hive table in partitioned and append mode

Created a new table in hive in partitioned and ORC format.
Writing into this table using spark by using append ,orc and partitioned mode.
It fails with the exception:
org.apache.spark.sql.AnalysisException: The format of the existing table test.table1 is `HiveFileFormat`. It doesn't match the specified format `OrcFileFormat`.;
I change the format to "hive" from "orc" while writing . It still fails with the exception :
Spark not able to understand the underlying structure of table .
So this issue is happening because spark is not able to write into hive table in append mode , because it cant create a new table . I am able to do overwrite successfully because spark creates a table again.
But my use case is to write into append mode from starting. InsertInto also does not work specifically for partitioned tables. I am pretty much blocked with my use case. Any help would be great.
Edit1:
Working on HDP 3.1.0 environment.
Spark Version is 2.3.2
Hive Version is 3.1.0
Edit 2:
// Reading the table
val inputdf=spark.sql("select id,code,amount from t1")
//writing into table
inputdf.write.mode(SaveMode.Append).partitionBy("code").format("orc").saveAsTable("test.t2")
Edit 3: Using insertInto()
val df2 =spark.sql("select id,code,amount from t1")
df2.write.format("orc").mode("append").insertInto("test.t2");
I get the error as:
20/05/17 19:15:12 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
20/05/17 19:15:12 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
20/05/17 19:15:13 WARN AcidUtils: Cannot get ACID state for test.t1 from null
20/05/17 19:15:13 WARN AcidUtils: Cannot get ACID state for test.t1 from null
20/05/17 19:15:13 WARN HiveMetastoreCatalog: Unable to infer schema for table test.t1 from file format ORC (inference mode: INFER_AND_SAVE). Using metastore schema.
If I rerun the insertInto command I get the following exception :
20/05/17 19:16:37 ERROR Hive: MetaException(message:The transaction for alter partition did not commit successfully.)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partitions_req_result$alter_partitions_req_resultStandardScheme.read(ThriftHiveMetastore.java)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partitions_req_result$alter_partitions_req_resultStandardScheme.read(ThriftHiveMetastore.java)
Error in hive metastore logs :
2020-05-17T21:17:43,891 INFO [pool-8-thread-198]: metastore.HiveMetaStore (HiveMetaStore.java:logInfo(907)) - 163: alter_partitions : tbl=hive.test.t1
2020-05-17T21:17:43,891 INFO [pool-8-thread-198]: HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(349)) - ugi=X#A.ORG ip=10.10.1.36 cmd=alter_partitions : tbl=hive.test.t1
2020-05-17T21:17:43,891 INFO [pool-8-thread-198]: metastore.HiveMetaStore (HiveMetaStore.java:alter_partitions_with_environment_context(5119)) - New partition values:[BR]
2020-05-17T21:17:43,913 ERROR [pool-8-thread-198]: metastore.ObjectStore (ObjectStore.java:alterPartitions(4397)) - Alter failed
org.apache.hadoop.hive.metastore.api.MetaException: Cannot change stats state for a transactional table without providing the transactional write state for verification (new write ID -1, valid write IDs null; current state null; new state {}
I was able to resolve the issue by using external tables in my use case. We currently have an open issue in spark , which is related to acid properties of hive . Once I create hive table in external mode , I am able to do append operations in partitioned/non partitioned table.
https://issues.apache.org/jira/browse/SPARK-15348

Cassandra CREATE CUSTOM INDEX ERROR java.lang.ClassNotFoundException

-> Table :
cassandra#cqlsh:coba> CREATE TABLE data(
... nim int,
... nama text,
... alamat text,
... PRIMARY KEY (nim, alamat)
... );
-> Make Index :
CREATE CUSTOM INDEX cari_alamat ON coba.data (alamat) USING 'org.apache.cassandra.index.sasi.SASIIndex';
-> Error :
ServerError: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.cassandra.index.sasi.SASIIndex
-> I would be very happy if you can help me
-> Thanks You
As initially susspected, I think you are running cassandra version lower than 3.4.
(that's why I asked for the version)
I tried it out and got the same error on 3.0.10:
cqlsh:test> CREATE CUSTOM INDEX cari_alamat ON test.data (alamat) USING 'org.apache.cassandra.index.sasi.SASIIndex';
ConfigurationException: Unable to find custom indexer class 'org.apache.cassandra.index.sasi.SASIIndex'
Theoretically you could implement your own with:
Cassandra Custom Secondary Index
But I guess it's just easier to upgrade.
Also be aware there might be some bugs with Sasi indexes:
SASI Indexes in Cassandra seem to have some bugs
But I guess it's better to search cassandra Jira for this one, this is just as a small warning.

Composite key in Cassandra with Pig and where_clause for part of the key in the where clause

I basically have the same problem as the following Composite key in Cassandra with Pig. The only difference is I try to query for a part of the composite key within the where_clause of pig.
The data structure is similar to the earlier mentioned issue, I'll copy some code/context to minimize the reading of that issue.
We have a CQL table that looks something like this:
CREATE table data (
occurday text,
seqnumber int,
occurtimems bigint,
unique bigint,
fields map<text, text>,
primary key ((occurday, seqnumber), occurtimems, unique)
)
Instead of querying for both the seqnumber and the occurday (as was the issue in previously mentioned issue) I try to query one of the keys.
If I execute this query as part of a LOAD from within Pig, however, things don't work.
-- Need to URL encode the query
data = LOAD 'cql://ks/data?where_clause=occurday%3D%272013-10-01%27' USING CqlStorage();
gives
java.lang.RuntimeException
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:665)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.<init>(CqlPagingRecordReader.java:301)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader.initialize(CqlPagingRecordReader.java:167)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initialize(PigRecordReader.java:181)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:522)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: InvalidRequestException(why:occurday cannot be restricted by more than one relation if it includes an Equal)
at org.apache.cassandra.thrift.Cassandra$prepare_cql3_query_result$prepare_cql3_query_resultStandardScheme.read(Cassandra.java:51017)
at org.apache.cassandra.thrift.Cassandra$prepare_cql3_query_result$prepare_cql3_query_resultStandardScheme.read(Cassandra.java:50994)
at org.apache.cassandra.thrift.Cassandra$prepare_cql3_query_result.read(Cassandra.java:50933)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_prepare_cql3_query(Cassandra.java:1756)
at org.apache.cassandra.thrift.Cassandra$Client.prepare_cql3_query(Cassandra.java:1742)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.prepareQuery(CqlPagingRecordReader.java:605)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:635)
... 7 more
Basically my question is, what am I doing wrong or what don't I understand?
As I understand from CqlPagingRecorderReader Used when Partition Key Is Explicitly Stated
I should be able to query with just part of the partition key?
Also while reading
Add CqlRecordReader to take advantage of native CQL pagination
I get the impression this should be possible, but I am swimming around with (in my opinion) no clear direction on how to accomplish this.
Any help is very very welcome at this point.
Regards,
Lennart Weijl
PS.
I am running on Cassandra 2.0.9 with Pig 0.13.0
According to CASSANDRA-6311, I believe you need to apply the 6331-v2-2.0-branch.txt patch, recompile pig, and then update your LOAD statement to:
data = LOAD 'cql://ks/data?where_clause=occurday%3D%272013-10-01%27' USING CqlInputFormat();
The key change being USING CqlInputFormat() which triggers the use of the new CqlRecordReader that was released in Cassandra 2.0.7.
Edit: Note that the exception is thrown from CqlPagingRecordReader which means you're still using the old record reader.

Hive error finding unmapped keyspaces

I am trying to make an external table in Hive as shown on page 88 of the Datastax Enterprise 3.1. Documentation.
The statement is further below together with the error message.
What am I doing wrong?
Regards Hans-Peter
hive> create external table testext (m string, n string, o string, p string)
> STORED BY 'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'
> TBLPROPERTIES ( "cassandra.ks.name" = "cql3ks",
> "cassandra.cf.name" = "test",
> "cassandra.cql3.type" = "text, text, text, text");
FAILED: Error in metadata:
com.datastax.bdp.hadoop.hive.metastore.CassandraHiveMetaStoreException:
There was a problem with the Cassandra Hive MetaStore: Problem finding unmapped
keyspaces
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
2013-10-15 12:47:36,657 WARN conf.HiveConf (HiveConf.java:(63)) - DEPRECATED: Ignoring hive-default.xml found on the CLASSPATH at /etc/dse/hive/hive-default.xml
2013-10-15 12:48:41,003 WARN config.DatabaseDescriptor (DatabaseDescriptor.java:loadYaml(253)) - Please rename 'authority' to 'authorizer' in cassandra.yaml
2013-10-15 12:48:42,988 ERROR exec.Task (SessionState.java:printError(400)) - FAILED: Error in metadata: com.datastax.bdp.hadoop.hive.metastore.CassandraHiveMetaStoreException: There was a problem with the Cassandra Hive MetaStore: Problem finding unmapped keyspaces
org.apache.hadoop.hive.ql.metadata.HiveException: com.datastax.bdp.hadoop.hive.metastore.CassandraHiveMetaStoreException: There was a problem with the Cassandra Hive MetaStore: Problem finding unmapped keyspaces
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:544)
at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3305)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:242)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:134)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1326)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1118)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:951)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:689)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: com.datastax.bdp.hadoop.hive.metastore.CassandraHiveMetaStoreException: There was a problem with the Cassandra Hive MetaStore: Problem finding unmapped keyspaces
at com.datastax.bdp.hadoop.hive.metastore.SchemaManagerService.createKeyspaceSchemasIfNeeded(SchemaManagerService.java:230)
at com.datastax.bdp.hadoop.hive.metastore.CassandraHiveMetaStore.setConf(CassandraHiveMetaStore.java:112)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.hive.metastore.RetryingRawStore.(RetryingRawStore.java:62)
at org.apache.hadoop.hive.metastore.RetryingRawStore.getProxy(RetryingRawStore.java:71)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:346)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:333)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:371)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:278)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.(HiveMetaStore.java:248)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:114)
at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:538)
... 17 more
Caused by: com.datastax.bdp.hadoop.hive.metastore.CassandraHiveMetaStoreException: There was a problem with the Cassandra Hive MetaStore: There was a problem retrieving column families for keyspace demo
at com.datastax.bdp.hadoop.hive.metastore.SchemaManagerService.createUnmappedTables(SchemaManagerService.java:277)
at com.datastax.bdp.hadoop.hive.metastore.CassandraHiveMetaStore.getDatabase(CassandraHiveMetaStore.java:148)
at com.datastax.bdp.hadoop.hive.metastore.CassandraHiveMetaStore.getDatabase(CassandraHiveMetaStore.java:136)
at com.datastax.bdp.hadoop.hive.metastore.SchemaManagerService.isKeyspaceMapped(SchemaManagerService.java:186)
at com.datastax.bdp.hadoop.hive.metastore.SchemaManagerService.finUnmappedKeyspaces(SchemaManagerService.java:137)
at com.datastax.bdp.hadoop.hive.metastore.SchemaManagerService.createKeyspaceSchemasIfNeeded(SchemaManagerService.java:224)
... 31 more
Caused by: com.datastax.bdp.hadoop.hive.metastore.CassandraHiveMetaStoreException: There was a problem with the Cassandra Hive MetaStore: Problem creating column mappingsorg.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type)
at com.datastax.bdp.hadoop.hive.metastore.SchemaManagerService.buildTable(SchemaManagerService.java:481)
at com.datastax.bdp.hadoop.hive.metastore.SchemaManagerService.createUnmappedTables(SchemaManagerService.java:254)
... 36 more
Caused by: java.lang.IllegalArgumentException
at java.nio.Buffer.limit(Buffer.java:247)
at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:51)
at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:60)
at org.apache.cassandra.db.marshal.AbstractCompositeType.getString(AbstractCompositeType.java:226)
at com.datastax.bdp.hadoop.hive.metastore.SchemaManagerService.addTypeToStorageDescriptor(SchemaManagerService.java:846)
at com.datastax.bdp.hadoop.hive.metastore.SchemaManagerService.buildColumnMappings(SchemaManagerService.java:546)
at com.datastax.bdp.hadoop.hive.metastore.SchemaManagerService.buildTable(SchemaManagerService.java:460)
... 37 more
2013-10-15 12:48:42,990 ERROR ql.Driver (SessionState.java:printError(400)) - FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
I am not sure what the actual problem, but I ran into this while creating a normal table in Hive.
I started Hive with sudo access, and can now run queries as expected.
$ sudo bin/dse hive
So something that worked for me was to totally wipe out the HiveMetaStore keyspace in Cassandra and recreate just the keyspace with the NetworkTopologyStrategy replica strategy. I made sure to add the Analytics datacenter to the new keyspace as well, so it looked something like:
CREATE KEYSPACE HiveMetaStore WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'Analytics' : 2};
I then restarted DSE on my analytics nodes and they correctly created the MetaStore table within the HiveMetaStore keyspace and everything started working again!

Composite key in Cassandra with Pig

We have a CQL table that looks something like this:
CREATE table data (
occurday text,
seqnumber int,
occurtimems bigint,
unique bigint,
fields map<text, text>,
primary key ((occurday, seqnumber), occurtimems, unique)
)
I can query this table from cqlsh like this:
select * from data where seqnumber = 10 AND occurday = '2013-10-01';
This query works and returns the expected data.
If I execute this query as part of a LOAD from within Pig, however, things don't work.
-- Need to URL encode the query
data = LOAD 'cql://ks/data?where_clause=seqnumber%3D10%20AND%20occurday%3D%272013-10-01%27' USING CqlStorage();
gives
InvalidRequestException(why:seqnumber cannot be restricted by more than one relation if it includes an Equal)
at org.apache.cassandra.thrift.Cassandra$prepare_cql3_query_result.read(Cassandra.java:39567)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_prepare_cql3_query(Cassandra.java:1625)
at org.apache.cassandra.thrift.Cassandra$Client.prepare_cql3_query(Cassandra.java:1611)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.prepareQuery(CqlPagingRecordReader.java:591)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:621)
Shouldn't these behave the same? Why is the version through Pig failing where the straight cqlsh command works?
Hadoop is using CqlPagingRecordReader to try to load your data. This is leading to queries that are not identical to what you have entered. The paging record reader is trying to obtain small slices of Cassandra data at a time to avoid timeouts.
This means that your query is executed as
SELECT * FROM "data" WHERE token("occurday","seqnumber") > ? AND
token("occurday","seqnumber") <= ? AND occurday='A Great Day'
AND seqnumber=1 LIMIT 1000 ALLOW FILTERING
And this is why you are seeing your repeated key error. I'll submit a bug to the Cassandra Project.
Jira:
https://issues.apache.org/jira/browse/CASSANDRA-6151

Resources