We have a CQL table that looks something like this:
CREATE table data (
occurday text,
seqnumber int,
occurtimems bigint,
unique bigint,
fields map<text, text>,
primary key ((occurday, seqnumber), occurtimems, unique)
)
I can query this table from cqlsh like this:
select * from data where seqnumber = 10 AND occurday = '2013-10-01';
This query works and returns the expected data.
If I execute this query as part of a LOAD from within Pig, however, things don't work.
-- Need to URL encode the query
data = LOAD 'cql://ks/data?where_clause=seqnumber%3D10%20AND%20occurday%3D%272013-10-01%27' USING CqlStorage();
gives
InvalidRequestException(why:seqnumber cannot be restricted by more than one relation if it includes an Equal)
at org.apache.cassandra.thrift.Cassandra$prepare_cql3_query_result.read(Cassandra.java:39567)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_prepare_cql3_query(Cassandra.java:1625)
at org.apache.cassandra.thrift.Cassandra$Client.prepare_cql3_query(Cassandra.java:1611)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.prepareQuery(CqlPagingRecordReader.java:591)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:621)
Shouldn't these behave the same? Why is the version through Pig failing where the straight cqlsh command works?
Hadoop is using CqlPagingRecordReader to try to load your data. This is leading to queries that are not identical to what you have entered. The paging record reader is trying to obtain small slices of Cassandra data at a time to avoid timeouts.
This means that your query is executed as
SELECT * FROM "data" WHERE token("occurday","seqnumber") > ? AND
token("occurday","seqnumber") <= ? AND occurday='A Great Day'
AND seqnumber=1 LIMIT 1000 ALLOW FILTERING
And this is why you are seeing your repeated key error. I'll submit a bug to the Cassandra Project.
Jira:
https://issues.apache.org/jira/browse/CASSANDRA-6151
Related
I am trying to fetch the Primary Key/Clustering Key names for a particular table/entity and implement the same query in my JPA interface (which extends CassandraRepository).
I am not sure whether something like:
#Query("DESCRIBE TABLE <table_name>)
public Object describeTbl();
would work here as describe isn't a valid CQL statement and in case it would, what would be the type of the Object?
Suggestions?
One thing you could try, would be to query the system_schema.columns table. It is keyed by keyspace_name and table_name, and might be what you're looking for here:
> SELECT column_name,kind FROM system_schema.columns
WHERE keyspace_name='spaceflight_data'
AND table_name='astronauts_by_group';
column_name | kind
-------------------+---------------
flights | regular
group | partition_key
name | clustering
spaceflight_hours | clustering
(4 rows)
DESCRIBE TABLE is supported only in Cassandra 4 that includes fix for CASSANDRA-14825. But it may not help you much because it just returns the text string representing the CREATE TABLE statement, and you'll need to parse text to extract primary key definition - it's doable but could be tricky, depending on the structure of the primary key.
Or you can obtain underlying Session object and via getMetadata function get access to actual metadata object that allows to obtain information about keyspaces & tables, including the information about schema.
My table looks like :
CREATE TABLE prod_cust (
pid bigint,
cid bigint,
effective_date date,
expiry_date date,
PRIMARY KEY ((pid, cid))
);
My below query is giving no viable alternative at input 'OR' error
SELECT * FROM prod_cust
where
pid=101 and cid=201
OR
pid=102 and cid=202;
Does Cassandra not support OR operator if not, Is there any alternate way to achieve my result.
CQL does not support the OR operator. Sometimes you can get around that by using IN. But even IN won't let you do what you're attempting.
I see two options:
Submit each side of your OR as individual queries.
Restructure the table to better-suit what you're trying to do. Doing a "port-over" from a RDBMS to Cassandra almost never works as intended.
I basically have the same problem as the following Composite key in Cassandra with Pig. The only difference is I try to query for a part of the composite key within the where_clause of pig.
The data structure is similar to the earlier mentioned issue, I'll copy some code/context to minimize the reading of that issue.
We have a CQL table that looks something like this:
CREATE table data (
occurday text,
seqnumber int,
occurtimems bigint,
unique bigint,
fields map<text, text>,
primary key ((occurday, seqnumber), occurtimems, unique)
)
Instead of querying for both the seqnumber and the occurday (as was the issue in previously mentioned issue) I try to query one of the keys.
If I execute this query as part of a LOAD from within Pig, however, things don't work.
-- Need to URL encode the query
data = LOAD 'cql://ks/data?where_clause=occurday%3D%272013-10-01%27' USING CqlStorage();
gives
java.lang.RuntimeException
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:665)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.<init>(CqlPagingRecordReader.java:301)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader.initialize(CqlPagingRecordReader.java:167)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initialize(PigRecordReader.java:181)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:522)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: InvalidRequestException(why:occurday cannot be restricted by more than one relation if it includes an Equal)
at org.apache.cassandra.thrift.Cassandra$prepare_cql3_query_result$prepare_cql3_query_resultStandardScheme.read(Cassandra.java:51017)
at org.apache.cassandra.thrift.Cassandra$prepare_cql3_query_result$prepare_cql3_query_resultStandardScheme.read(Cassandra.java:50994)
at org.apache.cassandra.thrift.Cassandra$prepare_cql3_query_result.read(Cassandra.java:50933)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_prepare_cql3_query(Cassandra.java:1756)
at org.apache.cassandra.thrift.Cassandra$Client.prepare_cql3_query(Cassandra.java:1742)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.prepareQuery(CqlPagingRecordReader.java:605)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:635)
... 7 more
Basically my question is, what am I doing wrong or what don't I understand?
As I understand from CqlPagingRecorderReader Used when Partition Key Is Explicitly Stated
I should be able to query with just part of the partition key?
Also while reading
Add CqlRecordReader to take advantage of native CQL pagination
I get the impression this should be possible, but I am swimming around with (in my opinion) no clear direction on how to accomplish this.
Any help is very very welcome at this point.
Regards,
Lennart Weijl
PS.
I am running on Cassandra 2.0.9 with Pig 0.13.0
According to CASSANDRA-6311, I believe you need to apply the 6331-v2-2.0-branch.txt patch, recompile pig, and then update your LOAD statement to:
data = LOAD 'cql://ks/data?where_clause=occurday%3D%272013-10-01%27' USING CqlInputFormat();
The key change being USING CqlInputFormat() which triggers the use of the new CqlRecordReader that was released in Cassandra 2.0.7.
Edit: Note that the exception is thrown from CqlPagingRecordReader which means you're still using the old record reader.
Im trying to create a schema that will enable me access rows with only part of the row_key.
For example the key is of the form user_id:machine_os:machine_arch
An example of a row key: 12242:"windows2000":"x86"
From the documentation I could not understand whether this will enable me to query all rows that have userid=12242 or query all rows that have "windows2000"
Is there any feasible way to achieve this ?
Thanks,
Yadid
Alright, here is what is happening: based on your schema, you are effectively creating a column family with a composite primary key or a composite rowkey. What this means is, you will need to restrict each component of the composite key except the last one with a strict equality relation. The last component of the composite key can use inequality and the IN relation, but not the 1st and 2nd components.
Additionally, you must specify all three parts if you want to utilize any kind of filtering. This is necessary because without all parts of the partition key, the coordinator node will have no idea on which node in the cluster the data exists (remember, Cassandra uses the partition key to determine replicas and data placement).
Effectively, this means you can't do any of these:
select * from datacf where user_id = 100012; # missing 2nd and 3rd key components
select * from datacf where user_id = 100012; and machine_arch = 'x86'; # missing 3rd key component
select * from datacf where machine_arch = 'x86'; # you have to specify the 1st
select * from datacf where user_id = 100012 and machine_arch in ('x86', 'x64'); # nope, still want 3rd
However, you will be able to run queries like this:
select * from datacf where user_id = 100012 and machine_arch = 'x86'
and machine_os = "windows2000"; # yes! all 3 parts are there
select * from datacf where user_id = 100012 and machine_os = "windows2000"
and machine_arch in ('x86', 'x64'); # the last part of the key can use the 'IN' or other equality relations
To answer your initial question, with you existing data model, you will neither be able to query data with userid = 12242 or query all rows that have "windows2000" as the machine_os.
If you can tell me exactly what kind of query you will be running, I can probably help in trying to design the table accordingly. Cassandra data models usually work better when looked at from the data retrieval perspective. Long story short- use only user_id as your primary key and use secondary indexes on other columns you want to query on.
I am very new to Cassandra and this time still I have not done my part on reading much about the architecture. I have a simple question for which I am not getting an answer for.
This is a sample data when I do a list abcColumnFamily:
RowKey:Message_1
=> (column=word, value=Message_1, timestamp=1373976339934001)
RowKey:Message_2
=> (column=word, value=Message_2, timestamp=1373976339934001)
How can I search for the Rowkey having say Message_1
In SQL world: Select * from Table where Rowkey = 'Message_1' (= OR like). I want to simply search on full string.
My intention is to just check whether a particular data of my interest is there in a rowkey or not.
For CQL try:
select * from abcColumnFamily where KEY = 'Message_1'
If You want to query that data using CLI try the following:
assume abcColumnFamily keys as utf8;
get abcColumnFamily['Message_1'];