I am new to Cassandra and trying to figure out how to get a simple contains query working with Cassandra.
My table looks like this
CREATE TABLE events (
timekey text,
id timeuuid,
event_types list<text>,
PRIMARY KEY ((timekey), id)
)
My query:
cqlsh> select count(1) from events where event_types contains 'foo';
**Bad Request: line 1:46 no viable alternative at input 'contains'**
Any thoughts about the error?
Also Is it possible to query for multiple event_types in one single query. I could not see any way to do this with Contains. Something equivalent to this in a regular sql
Relational SQL example:
select count(1) from events where event_types in ('foo', 'bar')
A couple of things. First of all, when I create your schema, insert a row, I get a different error message than you do:
aploetz#cqlsh:stackoverflow2> CREATE TABLE events (
... timekey text,
... id timeuuid,
... event_types list<text>,
... PRIMARY KEY ((timekey), id)
... );
aploetz#cqlsh:stackoverflow2> INSERT INTO events (timekey, id, event_types)
VALUES ('1', now(),['foo','bar']);
aploetz#cqlsh:stackoverflow2> select count(1) from events where event_types contains 'foo';
InvalidRequest: code=2200 [Invalid query] message="No secondary indexes on the restricted
columns support the provided operators: "
To get this to work, you will need to create a secondary index on your event_types collection. Of course secondary indexes on collections are a new feature as of Cassandra 2.1. By virtue of the fact that your error message is different, I'm going to guess that you would need to upgrade to 2.1.
I'm using 2.1.5 in my sandbox right now, so when I create an index on event_types this works:
aploetz#cqlsh:stackoverflow2> CREATE INDEX eventTypeIdx ON events(event_types);
aploetz#cqlsh:stackoverflow2> select count(1) from events where event_types contains 'foo';
count
-------
1
(1 rows)
Even though this may work, secondary indexes on large tables or in large clusters are known not to perform well. I would expect that secondary indexes on collections would perform even worse, so just take that as a warning.
Also Is it possible to query for multiple event_types in one single query?
There are ways to accomplish this, but I recommend against it for the aforementioned performance issues. I answered a similar question here, if you are interested: Cassandra CQL where clause with multiple collection values?
Related
I'm working on creating a Scheduler service which require Cassandra table structure as below.
CREATE TABLE IF NOT EXISTS spc_cmd_scheduler (
id timeuuid,
router_id text,
account_id text,
mode text,
triggered_by text,
retry_count smallint,
PRIMARY KEY ((triggered_by,retry_count),id)
)WITH CLUSTERING ORDER BY (id ASC);
When I do query with PK I'm getting below error. May I know the reason why?
select count(*) from spc_cmd_scheduler where triggered_by = 'ROUTER_ONBOARD' and retry_count < 3;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
I understand "ALLOW FILTERING" will solve my problem here but wanted to know what is wrong with the table structure.
What is the optimal way to design this table that suits my requirement.
Just to give background of my requirement, I need to run a scheduler to scan this table and issue a command and delete the entry once its successful. If command fails, I need to retry for 3 times.
So this table requires SELECT, UPDATE and DELETE operations.
In your case, the problem is that retry_count column is the part of the partition key, and we can use only equality operators (= or IN) for partition key column. Inequality operations (<, >, etc.) are supported only for clustering columns, and all preceding clustering columns need to be specified.
I'd like to filter rows of following table in cassandra.
CREATE TABLE mids_test_db.defect_data (
wafer_id text,
defect_id text,
document_id text,
fields list<double>,
PRIMARY KEY (wafer_id, defect_id)
)
...
CREATE INDEX defect_data_fields_idx ON mids_test_db.defect_data (values(fields));
What I firstly tried using something like field[0] > 0.5 but failed.
cqlsh:mids_test_db> select fields from defect_data where wafer_id = 'MIDS_1_20170101_023000_30000_1548100671' and fields[0] > 0.5;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Indexes on list entries (fields[index] = value) are not currently supported."
After searching google for a while, i'm feeling like this kind of job can not be easily done in Cassandra. The data model is something like a field value collection. mostly I want to query defect data using its fields data like above which is quite important in my business.
What approach should I have taken into consideration?. Application side filtering? Any hint or advice will be appreciated.
It's not possible to do directly with Cassandra, but you have following alternatives:
if your Cassandra is DataStax Enterprise, then you can use DSE Search;
you can add an additional table to perform lookup:
(...ignore this line...)
CREATE TABLE mids_test_db.defect_data_lookup (
wafer_id text,
defect_id text,
field double,
PRIMARY KEY (wafer_id, field, defect_id)
);
after that you should be able to to do a range scan inside partition, and at least fetch the defect_id field, and fetch all field values via second query.
Depending on your Cassandra version, you may be able to use materialized view to maintain that lookup table for you.
I want to use the Cassandra database system to create tables. The original data is in the picture.
So I create these tables and insert the value
Create table course(
Course_ID text PRIMARY KEY,
Course_Name text,
student_id text
);
However when I want to select all the student id from course American History :select * from course where Course_Name = 'Biology';
Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
Then when I try to print out all the table, I found all the value with some part of duplicate value is missing... Is it because of the way I design table is wrong? How can I change it and select all the student id from one course?
Thanks!!
The issue is that your query for the table course is not using the primary key; unlike relational databases, the tables in Cassandra are designed based on the query that you are going to execute, in this case, you can include the course name as part of the composite key:
Create table course(
Course_ID text,
Course_Name text,
student_id text,
PRIMARY KEY (Course_Name, Course_ID)
);
There are already answers explaining the difference between the keys like this one, you may also want to read this article from Datastax
Following a Tutorial on Cassandra, it was mentioned that if I do the following:
PRIMARY KEY(id, name) that id is the partition key and hence it is indexed. The name is the clustering column and hence it is also indexed. This means I can do a query such as:
SELECT * FROM my_table WHERE id = 'id_abc'; //this works!
I can also do a query such as:
SELECT * FROM my_table WHERE id = 'id_abc' AND name = 'name_123'; // this works!
However, I cannot do the following query:
SELECT * FROM my_table WHERE name = 'name_123'; // this does not work
Why does the last statement not work if the clustering column is indexed? Why does the first query work and not the second?
The error I get for the last query is the following:
InvalidRequest: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
Thanks in advance!
Just because it is named primary key there is no index on it in cassandra. ìd is your partition key - it defines which node in cassandra is responsible for your id. The clustering column name defines the order inside the partition.
Therefore SELECT * FROM my_table WHERE name = 'name_123'; // this does not work whould require all partitions to be scanned, which cassandra by default refuses.
I'm using Cassandra 2.1 and have a model that roughly looks as follows:
CREATE TABLE events (
client_id bigint,
bucket int,
timestamp timeuuid,
...
ticket_id bigint,
PRIMARY KEY ((client_id, bucket), timestamp)
);
CREATE INDEX events_ticket ON events(ticket_id);
As you can see, I've created a secondary index on ticket_id. This index works ok. events contains around 100 million rows, while only 5 million of these rows have around 50,000 distinct tickets. So a ticket - on average - has 100 events.
Querying the secondary index works without supplying the partition key, which is convenient in our situation. As the bucket column is sometimes hard to determine beforehand (i.e. you should know the date of the events, bucket is currently the date).
cqlsh> select * from events where ticket_id = 123;
client_id | bucket | timestamp | ... | ticket_id
-----------+--------+-----------+-----+-----------
(0 rows)
How do I solve the problem when all events of a ticket should be moved to another ticket? I.e. the following query won't work:
cqlsh> UPDATE events SET ticket_id = 321 WHERE ticket_id = 123;
InvalidRequest: code=2200 [Invalid query] message="Non PRIMARY KEY ticket_id found in where clause"
Does this imply secondary indexes cannot be used in UPDATE queries?
What model should I use to support these changes?
First of all, UPDATE and INSERT operations are treated the same in Cassandra. They are colloquially known as "UPSERTs."
Does this imply secondary indexes cannot be used in UPDATE queries?
Correct. You cannot perform an UPSERT in Cassandra without specifying the complete PRIMARY KEY. Even UPSERTs with a partial PRIMARY KEY will not work. And (as you have discovered) UPSERTing by an indexed value does not work, either.
How do I solve the problem when all events of a ticket should be moved to another ticket?
Unfortunately, the only way to accomplish this, is to query the keys of each row in events (with a particular ticket_id) and UPSERT ticket_id by those keys. The nice thing, is that you don't have to first DELETE them, because ticket_id is not part of the PRIMARY KEY.
How do I solve the problem when all events of a ticket should be moved to another ticket?
I think your best plan here would be to forego a secondary index all together, and create a query table to work alongside your events table:
CREATE TABLE eventsbyticketid (
client_id bigint,
bucket int,
timestamp timeuuid,
...
ticket_id bigint,
PRIMARY KEY ((ticket_id), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
This would allow you to query by ticket_id quickly (to obtain your client_id, bucket, and timestamp. This would give you the information you need to UPSERT the new ticket_id on your events table.
You could also then perform a DELETE by ticket_id (on the eventsbyticketid table). Cassandra does allow a DELETE operation with a partial PRIMARY KEY, as long as you have the full partition key (ticket_id). So removing old ticket_ids from the query table would be easy. And to ensure write atomicity, you could batch the UPSERTs together:
BEGIN BATCH
UPDATE events SET ticket_id = 321 WHERE client_id=2112 AND bucket='2015-04-22 14:53' AND timestamp=4a7e2730-e929-11e4-88c8-21b264d4c94d;
UPDATE eventsbyticketid SET client_id=2112, bucket='2015-04-22 14:53' WHERE ticket_id=321 AND timestamp=4a7e2730-e929-11e4-88c8-21b264d4c94d
APPLY BATCH;
Which is actually the same as performing:
BEGIN BATCH
INSERT INTO events (client_id,bucket,timestamp,ticketid) VALUES(2112,'2015-04-22 14:53',4a7e2730-e929-11e4-88c8-21b264d4c94d,321);
INSERT INTO eventsbyticketid (client_id,bucket,timestamp,ticketid) VALUES(2112,'2015-04-22 14:53',4a7e2730-e929-11e4-88c8-21b264d4c94d,321);
APPLY BATCH;
Side note: timestamp is actually a (reserved word) data type in Cassandra. This makes it a pretty lousy name for a timeuuid column.
You can use the secondary index to query the events for the old ticket, and then use the primary key from those retrieved events to update the events.
I'm not sure why you need to do this manually, seems like something Cassandra should be able to do under the hood.