Filtering on Primary Key in Cassandra - cassandra

I have the following table:
CREATE TABLE tab1 (id int PRIMARY KEY, price int, name text);
The following queries return errors:
SELECT name FROM tab1 WHERE id > 5;
SELECT name FROM tab1 WHERE id > 5 ALLOW FILTERING;
How can I fix it?

SELECT name FROM tab1 WHERE id > 5 ALLOW FILTERING; will not give an error since you are using allow filtering. If your queries require to use allow filtering, then you need to redesign your tables according to the queries. Allow filtering is not efficient way of querying your tables, especially in production. please check here
SELECT name FROM tab1 WHERE id > 5; will give you an error
[Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
The reason is; Cassandra doesn't work as how relational database works. The table structure doesn't allow you to run any query you want to, so you model your tables according to queries.
Please check here for where clause details. As it is stated in the documentation The partition key columns support only two operators: = and IN, in your case you are using greater, which causes you to get an error.

Related

Cassandra query failed to exec - Want to know the reason

I'm working on creating a Scheduler service which require Cassandra table structure as below.
CREATE TABLE IF NOT EXISTS spc_cmd_scheduler (
id timeuuid,
router_id text,
account_id text,
mode text,
triggered_by text,
retry_count smallint,
PRIMARY KEY ((triggered_by,retry_count),id)
)WITH CLUSTERING ORDER BY (id ASC);
When I do query with PK I'm getting below error. May I know the reason why?
select count(*) from spc_cmd_scheduler where triggered_by = 'ROUTER_ONBOARD' and retry_count < 3;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
I understand "ALLOW FILTERING" will solve my problem here but wanted to know what is wrong with the table structure.
What is the optimal way to design this table that suits my requirement.
Just to give background of my requirement, I need to run a scheduler to scan this table and issue a command and delete the entry once its successful. If command fails, I need to retry for 3 times.
So this table requires SELECT, UPDATE and DELETE operations.
In your case, the problem is that retry_count column is the part of the partition key, and we can use only equality operators (= or IN) for partition key column. Inequality operations (<, >, etc.) are supported only for clustering columns, and all preceding clustering columns need to be specified.

Filtering by `list<double>` column's element value range

I'd like to filter rows of following table in cassandra.
CREATE TABLE mids_test_db.defect_data (
wafer_id text,
defect_id text,
document_id text,
fields list<double>,
PRIMARY KEY (wafer_id, defect_id)
)
...
CREATE INDEX defect_data_fields_idx ON mids_test_db.defect_data (values(fields));
What I firstly tried using something like field[0] > 0.5 but failed.
cqlsh:mids_test_db> select fields from defect_data where wafer_id = 'MIDS_1_20170101_023000_30000_1548100671' and fields[0] > 0.5;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Indexes on list entries (fields[index] = value) are not currently supported."
After searching google for a while, i'm feeling like this kind of job can not be easily done in Cassandra. The data model is something like a field value collection. mostly I want to query defect data using its fields data like above which is quite important in my business.
What approach should I have taken into consideration?. Application side filtering? Any hint or advice will be appreciated.
It's not possible to do directly with Cassandra, but you have following alternatives:
if your Cassandra is DataStax Enterprise, then you can use DSE Search;
you can add an additional table to perform lookup:
(...ignore this line...)
CREATE TABLE mids_test_db.defect_data_lookup (
wafer_id text,
defect_id text,
field double,
PRIMARY KEY (wafer_id, field, defect_id)
);
after that you should be able to to do a range scan inside partition, and at least fetch the defect_id field, and fetch all field values via second query.
Depending on your Cassandra version, you may be able to use materialized view to maintain that lookup table for you.

Cassandra insert value disappear

I want to use the Cassandra database system to create tables. The original data is in the picture.
So I create these tables and insert the value
Create table course(
Course_ID text PRIMARY KEY,
Course_Name text,
student_id text
);
However when I want to select all the student id from course American History :select * from course where Course_Name = 'Biology';
Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
Then when I try to print out all the table, I found all the value with some part of duplicate value is missing... Is it because of the way I design table is wrong? How can I change it and select all the student id from one course?
Thanks!!
The issue is that your query for the table course is not using the primary key; unlike relational databases, the tables in Cassandra are designed based on the query that you are going to execute, in this case, you can include the course name as part of the composite key:
Create table course(
Course_ID text,
Course_Name text,
student_id text,
PRIMARY KEY (Course_Name, Course_ID)
);
There are already answers explaining the difference between the keys like this one, you may also want to read this article from Datastax

Performance impact of Allow filtering on same partition query in cassandra

I have table like this.
CREATE TABLE posts (
topic text
country text,
bookmarked text,
id uuid,
PRIMARY KEY (topic,id)
);
First query on single partition with allow filtering.
select * from posts where topic='cassandra' allow filtering;
Second query on single partition without allow filtering.
select * from posts where topic='cassandra';
My question is what is performance difference between first query and second query? Will first query(with allow filtering) get result from all partition before filtering though we have requested from single partition.
Thanks.
Allow filtering will allow you to run queries without specifying partition key. But if you using one, it will use only specific partition.
In this specific example you should see no difference.
Ran both queries on my test table with tracing on, got single partition in both execution plans:
Executing single-partition query on table_name
You don't need to use ALLOW FILTERING when you are querying with a partition key. So for the two queries you mentioned there will be no performance difference.
For Cassandra version 3.0 and up, ALLOW FILTERING can be used to query with any fields other than partition key. For example, you can run a query like this:
SELECT * FROM posts where country='Bangladesh';
And for Cassandra version below 3.0, ALLOW FILTERING can be used on only primary key.
Although it is not wise to query using ALLOW FILTERING.
Because, the only way Cassandra can execute this query is by retrieving all the rows from the table posts and then by filtering out the ones which do not have the requested value for the country column.
So you should useALLOW FILTERING at you own risk.

Cassandra asks for ALLOW FILTERING even though column is clustering key

Very new to Cassandra so apologies if the question is simple.
I created a table:
create table ApiLog (
LogId uuid,
DateCreated timestamp,
ClientIpAddress varchar,
primary key (LogId, DateCreated));
This work fine:
select * from apilog
If I try to add a where clause with the DateCreated like this:
select * from apilog where datecreated <= '2016-07-14'
I get this:
Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
From other questions here on SO and from the tutorials on datastax it is my understanding that since the datecreated column is a clustering key it can be used to filter data.
I also tried to create an index but I get the same message back. And I tried to remove the DateCreated from the primary key and have it only as an index and I still get the same back:
create index ApiLog_DateCreated on dotnetdemo.apilog (datecreated);
The partition key LogId determines on which node each partition will be stored. So if you don't specify the partition key, then Cassandra has to filter all the partitions of this table on all the nodes to find matching data. That's why you have to say ALLOW FILTERING, since that operation is very inefficient and is discouraged.
If you specify a specific LogId, then Cassandra can find the partition on a single node and efficiently do a range query by the clustering key.
So you need to plan your schema such that you can do your range queries within a single partition and not have to do a full table scan like you're trying to do.
When your query is rejected by Cassandra because it needs filtering, you should resist the urge to just add ALLOW FILTERING to it. You should think about your data, your model and what you are trying to do. You always have multiple options.
You can change your data model, add an index, use another table or use ALLOW FILTERING.
You have to make the right choice for your specific use case.
Anyway you want to make it work.
select * from dev."3" where "column" = '' limit 1000 ALLOW FILTERING;

Resources