Filtering by `list<double>` column's element value range - cassandra

I'd like to filter rows of following table in cassandra.
CREATE TABLE mids_test_db.defect_data (
wafer_id text,
defect_id text,
document_id text,
fields list<double>,
PRIMARY KEY (wafer_id, defect_id)
)
...
CREATE INDEX defect_data_fields_idx ON mids_test_db.defect_data (values(fields));
What I firstly tried using something like field[0] > 0.5 but failed.
cqlsh:mids_test_db> select fields from defect_data where wafer_id = 'MIDS_1_20170101_023000_30000_1548100671' and fields[0] > 0.5;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Indexes on list entries (fields[index] = value) are not currently supported."
After searching google for a while, i'm feeling like this kind of job can not be easily done in Cassandra. The data model is something like a field value collection. mostly I want to query defect data using its fields data like above which is quite important in my business.
What approach should I have taken into consideration?. Application side filtering? Any hint or advice will be appreciated.

It's not possible to do directly with Cassandra, but you have following alternatives:
if your Cassandra is DataStax Enterprise, then you can use DSE Search;
you can add an additional table to perform lookup:
(...ignore this line...)
CREATE TABLE mids_test_db.defect_data_lookup (
wafer_id text,
defect_id text,
field double,
PRIMARY KEY (wafer_id, field, defect_id)
);
after that you should be able to to do a range scan inside partition, and at least fetch the defect_id field, and fetch all field values via second query.
Depending on your Cassandra version, you may be able to use materialized view to maintain that lookup table for you.

Related

Cassandra insert value disappear

I want to use the Cassandra database system to create tables. The original data is in the picture.
So I create these tables and insert the value
Create table course(
Course_ID text PRIMARY KEY,
Course_Name text,
student_id text
);
However when I want to select all the student id from course American History :select * from course where Course_Name = 'Biology';
Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
Then when I try to print out all the table, I found all the value with some part of duplicate value is missing... Is it because of the way I design table is wrong? How can I change it and select all the student id from one course?
Thanks!!
The issue is that your query for the table course is not using the primary key; unlike relational databases, the tables in Cassandra are designed based on the query that you are going to execute, in this case, you can include the course name as part of the composite key:
Create table course(
Course_ID text,
Course_Name text,
student_id text,
PRIMARY KEY (Course_Name, Course_ID)
);
There are already answers explaining the difference between the keys like this one, you may also want to read this article from Datastax

Delete whole row based on one of clusturing column value in cassandra

Schema I am using is as follows:
CREATE TABLE mytable(
id int,
name varchar,
PRIMARY KEY ((id),name)
) WITH CLUSTERING ORDER BY (name desc);
I wanted to delete records by following command :
DELETE FROM mytable WHERE name = 'Jhon';
But gived error
[Invalid query] message="Some partition key parts are missing: name"
As I looked for the reason, I came to know that only delete in not possible only with clustering columns.
Then I tried
DELETE FROM mytable WHERE id IN (SELECT id FROM mytable WHERE name='Jhon') AND name = 'Jhon';
But obviously it did not work.
I then tried with setting TTL to 0 for deleting row. But TTL can be set only for particular column, not the entire row.
What are feasible alternates to perform this operation?
In Cassandra, you need to design your data model to support your query. When you query your data, you always have to provide the partition key (otherwise the query would be inefficient).
The problem is that you want to query your data without a partition key. You would need to denormalize your data to support this kind or request. For example, you could add an additional table, such as:
CREATE TABLE id_by_name(
name varchar,
id int,
name varchar,
PRIMARY KEY (name, id)
) WITH CLUSTERING ORDER BY (id desc);
Then, you would be able to do your delete with a few queries:
SELECT ID from id_by_name WHERE name='John';
let's assume this returns 4.
DELETE FROM mytable WHERE id=4;
DELETE FROM id_by_name WHERE name='John' and id=4;
You could try to leverage materialized view (instead of maintaining yourself id_by_name) but materialized views are currently marked as unstable.
Now, there are still a few issues you need to address in your data model, in particular, how do you handle multiple user with the same name etc...
You cannot delete primary key if not complete. Primary key decisions are for sharding and load balancing. Cassandra can get complex if you are not used to thinking in columns.
I don't like the above answer, which though is good, complicates your solution. If you are thinking relational but getting lost in Cassandra I suggest using something that simplifies and maps your thinking to relational views.

Filter on the partition and the clustering key with an additional criteria

I want to filter on a table that has a partition and a clustering key with another criteria on a regular column. I got the following warning.
InvalidQueryException: Cannot execute this query as it might involve
data filtering and thus may have unpredictable performance. If you
want to execute this query despite the performance unpredictability,
use ALLOW FILTERING
I understand the problem if the partition and the clustering key are not used. In my case, is it a relevant error or can I ignore it?
Here is an example of the table and query.
CREATE TABLE mytable(
name text,
id uuid,
deleted boolean
PRIMARY KEY((name),id)
)
SELECT id FROM mytable WHERE name='myname' AND id='myid' AND deleted=false;
In Cassandra you can't filter data with non-primary key column unless you create index in it.
Cassandra 3.0 or up it is allowed to filter data with non primary key but in unpredictable performance
Cassandra 3.0 or up, If you provide all the primary key (as your given query) then you can use the query with ALLOW FILTERING, ignoring the warning
Otherwise filter from the client side or remove the field deleted and create another table :
Instead of updating the field to deleted true move your data to another table let's say mytable_deleted
CREATE TABLE mytable_deleted (
name text,
id uuid
PRIMARY KEY (name, id)
);
Now if you only have the non deleted data on mytable and deleted data on mytable_deleted table
or
Create index on it :
The column deleted is a low cardinality column. So remember
A query on an indexed column in a large cluster typically requires collating responses from multiple data partitions. The query response slows down as more machines are added to the cluster. You can avoid a performance hit when looking for a row in a large partition by narrowing the search.
Read More : When not to use an index

Cassandra Contains query error

I am new to Cassandra and trying to figure out how to get a simple contains query working with Cassandra.
My table looks like this
CREATE TABLE events (
timekey text,
id timeuuid,
event_types list<text>,
PRIMARY KEY ((timekey), id)
)
My query:
cqlsh> select count(1) from events where event_types contains 'foo';
**Bad Request: line 1:46 no viable alternative at input 'contains'**
Any thoughts about the error?
Also Is it possible to query for multiple event_types in one single query. I could not see any way to do this with Contains. Something equivalent to this in a regular sql
Relational SQL example:
select count(1) from events where event_types in ('foo', 'bar')
A couple of things. First of all, when I create your schema, insert a row, I get a different error message than you do:
aploetz#cqlsh:stackoverflow2> CREATE TABLE events (
... timekey text,
... id timeuuid,
... event_types list<text>,
... PRIMARY KEY ((timekey), id)
... );
aploetz#cqlsh:stackoverflow2> INSERT INTO events (timekey, id, event_types)
VALUES ('1', now(),['foo','bar']);
aploetz#cqlsh:stackoverflow2> select count(1) from events where event_types contains 'foo';
InvalidRequest: code=2200 [Invalid query] message="No secondary indexes on the restricted
columns support the provided operators: "
To get this to work, you will need to create a secondary index on your event_types collection. Of course secondary indexes on collections are a new feature as of Cassandra 2.1. By virtue of the fact that your error message is different, I'm going to guess that you would need to upgrade to 2.1.
I'm using 2.1.5 in my sandbox right now, so when I create an index on event_types this works:
aploetz#cqlsh:stackoverflow2> CREATE INDEX eventTypeIdx ON events(event_types);
aploetz#cqlsh:stackoverflow2> select count(1) from events where event_types contains 'foo';
count
-------
1
(1 rows)
Even though this may work, secondary indexes on large tables or in large clusters are known not to perform well. I would expect that secondary indexes on collections would perform even worse, so just take that as a warning.
Also Is it possible to query for multiple event_types in one single query?
There are ways to accomplish this, but I recommend against it for the aforementioned performance issues. I answered a similar question here, if you are interested: Cassandra CQL where clause with multiple collection values?

Why cassandra/cql restrict to use where clause on a column that not indexed?

I have a table as follows in Cassandra 2.0.8:
CREATE TABLE emp (
empid int,
deptid int,
first_name text,
last_name text,
PRIMARY KEY (empid, deptid)
)
when I try to search by: "select * from emp where first_name='John';"
cql shell says:
"Bad Request: No indexed columns present in by-columns clause with Equal operator"
I searched for the issue and every places it says add a secondary index for the column 'first_name'.
But I need to know the exact reason for why that column need to be indexed?
Only thing I can figure out is performance.
Any other reasons?
Cassandra does not support for searching by arbitrary column. It is because it would involve scanning all the rows, which is not supported.
The data are internally organised into something which one can compare to HashMap[X, SortedMap[Y, Z]]. The key of the outer map is a partition key value and the key of the inner map is a kind of concatenation of all clustering columns values and a name of some regular column.
Unless you have an index on a column, you need to provide full (preferred) or partial path to the data you want to collect with the query. Therefore, you should design your schema so that queries contain primary key value and some range on clustering columns.
You may read about what is allowed and what is not here
Alternatively you can create an index in Cassandra, but that will hamper your write performance.

Resources