Is it possible to delete data older than 'x' in Cassandra using only timestamp PK field and w/o ALLOW FILTERING and TTL option? - cassandra

Title says all. I have a table timestampTEST
create table timestampTEST ( timestamp timestamp, test text, PRIMARY KEY(timestamp));
When trying to
select * from messagesbytimestampTEST where timestamp > '2021-01-03' and timestamp < '2021-01-04' ;
I got error
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
What I saw here https://docs.datastax.com/en/dse/5.1/cql/cql/cql_using/refTimeUuidFunctions.html it this sample (but I assume it is just part of the cql query):
SELECT * FROM myTable
WHERE t > maxTimeuuid('2013-01-01 00:05+0000')
AND t < minTimeuuid('2013-02-02 10:00+0000')
I know above is related to timeuuid, but I have tried it also and it yields same error.

It's not possible to do in CQL without ALLOW FILTERING. The primary reason is that in your table, primary key is the same as partition key, and to fulfill your query you need to scan data on all servers. This happens because the partition key is not ordered - the value is hashed, and used to select the server on which it will be stored. So CurrentTime-1sec will be on one server, CurrentTime-10sec - on another, etc.
Usually, for such queries people are using some external tools, like, DSBulk, or Spark with Spark Cassandra Connector. You can refer to following answers that I already provided on that topic:
Data model in Cassandra and proper deletion Strategy
Delete records in Cassandra table based on time range

Related

Cassanda cql issue : "Batch too large","code":8704

I am getting the below error in select query.
{"error":{"name":"ResponseError","info":"Represents an error message from the server","message":"Batch too large","code":8704,"coordinator":"10.29.96.106:9042"}}
Ahh, I get it; you're using Dev Center.
If result is more than 1000 it is showing this error
Yes, that's Dev Center preventing you from running queries that can hurt your cluster. Like this:
select * from user_request_by_country_by_processworkflow
WHERE created_on <= '2022-01-08T16:19:07+05:30' ALLOW FILTERING;
ALLOW FILTERING is a way to force Cassandra to read multiple partitions in one query, even though it is designed to warn you against doing that. If you really need to run a query like this, then you'll want to build a table with a PRIMARY KEY designed to specifically support that.
In this case, I'd recommend "bucketing" your table data by whichever time component keeps the partitions within a reasonable size. For example, if the day keeps the rows-per-partition below 50k, the primary key definition would look like this:
PRIMARY KEY (day,created_on)
WITH CLUSTERING ORDER BY (created_on DESC);
Then, a query that would work and be allowed would look like this:
SELECT * FROM user_request_by_country_by_processworkflow
WHERE day=20220108
AND created_on <= '2022-01-08T16:19:07+05:30';
In summary:
Don't run multi-partition queries.
Don't use ALLOW FILTERING.
Do build tables to match queries.
Do use time buckets to keep partitions from growing unbounded.

CassandraQL allow filtering

I am creating a table in cassandra database but I am getting an allow filtering error:
CREATE TABLE device_check (
id int,
checked_at bigint,
is_power boolean,
is_locked boolean,
PRIMARY KEY ((device_id), checked_at)
);
When I make a query
SELECT * FROM device_check where checked_at > 1234432543
But it is giving an allow filtering error. I tried removing brackets from device_id but it gives the same error. Even when I tried setting only the checked_at as the primary key it still wont work with the > operator. With the = operator it works.
PRIMARY KEY in Cassandra contains two type of keys
Partition key
Clustering Key
It is expressed as 'PRIMARY KEY((Partition Key), Clustering keys)`
Cassandra is a distributed database where data can be present on any of the node depending on the partition key. So to search data fast Cassandra asks users to send a partition key to identify the node where the data resides and query that node. So if you don't give parition key in your query then Cassandra complains that you are not querying the right way. Cassandra has to search all the nodes if you dont give it partition key. Thus Cassandra gives a error ALLOW FILTERING if you want to query without partition key.
With respect to > not supported for partition key, answer remains same as when you give a range search in your query then Cassandra has to scan all the nodes for responding which is not the right way to use Cassandra.

Is there anyway to use LIKE in NoSQL Command on non primary Key?

I am selecting from Cassandra database using the LIKE operator on non primary key.
select * from "TABLE_NAME" where "Column_name" LIKE '%SpO%' ALLOW FILTERING;
Error from server: code=2200 [Invalid query] message="LIKE restriction is only
supported on properly indexed columns. parameter LIKE '%SpO%' is not valid."
Simply put, "yes" there is a way to query with LIKE on a non-Primary Key component. You can do this with a SASI (Storage Attached Secondary Index) Index. Here is a quick example:
CREATE TABLE testLike (key TEXT PRIMARY KEY, value TEXT) ;
CREATE CUSTOM INDEX valueIdx ON testLike (value)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS={'mode':'CONTAINS'};
As your query requires to match a string within a column, and not just a prefix or suffix you'll want to pass the CONTAINS option on index creation.
After writing some data, your query works for me:
> SELECT * FROM testlike WHERE value LIKE '%SpO%';
key | value
-----+--------------
C | CSpOblahblah
D | DSpOblahblah
(2 rows)
WARNING!!!
This query is extremely inefficient, and will probably time out in a large cluster, unless you also filter by a partition key in your WHERE clause. It's important to understand that while this functionality works similar to how a relational database would, that Cassandra is definitely not a relational database. It is simply not designed to handle queries which incur a large amount of network time polling multiple nodes for data.

Cassandra query collection while specifying a partition key

I've been reading about indexes in Cassandra but I'm a little confused when it comes to creating an index on a collection like a set, list or map.
Let's say I have the following table and index on users like the following
CREATE TABLE chatter.channels (
id text PRIMARY KEY,
users set<text>
);
CREATE INDEX channels_users_idx ON chatter.channels (values(users));
INSERT INTO chatter.channels (id, users) VALUE ('ch1', {'jeff', 'jenny'});
In the docs, at least what I've found so far, says that this can have a huge performance hit because the indexes are created local on the nodes. And all the examples that are given query the tables like below
SELECT * FROM chatter.channels WHERE users CONTAINS 'jeff';
From my understanding this would have the performance hit because the partition key is not specified and all nodes must be queried. However, if I was to issue a query like below
SELECT * FROM chatter.channels WHERE id = 'ch1' AND users CONTAINS 'jeff';
(giving the partition key) then would I still have the performance hit?
How would I be able to check this for myself? In SQL I can run EXPLAIN and get some useful information. Is there something similar in Cassandra?
Cassandra provides tracing capability , this helps to trace the progression of reads and writes of queries in Cassandra.
To view traces, open -> cqlsh on one of your Cassandra nodes and run the following command:
cqlsh> tracing on;
Now tracing requests.
cqlsh> use [KEYSPACE];
I hope this helps in checking the performance of query.

Query all and consistency

This is a question regarding the behavior of cassandra for a select * query.
It's more for understanding, I know that normaly I should not execute such a query.
Assuming I have 4 Nodes with RF=2.
Following table (column family):
create table test_storage (
id text,
created_on TIMESTAMP,
location int,
data text,
PRIMARY KEY(id)
);
I inserted 100 entries into the table.
Now I do a select * from test_storage via cqlsh. Doing the query multiple times I get different results, so not all entries. When changing consistency to local_quorum I always get back the complete result. Why is this so?
I assumed, despite from the performance, that I also get for consistency one all entries since it must query the whole token range.
Second issue, when I add a secondary index in this case to location, and do a query like select * from test_storage where location=1 I also get random results wiht consistency one. And always correct results when changing to consistency level local_quorum. Also here I don't understand why this happens?
When changing consistency to local_quorum I always get back the complete result. Why is this so?
Welcome to the eventual consistency world. To understand it, read my slides: http://www.slideshare.net/doanduyhai/cassandra-introduction-2016-60292046/31
I assumed, despite from the performance, that I also get for consistency one all entries since it must query the whole token range
Yes, Cassandra will query all token ranges because of the non restricted SELECT * but it will only request data from one replicas out of 2 (RF=2)
and do a query like select * from test_storage where location=1 I also get random results wiht consistency one
Same answer as above, native Cassandra secondary index is just using a Cassandra table under the hood to store the reverse-index so the same eventual consistency rules apply there too

Resources