CassandraQL allow filtering - cassandra

I am creating a table in cassandra database but I am getting an allow filtering error:
CREATE TABLE device_check (
id int,
checked_at bigint,
is_power boolean,
is_locked boolean,
PRIMARY KEY ((device_id), checked_at)
);
When I make a query
SELECT * FROM device_check where checked_at > 1234432543
But it is giving an allow filtering error. I tried removing brackets from device_id but it gives the same error. Even when I tried setting only the checked_at as the primary key it still wont work with the > operator. With the = operator it works.

PRIMARY KEY in Cassandra contains two type of keys
Partition key
Clustering Key
It is expressed as 'PRIMARY KEY((Partition Key), Clustering keys)`
Cassandra is a distributed database where data can be present on any of the node depending on the partition key. So to search data fast Cassandra asks users to send a partition key to identify the node where the data resides and query that node. So if you don't give parition key in your query then Cassandra complains that you are not querying the right way. Cassandra has to search all the nodes if you dont give it partition key. Thus Cassandra gives a error ALLOW FILTERING if you want to query without partition key.
With respect to > not supported for partition key, answer remains same as when you give a range search in your query then Cassandra has to scan all the nodes for responding which is not the right way to use Cassandra.

Related

Is it possible to delete data older than 'x' in Cassandra using only timestamp PK field and w/o ALLOW FILTERING and TTL option?

Title says all. I have a table timestampTEST
create table timestampTEST ( timestamp timestamp, test text, PRIMARY KEY(timestamp));
When trying to
select * from messagesbytimestampTEST where timestamp > '2021-01-03' and timestamp < '2021-01-04' ;
I got error
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
What I saw here https://docs.datastax.com/en/dse/5.1/cql/cql/cql_using/refTimeUuidFunctions.html it this sample (but I assume it is just part of the cql query):
SELECT * FROM myTable
WHERE t > maxTimeuuid('2013-01-01 00:05+0000')
AND t < minTimeuuid('2013-02-02 10:00+0000')
I know above is related to timeuuid, but I have tried it also and it yields same error.
It's not possible to do in CQL without ALLOW FILTERING. The primary reason is that in your table, primary key is the same as partition key, and to fulfill your query you need to scan data on all servers. This happens because the partition key is not ordered - the value is hashed, and used to select the server on which it will be stored. So CurrentTime-1sec will be on one server, CurrentTime-10sec - on another, etc.
Usually, for such queries people are using some external tools, like, DSBulk, or Spark with Spark Cassandra Connector. You can refer to following answers that I already provided on that topic:
Data model in Cassandra and proper deletion Strategy
Delete records in Cassandra table based on time range

Cassandra query collection while specifying a partition key

I've been reading about indexes in Cassandra but I'm a little confused when it comes to creating an index on a collection like a set, list or map.
Let's say I have the following table and index on users like the following
CREATE TABLE chatter.channels (
id text PRIMARY KEY,
users set<text>
);
CREATE INDEX channels_users_idx ON chatter.channels (values(users));
INSERT INTO chatter.channels (id, users) VALUE ('ch1', {'jeff', 'jenny'});
In the docs, at least what I've found so far, says that this can have a huge performance hit because the indexes are created local on the nodes. And all the examples that are given query the tables like below
SELECT * FROM chatter.channels WHERE users CONTAINS 'jeff';
From my understanding this would have the performance hit because the partition key is not specified and all nodes must be queried. However, if I was to issue a query like below
SELECT * FROM chatter.channels WHERE id = 'ch1' AND users CONTAINS 'jeff';
(giving the partition key) then would I still have the performance hit?
How would I be able to check this for myself? In SQL I can run EXPLAIN and get some useful information. Is there something similar in Cassandra?
Cassandra provides tracing capability , this helps to trace the progression of reads and writes of queries in Cassandra.
To view traces, open -> cqlsh on one of your Cassandra nodes and run the following command:
cqlsh> tracing on;
Now tracing requests.
cqlsh> use [KEYSPACE];
I hope this helps in checking the performance of query.

Avoiding filtering with a compound partition key in Cassandra

I am fairly new to Cassandra and currently have to following table in Cassandra:
CREATE TABLE time_data (
id int,
secondary_id int,
timestamp timestamp,
value bigint,
PRIMARY KEY ((id, secondary_id), timestamp)
);
The compound partition key (with secondary_id) is necessary in order to not violate max partition sizes.
The issue I am running in to is that I would like to complete the query SELECT * FROM time_data WHERE id = ?. Because the table has a compound partition key, this query requires filtering. I realize this is a querying a lot of data and partitions, but it is necessary for the application. For reference, id has relatively low cardinality and secondary_id has high cardinality.
What is the best way around this? Should I simply allow filtering on the query? Or is it better to create a secondary index like CREATE INDEX id_idx ON time_data (id)?
You will need to specify full partition key on queries (ALLOW FILTERING will impact performance badly in most cases).
One way to go could be if you know all secondary_id (you could add a table to track them in necessary) and do the job in your application and query all (id, secondary_id) pairs and process them afterwards. This has the disadvantage of beeing more complex but the advantage that it can be done with async queries and in parallel so many nodes in your cluster participate in processing your task.
See also https://www.datastax.com/dev/blog/java-driver-async-queries

How to understandr primary key in Apache cassandra?

i new for use apache cassandra, i have install cassandra and use cqlsh in my laptop
i used to create table using :
create table userpageview( created_at timestamp, hit int, userid int, variantid int, primary key (created_at, hit, userid, variantid) );
and insert several data into table, but when i tried to select using condition for all column (i mean one by one) it's error
maybe my data modelling wrong, maybe anyone can tell me how create data modelling in cassandra
thx
You need to read about partition keys and clustering keys. Cassandra works much differently than relational databases and the types of queries you can do are much more restricted.
Some information to get you started: here and here.

Cassandra: selecting first entry for each value of an indexed column

I have a table of events and would like to extract the first timestamp (column unixtime) for each user.
Is there a way to do this with a single Cassandra query?
The schema is the following:
CREATE TABLE events (
id VARCHAR,
unixtime bigint,
u bigint,
type VARCHAR,
payload map<text, text>,
PRIMARY KEY(id)
);
CREATE INDEX events_u
ON events (u);
CREATE INDEX events_unixtime
ON events (unixtime);
CREATE INDEX events_type
ON events (type);
According to your schema, each user will have a single time stamp. If you want one event per entry, consider:
PRIMARY KEY (id, unixtime).
Assuming that is your schema, the entries for a user will be stored in ascending unixtime order. Be careful though...if it's an unbounded event stream and users have lots of events, the partition for the id will grow and grow. It's recommended to keep partition sizes to tens or hundreds of megs. If you anticipate larger, you'll need to start some form of bucketing.
Now, on to your query. In a word, no. If you don't hit a partition (by specifying the partition key), your query becomes a cluster wide operation. With little data it'll work. But with lots of data, you'll get timeouts. If you do have the data in its current form, then I recommend you use the Cassandra Spark connector and Apache Spark to do your query. An added benefit of the spark connectory is that if you have cassandra nodes as spark worker nodes, due to locality, you can efficiently hit a secondary index without specifying the partition key (which would normally cause a cluster wide query with timeout issues, etc.). You could even use Spark to get the required data and store it into another cassandra table for fast querying.

Resources