IN query on secondary index in cassandra when partition key is specified - cassandra

I'm working with a system that uses a secondary index in cassandra, along with a composite primary key, e.g.
CREATE TABLE table (
a bigint,
b bigint,
c bigint,
PRIMARY KEY (a, b, c)
) WITH CLUSTERING ORDER BY (b ASC, c ASC)
CREATE INDEX secondary_index ON table (c);
One of the operations in the application using the table is to fetch a number of rows (typically tens) specifying the partition key and the secondary index key. Currently, it performs one query for each (partition key, secondary key) pair, in parallel, which works fine, e.g.:
select * from table where a = ? and c = ?;
However, I've noticed that the system's workload is such that most of the time, there is significant overlap in the partition keys across the requested rows, sometimes more than half of them have the same partition key. So, I thought that it might be more efficient to perform one query per partition key, with an IN clause on the secondary key, reducing the number of overall queries to single digits in most cases, and reducing read query overhead on the cluster.
However, at least executed from cqlsh, this does not seem to be allowed:
select * from table where a = ? and c in (...);
InvalidRequest: Error from server: code=2200 [Invalid query] message="PRIMARY KEY column "c" cannot be restricted as preceding column "b" is not restricted"
Is this just not allowed, and I'll have to continue making individual queries? Is there some reason it wouldn't actually be more efficient? Or is this just a limitation of CQL, and IN queries cannot use the secondary index? Perhaps there is an issue because the secondary index key is also in the primary key, and Cassandra attempts to use that instead of the secondary index?

You are not allowed to execute
select * from table where a = ? and c = ?;
Because then it means that Cassandra has to scan over the whole partition 'a' just to find all the values where c = 'your defined value'. This is because Cassandra do not have any info on what value b is and cannot pinpoint directly to the row.
There are good explanations on this page for most of query patterns.
https://www.datastax.com/blog/deep-look-cql-where-clause

Related

Why Secondary Index ( = ?) and Clustering Columns (order by) CANNOT be used together for CQL Query?

EDIT: a related jira ticket
A query in pattern select * from <table> where <partition_keys> = ? and <secondary_index_column> = ? order by <first_clustering_column> desc does not work, with error msg:
InvalidRequest: Error from server: code=2200 [Invalid query] message="ORDER BY with 2ndary indexes is not supported."
From the structure of index table, above query include the partition key, and first two cluster columns in the index table. Also, note that without order by clause, the result is sorted by clustering column as CLUSTERING ORDER.
Is there any way to make the query work? If not, why?
Data in Cassandra is naturally stored based on the sort order of Clustering Columns.
Secondary index in Cassandra is way different than a corresponding index in relation database. Its local per node, which means that its contents aren't known to other nodes of the cluster. So sorting by this index is highly impossible. Also within the node, the secondary indexes are holding just pointers to corresponding partition key.
If you need sorting to be performed by Cassandra, have them as clustering columns. Otherwise you can sort them in code, after you retrieve the results.
Also secondary indexes aren't ideal for Cassandra and definitely a better model is to not have them in first place, to save some headache for future.

Are all values in a Primary Key Indexed?

Following a Tutorial on Cassandra, it was mentioned that if I do the following:
PRIMARY KEY(id, name) that id is the partition key and hence it is indexed. The name is the clustering column and hence it is also indexed. This means I can do a query such as:
SELECT * FROM my_table WHERE id = 'id_abc'; //this works!
I can also do a query such as:
SELECT * FROM my_table WHERE id = 'id_abc' AND name = 'name_123'; // this works!
However, I cannot do the following query:
SELECT * FROM my_table WHERE name = 'name_123'; // this does not work
Why does the last statement not work if the clustering column is indexed? Why does the first query work and not the second?
The error I get for the last query is the following:
InvalidRequest: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
Thanks in advance!
Just because it is named primary key there is no index on it in cassandra. ìd is your partition key - it defines which node in cassandra is responsible for your id. The clustering column name defines the order inside the partition.
Therefore SELECT * FROM my_table WHERE name = 'name_123'; // this does not work whould require all partitions to be scanned, which cassandra by default refuses.

Get first row for each partition key in Cassandra

I am considering Cassandra as an intermediate storage during my ETL job to perform data deduplication.
Let's imagine I have a stream of events, each of them have some business entity id, timestamp and some value. I need to get only latest value in terms of in-event timestamp for each business key, but events may come unordered.
My idea was to create staging table with business id as a partition key and timestamp as a clustering key:
CREATE TABLE sample_keyspace.table1_copy1 (
id uuid,
time timestamp,
value text,
PRIMARY KEY (id, time)
) WITH CLUSTERING ORDER BY ( time DESC )
Now if I insert some data in this table I can get latest value for some given partition key:
select * from table1 where id = 96b29b4b-b60b-4be9-9fa3-efa903511f2d limit 1;
But that would require to issue such query for every business key I'm interested in.
Is there some effective way I could do it in CQL?
I know we have an ability to list all available partition keys (by select distinct id from table1). So if I look into storage model of Cassandra, getting first row for each partition key should not be too hard.
Is that supported?
If you're using a version after 3.6, there is an option on your query named PER PARTITION LIMIT (CASSANDRA-7017) which you can set to 1. This won't auto complete in cqlsh until 3.10 with CASSANDRA-12803.
SELECT * FROM table1 PER PARTITION LIMIT 1;
In a word: no.
The partitioning key is why Cassandra can work essentially any amount of data: It decides where to put/look for data using the hash of the partitioning key. That is why CQL SELECTs always need to do an equality filter on the entire partitioning key. In order to find the first time for each id, Cassandra would have to ask all nodes for any partition of the data, then perform a complex operation on each of them. Relational databases allow this, Cassandra does not. All it allows are full table scans (SELECT * from table1), or partition scans (SELECT DISTINCT id FROM table1), but those cannot* be linked to any complex operation.
*) I am omitting ALLOW FILTERING here, since it does not help in this context.

Order by with Cassandra No Sql Db

I'm starting to using Cassandra but I'm getting some problems on "ordering" or "selecting".
CREATE TABLE functions (
id_function int,
sort int,
id_subfunction int,
php_class varchar,
php_function varchar,
PRIMARY KEY (id_function, sort, id_subfunction)
);
This is my table.
If I execute this query
SELECT * FROM functions WHERE id_subfunction = 0 ORDER BY sort;
this is what I get.
Bad Request: ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Where I'm doing wrong?
Thanks
PRIMARY KEY (id_function, sort, id_subfunction)
In Cassandra CQL the columns in a compound PRIMARY KEY are either partitioning keys or clustering keys. In your case, id_function (the first key listed) is the partitioning key. This is the key value that is hashed so that your data for that key can be evenly distributed on your cluster.
The remaining columns (sort and id_subfunction) are known as clustering columns, which determine the sort order of your data within a partition. This essentially means that your data will only be sorted by your clustering key(s) when a partitioning key is first designated in your WHERE clause.
You have two options:
1) Query this table by id_function instead:
SELECT * FROM functions WHERE id_function= 0 ORDER BY sort;
This will technically work, although I'm guessing that it won't give you the results that you are looking for.
2) The better option, is to create a "query table." This is a table designed to specifically handle your query by id_subfunction. It only differs from the original functions table in that the PRIMARY KEY is defined with id_subfunction as the partitioning key:
CREATE TABLE functionsbysubfunction (
id_function int,
sort int,
id_subfunction int,
php_class varchar,
php_function varchar,
PRIMARY KEY (id_subfunction, sort, id_function)
);
This query table will allow this query to function as expected:
SELECT * FROM functionsbysubfunction WHERE id_subfunction = 0;
And you shouldn't need to indicateORDER BY, unless you want to specify either ASCending or DESCending order.
Remember with Cassandra, it is important to design your data model according to how you want to query your data. And that may not necessarily be the way that it originally makes sense to store it.

Is Cassandra secondary index optimized if the partition key specified?

For secondary index queries that the partition key is specified in the WHERE clause, does the secondary index lookup hits all cluster nodes, or just the node of the specified partition key?
If the latter is correct, then secondary index will be a good fit also for high cardinality fields (only for queries that satisfies the partition key).
EDIT: For example, for the following feed schema, query of a specific feed (feed_id specified) to retrieve existing or deleted feed items should be very efficient:
CREATE TABLE my_feed (
feed_id int,
item_id timeuuid,
is_deleted boolean,
data text,
PRIMARY KEY (feed_id, item_id)
) WITH CLUSTERING ORDER BY (item_id DESC);
CREATE INDEX my_feed_is_deleted_idx ON my_feed (is_deleted);
==> SELECT * FROM my_feed WHERE feed_id=1 AND is_deleted=false; --efficient?
If you hit a partition key first, then it won't be a cluster wide operation. Only the target partition will be hit. If you have wide rows with many rows in a partition, a secondary index will be an efficient way to filter them down once a partition is hit.
When and when not to use a secondary index and why is covered here: https://docs.datastax.com/en/dse/6.8/cql/cql/cql_using/useWhenIndex.html

Resources