How do range queries work for clustering keys in Cassandra? - cassandra

According to the official doc:
Clustering columns order data within a partition. When a table has multiple clustering columns the data is stored in nested sort order.
Suppose we have simple timeseries table:
CREATE TABLE alerts_by_year(
year int,
ts timestamp,
alert text,
PRIMARY KEY ((year), ts)
);
A simple query that get events for some range:
SELECT * FROM alerts_by_year
WHERE year=2022
AND ts >'2022-06-24 03:11:00'
AND ts <'2022-06-24 04:11:00'
What is algorithm complexity to find this range through the "ts" clustering keys?
Is it constant time or O(n) time?
Does it depends on the type of storage used: memtable or sstable?
How does it work then? Are we simply iterating through "ts" clustering keys until we find the required range?

Clustering columns are stored in sorted order. If you don't explicitly specify the order when you create a table, the clustering columns will be sorted in ascending order.
In your case, the following table option is automatically added to your table's definition:
WITH CLUSTERING ORDER BY (ts ASC)
In your case, the table schema looks like:
CREATE TABLE alerts_by_year(
year int,
ts timestamp,
alert text,
PRIMARY KEY ((year), ts)
) WITH CLUSTERING ORDER BY (ts ASC)
Since the rows in each partition is sorted in chronological order from oldest to newest timestamp, a range query on the ts column is done sequentially, iterating one row at a time until the condition is satisfied.
Note that the drivers will automatically page through the results. For example, the Java driver will return the first 5000 rows by default. You app will then need to retrieve the "next page" to get the next set of rows. Cheers!

Related

Can I change the order of rows without specifying a column as the clustering key?

I know I can change the on-disk sorting to descending by defining a table like this:
create table timeseries (
event_type text,
insertion_time timestamp,
event blob,
PRIMARY KEY (event_type, insertion_time)
)
WITH CLUSTERING ORDER BY (insertion_time DESC);
But the problem is that now insertion_time is part of a composite unique constraint (event_type + insertation_time) which is not desired.
Is there any way to change the order without making the column a primary key? Something like this:
create table timeseries (
event_type text,
insertion_time timestamp,
event blob,
PRIMARY KEY (event_type)
)
WITH CLUSTERING ORDER BY (insertion_time DESC);
What you want is not possible because in the second table schema, there will only every be one row for every single partition. Another way of putting it is -- there are no rows to sort since there is only ever one row.
You can only specify the CLUSTERING ORDER if there is a clustering column to sort. Cheers!

Why does querying based on the first clustering key require an ALLOW FILTERING?

Say I have this Cassandra table:
CREATE TABLE orders (
customerId int,
datetime date,
amount int,
PRIMARY KEY (customerId, datetime)
);
Then why would the following query require an ALLOW FILTERING:
SELECT * FROM orders WHERE date >= '2020-01-01'
Cassandra could just go to all the individual partitions (i.e. customers) and filter on the clustering key date. Since date is sorted there is no need to retrieve all the rows in orders and filter out the ones that match my where clause (as far as I understand it).
I hope someone can enlighten me.
Thanks
This happens because for normal work, Cassandra needs the partition key - it's used to find what machine(s) are storing the data for it. If you don't have partition key, like, in your example, Cassandra need to scan all data to find those that are matching your query. And this requires the use of the ALLOW FILTERING.
P.S. Data is sorted only inside the individual partitions, not globally.

Cassandra CLUSTERING ORDER with updates [performance]

With Cassandra it is possible to specify the cluster ordering on a table with a particular column.
CREATE TABLE myTable (
user_id INT,
message TEXT,
modified DATE,
PRIMARY KEY ((user_id), modified)
)
WITH CLUSTERING ORDER BY (modified DESC);
Note: In this example, there is one message per user_id (intended)
Given this table my understanding is that the query's performance will be better in cases where recent data is queried.
However, if one where to make updates to the "modified" column does it add extra overhead on the server to "re-order" and is that overhead vs query performance significant?
In other words given this table would it perform better if the "CLUSTERING ORDER BY (modified DESC)" was dropped?
UPDATE: Updated the invalid CQL by adding modified to primary key, however, the original questions still stand.
In order to make modified a clustering column, it needs to be defined in the primary key.
CREATE TABLE myTable (
user_id INT,
message TEXT,
modified DATE,
PRIMARY KEY ((user_id), modified)
)
WITH CLUSTERING ORDER BY (modified DESC);
This way, your data will be sorted primarily by the hashed value of the user_id, and within each user_id by modified. You don't need to drop the "WITH CLUSTERING ORDER BY (modified DESC)"
Moving the comment as an answer, as reply of the updated question:
if one where to make updates to the "modified" column does it add
extra overhead on the server to "re-order" and is that overhead vs
query performance significant?
If modified is defined as part of the clustering key, you won't be able to update that record, but you will be able to add as many records as needed, each time with a different modified date.
Cassandra is an append-only database engine: this means that any update to the records will add a new record with a different timestamp, a select will consider the records with the latest timestamp. This means that there is no "re-order" operation.
Dropping or creating the clustering order should be defined in base of the query of how the information will be retrieved, if you are going to use only the latest records of that user_id, it makes sense to have the clustering order as you defined it.
in your data model user_id is a rowkey/shardkey/partition key (userid) that is important for data locality and the clustering column (modified) specifies the order that the data is arranged inside the partition. combination of these two keys makes the primary key.
Even in RDBS world, updating PK is avoidble for sake of data integrity.
however in cassandra there is no constraints/relation between column families/tables.
Assigning exact same values to Pk fields(userid,modified) will result in update the existing record else it will add set of fields.
refence:
https://www.datastax.com/dev/blog/we-shall-have-order

Error creating table in cassandra - Bad Request: Only clustering key columns can be defined in CLUSTERING ORDER directiv

I get the above error when I try to use following cql statement, not sure whats wrong with it.
CREATE TABLE Stocks(
id uuid,
  market text,
  symbol text,
value text,
time timestamp,
  PRIMARY KEY(id)
) WITH CLUSTERING ORDER BY (time DESC);
Bad Request: Only clustering key columns can be defined in CLUSTERING ORDER directive
But this works fine, can't I use some column which is not part of primary key to arrange my rows ?
CREATE TABLE timeseries (
... event_type text,
... insertion_time timestamp,
... event blob,
... PRIMARY KEY (event_type, insertion_time)
... )
... WITH CLUSTERING ORDER BY (insertion_time DESC);
"can't I use some column which is not part of primary key to arrange my rows?"
No, you cannot. From the DataStax documentation on the SELECT command:
ORDER BY clauses can select a single column only. That column has to be the second column in a compound PRIMARY KEY. This also applies to tables with more than two column components in the primary key.
Therefore, for your first CREATE to work, you will need to adjust your PRIMARY KEY to this:
PRIMARY KEY(id,time)
The second column of in a compound primary key is known as the "clustering column." This is the column that determines the on-disk sort order of data within a partitioning key. Note that last part in italics, because it is important. When you query your Stocks column family (table) by id, all "rows" of column values for that id will be returned, sorted by time. In Cassandra you can only specify order within a partitioning key (and not for your entire table), and your partitioning key is the first key listed in a compound primary key.
Of course the problem with this, is that you probably want id to be unique (which means that CQL will only ever return one "row" of column values per partitioning key). Requiring time to be part of the primary key negates that, and makes it possible to store multiple values for the same id. This is the problem with partitioning your data by a unique id. It might be a good idea in the RDBMS world, but it can make querying in Cassandra more difficult.
Essentially, you are going to need to revisit your data model here. For instance, if you wanted to query prices over time, you could name the table something like "StockPriceEvents" with a primary key of (id,time) or (symbol,time). Querying that table would give you the prices recorded for each id or symbol, sorted by time. Now that may or may not be of any value to your use case. Just trying to explain how primary keys and sort order work in Cassandra.
Note: You should really use column names that have more meaning. Things like "id," "time," and "timeseries" are pretty vague don't really describe anything about the context in which they are used.
While creating a Table in Cassandra with "CLUSTERING ORDER BY" option, make sure the clustering column is Primary column.
Below table created with clustering column ,but the clustering column "Datetime" is not a Primary key column. Hence below error.
ERROR_SCRIPT
cqlsh> CREATE TABLE IF NOT EXISTS cpdl3_spark_cassandra.log_data (
... IP text,
... URL text,
... Status text,
... UserAgent text,
... Datetime timestamp,
... PRIMARY KEY (IP)
... ) WITH CLUSTERING ORDER BY (Datetime DESC);
ERROR:
InvalidRequest: Error from server: code=2200 [Invalid query] message="Only clustering key columns can be defined in CLUSTERING ORDER directive"
CORRECTED_SCRIPT (Where the "Datetime" is added into the Primary Key columns)
cqlsh> CREATE TABLE IF NOT EXISTS cpdl3_spark_cassandra.log_data (
... IP text,
... URL text,
... Status text,
... UserAgent text,
... Datetime timestamp,
... PRIMARY KEY (IP,Datetime)
... ) WITH CLUSTERING ORDER BY (Datetime DESC);

Cassandra cql select sorting

If I define a table like this using cql:
CREATE TABLE scores (
name text,
age int,
score int,
date timestamp,
PRIMARY KEY (name, age, score)
);
And do a SELECT in cqlsh like this:
select * from mykeyspace.scores;
The result displayed seems to always be sorted by 'age', then 'score' automatically in ascending order regardless of input-data ordering (as expected, return rows are not sorted by the partition key 'name'). I have the following questions:
Does SELECT automatically sort return rows by the clustering keys?
If yes, what's the purpose of using the ORDER BY clause when using SELECT?
If no, how do I get the return rows to sort by the clustering keys since cql doesn't allow ORDER BY on a select *?
Your clustering columns define the order (in your case age then score)
http://cassandra.apache.org/doc/cql3/CQL.html#createTableStmt
On a given physical node, rows for a given partition key are stored in the order induced by the clustering columns, making the retrieval of rows in that clustering order particularly efficient (see SELECT).
http://cassandra.apache.org/doc/cql3/CQL.html#selectStmt
The ORDER BY option allows to select the order of the returned results. It takes as argument a list of column names along with the order for the column (ASC for ascendant and DESC for descendant, omitting the order being equivalent to ASC). Currently the possible orderings are limited (which depends on the table CLUSTERING ORDER):
if the table has been defined without any specific CLUSTERING ORDER, then the allowed orderings are the order induced by the clustering columns and the reverse of that one.
otherwise, the orderings allowed are the order of the CLUSTERING ORDER option and the reversed one.

Resources