COUNT(*) vs. COUNT(1) vs COUNT(column) with WHERE condition performance in Cassandra - cassandra

I have a query in Cassandra
select count(pk1) from tableA where pk1=xyz
Table is :
create table tableA
(
pk1 uuid,
pk2 int,
pk3 text,
pk4 int,
fc1 int,
fc2 bigint,
....
fcn blob,
primary key (pk1, pk2 , pk3 , pk4)
The query is executed often and takes up to 2s to execute.
I am wondering if there will be any performance gain if refactoring to:
select count(1) from tableA where pk = xyz

Based on the documentation here, there is no difference between count(1) and count(*).
Generally speaking COUNT(1) and COUNT(*) will both return the number of rows that match the condition specified in your query
This is in line with how traditional SQL databases are implemented.
COUNT ( { [ [ ALL | DISTINCT ] expression ] | * } )
Count(1) is a condition that always evaluates to true.
Also, Count(Column_name) only returns the Non-Null values.
Since in your case because of where condition the "Non-null" is a non-factor, I don't think there will be any difference in performance in using one over the other. This answer tried confirming the same using some performance tests.

In general COUNT is not at all recommended in Cassandra . As it’s going to scan through multiple nodes and get your answer back . And I’m not sure the count you get is really consistent.

Related

Cassandra data model for intersection of ranges

Assume data with pk (text), start (int), end (int), extra_data(text).
Query is: given a pk (e.g. 'pk1') and a range (e.g [1000, 2000]), find all rows for 'pk1' which intersect that range. This (sql) logically translates to WHERE pk=pk1 AND end>=1000 AND start<=2000 (intersection condition)
Notice this is NOT the same as the more conventional query of:
all rows for pk1 where start>1000 and start<2000
If I define a table with end as part of the clustering key:
CREATE TABLE test1 (
pk text,
start bigint,
end bigint,
extra_data text,
PRIMARY KEY ((pk), start, end)
)...
Then Cassandra does not allow the query:
select * from test1 where pk='pk1' and start < 2000 and end > 1000;
with "Clustering column "end" cannot be restricted (preceding column "start" is restricted by a non-EQ relation)"
Why does Cassandra not allow further filtering to limit ranged rows (forces to do this filter with results application-side).
A second try would be to remove 'end' from clustering columns:
CREATE TABLE test1 (
pk text,
start bigint,
end bigint,
extra_data text,
PRIMARY KEY ((pk), start)
)...
Then Cassandra warns the query:
select * from test1 where pk='pk1' and start < 2000 and end > 1000;
with "Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
Here I would like to understand if I can safely add the ALLOW FILTERING and be assured Cassandra will perform the scan only of 'pk1'.
Using cqlsh 5.0.1 | Cassandra 3.11.3
Actually, I think you made the fatal mistake of designing your table first and then trying to adapt the application query to fit the table design.
In Cassandra data modelling, the primary principle is to always start by listing all your application queries THEN design a table for each of those application queries -- not the other way around.
Let's say I have an IoT use case where I have sensors collecting temperature readings once a day. If my application needs to retrieve the readings from the last 7 days from a sensor, the app query is:
Get the temperature for the last 7 days for sensor X
Assuming today is October 25, a more SQL-like representation of this app query is:
SELECT temperature FROM table
WHERE sensor = X
AND reading_date >= 2022-10-18
AND reading_date < 2022-10-25
This means that we need to design the table such that:
it is partitioned by sensor, and
the data is clustered by date.
The table schema would look like:
CREATE TABLE readings_by_sensor (
sensor text,
reading_date date,
temp float,
PRIMARY KEY (sensor, reading_date)
)
We can then perform a range query on the date:
SELECT temperature FROM readings_by_sensor
WHERE sensor = ?
AND reading_date >= 2022-10-18
AND reading_date < 2022-10-25
You don't need two separate columns to represent the start and end range because. Cheers!

Set primary key for range query in Cassandra

I want to create a table with these columns: id1, id2, type, time, data, version.
The frequent query is:
select * from table_name where id1 = ... and id2 =... and type = ...
select * from table_name where id1= ... and type = ... and time > ... and time < ...
I don't know how to set the primary key for the fast query?
As you have two different queries, you will likely need to have two different tables for them to perform well. This is not unusual for Cassandra data models. Keep in mind that for both of these, the PRIMARY KEY definition in Cassandra is largely dependent on the cardinalities and anticipated query patterns. As you have only provided the latter, you may need to make adjustments based on the cardinalities of id1, id2, and type.
select * from table_name where id1 = X and id2 = Y and type = Z;
So here I'm going to make an educated guess that id1 and id2 are nigh unique (high cardinality), as IDs usually are. I don't know how many types are available in your application, but as long as there aren't more than 10,000 this should work:
CREATE TABLE table_name_by_ids (
id1 TEXT,
id2 TEXT,
type TEXT,
time TIMESTAMP,
data TEXT,
version TEXT,
PRIMARY KEY ((id1,id2),type));
This will key your partitions on a joint hash of id1 and id2, sorting the rows inside by type (default ascending).
select * from table_name where id1= X and type = Z and time > A and time < B;
Likewise, the table to support this query will look like this:
CREATE TABLE table_name_by_id1_time (
id1 TEXT,
id2 TEXT,
type TEXT,
time TIMESTAMP,
data TEXT,
version TEXT,
PRIMARY KEY ((id1),type,time))
WITH CLUSTERING ORDER BY (type ASC, time DESC);
Again, this should work as long as you don't have more than several thousand type/time combinations.
One final adjustment that I would make though, would be around judging just how many type/time combinations you expect to have over the life of the application. If this data will grow over time, then the above will cause the partitions to grow to an unmaintainable point. To keep that from happening, I'd also recommend adding a time "bucket."
version TEXT,
month_bucket TEXT,
PRIMARY KEY ((id1,month_bucket),type,time))
WITH CLUSTERING ORDER BY (type ASC, time DESC);
Likewise for this, the query will need to be adjusted as well:
select * from table_name_by_id1_time
where id1= 'X' and type = 'Z'
and month_bucket='201910'
and time > '2019-10-07 00:00:00' and time < '2019-10-07 16:22:12';
Hope this helps.
how do I guarantee the atomicity of these two insertions?
Simply put, you can run the two INSERTs together in an atomic batch.
BEGIN BATCH
INSERT INTO table_name_by_ids (
id1, id2, type, time, data, version
) VALUES (
'X', 'Y', 'Z', '2019-10-07 12:00:01','stuff','1.0'
) ;
INSERT INTO table_name_by_id1_time (
id1, id2, type, time, data, version, month_bucket
) VALUES (
'X', 'Y', 'Z', '2019-10-07 12:00:01','stuff','1.0','201910'
);
APPLY BATCH;
For more info, check out the DataStax docs on atomic batches: https://docs.datastax.com/en/dse/6.7/cql/cql/cql_using/useBatchGoodExample.html

cassandra error when using select and where in cql

I have a cassandra table defined like this:
CREATE TABLE test.test(
id text,
time bigint,
tag text,
mstatus boolean,
lonumb int,
PRIMARY KEY (id, time, tag)
)
And I want to select one column using select.
I tried:
select * from test where lonumb = 4231;
It gives:
code=2200 [Invalid query] message="No indexed columns present in by-columns clause with Equal operator"
Also I cannot do
select * from test where mstatus = true;
Doesn't cassandra support where as a part of CQL? How to correct this?
You can only use WHERE on the indexed or primary key columns. To correct your issue you will need to create an index.
CREATE INDEX iname
ON keyspacename.tablename(columname)
You can see more info here.
But you have to keep in mind that this query will have to run against all nodes in the cluster.
Alternatively you might rethink your table structure if the lonumb is something you'll do the most queries on.
Jny is correct in that WHERE is only valid on columns in the PRIMARY KEY, or those where a secondary index has been created for. One way to solve this issue is to create a specific query table for lonumb queries.
CREATE TABLE test.testbylonumb(
id text,
time bigint,
tag text,
mstatus boolean,
lonumb int,
PRIMARY KEY (lonumb, time, id)
)
Now, this query will work:
select * from testbylonumb where lonumb = 4231;
It will return all CQL rows where lonumb = 4231, sorted by time. I put id on the PRIMARY KEY to ensure uniqueness.
select * from test where mstatus = true;
This one is trickier. Indexes and keys on low-cardinality columns (like booleans) are generally considered a bad idea. See if there's another way you could model that. Otherwise, you could experiment with a secondary index on mstatus, but only use it when you specify a partition key (lonumb in this case), like this:
select * from testbylonumb where lonumb = 4231 AND mstatus = true;
Maybe that wouldn't perform too badly, as you are restricting it to a specific partition. But I definitely wouldn't ever do a SELECT * on mstatus.

Can an index be created on a UUID Column?

Is it possible to create an index on a UUID/TIMEUUID column in Cassandra? I'm testing out a model design which would have an index on a UUID column, but queries on that column always return 0 rows found.
I have a table like this:
create table some_data (site_id int, user_id int, run_id uuid, value int, primary key((site_id, user_id), run_id));
I create an index with this command:
create index idx on some_data (run_id) ;
No errors are thrown by CQL when I create this index.
I have a small bit of test data in the table:
site_id | user_id | run_id | value
---------+---------+--------------------------------------+-----------------
1 | 1 | 9e118af0-ac92-11e4-81ae-8d1bc921f26d | 3
However, when I run the query:
select * from some_data where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d
CQLSH just returns: (0 rows)
If I use an int for the run_id then the index behaves as expected.
Yes, you can create a secondary index on a UUID. The real question is "should you?"
In any case, I followed your steps, and got it to work.
Connected to Test Cluster at 192.168.23.129:9042.
[cqlsh 5.0.1 | Cassandra 2.1.2 | CQL spec 3.2.0 | Native protocol v3]
Use HELP for help.
aploetz#cqlsh> use stackoverflow ;
aploetz#cqlsh:stackoverflow> create table some_data (site_id int, user_id int, run_id uuid, value int, primary key((site_id, user_id), run_id));
aploetz#cqlsh:stackoverflow> create index idx on some_data (run_id) ;
aploetz#cqlsh:stackoverflow> INSERT INTO some_data (site_id, user_id, run_id, value) VALUES (1,1,9e118af0-ac92-11e4-81ae-8d1bc921f26d,3);
aploetz#cqlsh:stackoverflow> select * from usr_rec3 where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d;
code=2200 [Invalid query] message="unconfigured columnfamily usr_rec3"
aploetz#cqlsh:stackoverflow> select * from some_data where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d;
site_id | user_id | run_id | value
---------+---------+--------------------------------------+-------
1 | 1 | 9e118af0-ac92-11e4-81ae-8d1bc921f26d | 3
(1 rows)
Notice though, that when I ran this command, it failed:
select * from usr_rec3 where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d
Are you sure that you didn't mean to select from some_data instead?
Also, creating secondary indexes on high-cardinality columns (like a UUID) is generally not a good idea. If you need to query by run_id, then you should revisit your data model and come up with an appropriate query table to serve that.
Clarification:
Using secondary indexes in general is not considered good practice. In the new book Cassandra High Availability, Robbie Strickland identifies their use as an anti-pattern, due to poor performance.
Just because a column is of the UUID data type doesn't necessarily make it high-cardinality. That's more of a data model question for you. But knowing the nature of UUIDs and their underlying purpose toward being unique, is setting off red flags.
Put these two points together, and there isn't anything about creating an index on a UUID that sounds appealing to me. If it were my cluster, and (more importantly) I had to support it later, I wouldn't do it.

Cassandra CQL - clustering order with multiple clustering columns

I have a column family with primary key definition like this:
...
PRIMARY KEY ((website_id, item_id), user_id, date)
which will be queried using queries such as:
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id = 0 AND date > 'some_date' ;
However, I'd like to keep my column family ordered by date only, such as SELECT date FROM myCF ; would return the most recent inserted date.
Due to the order of clustering columns, what I get is an order per user_id then per date.
If I change the primary key definition to:
PRIMARY KEY ((website_id, item_id), date, user_id)
I can no longer run the same query, as date must be restricted is user_id is.
I thought there might be some way to say:
...
PRIMARY KEY ((website_id, shop_id), store_id, date)
) WITH CLUSTERING ORDER BY (store_id RANDOMPLEASE, date DESC) ;
But it doesn't seem to exist. Worst, maybe this is completely stupid and I don't get why.
Is there any ways of achieving this? Am I missing something?
Many thanks!
Your query example restricts user_id so that should work with the second table format. But if you are actually trying to run queries like
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND date > 'some_date'
Then you need an additional table which is created to handle those queries, it would only order on Date and not on user id
Create Table LookupByDate ... PRIMARY KEY ((website_id, item_id), date)
In addition to your primary query, if all you try to get is "return the most recent inserted date", you may not need an additional table. You can use "static column" to store the last update time per partition. CASSANDRA-6561
It probably won't help your particular case (since I imagine your list of all users is unmanagably large), but if the condition on the first clustering column is matching one of a relatively small set of values then you can use IN.
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id IN ? AND date > 'some_date'
Don't use IN on the partition key because this will create an inefficient query that hits multiple nodes putting stress on the coordinator node. Instead, execute multiple asynchronous queries in parallel. But IN on a clustering column is absolutely fine.

Resources