Cassandra slow SELECT MAX(x) query - cassandra

I have a dev machine with Cassandra 3.9 and 2 tables, one has ~~ 400,000 records, another about 40,000,000 records. Their structures are different.
Each of them has a secondary index on a field x, and I'm trying to run a query of the form SELECT MAX(x) FROM table. On the first table, the query takes a couple of seconds, and on the second table, it times out.
My experience is with relational databases where these queries are trivial and fast. So in Cassandra, it looks like the index isn't used to execute these queries? Is there an alternative?

In cassandra aggregation functions such as MIN, MAX, COUNT, SUM or AVG on a table without specifing a partition key is a bad practice. instead, you can have an other table that store the max value of x field for both tables.
However, you have to add some client side logic to maintain this max value in the other table when you run INSERT or UPDATE statements.
Tables structures :
CREATE TABLE t1 (
pk text PRIMARY KEY,
x int
);
CREATE TABLE t2 (
pk text PRIMARY KEY,
x int
);
CREATE TABLE agg_table (
table_name text PRIMARY KEY,
max_value int
);
So with this structure you can have the max value for a table :
SELECT max_value
FROM agg_table
WHERE table_name = 't1';
Hope this can help you.

Related

Cassandra data model for intersection of ranges

Assume data with pk (text), start (int), end (int), extra_data(text).
Query is: given a pk (e.g. 'pk1') and a range (e.g [1000, 2000]), find all rows for 'pk1' which intersect that range. This (sql) logically translates to WHERE pk=pk1 AND end>=1000 AND start<=2000 (intersection condition)
Notice this is NOT the same as the more conventional query of:
all rows for pk1 where start>1000 and start<2000
If I define a table with end as part of the clustering key:
CREATE TABLE test1 (
pk text,
start bigint,
end bigint,
extra_data text,
PRIMARY KEY ((pk), start, end)
)...
Then Cassandra does not allow the query:
select * from test1 where pk='pk1' and start < 2000 and end > 1000;
with "Clustering column "end" cannot be restricted (preceding column "start" is restricted by a non-EQ relation)"
Why does Cassandra not allow further filtering to limit ranged rows (forces to do this filter with results application-side).
A second try would be to remove 'end' from clustering columns:
CREATE TABLE test1 (
pk text,
start bigint,
end bigint,
extra_data text,
PRIMARY KEY ((pk), start)
)...
Then Cassandra warns the query:
select * from test1 where pk='pk1' and start < 2000 and end > 1000;
with "Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
Here I would like to understand if I can safely add the ALLOW FILTERING and be assured Cassandra will perform the scan only of 'pk1'.
Using cqlsh 5.0.1 | Cassandra 3.11.3
Actually, I think you made the fatal mistake of designing your table first and then trying to adapt the application query to fit the table design.
In Cassandra data modelling, the primary principle is to always start by listing all your application queries THEN design a table for each of those application queries -- not the other way around.
Let's say I have an IoT use case where I have sensors collecting temperature readings once a day. If my application needs to retrieve the readings from the last 7 days from a sensor, the app query is:
Get the temperature for the last 7 days for sensor X
Assuming today is October 25, a more SQL-like representation of this app query is:
SELECT temperature FROM table
WHERE sensor = X
AND reading_date >= 2022-10-18
AND reading_date < 2022-10-25
This means that we need to design the table such that:
it is partitioned by sensor, and
the data is clustered by date.
The table schema would look like:
CREATE TABLE readings_by_sensor (
sensor text,
reading_date date,
temp float,
PRIMARY KEY (sensor, reading_date)
)
We can then perform a range query on the date:
SELECT temperature FROM readings_by_sensor
WHERE sensor = ?
AND reading_date >= 2022-10-18
AND reading_date < 2022-10-25
You don't need two separate columns to represent the start and end range because. Cheers!

Storing time specific data in cassandra

I am looking for a good way to store time specific data in cassandra.
Each entry can look like (start_time, value). Later, I would like to retrieve the current value.
Logic of retrieving current value is like following.
Find all rows with start_time<=current_time.
Then find the value with maximum start_time from the rows obtained in the first step.
PS:- Edited the question to make it more clear
The exact requirements are not possible. But we can get close to it with one more column.
First, to be able to use <= operator, your start_time column need to be the clustering key of your table.
Then, you need a different partition key. You could choose a fixed value but it could bring problems when the partition will have too many rows. Then you should better use something like the year or the month of the start_time.
CREATE TABLE time_specific_table (
year bigint,
start_time timestamp,
value text,
PRIMARY KEY((year), start_time)
) WITH CLUSTERING ORDER BY (start_time DESC);
The problem is that when you will query the table, you will need to know the value of the partition key :
Find all rows with start_time<=current_time
SELECT * FROM time_specific_table
WHERE year = :year AND start_time <= :time;
select the value with maximum start_time
SELECT * FROM time_specific_table
WHERE year = :year LIMIT 1;
Create two separate table like below :
CREATE TABLE data (
start_time timestamp,
value int,
PRIMARY KEY(start_time, value)
);
CREATE TABLE current_value (
partition int PRIMARY KEY,
value int
);
Now you have to insert data into both table, to insert data into second table use a static value like 1
INSERT INTO current_value(partition, value) VALUES(1, 10);
Now In current value table your data will be upsert and You will get latest value whenever you select.

Ranges (intervals) request in Cassandra DB - CQL

Excuse, if it is a duplicate, I've found a few questions about times ranges here, but my case seems a little bit different and not yet discussed.
I would like to store quite big chunks (bins) of data (blobs - 2-4Mb, this is the “black-box data”, I can't change its layout) to access with interval keys:
...
primary key ( bin_id int, from_item_id int, to_item_id int )
...
with ability to select with items ranges, like in this pseudo-code to select all chunks that contains interval of items [110, 200]:
select chunk from tb1 where chunk_id = 100500 and from_item_id >= 110 and to_item_id <= 200;
Attempt to run such a query directly ended with error:
code=2200 [Invalid query] message="PRIMARY KEY column "to_item_id" cannot be restricted (preceding column "from_item_id" is restricted by a non-EQ relation)"
Currently only solution I've found is to implement additional table (tb_map) with reverse mapping from item_id to bin_id and use select to make a query looks something like this:
...
– in tb_map
primary key (dummy_id, item_id)
...
select bin_id from tb_map where dummy_id = SOME_MAGIK and item_id >= 110 and item_id <= 200;
And then use bin_id to retrieve chunks from tb1 with EQ or IN operator like here:
select * from tb1 where bin_id in (...);
But I can't use this model due insert performance issues (application should avoid many inserts to additional table and should avoid maintaining additional data structures, but should be "as simple as nail").
Is it any simple solution to stay within one table (or several simple tables)? I'm stuck with no ideas how to model such behaviour in C* (may be slices should be used?), could local C* gurus provide any hints?
I'm using CQL 3.1
From CQL3 reference:
Moreover, for a given partition key, the clustering columns induce an ordering of rows and relations on them is restricted to the relations that allow to select a contiguous (for the ordering) set of rows.
In your case the query doesn't select a contiguous set of rows, so Cassandra refuses to process it.

cassandra 2.0.9: query for undefined column

Using Cassandra 2.0.9 CQL, how does one query for rows that don't have a particular column defined? For example:
create table testtable ( id int primary key, thing int );
create index on testtable ( thing );
# can now select rows by thing
insert into testtable( id, thing ) values ( 100, 100 );
# row values will be persistent
update testtable using TTL 30 set thing=1 where id=100;
# wait 30 seconds, thing column will go away for the row
select * from testtable;
Ideally I'd like to be able to do something like this:
select * from testtable where NOT DEFINED thing;
or some such and have the row with the id==1 returned. Is there any way to search for rows that do not have a particular column value assigned?
I'm afraid I've been through the Datastax 2.0 manual, as well as the CQLSH help with no luck trying to find an operator or syntax for this. Thanks.
Doesn't appear to be available yet
https://issues.apache.org/jira/browse/CASSANDRA-3783

Cassandra CQL - clustering order with multiple clustering columns

I have a column family with primary key definition like this:
...
PRIMARY KEY ((website_id, item_id), user_id, date)
which will be queried using queries such as:
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id = 0 AND date > 'some_date' ;
However, I'd like to keep my column family ordered by date only, such as SELECT date FROM myCF ; would return the most recent inserted date.
Due to the order of clustering columns, what I get is an order per user_id then per date.
If I change the primary key definition to:
PRIMARY KEY ((website_id, item_id), date, user_id)
I can no longer run the same query, as date must be restricted is user_id is.
I thought there might be some way to say:
...
PRIMARY KEY ((website_id, shop_id), store_id, date)
) WITH CLUSTERING ORDER BY (store_id RANDOMPLEASE, date DESC) ;
But it doesn't seem to exist. Worst, maybe this is completely stupid and I don't get why.
Is there any ways of achieving this? Am I missing something?
Many thanks!
Your query example restricts user_id so that should work with the second table format. But if you are actually trying to run queries like
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND date > 'some_date'
Then you need an additional table which is created to handle those queries, it would only order on Date and not on user id
Create Table LookupByDate ... PRIMARY KEY ((website_id, item_id), date)
In addition to your primary query, if all you try to get is "return the most recent inserted date", you may not need an additional table. You can use "static column" to store the last update time per partition. CASSANDRA-6561
It probably won't help your particular case (since I imagine your list of all users is unmanagably large), but if the condition on the first clustering column is matching one of a relatively small set of values then you can use IN.
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id IN ? AND date > 'some_date'
Don't use IN on the partition key because this will create an inefficient query that hits multiple nodes putting stress on the coordinator node. Instead, execute multiple asynchronous queries in parallel. But IN on a clustering column is absolutely fine.

Resources