How do I count distinct values based on a non-primary key filter in Cassandra? - cassandra

relational databases and Cassandra. With two tables like the following:
TABLE 1: PRIMARY KEY (ID, DATE));
ID
DATE
TRIP_TIME
B03291
2022-01-01
5
B03291
2022-01-02
6
ZR7875
2022-01-01
2
ZR7875
2022-01-02
0
TABLE 2: PRIMARY KEY ((ID, TYPE), DATE))
TYPE
ID
DATE
TRIP_TIME
A
B03291
2022-01-01
5
A
B03291
2022-01-02
6
B
ZR7875
2022-01-01
2
B
ZR7875
2022-01-02
0
A
GF4589
2022-01-01
7
The two tables have the same data but aggregated in a different way.
Using the table that suits better for this query, I need to get the COUNT of all the IDs that have a trip_time greater than 0 on the DATE = '2022-01-01', but I can´t use allow filtering or create another table.
I have been using the query:
SELECT COUNT(ID)
FROM table1
WHERE date = '2022-01-01'
AND trip_time > 0;
But it raises an error and asks me to allow filtering. If I can´t specify an ID, because I want the COUNT for all, is there any way to do this?
Thank you for your help and sorry if it is too obvious.

Cause
You are getting this error because your query doesn't have a filter on primary key columns:
InvalidRequest: Error from server: code=2200 [Invalid query] \
message="Cannot execute this query as it might involve data filtering and thus may have \
unpredictable performance. If you want to execute this query despite the performance \
unpredictability, use ALLOW FILTERING"
Neither the trip date nor the trip time are primary key columns for the tables so it is not possible to query using these columns.
Warning
The ALLOW FILTERING clause enables filtering on non-primary key columns by performing a full table scan, querying every single partition on all nodes so it is very expensive and unpredictable.
The ALLOW FILTERING clause is only recommended for use when the query is restricted to a single partition.
Workaround
In order to query against non-primary key columns, you need to index the columns. To illustrate with an example, here's my table which has the trip id as the primary key:
CREATE TABLE stackoverflow.trips_by_id (
id text PRIMARY KEY,
tripdate date,
triptime int
)
If I want to run queries using either tripdate or triptime, I need to index these columns with:
CREATE CUSTOM INDEX tripdate_idx ON stackoverflow.trips_by_id (tripdate);
CREATE CUSTOM INDEX triptime_idx ON stackoverflow.trips_by_id (triptime);
Now that I have indexed them, I can execute queries like:
SELECT ... FROM trips_by_id
WHERE tripdate = ?
AND triptime = ?
WARNING: Be aware that indexing has its own issues so be aware of the pros and cons. Have a look at When to use and not use an index for details.
Solution
Cassandra is designed for high throughput, high velocity online transaction (OLTP) use cases where you are retrieving data one partition at a time (queries filtered by partition key).
In contrast, your query is analytics (OLAP) in nature because you are not reading just one partition -- you are scanning through the whole table. As such, the best way to run analytics queries is to use Apache Spark with the Spark Cassandra connector. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag then click on the Watch tag button. 🙏 Thanks!

Related

Cassandra DB Query for System Date

I have one table customer_info in a Cassandra DB & it contains one column as billing_due_date, which is date field (dd-MMM-yy ex. 17-AUG-21). I need to fetch the certain fields from customer_info table based on billing_due_date where billing_due_date should be equal to system date +1.
Can anyone suggest a Cassandra DB query for this?
fetch the certain fields from customer_info table based on billing_due_date
transaction_id is primarykey , It is just generated through uuid()
Unfortunately, there really isn't going to be a good way to do this. Right now, the data in the customer_info table is distributed across all nodes in the cluster based on a hash of the transaction_id. Essentially, any query based on something other than transaction_id is going to read from multiple nodes, which is a query anti-pattern in Cassandra.
In Cassandra, you need to design your tables based on the queries that they need to support. For example, choosing transaction_id as the sole primary key may distribute well, but it doesn't offer much in the way of query flexibility.
Therefore, the best way to solve for this query, is to create a query table containing the data from customer_info with a key definition of PRIMARY KEY (billing_date,transaction_id). Then, a query like this should work:
> SELECT * FROM customer_info_by_date
WHERE billing_due_date = toDate(now()) + 2d;
billing_due_date | transaction_id | name
------------------+--------------------------------------+---------
2021-08-20 | 2fe82360-e314-4d5b-aa33-5deee9f03811 | Rinzler
2021-08-20 | 92cb9ee5-dee6-47fe-b372-0829f2e384cd | Clu
(2 rows)
Note that for this example, I am using the system date plus 2 days out. So in your case, you'll want to adjust the "duration" aspect from 2d down to 1d. Cassandra 4.0 allows date arithmetic, so this should work just fine if you are on that version. If you are not, you'll have to do the "system date plus one" calculation on the app side.
Another way to go about this, would be to create a secondary index on billing_due_date, but I don't recommend that path as it will query multiple nodes to build the result set.

Optimization of a query which uses arithmetic operations in WHERE clause

I need to retrieve records where the expiration date is today. The expiration date is calculated dynamically using two other fields (startDate and durationDays):
SELECT * FROM subscription WHERE startDate + durationDays < currentDate()
Does it make sense to add two indexes for these two columns? Or should I consider adding a new column expirationDate and create an index for it only?
SELECT * FROM subscription WHERE startDate + durationDays < currentDate()
I'm wondering how does Cassandra handle such a filter as in my example? Does it make a full scan?
First of all, your question is predicated on CQL's ability to perform (date) arithmetic. It cannot.
> SELECT * FROM subscription WHERE startDate + durationDays < currentDate();
SyntaxException: line 1:43 no viable alternative at input '+' (SELECT * FROM subscription WHERE [startDate] +...)
Secondly the currentDate() function does not exist in Cassandra 3.11.4.
> SELECT currentDate() FROM system.local;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Unknown function 'currentdate'"
That does work in Cassandra 4.0, which as it has not been released yet, you really shouldn't be using.
So let's assume that you've created your secondary indexes on startDate and durationDays and you're just querying on those, without any arithmetic.
Does it execute a full table scan?
ABSOLUTELY.
The reason, is that querying solely on secondary index columns does not have a partition key. Therefore, it has to search for these values on all partitions on all nodes. In a large cluster, your query would likely time out.
Also, when it finds matching data, it has to keep querying. As those values are not unique; it's entirely possible that there are several results to be returned. Carlos in 100% correct is advising you to rebuild your table based on what you want to query.
Recommendations:
Try not to build a table with secondary indexes. Like ever.
If you have to build a table with secondary indexes, try to have a partition key in your WHERE clause to keep the query isolated to a single node.
Any filtering on dynamic (computed) values needs to be done on the application side.
In your case, it might make more sense to create a column called expirationDate, do your date arithmetic in your app, and then INSERT that value into your table.
You'll also want follow the "time bucket" pattern for handling time series data (which is what this appears to be). Say that month works as a "bucket" (it may or may not for your use case). PRIMARY KEY ((month),expirationDate,id) would be a good key. This way, all the subscriptions for a particular month are stored together, clustered by expirationDate, with id on the end to act as a tie-breaker for uniqueness.
One of the main differences between Cassandra and relational databases is that the definition of the tables depend on the query that will be used. The conditional of how the data will be retrieved (WHERE statement) should be included in the primary key as it will perform better than an index on the table.
There are multiple resources regarding the read path, and the quirks of primary keys vs indexes, this talk from the Cassandra Summit may be useful.

Cassandra use aggregate function and then order by that aggregate

I have a cassandra database with a table that has the following columns:
itemid
userid
rating
itemid and userid are the primary key. My query looks like this:
SELECT itemid, avg(rating) as avgRating from mytable GROUP BY itemid order by avgRating asc;
I get the following error:
InvalidRequest: Error from server: code=2200 [Invalid query] message="ORDER BY is only supported when the partition key is restricted by an EQ or an IN."
How can I fix this?
I need to order by the average ratings after so I can get the top 10 movies based on their average rating.
Cassandra can only order results by clustering column(s). It cannot order results by an aggregate function.
There are a couple of options you could look at in order to accomplish this.
Make the query and then re-order the results in your application.
This option may work if you only expect a limited number of rows to be returned from each query.
Note that it is recommended that you only use aggregate functions (like avg()) when you know that it will only apply to a limited number of rows. Ideally you should only use them when operating on a single partition (use a WHERE clause to limit to a single partition). If you don't have any limit you may see very slow queries, or query timeouts if Cassandra needs to read a large number of rows in order to calculate the aggregate.
Store a pre-calculated average in the table, or cache it in your application.
This is the best option if you need calculated averages over a larger data set.
If you make average_rating a clustering column Cassandra will store the averages for each partition in sorted order. This is very efficient from Cassandra's perspective.
The downside is that you'll need to calculate the average in your application each time you insert into or update a row, because it will be a primary key column in your Cassandra table.
One thing you could look into is using a Cassandra trigger to calculate the average for you. This may make life easier for you if you have multiple applications writing to this table, however I am not actually sure if it is possible to modify a primary key column via a custom trigger. I would recommend doing some research & testing if you decide to look at this option. You can read about triggers here.

What is best possible way out to sort records by aggregate value in Cassandra?

I have following data model for cars production data.
CREATE TABLE IF NOT EXISTS mytable (
date date,
color varchar,
modelid varchar,
PRIMARY KEY ((color), date, modelid)
)WITH CLUSTERING ORDER BY (date desc);
I want to sort it by total column in cassandra, which I was expecting to be generated as follows:
SELECT color, count(*) AS total
FROM cars
WHERE date<='2017-12-07' AND date >'2017-11-30'
GROUP BY color
ORDER BY total
ALLOW FILTERING;
But as I come to know Cassandra only support sorting by clustering columns and I can't keep aggregate value in table apriori, what is best possible way out to do this sorting?
First thing - the query that you're using is very ineffective - by using ALLOW FILTERING you're performing scanning of data on all servers - this may work for small datasets, but won't work for big datasets. You need to model your tables around queries that you're planning to execute.
Coming to your question - you need to use either Spark to do it, or do a sorting inside your application.
You shouldn't think about Cassandra as SQL-like database - to use it you need to follow some rules about data modelling, querying, etc. I would recommend to take DS220 course on DataStax Academy to learn about modelling for Cassandra.

cassandra filtering on an indexed column isn't working

I'm using (the latest version of) Cassandra nosql dbms to model some data.
I'd like to get a count of the number of active customer accounts in the last month.
I've created the following table:
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
So because I want to filter by date, I create an index on the date column:
CREATE INDEX ON active_accounts (date);
When I insert some data, Cassandra automatically updates data on any existing primary key matches, so the following inserts only produce two records:
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer1', 'account1', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377414000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377415000);
This is exactly what I'd like - I won't get a huge table of data, and each entry in the table represents a unique customer account - so no need for a select distinct.
The query I'd like to make - is how many distinct customer accounts are active within the last month say:
Select count(*) from active_accounts where date >= 1418377411000 and date <= 1418397411000 ALLOW FILTERING;
In response to this query, I get the following error:
code=2200 [Invalid query] message="No indexed columns present in by-columns clause with Equal operator"
What am I missing; isn't this the purpose of the Index I created?
Table design in Cassandra is extremely important and it must match the kind of queries that you are trying to preform. The reason that Cassandra is trying to keep you from performing queries on the date column, is that any query along that column will be extremely inefficient.
Table Design - Model your queries
One of the main reasons that Cassandra can be fast is that it partitions user data so that most( 99%)
of queries can be completed without contacting all of the nodes in the cluster. This means less network traffic, less disk access, and faster response time. Unfortunately Cassandra isn't able to determine automatically what the best way to partition data. The end user must determine a schema which fits into the C* datamodel and allows the queries they want at a high speed.
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
This schema will only be efficient for queries that look like
SELECT timestamp FROM active_accounts where customer_name = ? and account_name = ?
This is because on the the cluster the data is actually going to be stored like
node 1: [ ((Bob,1)->Monday), ((Tom,32)->Tuesday)]
node 2: [ ((Candice, 3) -> Friday), ((Sarah,1) -> Monday)]
The PRIMARY KEY for this table says that data should be placed on a node based on the hash of the combination of CustomerName and AccountName. This means we can only look up data quickly if we have both of those pieces of data. Anything outside of that scope becomes a batch job since it requires hitting multiple nodes and filtering over all the data in the table.
To optimize for different queries you need to change the layout of your table or use a distributed analytics framework like Spark or Hadoop.
An example of a different table schema that might work for your purposes would be something like
CREATE TABLE active_accounts
(
start_month timestamp,
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY (start_month, date, customer_name, account_name)
);
In this schema I would put the timestamp of the first day of the month as the partitioning key and date as the first clustering key. This means that multiple account creations that took place in the same month will end up in the same partition and on the same node. The data for a schema like this would look like
node 1: [ (May 1 1999) -> [(May 2 1999, Bob, 1), (May 15 1999,Tom,32)]
This places the account dates in order within each partition making it very fast for doing range slices between particular dates. Unfortunately you would have to add code on the application side to pull down the multiple months that a query might be spanning. This schema takes a lot of (dev) work so if these queries are very infrequent you should use a distributed analytics platform instead.
For more information on this kind of time-series modeling check out:
http://planetcassandra.org/getting-started-with-time-series-data-modeling/
Modeling in general:
http://www.slideshare.net/planetcassandra/cassandra-day-denver-2014-40328174
http://www.slideshare.net/johnny15676/introduction-to-cql-and-data-modeling
Spark and Cassandra:
http://planetcassandra.org/getting-started-with-apache-spark-and-cassandra/
Don't use secondary indexes
Allow filtering was added to the cql syntax to prevent users from accidentally designing queries that will not scale. The secondary indexes are really only for use by those do analytics jobs or those C* users who fully understand the implications. In Cassandra the secondary index lives on every node in your cluster. This means that any query that requires a secondary index necessarily will require contacting every node in the cluster. This will become less and less performant as the cluster grows and is definitely not something you want for a frequent query.

Resources