What is best possible way out to sort records by aggregate value in Cassandra? - cassandra

I have following data model for cars production data.
CREATE TABLE IF NOT EXISTS mytable (
date date,
color varchar,
modelid varchar,
PRIMARY KEY ((color), date, modelid)
)WITH CLUSTERING ORDER BY (date desc);
I want to sort it by total column in cassandra, which I was expecting to be generated as follows:
SELECT color, count(*) AS total
FROM cars
WHERE date<='2017-12-07' AND date >'2017-11-30'
GROUP BY color
ORDER BY total
ALLOW FILTERING;
But as I come to know Cassandra only support sorting by clustering columns and I can't keep aggregate value in table apriori, what is best possible way out to do this sorting?

First thing - the query that you're using is very ineffective - by using ALLOW FILTERING you're performing scanning of data on all servers - this may work for small datasets, but won't work for big datasets. You need to model your tables around queries that you're planning to execute.
Coming to your question - you need to use either Spark to do it, or do a sorting inside your application.
You shouldn't think about Cassandra as SQL-like database - to use it you need to follow some rules about data modelling, querying, etc. I would recommend to take DS220 course on DataStax Academy to learn about modelling for Cassandra.

Related

Cassandra DB Query for System Date

I have one table customer_info in a Cassandra DB & it contains one column as billing_due_date, which is date field (dd-MMM-yy ex. 17-AUG-21). I need to fetch the certain fields from customer_info table based on billing_due_date where billing_due_date should be equal to system date +1.
Can anyone suggest a Cassandra DB query for this?
fetch the certain fields from customer_info table based on billing_due_date
transaction_id is primarykey , It is just generated through uuid()
Unfortunately, there really isn't going to be a good way to do this. Right now, the data in the customer_info table is distributed across all nodes in the cluster based on a hash of the transaction_id. Essentially, any query based on something other than transaction_id is going to read from multiple nodes, which is a query anti-pattern in Cassandra.
In Cassandra, you need to design your tables based on the queries that they need to support. For example, choosing transaction_id as the sole primary key may distribute well, but it doesn't offer much in the way of query flexibility.
Therefore, the best way to solve for this query, is to create a query table containing the data from customer_info with a key definition of PRIMARY KEY (billing_date,transaction_id). Then, a query like this should work:
> SELECT * FROM customer_info_by_date
WHERE billing_due_date = toDate(now()) + 2d;
billing_due_date | transaction_id | name
------------------+--------------------------------------+---------
2021-08-20 | 2fe82360-e314-4d5b-aa33-5deee9f03811 | Rinzler
2021-08-20 | 92cb9ee5-dee6-47fe-b372-0829f2e384cd | Clu
(2 rows)
Note that for this example, I am using the system date plus 2 days out. So in your case, you'll want to adjust the "duration" aspect from 2d down to 1d. Cassandra 4.0 allows date arithmetic, so this should work just fine if you are on that version. If you are not, you'll have to do the "system date plus one" calculation on the app side.
Another way to go about this, would be to create a secondary index on billing_due_date, but I don't recommend that path as it will query multiple nodes to build the result set.

Cassandra use aggregate function and then order by that aggregate

I have a cassandra database with a table that has the following columns:
itemid
userid
rating
itemid and userid are the primary key. My query looks like this:
SELECT itemid, avg(rating) as avgRating from mytable GROUP BY itemid order by avgRating asc;
I get the following error:
InvalidRequest: Error from server: code=2200 [Invalid query] message="ORDER BY is only supported when the partition key is restricted by an EQ or an IN."
How can I fix this?
I need to order by the average ratings after so I can get the top 10 movies based on their average rating.
Cassandra can only order results by clustering column(s). It cannot order results by an aggregate function.
There are a couple of options you could look at in order to accomplish this.
Make the query and then re-order the results in your application.
This option may work if you only expect a limited number of rows to be returned from each query.
Note that it is recommended that you only use aggregate functions (like avg()) when you know that it will only apply to a limited number of rows. Ideally you should only use them when operating on a single partition (use a WHERE clause to limit to a single partition). If you don't have any limit you may see very slow queries, or query timeouts if Cassandra needs to read a large number of rows in order to calculate the aggregate.
Store a pre-calculated average in the table, or cache it in your application.
This is the best option if you need calculated averages over a larger data set.
If you make average_rating a clustering column Cassandra will store the averages for each partition in sorted order. This is very efficient from Cassandra's perspective.
The downside is that you'll need to calculate the average in your application each time you insert into or update a row, because it will be a primary key column in your Cassandra table.
One thing you could look into is using a Cassandra trigger to calculate the average for you. This may make life easier for you if you have multiple applications writing to this table, however I am not actually sure if it is possible to modify a primary key column via a custom trigger. I would recommend doing some research & testing if you decide to look at this option. You can read about triggers here.

Allow filtering function in Cassandra (which choice is correct?)

I am currently trying to model some time series data in base of Cassandra.
For example i have a table bigint_table, which was created by following query
**
CREATE TABLE bigint_table (name_id int,tuuid timeuuid, timestamp
timestamp, value text, PRIMARY KEY ((name_id),tuuid, timestamp)) WITH
CLUSTERING ORDER BY (tuuid asc, timestamp asc)
**
tuuid column was added because without it I had problems and I lost some data while inserting them in DB. name_id represents the channel's ID data comes from.tuuid column was added because without it I had problems and I lost some data while inserting them in DB. In one table there are lots of data with the same ID, but they are unique by timestamp and tuuid (values also can be the same sometimes).
I consistently execute 2 different queries to get values and timestamps
select value from bigint_table where name_id=6 and timestamp>'
2017-11-01 8:26:47.970+0000' and timestamp<'2017-11-30
8:26:52.048+0000' order by tuuid asc, timestamp asc allow filtering
2.
select timestamp from bigint_table where name_id=6 and timestamp>'
2017-11-01 8:26:47.970+0000' and timestamp<'2017-11-30
8:26:52.048+0000' order by tuuid asc, timestamp asc allow filtering
In this post author says one need to resist the urge to just add ALLOW FILTERING to itand one should think about data, model and what one is trying to do.
I thought a lot about using ALLOW FILTERING function or not, and I figured out that I have no choice in my case and I need to use it. But those words in post I mentioned above are keeping me in doubt. I would like to know your advise and what do you thnik about my problem. Is there another way to model my data tables, queries of which do not require ALLOW FILTERING? I would be very very thank you for advice.
The reason you need allow filtering is because you have the clustering column (tuuid, timestamp)in the wrong order. In this case the data stored first by tuuid and then by timestamp.But you're actually choosing data by timestamp and then ordering by tuuid so Cassandra can't use the indexes that you have specified. The order when you define the primary key matters.

cassandra filtering on an indexed column isn't working

I'm using (the latest version of) Cassandra nosql dbms to model some data.
I'd like to get a count of the number of active customer accounts in the last month.
I've created the following table:
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
So because I want to filter by date, I create an index on the date column:
CREATE INDEX ON active_accounts (date);
When I insert some data, Cassandra automatically updates data on any existing primary key matches, so the following inserts only produce two records:
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer1', 'account1', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377414000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377415000);
This is exactly what I'd like - I won't get a huge table of data, and each entry in the table represents a unique customer account - so no need for a select distinct.
The query I'd like to make - is how many distinct customer accounts are active within the last month say:
Select count(*) from active_accounts where date >= 1418377411000 and date <= 1418397411000 ALLOW FILTERING;
In response to this query, I get the following error:
code=2200 [Invalid query] message="No indexed columns present in by-columns clause with Equal operator"
What am I missing; isn't this the purpose of the Index I created?
Table design in Cassandra is extremely important and it must match the kind of queries that you are trying to preform. The reason that Cassandra is trying to keep you from performing queries on the date column, is that any query along that column will be extremely inefficient.
Table Design - Model your queries
One of the main reasons that Cassandra can be fast is that it partitions user data so that most( 99%)
of queries can be completed without contacting all of the nodes in the cluster. This means less network traffic, less disk access, and faster response time. Unfortunately Cassandra isn't able to determine automatically what the best way to partition data. The end user must determine a schema which fits into the C* datamodel and allows the queries they want at a high speed.
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
This schema will only be efficient for queries that look like
SELECT timestamp FROM active_accounts where customer_name = ? and account_name = ?
This is because on the the cluster the data is actually going to be stored like
node 1: [ ((Bob,1)->Monday), ((Tom,32)->Tuesday)]
node 2: [ ((Candice, 3) -> Friday), ((Sarah,1) -> Monday)]
The PRIMARY KEY for this table says that data should be placed on a node based on the hash of the combination of CustomerName and AccountName. This means we can only look up data quickly if we have both of those pieces of data. Anything outside of that scope becomes a batch job since it requires hitting multiple nodes and filtering over all the data in the table.
To optimize for different queries you need to change the layout of your table or use a distributed analytics framework like Spark or Hadoop.
An example of a different table schema that might work for your purposes would be something like
CREATE TABLE active_accounts
(
start_month timestamp,
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY (start_month, date, customer_name, account_name)
);
In this schema I would put the timestamp of the first day of the month as the partitioning key and date as the first clustering key. This means that multiple account creations that took place in the same month will end up in the same partition and on the same node. The data for a schema like this would look like
node 1: [ (May 1 1999) -> [(May 2 1999, Bob, 1), (May 15 1999,Tom,32)]
This places the account dates in order within each partition making it very fast for doing range slices between particular dates. Unfortunately you would have to add code on the application side to pull down the multiple months that a query might be spanning. This schema takes a lot of (dev) work so if these queries are very infrequent you should use a distributed analytics platform instead.
For more information on this kind of time-series modeling check out:
http://planetcassandra.org/getting-started-with-time-series-data-modeling/
Modeling in general:
http://www.slideshare.net/planetcassandra/cassandra-day-denver-2014-40328174
http://www.slideshare.net/johnny15676/introduction-to-cql-and-data-modeling
Spark and Cassandra:
http://planetcassandra.org/getting-started-with-apache-spark-and-cassandra/
Don't use secondary indexes
Allow filtering was added to the cql syntax to prevent users from accidentally designing queries that will not scale. The secondary indexes are really only for use by those do analytics jobs or those C* users who fully understand the implications. In Cassandra the secondary index lives on every node in your cluster. This means that any query that requires a secondary index necessarily will require contacting every node in the cluster. This will become less and less performant as the cluster grows and is definitely not something you want for a frequent query.

alllow filtering, data modeling in cql

I'm currently using and researching about data modeling practices in cassandra. So far, I get that you need have a data modeling based on the queries executed. However, multiple select requirements make data modeling even harder or impossible to handle it on 1 table. So, when you can't handle these requirements on 1 table, you need to insert 2-3 tables. In other words, you need to make multiple inserts on 1 operation.
Currently, I'm dealing with a data model of a campaign structure. I have a campaign table on cassandra with the following cql;
CREATE TABLE campaign_users
(
created_at timeuuid,
campaign_id int,
uid bigint,
updated_at timestamp,
PRIMARY KEY (campaign_id, uid),
INDEX(campaign_id, created_at)
);
In this model, I need to be able to make incremental exports given a timestamp only. In cassandra, there is allow filtering mode that enables select queries for secondary indexes. So, my cql statement for incremental export is the following;
select campaign_id, uid
from campaign_users
where created_at > minTimeuuid('2013-08-14 12:26:06+0000') allow filtering;
However, if allow filtering is used, there is a warning saying that the statement have unpredictable performance. So, is it a good practice relying on allow filtering ? What can be other alternatives ?
The ALLOW FILTERING warning is because Cassandra is internally skipping over data, rather than using an index and seeking. This is unpredictable because you don't know how much data Cassandra is going to skip over per row returned. You could be scanning through all your data to return zero rows, in the worst case. This is in contrast to operations without ALLOW FILTERING (apart from SELECT COUNT queries), where the data read through scales linearly with the amount of data returned.
This is OK if you're returning most of the data, so the data skipped over doesn't cost very much. But if you were skipping over most of your data a lot of work will be wasted.
The alternative is to include time in the first component of your primary key, in buckets. E.g. you could have day buckets and duplicate your queries for each day that contains data you need. This method guarantees that most of the data Cassandra reads over is data that you want. The problem is that all data for the bucket (e.g. day) needs to fit in one partition. You can fix this by sharding the partition somehow e.g. include some aspect of the uid within it.

Resources