Cassandra use aggregate function and then order by that aggregate - cassandra

I have a cassandra database with a table that has the following columns:
itemid
userid
rating
itemid and userid are the primary key. My query looks like this:
SELECT itemid, avg(rating) as avgRating from mytable GROUP BY itemid order by avgRating asc;
I get the following error:
InvalidRequest: Error from server: code=2200 [Invalid query] message="ORDER BY is only supported when the partition key is restricted by an EQ or an IN."
How can I fix this?
I need to order by the average ratings after so I can get the top 10 movies based on their average rating.

Cassandra can only order results by clustering column(s). It cannot order results by an aggregate function.
There are a couple of options you could look at in order to accomplish this.
Make the query and then re-order the results in your application.
This option may work if you only expect a limited number of rows to be returned from each query.
Note that it is recommended that you only use aggregate functions (like avg()) when you know that it will only apply to a limited number of rows. Ideally you should only use them when operating on a single partition (use a WHERE clause to limit to a single partition). If you don't have any limit you may see very slow queries, or query timeouts if Cassandra needs to read a large number of rows in order to calculate the aggregate.
Store a pre-calculated average in the table, or cache it in your application.
This is the best option if you need calculated averages over a larger data set.
If you make average_rating a clustering column Cassandra will store the averages for each partition in sorted order. This is very efficient from Cassandra's perspective.
The downside is that you'll need to calculate the average in your application each time you insert into or update a row, because it will be a primary key column in your Cassandra table.
One thing you could look into is using a Cassandra trigger to calculate the average for you. This may make life easier for you if you have multiple applications writing to this table, however I am not actually sure if it is possible to modify a primary key column via a custom trigger. I would recommend doing some research & testing if you decide to look at this option. You can read about triggers here.

Related

Select row with highest timestamp

I have a table that stores events
CREATE TABLE active_events (
event_id VARCHAR,
number VARCHAR,
....
start_time TIMESTAMP,
PRIMARY KEY (event_id, number)
);
Now, I want to select an event with the highest start_time. It is possible? I've tried to create a secondary index, but no success.
This is a query I've created
select * from active_call order by start_time limit 1
But the error says ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Should I create some kind of materialized view? What should I do to execute my query?
This is an anti-pattern in Cassandra. To order the data you need to read all data and find the highest value. And this will require scanning of data on multiple nodes, and will be very long.
Materialized view also won't help much as order for data only exists inside an individual partition, so you will need to put all your data into a single partition that could be huge and data would be imbalanced.
I can only think of following workaround:
Have an additional table that will have all columns of the original table, but with a fake partition key and no clustering columns
You do inserts into that table in parallel to normal inserts, but use a fixed value for that fake partition key, and explicitly setting a timestamp for a record equal to start_time (don't forget to multiple by 1000 as timestamp uses microseconds). In this case it will guaranteed to be the value with the highest timestamp as Cassandra won't override it with other data with lower timestamp.
But this doesn't solve a problem with data skew, and all traffic will be handled by fixed number of nodes equal to RF.
Another alternative - use another database.
This type of query isn't valid in big data because it requires a full table scan and doesn't scale. It works in traditional relational databases because the dataset is smaller. Imagine you had billions of partitions each with thousands of rows spread across hundreds of nodes. A full table scan in a large cluster will take a very long time if it was allowed.
The error:
ORDER BY is only supported when the partition key is restricted by an EQ or an IN
gets returned because you can only sort the results provided (a) the query is restricted to a partition key, and (b) the rows are ordered by a clustering column. You cannot sort the results based on a column that is not part of the clustering key. Cheers!

Optimization of a query which uses arithmetic operations in WHERE clause

I need to retrieve records where the expiration date is today. The expiration date is calculated dynamically using two other fields (startDate and durationDays):
SELECT * FROM subscription WHERE startDate + durationDays < currentDate()
Does it make sense to add two indexes for these two columns? Or should I consider adding a new column expirationDate and create an index for it only?
SELECT * FROM subscription WHERE startDate + durationDays < currentDate()
I'm wondering how does Cassandra handle such a filter as in my example? Does it make a full scan?
First of all, your question is predicated on CQL's ability to perform (date) arithmetic. It cannot.
> SELECT * FROM subscription WHERE startDate + durationDays < currentDate();
SyntaxException: line 1:43 no viable alternative at input '+' (SELECT * FROM subscription WHERE [startDate] +...)
Secondly the currentDate() function does not exist in Cassandra 3.11.4.
> SELECT currentDate() FROM system.local;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Unknown function 'currentdate'"
That does work in Cassandra 4.0, which as it has not been released yet, you really shouldn't be using.
So let's assume that you've created your secondary indexes on startDate and durationDays and you're just querying on those, without any arithmetic.
Does it execute a full table scan?
ABSOLUTELY.
The reason, is that querying solely on secondary index columns does not have a partition key. Therefore, it has to search for these values on all partitions on all nodes. In a large cluster, your query would likely time out.
Also, when it finds matching data, it has to keep querying. As those values are not unique; it's entirely possible that there are several results to be returned. Carlos in 100% correct is advising you to rebuild your table based on what you want to query.
Recommendations:
Try not to build a table with secondary indexes. Like ever.
If you have to build a table with secondary indexes, try to have a partition key in your WHERE clause to keep the query isolated to a single node.
Any filtering on dynamic (computed) values needs to be done on the application side.
In your case, it might make more sense to create a column called expirationDate, do your date arithmetic in your app, and then INSERT that value into your table.
You'll also want follow the "time bucket" pattern for handling time series data (which is what this appears to be). Say that month works as a "bucket" (it may or may not for your use case). PRIMARY KEY ((month),expirationDate,id) would be a good key. This way, all the subscriptions for a particular month are stored together, clustered by expirationDate, with id on the end to act as a tie-breaker for uniqueness.
One of the main differences between Cassandra and relational databases is that the definition of the tables depend on the query that will be used. The conditional of how the data will be retrieved (WHERE statement) should be included in the primary key as it will perform better than an index on the table.
There are multiple resources regarding the read path, and the quirks of primary keys vs indexes, this talk from the Cassandra Summit may be useful.

What is best possible way out to sort records by aggregate value in Cassandra?

I have following data model for cars production data.
CREATE TABLE IF NOT EXISTS mytable (
date date,
color varchar,
modelid varchar,
PRIMARY KEY ((color), date, modelid)
)WITH CLUSTERING ORDER BY (date desc);
I want to sort it by total column in cassandra, which I was expecting to be generated as follows:
SELECT color, count(*) AS total
FROM cars
WHERE date<='2017-12-07' AND date >'2017-11-30'
GROUP BY color
ORDER BY total
ALLOW FILTERING;
But as I come to know Cassandra only support sorting by clustering columns and I can't keep aggregate value in table apriori, what is best possible way out to do this sorting?
First thing - the query that you're using is very ineffective - by using ALLOW FILTERING you're performing scanning of data on all servers - this may work for small datasets, but won't work for big datasets. You need to model your tables around queries that you're planning to execute.
Coming to your question - you need to use either Spark to do it, or do a sorting inside your application.
You shouldn't think about Cassandra as SQL-like database - to use it you need to follow some rules about data modelling, querying, etc. I would recommend to take DS220 course on DataStax Academy to learn about modelling for Cassandra.

Where and Order By Clauses in Cassandra CQL

I am new to NoSQL database and have just started using apache Cassandra. I created a simple table "emp" with primary key on "empno" column. This is a simple table as we always get in Oracle's default scott schema.
Now I loaded data using the COPY command and issued query Select * from emp order by empno but I was surprised that CQL did not allow Order by on empno column (which is PK). Also when I used Where condition, it did not allow any inequality operations on empno column (it said only EQ or IN conditions are allowed). It also did not allowed Where and Order by on any other column, as they were not used in PK, and did not have an index.
Can someone please help me what should I do if I want to keep empno unique in the table and want a query results in Sorted order of empno?
(My version is:
cqlsh:demodb> show version
[cqlsh 5.0.1 | Cassandra 2.2.0 | CQL spec 3.3.0 | Native protocol v4]
)
There are two parts to a PRIMARY KEY in Cassandra:
partition key(s)
clustering key(s)
PRIMARY KEY (partitionKey1,clusteringKey1,clusteringKey2)
or
PRIMARY KEY ((partitionKey1,partitionKey2),clusteringKey1,clusteringKey2)
The partition key determines which node(s) your data is stored on. The clustering key determines the order of the data within your partition key.
In CQL, the ORDER BY clause is really only used to reverse the defined sort direction of your clustering order. As for the columns themselves, you can only specify the columns defined (and in that exact order...no skipping) in your CLUSTERING ORDER BY clause at table creation time. So you cannot pick arbitrary columns to order your result set at query-time.
Cassandra achieves performance by using the clustering keys to sort your data on-disk, thereby only returning ordered rows in a single read (no random reads). This is why you must take a query-based modeling approach (often duplicating your data into multiple query tables) with Cassandra. Know your queries ahead of time, and build your tables to serve them.
Select * from emp order by empno;
First of all, you need a WHERE clause. It's ok to query without it, if you're working with a relational database. With Cassandra, you should do your best to avoid unbound SELECT queries. Besides, Cassandra can only enforce a sort order within a partition, so querying without a WHERE clause won't return data in the order you want, anyway.
Secondly, as I mentioned above, you need to define clustering keys. If you want to order your result set by empno, then you must find another column to define as your partition key. Try something like this:
CREATE TABLE emp_by_dept (
empno text,
dept text,
name text,
PRIMARY KEY (dept,empno)
) WITH CLUSTERING ORDER BY (empno ASC);
Now, I can query employees by department, and they will be returned to me ordered by empno:
SELECT * FROM emp_by_dept WHERE dept='IT';
But to be clear, you will not be able to query every row in your table, and have it ordered by a single column. The only way to get meaningful order into your result sets, is first partition your data in a way that makes sense to your business case. Running an unbound SELECT will return all of your rows (assuming that the query doesn't time-out while trying to query every node in your cluster), but result set ordering can only be enforced within a partition. So you have to restrict by partition key in order for that to make any sense.
My apologies for self-promoting, but last year I wrote an article for DataStax called We Shall Have Order!, in which I addressed how to solve these types of problems. Give it a read and see if it helps.
Edit for additional questions:
From your answer I concluded 2 things about Cassandra:
(1) There is no
way of getting a result set which is only order by a column that has
been defined as Unique.
(2) When we define a PK
(partition-key+clustering-key), then the results will always be order
by Clustering columns within any fixed partition key (we must restrict
to one partition-key value), that means there is no need of ORDER BY
clause, since it cannot ever change the order of rows (the order in
which rows are actually stored), i.e. Order By is useless.
1) All PRIMARY KEYs in Cassandra are unique. There's no way to order your result set by your partition key. In my example, I order by empno (after partitioning by dept). – Aaron 1 hour ago
2) Stopping short of saying that ORDER BY is useless, I'll say that its only real use is to switch your sort direction between ASC and DESC.
I created an index on "empno" column of "emp" table, it is still not
allowing ORDER BY empno. So, what Indexes are for? are they only for
searching records for specific value of index key?
You cannot order a result set by an indexed column. Secondary indexes are (not the same as their relational counterparts) really only useful for edge-case, analytics-based queries. They don't scale, so the general recommendation is not to use secondary indexes.
Ok, that simply means that one table cannot be used for getting
different result sets with different conditions and different sorting
order.
Correct.
Hence for each new requirement, we need to create a new table.
IT means if we have a billion rows in a table (say Sales table), and
we need sum of sales (1) Product-wise, (2) Region-wise, then we will
duplicate all those billion rows in 2 tables with one in clustering
order of Product, the other in clustering order of Region,. and even
if we need to sum sales per Salesman_id, then we build a 3rd table,
again putting all those billion rows? is it sensible?
It's really up to you to decide how sensible it is. But lack of query flexibility is a drawback of Cassandra. To get around it you can keep creating query tables (I.E., trading disk for performance). But if it gets to a point where it becomes ungainly or difficult to manage, then it's time to think about whether or not Cassandra is really the right solution.
EDIT 20160321
Hi Aaron, you said above "Stopping short of saying that ORDER BY is useless, I'll say that its only real use is to switch your sort direction between ASC and DESC."
But i found even that is not correct. Cassandra only allows ORDER by in the same direction as we define in the "CLUSTERING ORDER BY" caluse of CREATE TABLE. If in that clause we define ASC, it allows only order by ASC, and vice versa.
Without seeing an error message, it's hard to know what to tell you on that one. Although I have heard of queries with ORDER BY failing when you have too many rows stored in a partition.
ORDER BY also functions a little odd if you specify multiple columns to sort by. If I have two clustering columns defined, I can use ORDER BY on the first column indiscriminately. But as soon as I add the second column to the ORDER BY clause, my query only works if I specify both sort directions the same (as the CLUSTERING ORDER BY definition) or both different. If I mix and match, I get this:
InvalidRequest: code=2200 [Invalid query] message="Unsupported order by relation"
I think that has to do with how the data is stored on-disk. Otherwise Cassandra would have more work to do in preparing result sets. Whereas if it requires everything to either to match or mirror the direction(s) specified in the CLUSTERING ORDER BY, it can just relay a sequential read from disk. So it's probably best to only use a single column in your ORDER BY clause, for more predictable results.
Adding a redux answer as the accepted one is quite long.
Order by is currently only supported on the clustered columns of the PRIMARY KEY
and when the partition key is restricted by an Equality or an IN operator in where clause.
That is if you have your primary key defined like this :
PRIMARY KEY ((a,b),c,d)
Then you will be able to use the ORDER BY when & only when your query has :
a where clause with all the primary key restricted either by an equality operator (=) or an IN operator such as :
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY c,d;
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY c;
These two query are the only valid ones.
Also this query would not work :
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY d,c;
because order by currently only support the ordering of columns following their declared order in the PRIMARY KEY that is in primary key definition c has been declared before d and the query violates the ordering by placing d first.

How do secondary indexes work in Cassandra?

Suppose I have a column family:
CREATE TABLE update_audit (
scopeid bigint,
formid bigint,
time timestamp,
record_link_id bigint,
ipaddress text,
user_zuid bigint,
value text,
PRIMARY KEY ((scopeid, formid), time)
) WITH CLUSTERING ORDER BY (time DESC)
With two secondary indexes, where record_link_id is a high-cardinality column:
CREATE INDEX update_audit_id_idx ON update_audit (record_link_id);
CREATE INDEX update_audit_user_zuid_idx ON update_audit (user_zuid);
According to my knowledge Cassandra will create two hidden column families like so:
CREATE TABLE update_audit_id_idx(
record_link_id bigint,
scopeid bigint,
formid bigint,
time timestamp
PRIMARY KEY ((record_link_id), scopeid, formid, time)
);
CREATE TABLE update_audit_user_zuid_idx(
user_zuid bigint,
scopeid bigint,
formid bigint,
time timestamp
PRIMARY KEY ((user_zuid), scopeid, formid, time)
);
Cassandra secondary indexes are implemented as local indexes rather than being distributed like normal tables. Each node only stores an index for the data it stores.
Consider the following query:
select * from update_audit where scopeid=35 and formid=78005 and record_link_id=9897;
How will this query execute 'under the hood' in Cassandra?
How will a high-cardinality column index (record_link_id) affect its performance?
Will Cassandra touch all nodes for the above query? Why?
Which criteria will be executed first, base table partition_key or secondary index partition_key? How will Cassandra intersect these two results?
select * from update_audit where scopeid=35 and formid=78005 and record_link_id=9897;
How the above query will work internally in cassandra?
Essentially, all data for partition scopeid=35 and formid=78005 will be returned, and then filtered by the record_link_id index. It will look for the record_link_id entry for 9897, and attempt to match-up entries that match the rows returned where scopeid=35 and formid=78005. The intersection of the rows for the partition keys and the index keys will be returned.
How high-cardinality column (record_link_id)index will affect the query performance for the above query?
High-cardinality indexes essentially create a row for (almost) each entry in the main table. Performance is affected, because Cassandra is designed to perform sequential reads for query results. An index query essentially forces Cassandra to perform random reads. As cardinality of your indexed value increases, so does the time it takes to find the queried value.
Does cassandra will touch all nodes for the above query? WHY?
No. It should only touch a node that is responsible for the scopeid=35 and formid=78005 partition. Indexes likewise are stored locally, only contain entries that are valid for the local node.
creating index over high-cardinality columns will be the fastest and best data model
The problem here is that approach does not scale, and will be slow if update_audit is a large dataset. MVP Richard Low has a great article on secondary indexes(The Sweet Spot For Cassandra Secondary Indexing), and particularly on this point:
If your table was significantly larger than memory, a query would be very slow even to return just a few thousand results. Returning potentially millions of users would be disastrous even though it would appear to be an efficient query.
...
In practice, this means indexing is most useful for returning tens, maybe hundreds of results. Bear this in mind when you next consider using a secondary index.
Now, your approach of first restricting by a specific partition will help (as your partition should certainly fit into memory). But I feel the better-performing choice here would be to make record_link_id a clustering key, instead of relying on a secondary index.
Edit
How does having index on low cardinality index when there are millions of users scale even when we provide the primary key
It will depend on how wide your rows are. The tricky thing about extremely low cardinality indexes, is that the % of rows returned is usually greater. For instance, consider a wide-row users table. You restrict by the partition key in your query, but there are still 10,000 rows returned. If your index is on something like gender, your query will have to filter-out about half of those rows, which won't perform well.
Secondary indexes tend to work best on (for lack of a better description) "middle of the road" cardinality. Using the above example of a wide-row users table, an index on country or state should perform much better than an index on gender (assuming that most of those users don't all live in the same country or state).
Edit 20180913
For your answer to 1st question "How the above query will work internally in cassandra?", do you know what's the behavior when query with pagination?
Consider the following diagram, taken from the Java Driver documentation (v3.6):
Basically, paging will cause the query to break itself up and return to the cluster for the next iteration of results. It'd be less likely to timeout, but performance will trend downward, proportional to the size of the total result set and the number of nodes in the cluster.
TL;DR; The more requested results spread over more nodes, the longer it will take.
Query with only secondary index is also possible in Cassandra 2.x
select * from update_audit where record_link_id=9897;
But this has a large impact on fetching data, because it reads all partitions on distributed environment. The data fetched by this query is also not consistent and could not relay on it.
Suggestion:
Use of Secondary index is considered to be a DIRT query from NoSQL Data Model view.
To avoid secondary index, we could create a new table and copy data to it. Since this is a query of the application, Tables are derived from queries.

Resources