order by clause not working in Cassandra query - cassandra

I have created a table layer using following code:
CREATE TABLE layer (
layer_name text,
layer_position text,
PRIMARY KEY (layer_name, layer_position)
) WITH CLUSTERING ORDER BY (layer_position DESC)
I use the below query to fetch data from the layer table in descending order(layer):
$select = new Cassandra\SimpleStatement(<<<EOD
select * from layer ORDER BY layer_position DESC
EOD
);
$result = $session->execute($select);
But this query is not working. Please can anyone help me?

Simply put, Cassandra only enforces sort order within a partition key.
PRIMARY KEY (layer_name, layer_position)
) WITH CLUSTERING ORDER BY (layer_position DESC)
In this case, layer_name is your partition key. If you specify layer_name in your WHERE clause, your results for that value of layer_name will be ordered by layer_position.
SELECT * FROM layer WHERE layer_name = 'layer1';
You don't need to specify ORDER BY. All ORDER BY really can do at the query level is apply a different sort direction (ascending vs. descending).
Cassandra works this way, because it is designed to read data in whatever order it is sorted on disk. Your partition keys are sorted by hashed token value, which is why results from an unbound WHERE clause appear to be ordered randomly.
EDIT
I have to fetch data using state_id column and it should be order by layer_position.
Cassandra tables are optimized for a specific query. While this results in high performance, the drawback is that query flexibility is limited. The way to solve for this, is to duplicate your data into an additional table designed to serve that particular query.
CREATE TABLE layer_by_state_id (
layer_name text,
layer_position text,
state_id text,
PRIMARY KEY (state_id, layer_position, layer_name)
) WITH CLUSTERING ORDER BY (layer_position DESC, layer_name ASC);
This table will allow queries like this to work:
SELECT * FROM layer WHERE state_id='thx1138';
And the results will be sorted by layer_position, within the requested state_id.
Now I am making a couple of assumptions that you will want to investigate:
I am assuming that state_id is a good partitioning key. Meaning that it has high-enough cardinality to offer good distribution in the cluster, but low-enough cardinality that it returns enough CQL rows to make sorting worthwhile.
I am assuming that the combination of state_id and layer_position is not enough to uniquely identify each row. Therefore I am ensuring uniqueness by adding layer_name as an additional clustering key. You may or may not need this, but I'm guessing that you will.
I am assuming that using state_id as a partitioning key will not exhibit unbound growth so as to approach Cassandra's limit of 2 billion cells per partition. If that is the case, you may need to add an additional partition "bucket."

You can't use order by directly in Cassandra.
You can apply order by on clustering columns only when your partition key will be restricted by EQ or IN.

You can use order by clause in cassandra by creating materilaized view table.

Related

Select row with highest timestamp

I have a table that stores events
CREATE TABLE active_events (
event_id VARCHAR,
number VARCHAR,
....
start_time TIMESTAMP,
PRIMARY KEY (event_id, number)
);
Now, I want to select an event with the highest start_time. It is possible? I've tried to create a secondary index, but no success.
This is a query I've created
select * from active_call order by start_time limit 1
But the error says ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Should I create some kind of materialized view? What should I do to execute my query?
This is an anti-pattern in Cassandra. To order the data you need to read all data and find the highest value. And this will require scanning of data on multiple nodes, and will be very long.
Materialized view also won't help much as order for data only exists inside an individual partition, so you will need to put all your data into a single partition that could be huge and data would be imbalanced.
I can only think of following workaround:
Have an additional table that will have all columns of the original table, but with a fake partition key and no clustering columns
You do inserts into that table in parallel to normal inserts, but use a fixed value for that fake partition key, and explicitly setting a timestamp for a record equal to start_time (don't forget to multiple by 1000 as timestamp uses microseconds). In this case it will guaranteed to be the value with the highest timestamp as Cassandra won't override it with other data with lower timestamp.
But this doesn't solve a problem with data skew, and all traffic will be handled by fixed number of nodes equal to RF.
Another alternative - use another database.
This type of query isn't valid in big data because it requires a full table scan and doesn't scale. It works in traditional relational databases because the dataset is smaller. Imagine you had billions of partitions each with thousands of rows spread across hundreds of nodes. A full table scan in a large cluster will take a very long time if it was allowed.
The error:
ORDER BY is only supported when the partition key is restricted by an EQ or an IN
gets returned because you can only sort the results provided (a) the query is restricted to a partition key, and (b) the rows are ordered by a clustering column. You cannot sort the results based on a column that is not part of the clustering key. Cheers!

Why does querying based on the first clustering key require an ALLOW FILTERING?

Say I have this Cassandra table:
CREATE TABLE orders (
customerId int,
datetime date,
amount int,
PRIMARY KEY (customerId, datetime)
);
Then why would the following query require an ALLOW FILTERING:
SELECT * FROM orders WHERE date >= '2020-01-01'
Cassandra could just go to all the individual partitions (i.e. customers) and filter on the clustering key date. Since date is sorted there is no need to retrieve all the rows in orders and filter out the ones that match my where clause (as far as I understand it).
I hope someone can enlighten me.
Thanks
This happens because for normal work, Cassandra needs the partition key - it's used to find what machine(s) are storing the data for it. If you don't have partition key, like, in your example, Cassandra need to scan all data to find those that are matching your query. And this requires the use of the ALLOW FILTERING.
P.S. Data is sorted only inside the individual partitions, not globally.

Cassandra CLUSTERING ORDER with updates [performance]

With Cassandra it is possible to specify the cluster ordering on a table with a particular column.
CREATE TABLE myTable (
user_id INT,
message TEXT,
modified DATE,
PRIMARY KEY ((user_id), modified)
)
WITH CLUSTERING ORDER BY (modified DESC);
Note: In this example, there is one message per user_id (intended)
Given this table my understanding is that the query's performance will be better in cases where recent data is queried.
However, if one where to make updates to the "modified" column does it add extra overhead on the server to "re-order" and is that overhead vs query performance significant?
In other words given this table would it perform better if the "CLUSTERING ORDER BY (modified DESC)" was dropped?
UPDATE: Updated the invalid CQL by adding modified to primary key, however, the original questions still stand.
In order to make modified a clustering column, it needs to be defined in the primary key.
CREATE TABLE myTable (
user_id INT,
message TEXT,
modified DATE,
PRIMARY KEY ((user_id), modified)
)
WITH CLUSTERING ORDER BY (modified DESC);
This way, your data will be sorted primarily by the hashed value of the user_id, and within each user_id by modified. You don't need to drop the "WITH CLUSTERING ORDER BY (modified DESC)"
Moving the comment as an answer, as reply of the updated question:
if one where to make updates to the "modified" column does it add
extra overhead on the server to "re-order" and is that overhead vs
query performance significant?
If modified is defined as part of the clustering key, you won't be able to update that record, but you will be able to add as many records as needed, each time with a different modified date.
Cassandra is an append-only database engine: this means that any update to the records will add a new record with a different timestamp, a select will consider the records with the latest timestamp. This means that there is no "re-order" operation.
Dropping or creating the clustering order should be defined in base of the query of how the information will be retrieved, if you are going to use only the latest records of that user_id, it makes sense to have the clustering order as you defined it.
in your data model user_id is a rowkey/shardkey/partition key (userid) that is important for data locality and the clustering column (modified) specifies the order that the data is arranged inside the partition. combination of these two keys makes the primary key.
Even in RDBS world, updating PK is avoidble for sake of data integrity.
however in cassandra there is no constraints/relation between column families/tables.
Assigning exact same values to Pk fields(userid,modified) will result in update the existing record else it will add set of fields.
refence:
https://www.datastax.com/dev/blog/we-shall-have-order

Allow filtering function in Cassandra (which choice is correct?)

I am currently trying to model some time series data in base of Cassandra.
For example i have a table bigint_table, which was created by following query
**
CREATE TABLE bigint_table (name_id int,tuuid timeuuid, timestamp
timestamp, value text, PRIMARY KEY ((name_id),tuuid, timestamp)) WITH
CLUSTERING ORDER BY (tuuid asc, timestamp asc)
**
tuuid column was added because without it I had problems and I lost some data while inserting them in DB. name_id represents the channel's ID data comes from.tuuid column was added because without it I had problems and I lost some data while inserting them in DB. In one table there are lots of data with the same ID, but they are unique by timestamp and tuuid (values also can be the same sometimes).
I consistently execute 2 different queries to get values and timestamps
select value from bigint_table where name_id=6 and timestamp>'
2017-11-01 8:26:47.970+0000' and timestamp<'2017-11-30
8:26:52.048+0000' order by tuuid asc, timestamp asc allow filtering
2.
select timestamp from bigint_table where name_id=6 and timestamp>'
2017-11-01 8:26:47.970+0000' and timestamp<'2017-11-30
8:26:52.048+0000' order by tuuid asc, timestamp asc allow filtering
In this post author says one need to resist the urge to just add ALLOW FILTERING to itand one should think about data, model and what one is trying to do.
I thought a lot about using ALLOW FILTERING function or not, and I figured out that I have no choice in my case and I need to use it. But those words in post I mentioned above are keeping me in doubt. I would like to know your advise and what do you thnik about my problem. Is there another way to model my data tables, queries of which do not require ALLOW FILTERING? I would be very very thank you for advice.
The reason you need allow filtering is because you have the clustering column (tuuid, timestamp)in the wrong order. In this case the data stored first by tuuid and then by timestamp.But you're actually choosing data by timestamp and then ordering by tuuid so Cassandra can't use the indexes that you have specified. The order when you define the primary key matters.

Where and Order By Clauses in Cassandra CQL

I am new to NoSQL database and have just started using apache Cassandra. I created a simple table "emp" with primary key on "empno" column. This is a simple table as we always get in Oracle's default scott schema.
Now I loaded data using the COPY command and issued query Select * from emp order by empno but I was surprised that CQL did not allow Order by on empno column (which is PK). Also when I used Where condition, it did not allow any inequality operations on empno column (it said only EQ or IN conditions are allowed). It also did not allowed Where and Order by on any other column, as they were not used in PK, and did not have an index.
Can someone please help me what should I do if I want to keep empno unique in the table and want a query results in Sorted order of empno?
(My version is:
cqlsh:demodb> show version
[cqlsh 5.0.1 | Cassandra 2.2.0 | CQL spec 3.3.0 | Native protocol v4]
)
There are two parts to a PRIMARY KEY in Cassandra:
partition key(s)
clustering key(s)
PRIMARY KEY (partitionKey1,clusteringKey1,clusteringKey2)
or
PRIMARY KEY ((partitionKey1,partitionKey2),clusteringKey1,clusteringKey2)
The partition key determines which node(s) your data is stored on. The clustering key determines the order of the data within your partition key.
In CQL, the ORDER BY clause is really only used to reverse the defined sort direction of your clustering order. As for the columns themselves, you can only specify the columns defined (and in that exact order...no skipping) in your CLUSTERING ORDER BY clause at table creation time. So you cannot pick arbitrary columns to order your result set at query-time.
Cassandra achieves performance by using the clustering keys to sort your data on-disk, thereby only returning ordered rows in a single read (no random reads). This is why you must take a query-based modeling approach (often duplicating your data into multiple query tables) with Cassandra. Know your queries ahead of time, and build your tables to serve them.
Select * from emp order by empno;
First of all, you need a WHERE clause. It's ok to query without it, if you're working with a relational database. With Cassandra, you should do your best to avoid unbound SELECT queries. Besides, Cassandra can only enforce a sort order within a partition, so querying without a WHERE clause won't return data in the order you want, anyway.
Secondly, as I mentioned above, you need to define clustering keys. If you want to order your result set by empno, then you must find another column to define as your partition key. Try something like this:
CREATE TABLE emp_by_dept (
empno text,
dept text,
name text,
PRIMARY KEY (dept,empno)
) WITH CLUSTERING ORDER BY (empno ASC);
Now, I can query employees by department, and they will be returned to me ordered by empno:
SELECT * FROM emp_by_dept WHERE dept='IT';
But to be clear, you will not be able to query every row in your table, and have it ordered by a single column. The only way to get meaningful order into your result sets, is first partition your data in a way that makes sense to your business case. Running an unbound SELECT will return all of your rows (assuming that the query doesn't time-out while trying to query every node in your cluster), but result set ordering can only be enforced within a partition. So you have to restrict by partition key in order for that to make any sense.
My apologies for self-promoting, but last year I wrote an article for DataStax called We Shall Have Order!, in which I addressed how to solve these types of problems. Give it a read and see if it helps.
Edit for additional questions:
From your answer I concluded 2 things about Cassandra:
(1) There is no
way of getting a result set which is only order by a column that has
been defined as Unique.
(2) When we define a PK
(partition-key+clustering-key), then the results will always be order
by Clustering columns within any fixed partition key (we must restrict
to one partition-key value), that means there is no need of ORDER BY
clause, since it cannot ever change the order of rows (the order in
which rows are actually stored), i.e. Order By is useless.
1) All PRIMARY KEYs in Cassandra are unique. There's no way to order your result set by your partition key. In my example, I order by empno (after partitioning by dept). – Aaron 1 hour ago
2) Stopping short of saying that ORDER BY is useless, I'll say that its only real use is to switch your sort direction between ASC and DESC.
I created an index on "empno" column of "emp" table, it is still not
allowing ORDER BY empno. So, what Indexes are for? are they only for
searching records for specific value of index key?
You cannot order a result set by an indexed column. Secondary indexes are (not the same as their relational counterparts) really only useful for edge-case, analytics-based queries. They don't scale, so the general recommendation is not to use secondary indexes.
Ok, that simply means that one table cannot be used for getting
different result sets with different conditions and different sorting
order.
Correct.
Hence for each new requirement, we need to create a new table.
IT means if we have a billion rows in a table (say Sales table), and
we need sum of sales (1) Product-wise, (2) Region-wise, then we will
duplicate all those billion rows in 2 tables with one in clustering
order of Product, the other in clustering order of Region,. and even
if we need to sum sales per Salesman_id, then we build a 3rd table,
again putting all those billion rows? is it sensible?
It's really up to you to decide how sensible it is. But lack of query flexibility is a drawback of Cassandra. To get around it you can keep creating query tables (I.E., trading disk for performance). But if it gets to a point where it becomes ungainly or difficult to manage, then it's time to think about whether or not Cassandra is really the right solution.
EDIT 20160321
Hi Aaron, you said above "Stopping short of saying that ORDER BY is useless, I'll say that its only real use is to switch your sort direction between ASC and DESC."
But i found even that is not correct. Cassandra only allows ORDER by in the same direction as we define in the "CLUSTERING ORDER BY" caluse of CREATE TABLE. If in that clause we define ASC, it allows only order by ASC, and vice versa.
Without seeing an error message, it's hard to know what to tell you on that one. Although I have heard of queries with ORDER BY failing when you have too many rows stored in a partition.
ORDER BY also functions a little odd if you specify multiple columns to sort by. If I have two clustering columns defined, I can use ORDER BY on the first column indiscriminately. But as soon as I add the second column to the ORDER BY clause, my query only works if I specify both sort directions the same (as the CLUSTERING ORDER BY definition) or both different. If I mix and match, I get this:
InvalidRequest: code=2200 [Invalid query] message="Unsupported order by relation"
I think that has to do with how the data is stored on-disk. Otherwise Cassandra would have more work to do in preparing result sets. Whereas if it requires everything to either to match or mirror the direction(s) specified in the CLUSTERING ORDER BY, it can just relay a sequential read from disk. So it's probably best to only use a single column in your ORDER BY clause, for more predictable results.
Adding a redux answer as the accepted one is quite long.
Order by is currently only supported on the clustered columns of the PRIMARY KEY
and when the partition key is restricted by an Equality or an IN operator in where clause.
That is if you have your primary key defined like this :
PRIMARY KEY ((a,b),c,d)
Then you will be able to use the ORDER BY when & only when your query has :
a where clause with all the primary key restricted either by an equality operator (=) or an IN operator such as :
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY c,d;
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY c;
These two query are the only valid ones.
Also this query would not work :
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY d,c;
because order by currently only support the ordering of columns following their declared order in the PRIMARY KEY that is in primary key definition c has been declared before d and the query violates the ordering by placing d first.

Resources