Order by with Cassandra No Sql Db - cassandra

I'm starting to using Cassandra but I'm getting some problems on "ordering" or "selecting".
CREATE TABLE functions (
id_function int,
sort int,
id_subfunction int,
php_class varchar,
php_function varchar,
PRIMARY KEY (id_function, sort, id_subfunction)
);
This is my table.
If I execute this query
SELECT * FROM functions WHERE id_subfunction = 0 ORDER BY sort;
this is what I get.
Bad Request: ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Where I'm doing wrong?
Thanks

PRIMARY KEY (id_function, sort, id_subfunction)
In Cassandra CQL the columns in a compound PRIMARY KEY are either partitioning keys or clustering keys. In your case, id_function (the first key listed) is the partitioning key. This is the key value that is hashed so that your data for that key can be evenly distributed on your cluster.
The remaining columns (sort and id_subfunction) are known as clustering columns, which determine the sort order of your data within a partition. This essentially means that your data will only be sorted by your clustering key(s) when a partitioning key is first designated in your WHERE clause.
You have two options:
1) Query this table by id_function instead:
SELECT * FROM functions WHERE id_function= 0 ORDER BY sort;
This will technically work, although I'm guessing that it won't give you the results that you are looking for.
2) The better option, is to create a "query table." This is a table designed to specifically handle your query by id_subfunction. It only differs from the original functions table in that the PRIMARY KEY is defined with id_subfunction as the partitioning key:
CREATE TABLE functionsbysubfunction (
id_function int,
sort int,
id_subfunction int,
php_class varchar,
php_function varchar,
PRIMARY KEY (id_subfunction, sort, id_function)
);
This query table will allow this query to function as expected:
SELECT * FROM functionsbysubfunction WHERE id_subfunction = 0;
And you shouldn't need to indicateORDER BY, unless you want to specify either ASCending or DESCending order.
Remember with Cassandra, it is important to design your data model according to how you want to query your data. And that may not necessarily be the way that it originally makes sense to store it.

Related

Order of column in composite partitioning key

I am using Scylla database and I have created a partitioning key composite of two columns.
Does the order of keys matter in this case?
Table definition
create table X(
user_id text,
city text,
name text,
PRIMARY KEY ((user_id, city))
);
will anything change if I write
PRIMARY KEY ((city, primary_key))?
In a composite partition key the order does not matter.
Switching the order of the keys may result in different hash values. But it shouldn't reduce the efficiency of data distribution.

Cassandra: How to query primary key using mathematical operators like >,< which is of type text?

I have a table in Cassandra, whose primary key of type text(string) and stores only numbers.
Now I want to perform CQL query on this column using mathematical operators like ><.
Any idea on how to accomplish this?
select * from ynapanalyticsteam.df_tran_order_info where order_header_key>'2018';
why do you store it as text since your values are numbers?
Cassandra has some important concepts: partition key and clustering key.
Let's say you have a table like this:
TABLE A (
...,
PRIMARY KEY ((pk1, pk2), ck1, ck2, ck3, ck4, ck5)
)
pk1 and pk2 are the partition keys and your query must include them using =. The partition key it's used to determine the nodes to which the data belong.
The clustering columns (ck1, ck2, ..., ck5) are used for ordering the data and adding some other filtering using =, <, > operators. The clustering columns are used to control how data it's sorted in the partition.
You need to change your data model so order_header_key to be a clustering key and have another column to be your partition key.

Order by created date In Cassandra

i have problem with ordering data in cassandra Database.
this is my table structure:
CREATE TABLE posts (
id uuid,
created_at timestamp,
comment_enabled boolean,
content text,
enabled boolean,
meta map<text, text>,
post_type tinyint,
summary text,
title text,
updated_at timestamp,
url text,
user_id uuid,
PRIMARY KEY (id, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
and when i run this query, i got the following message:
Query:
select * from posts order by created_at desc;
message:
ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Or this query return data without sorting:
select * from posts
There are couple of things you need to understand,
In your case the partition key is "id" and the clustering key is "created_at".
what that essentially means is any row will be stored in a partition based on the hash of "id"(depending on your hashing scheme by default it is Murmur3), now inside that partition the data is sorted based on your clustering key, in your case "created_at".
So if you query some data from that table by default the results which come are sorted based on your clustering order and the default sort order is the one which you specify while creating the table. However there is a gotcha there.
If yo do not specify the partition key in the WHERE clause, the actual order of the result set then becomes dependent on the hashed values of partition key(in your case id).
So in order to get the posts by that specific order. you have to specify the partition key like this
select * from posts WHERE id=1 order by created_at desc;
Note:
It is not necessary to specify the ORDER BY clause on a query if your desired sort direction (“ASCending/DESCending”) already matches the CLUSTERING ORDER in the table definition.
So essentially the above query is same as
select * from posts WHERE id=1
You can read more about this here http://www.datastax.com/dev/blog/we-shall-have-order
The error message is pretty clear: you cannot ORDER BY without restricting the query with a WHERE clause. This is by design.
The data you get when running without a WHERE clause actually are ordered, not with your clustering key, but by applying the token function to your partition key. You can verify the order by issuing:
SELECT token(id), id, created_at, user_id FROM posts;
where the token function arguments exactly match your PARTITION KEY.
I suggest you to read this and this to understand what you can/can't do.

What do nested parenthesis indicate in a PRIMARY KEY definition

What is difference between these two kinds of tables in Cassandra?
First :
CREATE TABLE data (
sensor_id int,
collected_at timestamp,
volts float,
volts2 float,
PRIMARY KEY (sensor_id, collected_at,volts )
)
and Second:
CREATE TABLE data (
sensor_id int,
collected_at timestamp,
volts float,
volts2 float,
PRIMARY KEY ((sensor_id, collected_at),volts )
)
My questions:
What is difference between these two tables?
When would we use first table, and when would we use the second table?
The difference is the primary key. Cassandra primary key is divided in (Partition Key, Clustering Key).
Partition Key decides where a register goes within the ring and Clustering determines how the registers with same partition key are stored to make use of the on-disk sorting of columns in your queries.
First table:
Sensor_id is your Partition Key so you know every register with the same sensor_id will go to the same node.
You have two clustering keys, collected_at and volts fields so data with the same sensor_id will be stored ordered by collected_at field in ascending order and data with same sendor_id, collected_at fields will be stored ordered by volts field in ascending order.
Second table:
You will have a compound Partition Key (sensor_id, collected_at) so you know every register with the same sensor_id and collected_at will go to the same node.
Your clustering key is volts so data with same (sensor_id, collected_at) will be stored ordered by volts in ascending order.
Imagine you have billions of registers for the same sensor_id. Using the first approach you will store it in the same node so probably you will run out of space. If you use the second approach you will have to query using an exact sensor_id and collected_at timestamp so probably it doesn't make sense. Because of that in Cassandra modeling you must know what queries are you going to execute before create the model.
The first table partitions data on sensor_id only. Meaning, that all data underneath each sensor_id is stored in the same data partition. The hashed token value of sensor_id also determines which node(s) in the cluster the data partition is stored on. Data within each partition is sorted by collected_at and volts.
The second table uses a composite key on both sensor_id and collected_at to determine data partitioning. Data in each partition is sorted by volts.
When we use first table and when we use the second table ?
As you have to pass all of your partition keys in a query, the first table offers more query flexibility. That is, you can decide to query only on sensor_id, and then you can choose whether or not to also query by collected_at and then volts. In the second table, you have to query by both sensor_id and collected_at. So you have less query flexibility, but you get better data distribution out of the second model.
And actually, partitioning on a timestamp (second table) value is typically not very useful, because you would have to have that exact timestamp before executing your query. Typically what you see when timestamp components are used in a partition key, is in a technique called "date bucketing," in which you would use something with less precision like month or day. That way, you could still query for an entire month/day or whatever your bucket was.

Select Cassandra row key

What criteria should be considered when selecting a rowid for a column family in cassandra? I want to migrate a relational database which does not contain any primary key. In that case what should be the best rowid selection?
Use natural keys that can be derived from the dataset if possible (e.g. phone_number for phone book, user_name for user table). If thats not possible, use a UUID.
There are many things to consider when consider the primary key of the cassandra system
Understand the difference between primary and partition key
CREATE TABLE users (
user_name varchar PRIMARY KEY,
password varchar,
);
In the above case primary and partition keys are the same.
CREATE TABLE users (
user_name varchar,
user_email varchar,
password varchar,
PRIMARY KEY (user_name, user_email)
);
Here Primary key is the user_name and user_email together, where as user_name is the partition keys.
CREATE TABLE users (
user_name varchar,
user_email varchar,
password varchar,
PRIMARY KEY ((user_name, user_email))
);
Here the primary key and partition keys are both equal to user_name,user_email
Carefully define your partition key. Partition keys are used for lookups by cassandra, so you must define your partition key by looking at your select queries.
Cassandra organizes data where partition keys are used for lookups, using the previous example
For the first case:
user_name ---> email:password email:data_of_birth
ABC --> abc#gmail.com:abc123 abc#gmail.com:22/02/1950 abc#yahoo.com:def123...
In the second case:
user_name,email ---> password data_of_birth ABC,abc#gmail.com --> abc123 22/02/1950
Making partition key more complex containing many data will make sure that you have many rows instead of a single row with many columns. It might be beneficial to balance the number of rows you might induce vs the number of columns each row might have. Having incredible large of small rows might not be too beneficial for reads
Partition keys indicate how data is distributed across nodes, so consider whether you have hotspots and decide whether you want to break it further.
Case 1:
All users named ABC will be in a single node
Case 2:
Users named ABC might or might not be in the single node, depending on the key that is generated along with their email.
Your partition key(s) should be how you want to store the data and how you will always look it up. You can only retrieve data by partition key, so it's important to choose something that you will naturally look up (this is why sometimes data is denormalized in Cassandra by storing it in multiple tables that mimic materialized views).
The clustering column key(s), if any, are mostly useful if you sometimes want to retrieve all the data in a partition and sometimes only want some of it. This is great for things like timeseries data because you can cluster the data on a timeuuid, store it sorted, and then do efficient range queries over the data.

Resources