Cassandra clustering key uniqueness - cassandra

In the book Cassandra the definitive guide it is said that the combination of partition key and clustering key guarantees a unique record in the data base... i understand that the partition key is the one that guarantees unique of record - the node where the record is stored. And the clustering key is for the sorting of the records. Can someone help me understand this?
thank and sorry for the question...

Single partition key (without clustering key) is primary key which has to be unique.
A partition key + clustering key has to be unique but it doesn't mean that either partition key or a clustering key has to be unique alone.
You can insert
(a,b) (first record)
(a,c) (same partition key with the first record)
(d,b) (same clustering key with the first record)
When you insert (a,b) again then it will update the non primary key values for existing primary key.
In the following example userid is partition key and date is clustering key.
cqlsh:play> CREATE TABLE example (userid int, date int, name text, PRIMARY KEY (userid, date));
cqlsh:play> INSERT INTO example (userid, date, name) VALUES (1, 20200530, 'a');
cqlsh:play> INSERT INTO example (userid, date, name) VALUES (1, 20200531, 'a');
cqlsh:play> INSERT INTO example (userid, date, name) VALUES (2, 20200531, 'a');
cqlsh:play> SELECT * FROM example;
userid | date | name
--------+----------+------
1 | 20200530 | a
1 | 20200531 | a
2 | 20200531 | a
(3 rows)
cqlsh:play> INSERT INTO example (userid, date, name) VALUES (2, 20200531, 'b');
cqlsh:play> SELECT * FROM example;
userid | date | name
--------+----------+------
1 | 20200530 | a
1 | 20200531 | a
2 | 20200531 | b
(3 rows)
cqlsh:play>

Related

Cassandra Token or Hash Values

I have a student table in Cassandra with column named StudentId as primary key. Can two values from this column have same token/hash value?
Table structure
|-----------|-------------|
| StudentId | Primary Key |
| FName | |
| FName | |
|-----------|-------------|
So I think that I get what you're trying to ask here. When determining data distribution, the partition key (first part of the PRIMARY KEY) is hashed to obtain a token. That row is then written to the node(s) responsible for that particular token range.
As for having the same hash value, it is important to note that PRIMARY KEYs in Cassandra are unique. Therefore, to have the same hashed token value, rows would have to have identical partition keys, which is not possible.
To demonstrate this, I have re-created your table and INSERTed a few rows:
CREATE TABLE student (
studentid TEXT PRIMARY KEY,
fname TEXT,
lname TEXT);
INSERT INTO student (studentid, fname, lname) VALUES ('aploetz','Aaron','Ploetz');
INSERT INTO student (studentid, fname, lname) VALUES ('aploetz','Avery','Ploetz');
INSERT INTO student (studentid, fname, lname) VALUES ('janderson','Jordy','Anderson');
INSERT INTO student (studentid, fname, lname) VALUES ('mgin','Micah','Gin');
Now I will query that table, utilizing the token function on the partition key (studentid):
SELECT token(studentid),studentid,fname,lname FROM student ;
system.token(studentid) | studentid | fname | lname
-------------------------+-----------+-------+----------
-5626264886876159064 | janderson | Jordy | Anderson
-1472930629430174260 | aploetz | Avery | Ploetz
8993000853088610283 | mgin | Micah | Gin
(3 rows)
Notes:
In using the token function on the partition key, I can see the hashed token values, thus I can determine which node(s) in the cluster will contain this data.
The first two students I inserted had different names, but they ended up with the same studentid of aploetz. As PRIMARY KEYs are unique, only one persisted.
The row for Avery Ploetz "won," as it was written last.
Let me know if you require any further explanation, but I hope this helps to answer your question.

Cassandra data modeling for range queries using timestamp

I need to create a table with 4 columns:
timestamp BIGINT
name VARCHAR
value VARCHAR
value2 VARCHAR
I have 3 required queries:
SELECT *
FROM table
WHERE timestamp > xxx
AND timestamp < xxx;
SELECT *
FROM table
WHERE name = 'xxx';
SELECT *
FROM table
WHERE name = 'xxx'
AND timestamp > xxx
AND timestamp < xxx;
The result needs to be sorted by timestamp.
When I use:
CREATE TABLE table (
timestamp BIGINT,
name VARCHAR,
value VARCHAR,
value2 VARCHAR,
PRIMARY KEY (timestamp)
);
the result is never sorted.
When I use:
CREATE TABLE table (
timestamp BIGINT,
name VARCHAR,
value VARCHAR,
value2 VARCHAR,
PRIMARY KEY (name, timestamp)
);
the result is sorted by name > timestamp which is wrong.
name | timestamp
------------------------
a | 20170804142825729
a | 20170804142655569
a | 20170804142650546
a | 20170804142645516
a | 20170804142640515
a | 20170804142620454
b | 20170804143446311
b | 20170804143431287
b | 20170804143421277
b | 20170804142920802
b | 20170804142910787
How do I do this using Cassandra?
Cassandra order data by clustering key group by partition key
In your case first table have only partition key timestamp, no clustering key. So data will not be sorted.
And For the second table partition key is name and clustering key is timestamp. So your data will sorted by timestamp group by name. Means data will be first group by it's name then each group will be sorted separately by timestamp.
Edited
So you need to add a partition key like below :
CREATE TABLE table (
year BIGINT,
month BIGINT,
timestamp BIGINT,
name VARCHAR,
value VARCHAR,
value2 VARCHAR,
PRIMARY KEY ((year, month), timestamp)
);
here (year, month) is the composite partition key. You have to insert the year and month from the timestamp. So your data will be sorted by timestamp within a year and month

Are there any performance penalties when using a TEXT as a Primary Key?

If yes, what would the data model look like if I want to have a unique TEXT field?
No. Regardless of data type used, Cassandra stores all data on disk (including primary key values) as hex byte arrays. In terms of performance, the datatype of the primary key really doesn't matter.
The only case where it would matter, is in token/node distribution. This is because the generated token for "12345" as text will be different from the token generated for 12345 as a bigint:
aploetz#cqlsh:stackoverflow> CREATE TABLE textaskey (key text PRIMARY KEY, value text);
aploetz#cqlsh:stackoverflow> CREATE TABLE longaskey (key bigint PRIMARY KEY, value text);
aploetz#cqlsh:stackoverflow> INSERT INTO textaskey (key, value) VALUES ('12345','12345');
aploetz#cqlsh:stackoverflow> INSERT INTO longaskey (key, value) VALUES (12345,'12345');
aploetz#cqlsh:stackoverflow> SELECT token(key),value FROM textaskey ;
token(key) | value
---------------------+-------
2375712675693977547 | 12345
(1 rows)
aploetz#cqlsh:stackoverflow> SELECT token(key),value FROM longaskey;
token(key) | value
---------------------+-------
3741197147323682197 | 12345
(1 rows)
But even in this example, one shouldn't perform faster/different than the other.

Query results not ordered despite WITH CLUSTERING ORDER BY

I am storing posts from all users in table. I want to retrieve post from all users the user is following.
CREATE TABLE posts (
userid int,
time timestamp,
id uuid,
content text,
PRIMARY KEY (userid, time)
)WITH CLUSTERING ORDER BY (time DESC)
I have the data about who all user follows in another table
CREATE TABLE follow (
userid int,
who_follow_me set<int>,
who_i_follow set<int>,
PRIMARY KEY ((userid))
)
I am making query like
select * from posts where userid in(1,2,3,4....n);
2 questions:
why I still get data in random order, though CLUSTERING ORDER BY is specified in posts. ?
Is model correct to satisfy the query optimally (user can have n number of followers)?
I am using Cassandra 2.0.10.
"why I still get data in random order, though CLUSTERING ORDER BY is specified in posts?"
This is because ORDER BY only works for rows within a particular partitioning key. So in your case, if you wanted to see all of the posts for a specific user like this:
SELECT * FROM posts WHERE userid=1;
That return your results ordered by time, as all of the rows within the userid=1 partitioning key would be clustered by it.
"Is model correct to satisfy the query optimally (user can have n number of followers)?"
It will work, as long as you don't care about getting the results ordered by timestamp. To be able to query posts for all users ordered by time, you would need to come up with a different partitioning key. Without knowing too much about your application, you could use a column like GROUP (for instance) and partition on that.
So let's say that you evenly assign all of your users to eight groups: A, B, C, D, E, F, G and H. Let's say your table design changed like this:
CREATE TABLE posts (
group text,
userid int,
time timestamp,
id uuid,
content text,
PRIMARY KEY (group, time, userid)
)WITH CLUSTERING ORDER BY (time DESC)
You could then query all posts for all users for group B like this:
SELECT * FROM posts WHERE group='B';
That would give you all of the posts for all of the users in group B, ordered by time. So basically, for your query to order the posts appropriately by time, you need to partition your post data on something other than userid.
EDIT:
PRIMARY KEY (userid, follows)) WITH CLUSTERING ORDER BY (created DESC);
That's not going to work. In fact, that should produce the following error:
code=2200 [Invalid query] message="Missing CLUSTERING ORDER for column follows"
And even if you did add follows to your CLUSTERING ORDER clause, you would see this:
code=2200 [Invalid query] message="Only clustering key columns can be defined in CLUSTERING ORDER directive"
The CLUSTERING ORDER clause can only be used on the clustering column(s), which in this case, is only the follows column. Alter your PRIMARY KEY definition to cluster on follows (ASC) and created (DESC). I have tested this, and inserted some sample data, and can see that this query works:
aploetz#cqlsh:stackoverflow> SELECT * FROM posts WHERE userid=2 AND follows=1;
userid | follows | created | id
--------+---------+--------------------------+--------------------------------------
2 | 1 | 2015-01-25 13:27:00-0600 | 559cda12-8fe7-45d3-9a61-7ddd2119fcda
2 | 1 | 2015-01-25 13:26:00-0600 | 64b390ba-a323-4c71-baa8-e247a8bc9cdf
2 | 1 | 2015-01-25 13:24:00-0600 | 1b325b66-8ae5-4a2e-a33d-ee9b5ad464b4
(3 rows)
Although, if you want to query by just userid you can see posts from all of your followers. But in that case, the posts will only be ordered within each followerid, like this:
aploetz#cqlsh:stackoverflow> SELECT * FROM posts WHERE userid=2;
userid | follows | created | id
--------+---------+--------------------------+--------------------------------------
2 | 0 | 2015-01-25 13:28:00-0600 | 94da27d0-e91f-4c1f-88f2-5a4bbc4a0096
2 | 0 | 2015-01-25 13:23:00-0600 | 798053d3-f1c4-4c1d-a79d-d0faff10a5fb
2 | 1 | 2015-01-25 13:27:00-0600 | 559cda12-8fe7-45d3-9a61-7ddd2119fcda
2 | 1 | 2015-01-25 13:26:00-0600 | 64b390ba-a323-4c71-baa8-e247a8bc9cdf
2 | 1 | 2015-01-25 13:24:00-0600 | 1b325b66-8ae5-4a2e-a33d-ee9b5ad464b4
(5 rows)
This is my new schema,
CREATE TABLE posts(id uuid,
userid int,
follows int,
created timestamp,
PRIMARY KEY (userid, follows)) WITH CLUSTERING ORDER BY (created DESC);
Here userid represents who posted it and follows represents userid for his one of the follower. Say user x follows 10 other people , i am making 10+1 inserts. Definitely there is too much data duplication. However now its easier to get timeline for one of the user with following query
select * from posts where follows=?

what is the reason for composite column, there must be at least one column which is not part of the primary key

From online document:
A CQL 3 table’s primary key can have any number (1 or more) of component columns, but there must be at least one column which is not part of the primary key.
What is the reason for that?
I tried to insert a row only with the columns in the composite key in CQL. I can't see it when I do SELECT
cqlsh:demo> CREATE TABLE DEMO (
user_id bigint,
dep_id bigint,
created timestamp,
lastupdated timestamp,
PRIMARY KEY (user_id, dep_id)
);
cqlsh:demo> INSERT INTO DEMO (user_id, dep_id)
... VALUES (100, 1);
cqlsh:demo> select * from demo;
cqlsh:demo>
But when I use cli, it shows up something:
default#demo] list demo;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: 100
1 Row Returned.
Elapsed time: 27 msec(s).
But can't see the values of the columns.
After I add the column which is not in the primary key, the value shows up in CQL
cqlsh:demo> INSERT INTO DEMO (user_id, dep_id, created)
... VALUES (100, 1, '7943-07-23');
cqlsh:demo> select * from demo;
user_id | dep_id | created | lastupdated
---------+--------+--------------------------+-------------
100 | 1 | 7943-07-23 00:00:00+0000 | null
Result from CLI:
[default#demo] list demo;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: 100
invalid UTF8 bytes 0000ab7240ab7580
[default#demo]
Any idea?
update: I found the reason why CLI returns invalid UTF8 bytes 0000ab7240ab7580, it's not compatible with the table created for from CQL3, if I use compact storage option, the value shows up correctly for CLI.
What's really happening under the covers is that the non-key values are being saved using the primary key values which make up the row key and column names. If you don't insert any non-key values then you're not really creating any new column family columns. The row key comes from the first primary key, so that's why Cassandra was able to create a new row for you, even though no columns were created with it.
This limitation is fixed in Cassandra 1.2, which is in beta now.

Resources