Cassandra Token or Hash Values - cassandra

I have a student table in Cassandra with column named StudentId as primary key. Can two values from this column have same token/hash value?
Table structure
|-----------|-------------|
| StudentId | Primary Key |
| FName | |
| FName | |
|-----------|-------------|

So I think that I get what you're trying to ask here. When determining data distribution, the partition key (first part of the PRIMARY KEY) is hashed to obtain a token. That row is then written to the node(s) responsible for that particular token range.
As for having the same hash value, it is important to note that PRIMARY KEYs in Cassandra are unique. Therefore, to have the same hashed token value, rows would have to have identical partition keys, which is not possible.
To demonstrate this, I have re-created your table and INSERTed a few rows:
CREATE TABLE student (
studentid TEXT PRIMARY KEY,
fname TEXT,
lname TEXT);
INSERT INTO student (studentid, fname, lname) VALUES ('aploetz','Aaron','Ploetz');
INSERT INTO student (studentid, fname, lname) VALUES ('aploetz','Avery','Ploetz');
INSERT INTO student (studentid, fname, lname) VALUES ('janderson','Jordy','Anderson');
INSERT INTO student (studentid, fname, lname) VALUES ('mgin','Micah','Gin');
Now I will query that table, utilizing the token function on the partition key (studentid):
SELECT token(studentid),studentid,fname,lname FROM student ;
system.token(studentid) | studentid | fname | lname
-------------------------+-----------+-------+----------
-5626264886876159064 | janderson | Jordy | Anderson
-1472930629430174260 | aploetz | Avery | Ploetz
8993000853088610283 | mgin | Micah | Gin
(3 rows)
Notes:
In using the token function on the partition key, I can see the hashed token values, thus I can determine which node(s) in the cluster will contain this data.
The first two students I inserted had different names, but they ended up with the same studentid of aploetz. As PRIMARY KEYs are unique, only one persisted.
The row for Avery Ploetz "won," as it was written last.
Let me know if you require any further explanation, but I hope this helps to answer your question.

Related

Cassandra clustering key uniqueness

In the book Cassandra the definitive guide it is said that the combination of partition key and clustering key guarantees a unique record in the data base... i understand that the partition key is the one that guarantees unique of record - the node where the record is stored. And the clustering key is for the sorting of the records. Can someone help me understand this?
thank and sorry for the question...
Single partition key (without clustering key) is primary key which has to be unique.
A partition key + clustering key has to be unique but it doesn't mean that either partition key or a clustering key has to be unique alone.
You can insert
(a,b) (first record)
(a,c) (same partition key with the first record)
(d,b) (same clustering key with the first record)
When you insert (a,b) again then it will update the non primary key values for existing primary key.
In the following example userid is partition key and date is clustering key.
cqlsh:play> CREATE TABLE example (userid int, date int, name text, PRIMARY KEY (userid, date));
cqlsh:play> INSERT INTO example (userid, date, name) VALUES (1, 20200530, 'a');
cqlsh:play> INSERT INTO example (userid, date, name) VALUES (1, 20200531, 'a');
cqlsh:play> INSERT INTO example (userid, date, name) VALUES (2, 20200531, 'a');
cqlsh:play> SELECT * FROM example;
userid | date | name
--------+----------+------
1 | 20200530 | a
1 | 20200531 | a
2 | 20200531 | a
(3 rows)
cqlsh:play> INSERT INTO example (userid, date, name) VALUES (2, 20200531, 'b');
cqlsh:play> SELECT * FROM example;
userid | date | name
--------+----------+------
1 | 20200530 | a
1 | 20200531 | a
2 | 20200531 | b
(3 rows)
cqlsh:play>

Primary key in cassandra is unique?

It could be kind of lame but in cassandra has the primary key to be unique?
For example in the following table:
CREATE TABLE users (
name text,
surname text,
age int,
adress text,
PRIMARY KEY(name, surname)
);
So if is it possible in my database to have 2 persons in my database with the same name and surname but different ages? Which means same primary key..
Yes the primary key has to be unique. Otherwise there would be no way to know which row to return when you query with a duplicate key.
In your case you can have 2 rows with the same name or with the same surname but not both.
By definition, the primary key has to be unique. But that doesn't mean you can't accomplish your goals. You just need to change your approach/terminology.
First of all, if you relax your goal of having the name+surname be a primary key, you can do the following:
CREATE TABLE users ( name text, surname text, age int, address text, PRIMARY KEY((name, surname),age) );
insert into users (name,surname,age,address) values ('name1','surname1',10,'address1');
insert into users (name,surname,age,address) values ('name1','surname1',30,'address2');
select * from users where name='name1' and surname='surname1';
name | surname | age | address
-------+----------+-----+----------
name1 | surname1 | 10 | address1
name1 | surname1 | 30 | address2
If, on the other hand, you wanted to ensure that the address is shared as well, then you probably just want to store a collection of ages in the user record. That could be achieved by:
CREATE TABLE users2 ( name text, surname text, age set<int>, address text, PRIMARY KEY(name, surname) );
insert into users2 (name,surname,age,address) values ('name1','surname1',{10,30},'address2');
select * from users2 where name='name1' and surname='surname1';
name | surname | address | age
-------+----------+----------+----------
name1 | surname1 | address2 | {10, 30}
So it comes back to what you actually need to accomplish. Hopefully the above examples give you some ideas.
The primary key is unique. With your data model, you can only have one age per (name, surname) combination.
Yes as mentioned in above comments you can have a composite key with name, surname, and age to achieve your goal but still, that won't solve the problem. Rather you can consider adding a new column userID and make that as the primary key. So even in case of name, surname and age duplicate, you don't have to revisit your data model.
CREATE TABLE users (
userId int,
name text,
surname text,
age int,
adress text,
PRIMARY KEY(userid)
);
I would state specifically that partition key should be unique.I could not get it in one place but from the following statements.
Cassandra needs all the partition key columns to be able to compute
the hash that will allow it to locate the nodes containing the
partition.
The partition key has a special use in Apache Cassandra beyond
showing the uniqueness of the record in the database..
Please note that there will not be any error if you insert same
partition key again and again as there is no constraint check.
Queries that you'll run equality searches on should be in a partition
key.
References
https://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause
how Cassandra chooses the coordinator node and the replication nodes?
Insert query replaces rows having same data field in Cassandra clustering column

Cassandra: can you add dynamic columns within existing column clustering?

I'm using Cassandra 1.2.12 with CQL 3, and am having trouble modeling my column family.
I currently store snapshots of customer data at particular times. Works great:
CREATE TABLE data (
cust_id varchar,
time timeuuid,
data_text text,
PRIMARY KEY (cust_id, time)
);
The cust_id is the partition key and time is the clustering id, so, as I understand it, I can think of each row in the table like:
| cust_id | timeuuid1 : data_text | timeuuid2 : data_text |
| CUST1 | data at this time | data at this time |
Now I'd like to store another group of metrics for each snapshot - but the name of each of these columns isn't fixed. So something like:
| cust_id | timeuuid1 : data_text | timeuuid1 : dynamicCol1 | timeuuid1 : dynamicCol2 | timeuuid1 : dynamicColN |
| CUST1 | data |{some value} |{some value} |{some value} |
I've achieved dynamic columns for timestamp by using a composite primary key, but I can't see how to achieve this within each cluster of columns, if you see what I mean.
If I add, say, "dynamicColumnName" to the existing composite key, I'll end up with customer data stored for each dynamic column, which is not what I want.
Is this possible, without using a Map column? Hope you can help, thanks!
I am not a CQL user... With the thrift API you dynamically add a column to a column family by inserting/updating a record with a value for a column with name X. The column X will start to exist right there and then for that record.
Have you tried an INSERT statement specifying a column that you have not explicitly defined? I would expect that to have the same effect (column is created).

time series data, selecting range with maxTimeuuid/minTimeuuid in cassandra

I recently created a keyspace and a column family in cassandra. I have the following
CREATE TABLE reports (
id timeuuid PRIMARY KEY,
report varchar
)
I want to select the report according to a range of time. so my query is the following;
select dateOf(id), id
from keyspace.reports
where token(id) > token(maxTimeuuid('2013-07-16 16:10:48+0300'));
It returns;
dateOf(id) | id
--------------------------+--------------------------------------
2013-07-16 16:10:37+0300 | 1b3f6d00-ee19-11e2-8734-8d331d938752
2013-07-16 16:10:13+0300 | 0d4b20e0-ee19-11e2-bbb3-e3eef18ad51b
2013-07-16 16:10:37+0300 | 1b275870-ee19-11e2-b3f3-af3e3057c60f
2013-07-16 16:10:48+0300 | 21f9a390-ee19-11e2-89a2-97143e6cae9e
So, it's wrong.
When I try to use the following cql;
select dateOf(id), id from keyspace.reports
where token(id) > token(minTimeuuid('2013-07-16 16:12:48+0300'));
dateOf(id) | id
--------------------------+--------------------------------------
2013-07-16 16:10:37+0300 | 1b3f6d00-ee19-11e2-8734-8d331d938752
2013-07-16 16:10:13+0300 | 0d4b20e0-ee19-11e2-bbb3-e3eef18ad51b
2013-07-16 16:10:37+0300 | 1b275870-ee19-11e2-b3f3-af3e3057c60f
2013-07-16 16:10:48+0300 | 21f9a390-ee19-11e2-89a2-97143e6cae9e
select dateOf(id), id from keyspace.reports
where token(id) > token(minTimeuuid('2013-07-16 16:13:48+0300'));
dateOf(id) | id
--------------------------+--------------------------------------
2013-07-16 16:10:37+0300 | 1b275870-ee19-11e2-b3f3-af3e3057c60f
2013-07-16 16:10:48+0300 | 21f9a390-ee19-11e2-89a2-97143e6cae9e
Is it random ? Why isn't it giving meaningful outputs ?
What's the best solution for this in cassandra ?
You are using the token function, which isn't really useful in your context (querying between times using mintimeuuid and maxtimeuuid) and is generating random-looking, and incorrect output:
From the CQL documentation:
The TOKEN function can be used with a condition operator on the partition key column to query. The query selects rows based on the token of their partition key rather than on their value. The token of a key depends on the partitioner in use. The RandomPartitioner and Murmur3Partitioner do not yield a meaningful order.
If you are looking to retrieve based on all records between two dates it might make more sense to model your data as a wide row, with one record per column, rather than one record per row, e.g., creating the table:
CREATE TABLE reports (
reportname text,
id timeuuid,
report text,
PRIMARY KEY (reportname, id)
)
, populating the data:
insert into reports2(reportname,id,report) VALUES ('report', 1b3f6d00-ee19-11e2-8734-8d331d938752, 'a');
insert into reports2(reportname,id,report) VALUES ('report', 0d4b20e0-ee19-11e2-bbb3-e3eef18ad51b, 'b');
insert into reports2(reportname,id,report) VALUES ('report', 1b275870-ee19-11e2-b3f3-af3e3057c60f, 'c');
insert into reports2(reportname,id,report) VALUES ('report', 21f9a390-ee19-11e2-89a2-97143e6cae9e, 'd');
, and querying (no token calls!):
select dateOf(id),id from reports2 where reportname='report' and id>maxtimeuuid('2013-07-16 16:10:48+0300');
, which returns the expected result:
dateOf(id) | id
--------------------------+--------------------------------------
2013-07-16 14:10:48+0100 | 21f9a390-ee19-11e2-89a2-97143e6cae9e
The downside to this is that all of your reports are in the one row, of course you can now store lots of different reports (keyed by reportname here). To get all reports called mynewreport in August 2013 you could query using:
select dateOf(id),id from reports2 where reportname='mynewreport' and id>=mintimeuuid('2013-08-01+0300') and id<mintimeuuid('2013-09-01+0300');

How Cassandra stores multicolumn primary key (CQL)

I have a little misunderstanding about composite row keys with CQL in Cassandra.
Let's say I have the following
cqlsh:testcql> CREATE TABLE Note (
... key int,
... user text,
... name text
... , PRIMARY KEY (key, user)
... );
cqlsh:testcql> INSERT INTO Note (key, user, name) VALUES (1, 'user1', 'name1');
cqlsh:testcql> INSERT INTO Note (key, user, name) VALUES (1, 'user2', 'name1');
cqlsh:testcql>
cqlsh:testcql> SELECT * FROM Note;
key | user | name
-----+-------+-------
1 | user1 | name1
1 | user2 | name1
How this data is stored? Are there 2 rows or one.
If two then how it is possible to have more than one row with the same key?
If one then having records with key=1 and user from "user1" to "user1000" does it mean it will have one row with key=1 and 1000 columns containing names for each user?
Can someone explain what's going on on the background? Thanks.
So, after diging a bit more and reading an article suggested by Lyuben Todorov (thank you) I found the answer to my question.
Cassandra stores data in data structures called rows which is totally different than relational databases. Rows have a unique key.
Now, what's happening in my example... In table Note I have a composite key defined as PRIMARY KEY (key, user). Only the first element of this key acts as a row key and it's called partition key. Internally the rest of this key is used to build a composite columns.
In my example
key | user | name
-----+-------+-------
1 | user1 | name1
1 | user2 | name1
This will be represented in Cassandra in one row as
-------------------------------------
| | user1:name | user2:name |
| 1 |--------------------------------
| | name1 | name1 |
-------------------------------------
Having know that it's clear that it's not a good idea to add any column with huge amount of unique values (and growing) to the composite key because it will be stored in one row. Even worse if you have multiple columns like this in a composite primary key.
Update: Later I found this blog post by Aaron Morton than explains the same in more details.

Resources