Are there any performance penalties when using a TEXT as a Primary Key? - cassandra

If yes, what would the data model look like if I want to have a unique TEXT field?

No. Regardless of data type used, Cassandra stores all data on disk (including primary key values) as hex byte arrays. In terms of performance, the datatype of the primary key really doesn't matter.
The only case where it would matter, is in token/node distribution. This is because the generated token for "12345" as text will be different from the token generated for 12345 as a bigint:
aploetz#cqlsh:stackoverflow> CREATE TABLE textaskey (key text PRIMARY KEY, value text);
aploetz#cqlsh:stackoverflow> CREATE TABLE longaskey (key bigint PRIMARY KEY, value text);
aploetz#cqlsh:stackoverflow> INSERT INTO textaskey (key, value) VALUES ('12345','12345');
aploetz#cqlsh:stackoverflow> INSERT INTO longaskey (key, value) VALUES (12345,'12345');
aploetz#cqlsh:stackoverflow> SELECT token(key),value FROM textaskey ;
token(key) | value
---------------------+-------
2375712675693977547 | 12345
(1 rows)
aploetz#cqlsh:stackoverflow> SELECT token(key),value FROM longaskey;
token(key) | value
---------------------+-------
3741197147323682197 | 12345
(1 rows)
But even in this example, one shouldn't perform faster/different than the other.

Related

Cassandra clustering key uniqueness

In the book Cassandra the definitive guide it is said that the combination of partition key and clustering key guarantees a unique record in the data base... i understand that the partition key is the one that guarantees unique of record - the node where the record is stored. And the clustering key is for the sorting of the records. Can someone help me understand this?
thank and sorry for the question...
Single partition key (without clustering key) is primary key which has to be unique.
A partition key + clustering key has to be unique but it doesn't mean that either partition key or a clustering key has to be unique alone.
You can insert
(a,b) (first record)
(a,c) (same partition key with the first record)
(d,b) (same clustering key with the first record)
When you insert (a,b) again then it will update the non primary key values for existing primary key.
In the following example userid is partition key and date is clustering key.
cqlsh:play> CREATE TABLE example (userid int, date int, name text, PRIMARY KEY (userid, date));
cqlsh:play> INSERT INTO example (userid, date, name) VALUES (1, 20200530, 'a');
cqlsh:play> INSERT INTO example (userid, date, name) VALUES (1, 20200531, 'a');
cqlsh:play> INSERT INTO example (userid, date, name) VALUES (2, 20200531, 'a');
cqlsh:play> SELECT * FROM example;
userid | date | name
--------+----------+------
1 | 20200530 | a
1 | 20200531 | a
2 | 20200531 | a
(3 rows)
cqlsh:play> INSERT INTO example (userid, date, name) VALUES (2, 20200531, 'b');
cqlsh:play> SELECT * FROM example;
userid | date | name
--------+----------+------
1 | 20200530 | a
1 | 20200531 | a
2 | 20200531 | b
(3 rows)
cqlsh:play>

Cassandra migrate int to bigint

What would be the easiest way to migrate an int to a bigint in Cassandra? I thought of creating a new column of type bigint and then running a script to basically set the value of that column = the value of the int column for all rows, and then dropping the original column and renaming the new column. However, I'd like to know if someone has a better alternative, because this approach just doesn't sit quite right with me.
You could ALTER your table and change your int column to a varint type. Check the documentation about ALTER TABLE, and the data types compatibility matrix.
The only other alternative is what you said: add a new column and populate it row by row. Dropping the first column can be entirely optional: if you don't assign values when performing insert everything will stay as it is, and new records won't consume space.
You can ALTER your table to store bigint in cassandra with varint. See the example-
cassandra#cqlsh:demo> CREATE TABLE int_test (id int, name text, primary key(id));
cassandra#cqlsh:demo> SELECT * FROM int_test;
id | name
----+------
(0 rows)
cassandra#cqlsh:demo> INSERT INTO int_test (id, name) VALUES ( 215478936541111, 'abc');
cassandra#cqlsh:demo> SELECT * FROM int_test ;
id | name
---------------------+---------
215478936541111 | abc
(1 rows)
cassandra#cqlsh:demo> ALTER TABLE demo.int_test ALTER id TYPE varint;
cassandra#cqlsh:demo> INSERT INTO int_test (id, name) VALUES ( 9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999, 'abcd');
cassandra#cqlsh:demo> SELECT * FROM int_test ;
id | name
------------------------------------------------------------------------------------------------------------------------------+---------
215478936541111 | abc
9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999 | abcd
(2 rows)
cassandra#cqlsh:demo>

CQL IN set query

Have a table
REATE TABLE IF NOT EXISTS tabletest (uuid text, uuidHotel text, uuidRoom text, uuidGuest text, bookedTimeStampSet set<text>, PRIMARY KEY (uuidHotel, uuidRoom));
Tried to select with IN:
select * from tabletest where uuidhotel = 'uuidHotel' and bookedtimestampset IN ('1460710800000');
Got
'bookedtimestampset' (set<text>) cannot be restricted by a 'IN' relation"
Can I select elements by IN Set filter?
Can I select elements by IN Set filter?
No, but you can put a secondary index on bookedtimestampset and use the CONTAINS operator:
aploetz#cqlsh:stackoverflow> CREATE INDEX timeset_idx ON tabletest(bookedtimestampset);
aploetz#cqlsh:stackoverflow> SELECT uuidhotel,uuidroom FROM tabletest
WHERE uuidhotel = 'uuidHotel1' and bookedtimestampset CONTAINS '1460710800000';
uuidhotel | uuidroom
------------+----------
uuidHotel1 | uuidroom1
(1 rows)
Normally I wouldn't recommend a secondary index, but as long as you are filtering by a partition key (uuidhotel) it should perform ok.
Can I select elements by IN Set filter?
you can't use clause IN with your primary key. It is highly important to understand how significantly data model influences on query performance. Of course, you can add secondary index for column bookedtimestampset but in this case be ready to for performance degradation.
CREATE TABLE IF NOT EXISTS tabletest (uuid text, uuidHotel text, uuidRoom text, uuidGuest text, bookedTimeStampSet set, PRIMARY KEY (uuidHotel, uuidRoom));
your compound primary key consists of one partition key uuidHotel and one clustering key uuidRoom which means that all your hotels and rooms would physically stored on same node in order as result retrieval of rows is very efficient. bookedTimeStampSet is different column which would be spread through whole cluster and it is just impossible to restrict by this column without secondary indexing one.
Consequently. I would recommend you to create primary key according to your future queries even if you need to duplicate some data which is common practice for NoSql database such Cassandra is.
e.q.
CREATE TABLE IF NOT EXISTS tabletest (uuid text, uuidHotel text,
uuidRoom text, uuidGuest text, bookedTimeStamp timestamp, PRIMARY KEY
(uuidHotel, bookedTimeStamp , uuidRoom))
it allows you to make a query like
select * from tabletest where uuidhotel = 'uuidHotel' and
bookedtimestamp > '1460710800000 and bookedtimestamp < '1460710900000'

Timestamp with auto increment in Cassandra

Want to write System.currentMiliseconds in the cassandta table for each column by cassandra. For example
writeToCassandra(name, email)
in cassandra table:
--------------------------------
name | email| currentMiliseconds
Can cassandra prepare currentMiliseconds column automatically like auto increment ?
BR!
Cassandra has some sort of columnar database taste inside. So if you read docs how the columns are stored inside SSTable, you'll notice that each column has a personal write timestamp appended (used for conflict resolution, like last-write-wins strategy). You can query for that timestamp using writetime() function:
cqlsh:so> create table ticks ( id text primary key, value int);
cqlsh:so> insert into ticks (id, value) values ('foo', 1);
cqlsh:so> insert into ticks (id, value) values ('bar', 2);
cqlsh:so> insert into ticks (id, value) values ('baz', 3);
cqlsh:so> select id, value from ticks;
id | value
-----+-------
bar | 2
foo | 1
baz | 3
(3 rows)
cqlsh:so> select id, writetime(value) from ticks;
id | writetime(value)
-----+------------------
bar | 1448282940862913
foo | 1448282937031542
baz | 1448282945591607
(3 rows)
As you requested, I've not explicitly inserted write timestamp to DB, but able to query it. Note you cannot use writetime() function for PK.
You can try with: dateof(now())
e.g.
INSERT INTO YOUR_TABLE (NAME, EMAIL, DATE)
VALUES ('NAME', 'EMAIL', dateof(now()));

what is the reason for composite column, there must be at least one column which is not part of the primary key

From online document:
A CQL 3 table’s primary key can have any number (1 or more) of component columns, but there must be at least one column which is not part of the primary key.
What is the reason for that?
I tried to insert a row only with the columns in the composite key in CQL. I can't see it when I do SELECT
cqlsh:demo> CREATE TABLE DEMO (
user_id bigint,
dep_id bigint,
created timestamp,
lastupdated timestamp,
PRIMARY KEY (user_id, dep_id)
);
cqlsh:demo> INSERT INTO DEMO (user_id, dep_id)
... VALUES (100, 1);
cqlsh:demo> select * from demo;
cqlsh:demo>
But when I use cli, it shows up something:
default#demo] list demo;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: 100
1 Row Returned.
Elapsed time: 27 msec(s).
But can't see the values of the columns.
After I add the column which is not in the primary key, the value shows up in CQL
cqlsh:demo> INSERT INTO DEMO (user_id, dep_id, created)
... VALUES (100, 1, '7943-07-23');
cqlsh:demo> select * from demo;
user_id | dep_id | created | lastupdated
---------+--------+--------------------------+-------------
100 | 1 | 7943-07-23 00:00:00+0000 | null
Result from CLI:
[default#demo] list demo;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: 100
invalid UTF8 bytes 0000ab7240ab7580
[default#demo]
Any idea?
update: I found the reason why CLI returns invalid UTF8 bytes 0000ab7240ab7580, it's not compatible with the table created for from CQL3, if I use compact storage option, the value shows up correctly for CLI.
What's really happening under the covers is that the non-key values are being saved using the primary key values which make up the row key and column names. If you don't insert any non-key values then you're not really creating any new column family columns. The row key comes from the first primary key, so that's why Cassandra was able to create a new row for you, even though no columns were created with it.
This limitation is fixed in Cassandra 1.2, which is in beta now.

Resources