frozen list as partition key in Cassandra is a good practice or any alternates [duplicate] - cassandra

The official documentation tells us to not use UDTs for primary keys. Is there a particular reason for this? What would the potential downsides be in doing this?

That sentence was intended to discourage users from using UDT for PK columns indiscriminately. The main motivation for UDT in it's current incarnation (that is, given that Cassandra supports the "frozen" UDT) is for storing more complex values inside collections. Outside collections, UDT can have it's uses, but it's worth asking yourself twice if you need it. For example:
CREATE TYPE myType (a text, b int);
CREATE TABLE myTable (id uuid PRIMARY KEY, v frozen<myType>);
is often not very judicious in that you lose the ability of updating v.a without also updating v.b. So that it's actually more flexible to directly do:
CREATE TABLE myTable (id uuid PRIMARY KEY, a text, b int);
This trivial example points out that UDT outside of collections is not necessarily a good thing, and this also extends to primary key columns. It's not necessarily better to do:
CREATE TYPE myType (a text, b int);
CREATE TABLE myTable (id frozen<myType> PRIMARY KEY);
than more simply:
CREATE TABLE myTable (a text, b int, PRIMARY KEY ((a, b)))
Furthermore, regarding the primary key, any complex UDT probably doesn't make sense. Consider even a moderately complex type like:
CREATE TYPE address (
number int,
street text,
city text,
phones set<text>
)
Using such a type inside a primary key almost surely isn't very useful since the PK identifies rows and so 2 addresses that are the same except for the set of phones wouldn't identify the same row. There are not many situations where that would be desirable. More generally, a PK tends to be relatively simple, and you might want to have fine-grained control over the clustering columns, and so UDT are rarely good candidates.
In summary, UDT in PK columns are not always a bad, just not often useful in that context, and so users should not be looking hard at ways to use UDT for PK columns just because it's allowed.

Related

Better way to define UDT's in Cassandra database

We are trying to remove 2 columns in a table with 3 types and make them as UDT instead of having those 2 as columns. So we came up with below two options. I just wanted to understand if there are any difference in these two UDT in Cassandra database?
First option is:
CREATE TYPE test_type (
cid int,
type text,
hid int
);
and then using like this in a table definition
test_types set<frozen<test_type>>,
vs
Second option is:
CREATE TYPE test_type (
type text,
hid int
);
and then using like this in a table definition
test_types map<int, frozen<test_type>
So I am just curious which one is a preferred option here for performance related or they both are same in general?
It's really depends on how will you use it - in the first solution you won't able to select element by cid, because to access the set element you'll need to specify the full UDT value, with all fields.
The better solution would be following, assuming that you have only one collection column:
CREATE TYPE test_type (
type text,
hid int
);
create table test (
pk int,
cid int
udt frozen<test_type>,
primary key(pk, cid)
);
In this case:
you can easily select individual element by specifying the full primary key. The ability to select individual elements from map is coming only in Cassandra 4.0. See the CASSANDRA-7396. Until that you'll need to get full map back, even if you need one element, and this will limit you on the size of the map
you can even select the range of the values, using the range query
you can get all values by specifying only partition key (pk in this example)
you can select multiple non-consecutive values by doing select * from test where pk = ... and cid in (..., ..., ...);
See the "Check use of collection types" section in the data model checks best practices doc.

What is the Difference between tuple and user defined type in Cassandra

Can someone tell me the difference between tuple and user defined types in Cassandra
Datastax documents that
You can use a tuple as an alternative to a user-defined type when you
don't need to add new fields.
User-defined types gives you more flexibility with altering the number of fields in case you need to update the data later on, as well as allowing you to give meaningful names to each field. The classic example of how UDTs work is an address.
CREATE TYPE mykeyspace.address (
street_number int,
street text,
city text,
zip_code int,
phones set<text>
);
and creating the table
CREATE TABLE users (
id uuid PRIMARY KEY,
address frozen <address>
);
The tuple equivalent would be
CREATE TABLE users (
id uuid PRIMARY KEY,
address <tuple<int, text, text, int, set<text>>
);
So tuples would be best for a fixed amount of collected data where field names aren't important (an address column is definitely not a good use case; fields matter--street_number and zip_code could potentially be confused--and you wouldn't be able to add detailed fields later on). UDT would allow this, and also let you query by field name.
Furthermore, there is no significant difference in performance.

What are the pros and cons of grouping multiple table together in Cassandra?

The problem is Cassandra cannot handle a lot of tables per cluster (> 1000). I was looking for any means to reduce the number of tables, and one of them was grouping multiple tables that share the same structure to gether.
Let say if we have two table A and B
create table A (
key text,
value text,
primary key(key)
)
and
create table B (
key text,
value text,
primary key(key)
)
We can group them together by adding one more partition key
create table Shared (
original_table_name text, // either 'A' or 'B'
key text,
value text,
primary key(original_table_name, key)
)
My question is, is it a good pattern and what are the consequences of modelling data this way?
Please elaborate what you mean by alot of tables, because our production is running with 50+ tables, and I don't see any issue with it.
Anyways, if your application is using atlot of tables, then most probable cause of it it, normalized table. In cassandra you should always create denormalized tables, because of no join facility. Cassandra is built for very fast writes, so, you can count on it and not worry about that.
Now regarding the new design, I don't see any problem with that, only thing is your partition key should be combination of (table_name, key) and not just table_name so that it will be evenly distributed across nodes.
And ofcourse to query each time, you will have to specify table_name + key.

Cassandra Defining Primary key and alternatives

Here is a simple example of the user table in cassandra. What is best strategy to create a primary key.
My requirements are
search by uuid
search by username
search by email
All the keys mentioned will be high cardinality keys. Also at any moment I will be having only one of them to search
PRIMARY KEY(uid,username,email)
What if I have only the username ?, Then the above primary key is not use ful. I am not able visualize a solution to achieve this using compound primary key?
what are other options? should we go with a new table with username to uid, then search the user table. ?
From all articles out there on the internet recommends not to create secondary index for high cardinality keys
CREATE TABLE medicscity.user (
uid uuid,
fname text,
lname text,
user_id text,
email_id text,
password text,
city text,
state_id int,
country_id int,
dob timestamp,
zipcode text,
PRIMARY KEY (??)
)
How do we solve this kind of situation ?
Yes, you need to go with duplicate tables.
If ever in Cassandra you face a situation in which you will have to query a table based on column1, column2 or column3 independently. You will have to duplicate the tables.
Now, how much duplication you have to use, is individual choice.
Like, in this example, you can either duplicate table with full data.
Or, you can simply create a new table column1 (partition), column2, column 3 as primary key in main table.
Create a new table with primary key of column1, column2, column3 and partition key on column2.
Another one with same primary key and partition key on column3.
So, your data duplicate will be row, but in this case you will end up querying data twice. One from duplicate table, and one from full fledged table.
Big data technology, is there to speed up computation and let your system scale horizontally, and it comes at the expense of disk/storage. I mean just look at everything, even its base of replication factor does duplication of data.
Your PRIMARY KEY(uuid,username,email) don't fit your requirement. Because you can't search for the clustering column without fill the Partition Key, and even the second clustering column without fill the first clustering column.
e.g. you cannot search for username without uuid in WHERE clause and cannot search for email without uuid and username too.
All you need is the denormalization and duplicate data.
Denormalization and duplication of data is a fact of life with Cassandra. Don’t be afraid of it. Disk space is generally the cheapest resource (compared to CPU, memory, disk IOPs, or network), and Cassandra is architected around that fact. In order to get the most efficient reads, you often need to duplicate data.
In your case, you need to create 3 tables that have the same column (data that you want to get), but these 3 tables will have different PRIMARY KEY, one have uuid as PK, one have username as PK, and one have email as PK. :)

Cassandra UDTs as primary key

The official documentation tells us to not use UDTs for primary keys. Is there a particular reason for this? What would the potential downsides be in doing this?
That sentence was intended to discourage users from using UDT for PK columns indiscriminately. The main motivation for UDT in it's current incarnation (that is, given that Cassandra supports the "frozen" UDT) is for storing more complex values inside collections. Outside collections, UDT can have it's uses, but it's worth asking yourself twice if you need it. For example:
CREATE TYPE myType (a text, b int);
CREATE TABLE myTable (id uuid PRIMARY KEY, v frozen<myType>);
is often not very judicious in that you lose the ability of updating v.a without also updating v.b. So that it's actually more flexible to directly do:
CREATE TABLE myTable (id uuid PRIMARY KEY, a text, b int);
This trivial example points out that UDT outside of collections is not necessarily a good thing, and this also extends to primary key columns. It's not necessarily better to do:
CREATE TYPE myType (a text, b int);
CREATE TABLE myTable (id frozen<myType> PRIMARY KEY);
than more simply:
CREATE TABLE myTable (a text, b int, PRIMARY KEY ((a, b)))
Furthermore, regarding the primary key, any complex UDT probably doesn't make sense. Consider even a moderately complex type like:
CREATE TYPE address (
number int,
street text,
city text,
phones set<text>
)
Using such a type inside a primary key almost surely isn't very useful since the PK identifies rows and so 2 addresses that are the same except for the set of phones wouldn't identify the same row. There are not many situations where that would be desirable. More generally, a PK tends to be relatively simple, and you might want to have fine-grained control over the clustering columns, and so UDT are rarely good candidates.
In summary, UDT in PK columns are not always a bad, just not often useful in that context, and so users should not be looking hard at ways to use UDT for PK columns just because it's allowed.

Resources