What is the Difference between tuple and user defined type in Cassandra - cassandra

Can someone tell me the difference between tuple and user defined types in Cassandra

Datastax documents that
You can use a tuple as an alternative to a user-defined type when you
don't need to add new fields.
User-defined types gives you more flexibility with altering the number of fields in case you need to update the data later on, as well as allowing you to give meaningful names to each field. The classic example of how UDTs work is an address.
CREATE TYPE mykeyspace.address (
street_number int,
street text,
city text,
zip_code int,
phones set<text>
);
and creating the table
CREATE TABLE users (
id uuid PRIMARY KEY,
address frozen <address>
);
The tuple equivalent would be
CREATE TABLE users (
id uuid PRIMARY KEY,
address <tuple<int, text, text, int, set<text>>
);
So tuples would be best for a fixed amount of collected data where field names aren't important (an address column is definitely not a good use case; fields matter--street_number and zip_code could potentially be confused--and you wouldn't be able to add detailed fields later on). UDT would allow this, and also let you query by field name.
Furthermore, there is no significant difference in performance.

Related

Better way to define UDT's in Cassandra database

We are trying to remove 2 columns in a table with 3 types and make them as UDT instead of having those 2 as columns. So we came up with below two options. I just wanted to understand if there are any difference in these two UDT in Cassandra database?
First option is:
CREATE TYPE test_type (
cid int,
type text,
hid int
);
and then using like this in a table definition
test_types set<frozen<test_type>>,
vs
Second option is:
CREATE TYPE test_type (
type text,
hid int
);
and then using like this in a table definition
test_types map<int, frozen<test_type>
So I am just curious which one is a preferred option here for performance related or they both are same in general?
It's really depends on how will you use it - in the first solution you won't able to select element by cid, because to access the set element you'll need to specify the full UDT value, with all fields.
The better solution would be following, assuming that you have only one collection column:
CREATE TYPE test_type (
type text,
hid int
);
create table test (
pk int,
cid int
udt frozen<test_type>,
primary key(pk, cid)
);
In this case:
you can easily select individual element by specifying the full primary key. The ability to select individual elements from map is coming only in Cassandra 4.0. See the CASSANDRA-7396. Until that you'll need to get full map back, even if you need one element, and this will limit you on the size of the map
you can even select the range of the values, using the range query
you can get all values by specifying only partition key (pk in this example)
you can select multiple non-consecutive values by doing select * from test where pk = ... and cid in (..., ..., ...);
See the "Check use of collection types" section in the data model checks best practices doc.

frozen list as partition key in Cassandra is a good practice or any alternates [duplicate]

The official documentation tells us to not use UDTs for primary keys. Is there a particular reason for this? What would the potential downsides be in doing this?
That sentence was intended to discourage users from using UDT for PK columns indiscriminately. The main motivation for UDT in it's current incarnation (that is, given that Cassandra supports the "frozen" UDT) is for storing more complex values inside collections. Outside collections, UDT can have it's uses, but it's worth asking yourself twice if you need it. For example:
CREATE TYPE myType (a text, b int);
CREATE TABLE myTable (id uuid PRIMARY KEY, v frozen<myType>);
is often not very judicious in that you lose the ability of updating v.a without also updating v.b. So that it's actually more flexible to directly do:
CREATE TABLE myTable (id uuid PRIMARY KEY, a text, b int);
This trivial example points out that UDT outside of collections is not necessarily a good thing, and this also extends to primary key columns. It's not necessarily better to do:
CREATE TYPE myType (a text, b int);
CREATE TABLE myTable (id frozen<myType> PRIMARY KEY);
than more simply:
CREATE TABLE myTable (a text, b int, PRIMARY KEY ((a, b)))
Furthermore, regarding the primary key, any complex UDT probably doesn't make sense. Consider even a moderately complex type like:
CREATE TYPE address (
number int,
street text,
city text,
phones set<text>
)
Using such a type inside a primary key almost surely isn't very useful since the PK identifies rows and so 2 addresses that are the same except for the set of phones wouldn't identify the same row. There are not many situations where that would be desirable. More generally, a PK tends to be relatively simple, and you might want to have fine-grained control over the clustering columns, and so UDT are rarely good candidates.
In summary, UDT in PK columns are not always a bad, just not often useful in that context, and so users should not be looking hard at ways to use UDT for PK columns just because it's allowed.

Selecting data based on specific column in user-defined type

So I have the following columnfamily with its respective types:
CREATE TYPE subudt (
id varint,
-- snipped --
);
CREATE TYPE udt (
name text,
-- snipped --
subudt frozen <subudt>
);
CREATE COLUMNFAMILY tablename (
id varint,
-- snipped --
udt frozen <udt>,
PRIMARY KEY (id)
);
How can I perform a select query on the name field in the udt type? I was looking around and it seems that you cannot use CREATE INDEX on the udt fields, but only on the entire user defined type itself.
Ideally you model the data in Cassandra based on queries it will serve. Querying for specific fields within UDT, defeats the purpose of having them combined in the first place.
Having secondary indexes will depend on what type of query its trying to solve. The performance varies depending on different query structures explained here.
In short, you can't create Indexes on specific fields and ideally should look at modeling the data different. Its absolutely okay to duplicate the data to serve different query patterns. Say maintaining a whole new table to serve queries based on "names" is common.

Cassandra Hierachy Data Model

I'm newbie design cassandra data model and I need some help to think out the box.
Basically I need a hierarchical table, something pretty standard when talking about Employee.
You have a employee, say Big Boss, that have a list of employee under him.
Something like:
create table employee(id timeuuid, name text, employees list<employee>, primary key(id));
So, is there a way to model a hierarchical model in Cassandra adding the table type itself, or even another approach?
When trying this line above it give me
Bad Request: line 1:61 no viable alternative at input 'employee'
EDITED
I was thinking about 2 possibilities:
Add an uuid instead and in my java application find each uuid Employee when bringing up the "boss".
Working with Map, where the uuid is the id itself and my text would be the entire Row, then in my java application get the maps, convert each "text" employee into a Employee entity and finally return the whole object;
It really depends on your queries...one particular model would only be good for a set of queries, but not others.
You can store ids, and look them up again at the client side. This means n extra queries for each "query". This may or may not be a problem, as queries that hit a partition are fast. Using a map from id to name is also an option. This means you do extra work and denormalise the names into the map values. That's also valid. A third option is to use a UDT (user defined type). You could then have a list or set or even map. In cassandra 2.1, you could index the map keys/ values as well, allowing for some quite flexible querying.
https://www.datastax.com/documentation/cql/3.1/cql/cql_using/cqlUseUDT.html
One more approach could be to store a person's details as id, static columns for their attributes, and have "children" as columns in wide row format.
This could look like
create table person(
id int primary key,
name text static,
age int static,
employees map<int, employeeudt>
);
http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/refStaticCol.html
Querying this will give you rows with the static properties repeated, but on disk, it's still held once. You can resolve the rest client side.

Cassandra UDTs as primary key

The official documentation tells us to not use UDTs for primary keys. Is there a particular reason for this? What would the potential downsides be in doing this?
That sentence was intended to discourage users from using UDT for PK columns indiscriminately. The main motivation for UDT in it's current incarnation (that is, given that Cassandra supports the "frozen" UDT) is for storing more complex values inside collections. Outside collections, UDT can have it's uses, but it's worth asking yourself twice if you need it. For example:
CREATE TYPE myType (a text, b int);
CREATE TABLE myTable (id uuid PRIMARY KEY, v frozen<myType>);
is often not very judicious in that you lose the ability of updating v.a without also updating v.b. So that it's actually more flexible to directly do:
CREATE TABLE myTable (id uuid PRIMARY KEY, a text, b int);
This trivial example points out that UDT outside of collections is not necessarily a good thing, and this also extends to primary key columns. It's not necessarily better to do:
CREATE TYPE myType (a text, b int);
CREATE TABLE myTable (id frozen<myType> PRIMARY KEY);
than more simply:
CREATE TABLE myTable (a text, b int, PRIMARY KEY ((a, b)))
Furthermore, regarding the primary key, any complex UDT probably doesn't make sense. Consider even a moderately complex type like:
CREATE TYPE address (
number int,
street text,
city text,
phones set<text>
)
Using such a type inside a primary key almost surely isn't very useful since the PK identifies rows and so 2 addresses that are the same except for the set of phones wouldn't identify the same row. There are not many situations where that would be desirable. More generally, a PK tends to be relatively simple, and you might want to have fine-grained control over the clustering columns, and so UDT are rarely good candidates.
In summary, UDT in PK columns are not always a bad, just not often useful in that context, and so users should not be looking hard at ways to use UDT for PK columns just because it's allowed.

Resources