Better way to define UDT's in Cassandra database - cassandra

We are trying to remove 2 columns in a table with 3 types and make them as UDT instead of having those 2 as columns. So we came up with below two options. I just wanted to understand if there are any difference in these two UDT in Cassandra database?
First option is:
CREATE TYPE test_type (
cid int,
type text,
hid int
);
and then using like this in a table definition
test_types set<frozen<test_type>>,
vs
Second option is:
CREATE TYPE test_type (
type text,
hid int
);
and then using like this in a table definition
test_types map<int, frozen<test_type>
So I am just curious which one is a preferred option here for performance related or they both are same in general?

It's really depends on how will you use it - in the first solution you won't able to select element by cid, because to access the set element you'll need to specify the full UDT value, with all fields.
The better solution would be following, assuming that you have only one collection column:
CREATE TYPE test_type (
type text,
hid int
);
create table test (
pk int,
cid int
udt frozen<test_type>,
primary key(pk, cid)
);
In this case:
you can easily select individual element by specifying the full primary key. The ability to select individual elements from map is coming only in Cassandra 4.0. See the CASSANDRA-7396. Until that you'll need to get full map back, even if you need one element, and this will limit you on the size of the map
you can even select the range of the values, using the range query
you can get all values by specifying only partition key (pk in this example)
you can select multiple non-consecutive values by doing select * from test where pk = ... and cid in (..., ..., ...);
See the "Check use of collection types" section in the data model checks best practices doc.

Related

Cassandra dynamic column family

I am new to cassandra and I read some articles about static and dynamic column family.
It is mentioned ,From Cassandra 3 table and column family are same.
I created key space, some tables and inserted data into that table.
CREATE TABLE subscribers(
id uuid,
email text,
first_name text,
last_name text,
PRIMARY KEY(id,email)
);
INSERT INTO subscribers(id,email,first_name,last_name)
VALUES(now(),'Test#123.com','Test1','User1');
INSERT INTO subscribers(id,email,first_name,last_name)
VALUES(now(),'Test2#222.com','Test2','User2');
INSERT INTO subscribers(id,email,first_name,last_name)
VALUES(now(),'Test3#333.com','Test3','User3');
It all seems to work fine.
But what I need is to create a dynamic column family with only data types and no predefined columns.
With insert query I can have different arguments and the table should be inserted.
In articles, it is mentioned ,for dynamic column family, there is no need to create a schema(predefined columns).
I am not sure if this is possible in cassandra or my understanding is wrong.
Let me know if this is possible or not?
if possible Kindly provide with some examples.
Thanks in advance.
I think that articles that you're referring where written in the first years of Cassandra, when it was based on the Thrift protocols. Cassandra Query Language was introduced many years ago, and now it's the way to work with Cassandra - Thrift is deprecated in Cassandra 3.x, and fully removed in the 4.0 (not released yet).
If you really need to have fully dynamic stuff, then you can try to emulate this by using table with columns as maps from text to specific type, like this:
create table abc (
id int primary key,
imap map<text,int>,
tmap map<text,text>,
... more types
);
but you need to be careful - there are limitations and performance effects when using collections, especially if you want to store more then hundreds of elements.
another approach is to store data as individual rows:
create table xxxx (
id int,
col_name text,
ival int,
tval text,
... more types
primary key(id, col_name));
then you can insert individual values as separate columns:
insert into xxxx(id, col_name, ival) values (1, 'col1', 1);
insert into xxxx(id, col_name, tval) values (1, 'col2', 'text');
and select all columns as:
select * from xxxx where id = 1;

frozen list as partition key in Cassandra is a good practice or any alternates [duplicate]

The official documentation tells us to not use UDTs for primary keys. Is there a particular reason for this? What would the potential downsides be in doing this?
That sentence was intended to discourage users from using UDT for PK columns indiscriminately. The main motivation for UDT in it's current incarnation (that is, given that Cassandra supports the "frozen" UDT) is for storing more complex values inside collections. Outside collections, UDT can have it's uses, but it's worth asking yourself twice if you need it. For example:
CREATE TYPE myType (a text, b int);
CREATE TABLE myTable (id uuid PRIMARY KEY, v frozen<myType>);
is often not very judicious in that you lose the ability of updating v.a without also updating v.b. So that it's actually more flexible to directly do:
CREATE TABLE myTable (id uuid PRIMARY KEY, a text, b int);
This trivial example points out that UDT outside of collections is not necessarily a good thing, and this also extends to primary key columns. It's not necessarily better to do:
CREATE TYPE myType (a text, b int);
CREATE TABLE myTable (id frozen<myType> PRIMARY KEY);
than more simply:
CREATE TABLE myTable (a text, b int, PRIMARY KEY ((a, b)))
Furthermore, regarding the primary key, any complex UDT probably doesn't make sense. Consider even a moderately complex type like:
CREATE TYPE address (
number int,
street text,
city text,
phones set<text>
)
Using such a type inside a primary key almost surely isn't very useful since the PK identifies rows and so 2 addresses that are the same except for the set of phones wouldn't identify the same row. There are not many situations where that would be desirable. More generally, a PK tends to be relatively simple, and you might want to have fine-grained control over the clustering columns, and so UDT are rarely good candidates.
In summary, UDT in PK columns are not always a bad, just not often useful in that context, and so users should not be looking hard at ways to use UDT for PK columns just because it's allowed.

Selecting data based on specific column in user-defined type

So I have the following columnfamily with its respective types:
CREATE TYPE subudt (
id varint,
-- snipped --
);
CREATE TYPE udt (
name text,
-- snipped --
subudt frozen <subudt>
);
CREATE COLUMNFAMILY tablename (
id varint,
-- snipped --
udt frozen <udt>,
PRIMARY KEY (id)
);
How can I perform a select query on the name field in the udt type? I was looking around and it seems that you cannot use CREATE INDEX on the udt fields, but only on the entire user defined type itself.
Ideally you model the data in Cassandra based on queries it will serve. Querying for specific fields within UDT, defeats the purpose of having them combined in the first place.
Having secondary indexes will depend on what type of query its trying to solve. The performance varies depending on different query structures explained here.
In short, you can't create Indexes on specific fields and ideally should look at modeling the data different. Its absolutely okay to duplicate the data to serve different query patterns. Say maintaining a whole new table to serve queries based on "names" is common.

What is the Difference between tuple and user defined type in Cassandra

Can someone tell me the difference between tuple and user defined types in Cassandra
Datastax documents that
You can use a tuple as an alternative to a user-defined type when you
don't need to add new fields.
User-defined types gives you more flexibility with altering the number of fields in case you need to update the data later on, as well as allowing you to give meaningful names to each field. The classic example of how UDTs work is an address.
CREATE TYPE mykeyspace.address (
street_number int,
street text,
city text,
zip_code int,
phones set<text>
);
and creating the table
CREATE TABLE users (
id uuid PRIMARY KEY,
address frozen <address>
);
The tuple equivalent would be
CREATE TABLE users (
id uuid PRIMARY KEY,
address <tuple<int, text, text, int, set<text>>
);
So tuples would be best for a fixed amount of collected data where field names aren't important (an address column is definitely not a good use case; fields matter--street_number and zip_code could potentially be confused--and you wouldn't be able to add detailed fields later on). UDT would allow this, and also let you query by field name.
Furthermore, there is no significant difference in performance.

Why cassandra/cql restrict to use where clause on a column that not indexed?

I have a table as follows in Cassandra 2.0.8:
CREATE TABLE emp (
empid int,
deptid int,
first_name text,
last_name text,
PRIMARY KEY (empid, deptid)
)
when I try to search by: "select * from emp where first_name='John';"
cql shell says:
"Bad Request: No indexed columns present in by-columns clause with Equal operator"
I searched for the issue and every places it says add a secondary index for the column 'first_name'.
But I need to know the exact reason for why that column need to be indexed?
Only thing I can figure out is performance.
Any other reasons?
Cassandra does not support for searching by arbitrary column. It is because it would involve scanning all the rows, which is not supported.
The data are internally organised into something which one can compare to HashMap[X, SortedMap[Y, Z]]. The key of the outer map is a partition key value and the key of the inner map is a kind of concatenation of all clustering columns values and a name of some regular column.
Unless you have an index on a column, you need to provide full (preferred) or partial path to the data you want to collect with the query. Therefore, you should design your schema so that queries contain primary key value and some range on clustering columns.
You may read about what is allowed and what is not here
Alternatively you can create an index in Cassandra, but that will hamper your write performance.

Resources