Selecting data based on specific column in user-defined type - cassandra

So I have the following columnfamily with its respective types:
CREATE TYPE subudt (
id varint,
-- snipped --
);
CREATE TYPE udt (
name text,
-- snipped --
subudt frozen <subudt>
);
CREATE COLUMNFAMILY tablename (
id varint,
-- snipped --
udt frozen <udt>,
PRIMARY KEY (id)
);
How can I perform a select query on the name field in the udt type? I was looking around and it seems that you cannot use CREATE INDEX on the udt fields, but only on the entire user defined type itself.

Ideally you model the data in Cassandra based on queries it will serve. Querying for specific fields within UDT, defeats the purpose of having them combined in the first place.
Having secondary indexes will depend on what type of query its trying to solve. The performance varies depending on different query structures explained here.
In short, you can't create Indexes on specific fields and ideally should look at modeling the data different. Its absolutely okay to duplicate the data to serve different query patterns. Say maintaining a whole new table to serve queries based on "names" is common.

Related

What are the performance implications of sparsely populated Frozen User Defined Type?


We have a frozen UDT with ~2000 fields as one of the columns in a table.
We use this table to implement append-only writes so that the data is auditable and not overwritten.
We are seeing degradation in write performance when only 1 (out of 2000) field in the UDT is populated.
Trying to understand the performance implication of using sparsely populated frozen UDTs. How are UDTs serialized/deserialized internally? Any documentation of this will be highly appreciated.
We tried to gather some metrics from cass session, but couldn't get much information.
edit: Using the C++ cassandra driver withPrepared Statements for writes
Cassandra version: 3.11.6
Data Model:
CREATE TYPE udt_xyx {
field1 bigint,
field2 ..
..
..
field2000
}
CREATE TABLE table_xyz(
key_1 text,
txn_id int,
fields frozen<udt_xyx>,
PRIMARY KEY ((key_1), txn_id)
)
Workflow:
Request comes in from the caller to write n fields(out of 2000) for a given key_1.
We assign a unique txn_id (transaction_id) to the request.
Then we create a UDT object which has 2000 fields but only populate n of those fields and persist it in the table.
The new request that comes in for the same key_1 with different (or same) fields will be assigned a new txn_id and written to the table as a new record.
That way we are not updating any currently written UDT, but always creating a new record in the table (associated with new txn_id).
When the UDT is sparsely populated, we are experiencing write performance degradation.
EDIT:
After doing some analysis we narrowed down the slowness to this:
https://github.com/datastax/cpp-driver/blob/master/src/data_type.hpp#L352-L380
Basically every time we bind a udt the "check" method runs and compares the string names for every field in the UDT.
Since we have ~2000 fields and we do over 100,000 binds we're doing about 100 Million string comparisons
What performance are you measuring here? Comparing performance to inserting data using non-UDT columns into a table versus inserting data using both non-UDT columns and UDT-type columns?
a column whose type is a frozen collection (set, map, or list) or UDT can only have its value replaced as a whole. In other words, we can't add, update, or delete individual elements from the collection as we can in non-frozen collection types. So, the frozen keyword can be useful, for example, when we want to protect collections against single-value updates.
For example, in case of the below snippet,
CREATE TYPE IF NOT EXISTS race (
race_title text,
race_date date
);
CREATE TABLE IF NOT EXISTS race_data (
id INT PRIMARY KEY,
races frozen<list<race>>
...
);
the UDT nested in the list is frozen, so the entire list will be read when querying the table.
Since you did not provide "how" you're updating the frozen collection, it is hard to triage why there is a performannce concern here.
References for exploration:
Freezing collections
Essentially, you will not be able to do an append-only operation with a frozen type as you will always have to perform read-before-write operation for any upserts.

Better way to define UDT's in Cassandra database

We are trying to remove 2 columns in a table with 3 types and make them as UDT instead of having those 2 as columns. So we came up with below two options. I just wanted to understand if there are any difference in these two UDT in Cassandra database?
First option is:
CREATE TYPE test_type (
cid int,
type text,
hid int
);
and then using like this in a table definition
test_types set<frozen<test_type>>,
vs
Second option is:
CREATE TYPE test_type (
type text,
hid int
);
and then using like this in a table definition
test_types map<int, frozen<test_type>
So I am just curious which one is a preferred option here for performance related or they both are same in general?
It's really depends on how will you use it - in the first solution you won't able to select element by cid, because to access the set element you'll need to specify the full UDT value, with all fields.
The better solution would be following, assuming that you have only one collection column:
CREATE TYPE test_type (
type text,
hid int
);
create table test (
pk int,
cid int
udt frozen<test_type>,
primary key(pk, cid)
);
In this case:
you can easily select individual element by specifying the full primary key. The ability to select individual elements from map is coming only in Cassandra 4.0. See the CASSANDRA-7396. Until that you'll need to get full map back, even if you need one element, and this will limit you on the size of the map
you can even select the range of the values, using the range query
you can get all values by specifying only partition key (pk in this example)
you can select multiple non-consecutive values by doing select * from test where pk = ... and cid in (..., ..., ...);
See the "Check use of collection types" section in the data model checks best practices doc.

Understanding Cassandra Data Model

I have recently started learning No-SQL and Cassandra through this article. The author explains the data model through this diagram:
The author also gives the below column family example:
Book {
key: 9352130677{ name: “Hadoop The Definitive Guide”, author:” Tom White”, publisher:”Oreilly”, priceInr;650, category: “hadoop”, edition:4},
key: 8177228137{ name”” Hadoop in Action”, author: “Chuck Lam”, publisher:”manning”, priceInr;590, category: “hadoop”},
key: 8177228137{ name:” Cassandra: The Definitive Guide”, author: “Eben Hewitt”, publisher:” Oreilly”, priceInr:600, category: “cassandra”},
}
But in that tutorial and every other tutorial I have gone through, then end up creating regular tables in cassandra. I am unable to connect the Cassandar model with what I am creating.
For example, I created a column family called Employee as below:
create columnfamily Employee(empid int primary key,empName text,age int);
Now I inserted some data and my column family looks as this:
For me this looks like a regular relational table and not like the data model the author has explained. How do I create a Employee column family where each row represents an employee with different attributes? Something like:
Employee{
101:{name:Emp1,age:20}
102:{name:Emp2,salary:1000}
102:{manager_name:Emp3,age:45}
}
}
You need to understand that in the representation using cql, is may look like regular relational table, but the internal structure of the rows in Cassandra is completely different. It is saving different set of attributes for each employee, and the nulls you can see while querying with cql is just a representation of empty/nonexistent cells.
What you trying to achieve, is unstructured data model. Cassandra started with this model, and all was working as described in the tutorial you've read, but there is an opinion that unstructured data design is unhealthy to development and makes more problems than it solves. So, after sometime, Cassandra moved to the "structured" data structure (and from thrift to cql). It doesn't mean that you have to store all attributes for all keys/rows, it doesn't mean that all the rows are have same number of attributes, it just means that you have to declare attributes before you use them.
You can achieve some kind of unstructured data modeling using Map, List, Set, etc. data types, UDT (User defined types) or just saving your data as json string and parsing it on the application side.
What you have understood is correct. Just believe it. Internally cassandra stores columns exactly like the image in your question.
Now, what you are expecting is to insert a column which is not defined while creating the Employee table. For dynamic columns, you can always use Map data types .
For example
create table Employee(
empid int primary key,
empName text,
age int,
attributes Map<text,text>);
To add new attributes you can use below queries.
UPDATE Employee SET attributes = { manager_name : Emp3, age:45 } WHERE empid = 102;
Update -
another way to to create a dynamic column model is as below
create table Employee(
empid int primary key,
empName text,
attribute text,
attributevalue text,
primary key (empid,empName,attribute)
);
Lets take few inserts -
insert into Employee (empid,empName,attribute,attributevalue) values (102,'Emp1','age','25') ;
insert into Employee (empid,empName,attribute,attributevalue) values (102,'Emp1','manager','emp2') ;
insert into Employee (empid,empName,attribute,attributevalue) values (102,'Emp1','department','hr') ;
this data structure will create a wide row, and behaves like dynamic column. you can see primary key empid and name is common for all three rows, only attribute and value will change.
Hope this will help
Cassandra uses a special primary key called compositie key. This is the representation of the partitions. This is also one reason why cassandra scales well. The composite key is used to determine the nodes on which the rows are stored.
The result in your console may be a result set of rows, but the intern organization of cassandra is differnt from that. Have you ever tried to query a table without an primary key? You will quickly see that you can't query that flexible (because of the partitioning).
After that you will understand why we have to use a query-first design aproach for cassandra. This is completely different from RDBBS.

What is the Difference between tuple and user defined type in Cassandra

Can someone tell me the difference between tuple and user defined types in Cassandra
Datastax documents that
You can use a tuple as an alternative to a user-defined type when you
don't need to add new fields.
User-defined types gives you more flexibility with altering the number of fields in case you need to update the data later on, as well as allowing you to give meaningful names to each field. The classic example of how UDTs work is an address.
CREATE TYPE mykeyspace.address (
street_number int,
street text,
city text,
zip_code int,
phones set<text>
);
and creating the table
CREATE TABLE users (
id uuid PRIMARY KEY,
address frozen <address>
);
The tuple equivalent would be
CREATE TABLE users (
id uuid PRIMARY KEY,
address <tuple<int, text, text, int, set<text>>
);
So tuples would be best for a fixed amount of collected data where field names aren't important (an address column is definitely not a good use case; fields matter--street_number and zip_code could potentially be confused--and you wouldn't be able to add detailed fields later on). UDT would allow this, and also let you query by field name.
Furthermore, there is no significant difference in performance.

Cassandra Hierachy Data Model

I'm newbie design cassandra data model and I need some help to think out the box.
Basically I need a hierarchical table, something pretty standard when talking about Employee.
You have a employee, say Big Boss, that have a list of employee under him.
Something like:
create table employee(id timeuuid, name text, employees list<employee>, primary key(id));
So, is there a way to model a hierarchical model in Cassandra adding the table type itself, or even another approach?
When trying this line above it give me
Bad Request: line 1:61 no viable alternative at input 'employee'
EDITED
I was thinking about 2 possibilities:
Add an uuid instead and in my java application find each uuid Employee when bringing up the "boss".
Working with Map, where the uuid is the id itself and my text would be the entire Row, then in my java application get the maps, convert each "text" employee into a Employee entity and finally return the whole object;
It really depends on your queries...one particular model would only be good for a set of queries, but not others.
You can store ids, and look them up again at the client side. This means n extra queries for each "query". This may or may not be a problem, as queries that hit a partition are fast. Using a map from id to name is also an option. This means you do extra work and denormalise the names into the map values. That's also valid. A third option is to use a UDT (user defined type). You could then have a list or set or even map. In cassandra 2.1, you could index the map keys/ values as well, allowing for some quite flexible querying.
https://www.datastax.com/documentation/cql/3.1/cql/cql_using/cqlUseUDT.html
One more approach could be to store a person's details as id, static columns for their attributes, and have "children" as columns in wide row format.
This could look like
create table person(
id int primary key,
name text static,
age int static,
employees map<int, employeeudt>
);
http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/refStaticCol.html
Querying this will give you rows with the static properties repeated, but on disk, it's still held once. You can resolve the rest client side.

Resources