How we can do CRUD operations on complex data models in Cassandra? - cassandra

How we can do CRUD operations on complex data models in Cassandra?
I have a project using NOSQL.
I have a column family for my customers.
The column family has just "id" at first.
Then it will be updated by altering new columns.
Count and type of columns for each customer could be different.
Also, each column can include sub columns with ids again and it would be altered, too. So, they should be indexed. And documents are not useful for this issue.
I've read about NOSQL, and I've decided to use Cassandra. I will be thankful if you would answer this questions:
Is the above that possible?
How we can create and use CRUD operations on this column family?
If the answer of last question is true, what is the type of result of a query?
It will return some rows for each primary key (id)?
How we can manage that, to access a table like with no redundancy? because I don't now this summarizing should be manage in DBside or in code side.
Thank you for your help.

Related

How to understand the 'Flexible schema' in Cassandra?

I am new to Cassandra, and found below in the wikipedia.
A column family (called "table" since CQL 3) resembles a table in an RDBMS (Relational Database Management System). Column families contain rows and columns. Each row is uniquely identified by a row key. Each row has multiple columns, each of which has a name, value, and a timestamp. Unlike a table in an RDBMS, different rows in the same column family do not have to share the same set of columns, and a column may be added to one or multiple rows at any time.[29]
It said that 'different rows in the same column family do not have to share the same set of columns', but how to implement it? I have almost read all the documents in the offical site.
I can create table and insert data like below.
CREATE TABLE Emp_record(E_id int PRIMARY KEY,E_score int,E_name text,E_city text);
INSERT INTO Emp_record(E_id, E_score, E_name, E_city) values (101, 85, 'ashish', 'Noida');
INSERT INTO Emp_record(E_id, E_score, E_name, E_city) values (102, 90, 'ankur', 'meerut');
It's very like I did in the relational database. So how to create multiply rows with different columns?
I also found the offical document mentioned 'Flexible schema', how to understand it here?
Thanks very much in advance.
Column family is from the original design of Cassandra, when the data model looked like the Google BigTable or Apache HBase, and Thrift protocol was used for communication. But this required that schema was defined inside the application, and that makes access to data from many applications more problematic, as you need to update the schema inside all of them...
The CREATE TABLE and INSERT is a part of the Cassandra Query Language (CQL) that was introduced long time ago, and replaced Thrift-based implementation (Cassandra 4.0 completely removed the Thrift support). In CQL you need to have schema defined for a table, where you need to provide column name & type. If you really need to have dynamic columns, there are several approaches to that (I'll link answers that I already wrote over the time, so there won't duplicates):
If you have values of the same type, you can use one column as a name of the attribute/column, and another to store the value, like described here
if you have values of different types, you can also use one column as a name of attribute/column, and define multiple columns for values - one for each of the data types: int, text, ..., and you insert value into the corresponding columns only (described here)
you can use maps (described here) - it's similar to first or second, but mostly designed for very small number of "dynamic columns", plus have other limitations, like, you need to read the full map to fetch one value, etc.)

Cassandra Table Modeling

Imagine a table with thousands of columns, where most data in the row record is null. One of the columns is an ID, and this ID is known upfront.
select id,SomeRandomColumn
from LotsOfColumnsTable
where id = 92e72b9e-7507-4c83-9207-c357df57b318;
SomeRandomColumn is one of thousands, and in most cases the only column with data. SomeRandomColumn is NOT known upfront as the one that contains data.
Is there a CQL query that can do something like this.
select {Only Columns with data}
from LotsOfColumnsTable
where id = 92e72b9e-7507-4c83-9207-c357df57b318;
I was thinking of putting in a "hint" column that points to the column with data, but that feels wrong unless there is a CQL query that looks something like this with one query;
select ColumnHint.{DataColumnName}
from LotsOfColumnsTable
where id = 92e72b9e-7507-4c83-9207-c357df57b318;
In MongoDB I would just have a collection and the document I got back would have a "Type" attribute describing the data. So perhaps my real question is how do I replicate what I can do with MondoDB in Cassandra. My Cassandra journey so far is to create UDT's for each unique document, followed by altering the table to add this new UDT as a column. My starter table looks like this where ColumnDataName is the hint;
CREATE TABLE IF NOT EXISTS WideProductInstance (
Id uuid,
ColumnDataName text
PRIMARY KEY (Id)
);
Thanks
Is there a CQL query that can do something like this.
select {Only Columns with data}
from LotsOfColumnsTable
where id = 92e72b9e-7507-4c83-9207-c357df57b318;
No, you cannot do that. And it's pretty easy to explain. To be able to know that a column contains data, Cassandra will need to read it. And if it has to read the data, since the effort is already spent on disk, it will just return this data to the client.
The only saving you'll get if Cassandra was capable of filtering out null column is on the network bandwidth ...
I was thinking of putting in a "hint" column that points to the column with data, but that feels wrong unless there is a CQL query that looks something like this with one query;
Your idea is like storing in another table a list of all column that actually contains real data and not null. It sounds like a JOIN which is bad and not supported. And if you need to read this reference table before reading the original table, you'll have to read at many places and it's going to be expensive
So perhaps my real question is how do I replicate what I can do with MondoDB in Cassandra.
Don't try to replicate the same feature from Mongo to Cassandra. The two database have fundamentally different architecture. What you have to do is to reason about your functional use-case. "How do I want to fetch my data from Cassandra ?" and from this point design a proper data model. Cassandra data model is designed by query.
The best advice for you is to watch some Cassandra Data Model videos (it's free) at http://academy.datastax.com

data organization in cassandra

I am moving from RDBMS to Cassandra.Documentation saya that Cassandra is a column family based data structure. It means that a row will be divided in multiple column families and particular column family of all the rows will be stored at one place for fast access. At the same time it is written that a row belongs to only one column family in Cassandra and think of Cassandra model like Map<RowKey, SortedMap<ColumnKey, ColumnValue>> . So how does is that column family structure now ? As row keys are used as first level map, all the columns of a particular row will be close on disk, rather than column families of all the rows. What I am getting wrong ? An example or link to some clear documents will be much appreciated as most of the blogs have copied a page from Nosql Distilled
There is a bunch of good articles on DataStax site: about data modeling, PK structures and stuff.
You can think about column families as tables in RDBMS terms but with another set of capabilities and limitations.

how to implement fixed number of (timeuuid) columns in cassandra (with CQL)?

Here is an example use case:
You need to store last N (let's say 1000 as fixed bucket size) user actions with all details in timeuuid based columns.
Normally, each users' actions are already in "UserAction" column family where user id as row key, and actions in timeuuid columns. You may also have "AllActions" column family which stores all actions with same timeuuid as column name and user id as column value. It's basically a relationship column family but unfortunately without any details of user actions. Querying with this column family is expensive I guess, because of random partioner. On the other hand, if you store all details in "AllActions" CF then cassandra can't handle that big row properly at one point. This is why I want to store last N user actions with all details in fixed number of timeuuid based columns.
Maybe you may have a better design solution for this use case... I like to hear that ...
If not, the question is how to implement fixed number of (timeuuid) columns in cassandra (with CQL) effectively?
After insertion we could delete old (overflow) columns if we had some sort of range support in cql's DELETE. AFAIK there is no support for this.
So, any idea? Thanks in advance...
IMHO, this is something that C* must handle itself like compaction. It's not a good idea to handle this on client side.
Maybe, we need some configuration (storage) options for column families to make them suitable for "most recent data".

how to define dynamic columns in a column family in Cassandra?

We don't want to fix the columns definition when creating a column family, as we might have to insert new columns into the column family. Is it possible to achieve it? I am wondering whether it is possible to not to define the column metadata when creating a column family, but to specify the column when client updates data, for example:
CREATE COLUMN FAMILY products WITH default_validation_class= UTF8Type AND key_validation_class=UTF8Type AND comparator=UTF8Type;
set products['1001']['brand']= ‘Sony’;
Thanks,
Fan
Yes... it is possible to achieve this, without even taking any special effort. Per the DataStax documentation of the Cassandra data model (a good read, by the way, along with the CQL spec):
The Cassandra data model is a schema-optional, column-oriented data model. This means that, unlike a relational database, you do not need to model all of the columns required by your application up front, as each row is not required to have the same set of columns. Columns and their metadata can be added by your application as they are needed without incurring downtime to your application.

Resources