data organization in cassandra - cassandra

I am moving from RDBMS to Cassandra.Documentation saya that Cassandra is a column family based data structure. It means that a row will be divided in multiple column families and particular column family of all the rows will be stored at one place for fast access. At the same time it is written that a row belongs to only one column family in Cassandra and think of Cassandra model like Map<RowKey, SortedMap<ColumnKey, ColumnValue>> . So how does is that column family structure now ? As row keys are used as first level map, all the columns of a particular row will be close on disk, rather than column families of all the rows. What I am getting wrong ? An example or link to some clear documents will be much appreciated as most of the blogs have copied a page from Nosql Distilled

There is a bunch of good articles on DataStax site: about data modeling, PK structures and stuff.
You can think about column families as tables in RDBMS terms but with another set of capabilities and limitations.

Related

How to understand the 'Flexible schema' in Cassandra?

I am new to Cassandra, and found below in the wikipedia.
A column family (called "table" since CQL 3) resembles a table in an RDBMS (Relational Database Management System). Column families contain rows and columns. Each row is uniquely identified by a row key. Each row has multiple columns, each of which has a name, value, and a timestamp. Unlike a table in an RDBMS, different rows in the same column family do not have to share the same set of columns, and a column may be added to one or multiple rows at any time.[29]
It said that 'different rows in the same column family do not have to share the same set of columns', but how to implement it? I have almost read all the documents in the offical site.
I can create table and insert data like below.
CREATE TABLE Emp_record(E_id int PRIMARY KEY,E_score int,E_name text,E_city text);
INSERT INTO Emp_record(E_id, E_score, E_name, E_city) values (101, 85, 'ashish', 'Noida');
INSERT INTO Emp_record(E_id, E_score, E_name, E_city) values (102, 90, 'ankur', 'meerut');
It's very like I did in the relational database. So how to create multiply rows with different columns?
I also found the offical document mentioned 'Flexible schema', how to understand it here?
Thanks very much in advance.
Column family is from the original design of Cassandra, when the data model looked like the Google BigTable or Apache HBase, and Thrift protocol was used for communication. But this required that schema was defined inside the application, and that makes access to data from many applications more problematic, as you need to update the schema inside all of them...
The CREATE TABLE and INSERT is a part of the Cassandra Query Language (CQL) that was introduced long time ago, and replaced Thrift-based implementation (Cassandra 4.0 completely removed the Thrift support). In CQL you need to have schema defined for a table, where you need to provide column name & type. If you really need to have dynamic columns, there are several approaches to that (I'll link answers that I already wrote over the time, so there won't duplicates):
If you have values of the same type, you can use one column as a name of the attribute/column, and another to store the value, like described here
if you have values of different types, you can also use one column as a name of attribute/column, and define multiple columns for values - one for each of the data types: int, text, ..., and you insert value into the corresponding columns only (described here)
you can use maps (described here) - it's similar to first or second, but mostly designed for very small number of "dynamic columns", plus have other limitations, like, you need to read the full map to fetch one value, etc.)

How we can do CRUD operations on complex data models in Cassandra?

How we can do CRUD operations on complex data models in Cassandra?
I have a project using NOSQL.
I have a column family for my customers.
The column family has just "id" at first.
Then it will be updated by altering new columns.
Count and type of columns for each customer could be different.
Also, each column can include sub columns with ids again and it would be altered, too. So, they should be indexed. And documents are not useful for this issue.
I've read about NOSQL, and I've decided to use Cassandra. I will be thankful if you would answer this questions:
Is the above that possible?
How we can create and use CRUD operations on this column family?
If the answer of last question is true, what is the type of result of a query?
It will return some rows for each primary key (id)?
How we can manage that, to access a table like with no redundancy? because I don't now this summarizing should be manage in DBside or in code side.
Thank you for your help.

Is Cassandra a column oriented or columnar database

Columnar database should store group of columns together. But Cassandra stores data row-wise.
SS Table will hold multiple rows of data mapped to their corresponding partition key. So I feel like Cassandra is a row wise data store like MySQL but has other benefits like "wide rows" and every columns are not necessarily to be present for all the rows and of course it's in memory . Please correct me if I'm wrong.
If you go to the Apache Cassandra project on GitHub, and scroll down to the "Executive Summary," you will get your answer:
Cassandra is a partitioned row store. Rows are organized into tables
with a required primary key.
Partitioning means that Cassandra can distribute your data across
multiple machines in an application-transparent matter. Cassandra will
automatically repartition as machines are added and removed from the
cluster.
Row store means that like relational databases, Cassandra organizes
data by rows and columns.
"So I feel like Cassandra is a row wise data store"
And that would be correct.
In a Column oriented or a columnar database data are stored on disk in a column wise manner.
e.g: Table Bonuses table
ID Last First Bonus
1 Doe John 8000
2 Smith Jane 4000
3 Beck Sam 1000
In a row-oriented database management system, the data would be stored like this: 1,Doe,John,8000;2,Smith,Jane,4000;3,Beck,Sam,1000;
In a column-oriented database management system, the data would be stored like this:
1,2,3;Doe,Smith,Beck;John,Jane,Sam;8000,4000,1000;
Cassandra is basically a column-family store
Cassandra would store the above data as:
Bonuses: { row1: { "ID":1, "Last":"Doe", "First":"John", "Bonus":8000}, row2: { "ID":2, "Last":"Smith", "Jane":"John", "Bonus":4000} ... }
Vertica, VectorWise, MonetDB are some column oriented databases that I've heard of.
Read this for more details.
Hope this helps.
A good way of thinking about cassandra is as a map of maps, where the inner maps are sorted by key. A partition has many columns, and they are always stored together. They are sorted by clustering keys - first by the first key, then the next, then next...and so on. Partitions are then replicated amongst replicas. It's not necessarily stored as "rows" as different rows are stored on different nodes based on replication strategy and active hashing algorithm. In other words, a partition for ProductId 1 is likely not stored next to ProductId 2 if ProductId is the partition key. However the coloumns for Product Id 1, are always stored together.
As for definitions, most NoSQL stores are blurring the lines one way or the other. They usually span multiple categories. I'll leave it up to you to decide whether this qualifies as a columnar database or not :)
It is a wide column database and is also known as column family databases.
The definition from Wikipedia also helps further:
Wide-column stores such as Bigtable and Apache Cassandra are not column stores in the original sense of the term, since their two-level structures do not use a columnar data layout. In genuine column stores, a columnar data layout is adopted such that each column is stored separately on disk. Wide-column stores do often support the notion of column families that are stored separately. However, each such column family typically contains multiple columns that are used together, similar to traditional relational database tables. Within a given column family, all data is stored in a row-by-row fashion, such that the columns for a given row are stored together, rather than each column being stored separately. Wide-column stores that support column families are also known as column family databases.
Reference: https://en.wikipedia.org/wiki/Wide-column_store

how to define dynamic columns in a column family in Cassandra?

We don't want to fix the columns definition when creating a column family, as we might have to insert new columns into the column family. Is it possible to achieve it? I am wondering whether it is possible to not to define the column metadata when creating a column family, but to specify the column when client updates data, for example:
CREATE COLUMN FAMILY products WITH default_validation_class= UTF8Type AND key_validation_class=UTF8Type AND comparator=UTF8Type;
set products['1001']['brand']= ‘Sony’;
Thanks,
Fan
Yes... it is possible to achieve this, without even taking any special effort. Per the DataStax documentation of the Cassandra data model (a good read, by the way, along with the CQL spec):
The Cassandra data model is a schema-optional, column-oriented data model. This means that, unlike a relational database, you do not need to model all of the columns required by your application up front, as each row is not required to have the same set of columns. Columns and their metadata can be added by your application as they are needed without incurring downtime to your application.

Are column families within a single keyspace related in any way?

What's the difference between having keyspace Foo and column families A and B in it vs. having two keyspaces FooA and FooB with one column family in each?
API make it look as if these two were pretty much equivalent.
As a bonus question, how do supercolumns fit into this picture?
Keyspace: a namespace for ColumnFamilies, typically one per application. A keyspace is the first dimension of the Cassandra hash, and is the container for column families. Keyspaces are of roughly the same granularity as a schema or database (i.e. a logical collection of tables) in the RDBMS world. They are the configuration and management point for column families, and is also the structure on which batch inserts are applied.
ColumnFamilies contain multiple columns, each of which has a name, value, and a timestamp, and which are referenced by row keys.
SuperColumns can be thought of as columns that themselves have subcolumns (columnfamily within columnfamily).
A more fine grained explanation of the Cassandra data model is found here

Resources