Are column families within a single keyspace related in any way? - cassandra

What's the difference between having keyspace Foo and column families A and B in it vs. having two keyspaces FooA and FooB with one column family in each?
API make it look as if these two were pretty much equivalent.
As a bonus question, how do supercolumns fit into this picture?

Keyspace: a namespace for ColumnFamilies, typically one per application. A keyspace is the first dimension of the Cassandra hash, and is the container for column families. Keyspaces are of roughly the same granularity as a schema or database (i.e. a logical collection of tables) in the RDBMS world. They are the configuration and management point for column families, and is also the structure on which batch inserts are applied.
ColumnFamilies contain multiple columns, each of which has a name, value, and a timestamp, and which are referenced by row keys.
SuperColumns can be thought of as columns that themselves have subcolumns (columnfamily within columnfamily).
A more fine grained explanation of the Cassandra data model is found here

Related

How to understand the 'Flexible schema' in Cassandra?

I am new to Cassandra, and found below in the wikipedia.
A column family (called "table" since CQL 3) resembles a table in an RDBMS (Relational Database Management System). Column families contain rows and columns. Each row is uniquely identified by a row key. Each row has multiple columns, each of which has a name, value, and a timestamp. Unlike a table in an RDBMS, different rows in the same column family do not have to share the same set of columns, and a column may be added to one or multiple rows at any time.[29]
It said that 'different rows in the same column family do not have to share the same set of columns', but how to implement it? I have almost read all the documents in the offical site.
I can create table and insert data like below.
CREATE TABLE Emp_record(E_id int PRIMARY KEY,E_score int,E_name text,E_city text);
INSERT INTO Emp_record(E_id, E_score, E_name, E_city) values (101, 85, 'ashish', 'Noida');
INSERT INTO Emp_record(E_id, E_score, E_name, E_city) values (102, 90, 'ankur', 'meerut');
It's very like I did in the relational database. So how to create multiply rows with different columns?
I also found the offical document mentioned 'Flexible schema', how to understand it here?
Thanks very much in advance.
Column family is from the original design of Cassandra, when the data model looked like the Google BigTable or Apache HBase, and Thrift protocol was used for communication. But this required that schema was defined inside the application, and that makes access to data from many applications more problematic, as you need to update the schema inside all of them...
The CREATE TABLE and INSERT is a part of the Cassandra Query Language (CQL) that was introduced long time ago, and replaced Thrift-based implementation (Cassandra 4.0 completely removed the Thrift support). In CQL you need to have schema defined for a table, where you need to provide column name & type. If you really need to have dynamic columns, there are several approaches to that (I'll link answers that I already wrote over the time, so there won't duplicates):
If you have values of the same type, you can use one column as a name of the attribute/column, and another to store the value, like described here
if you have values of different types, you can also use one column as a name of attribute/column, and define multiple columns for values - one for each of the data types: int, text, ..., and you insert value into the corresponding columns only (described here)
you can use maps (described here) - it's similar to first or second, but mostly designed for very small number of "dynamic columns", plus have other limitations, like, you need to read the full map to fetch one value, etc.)

Conceptual difference concerning column families in Cassandras data model compared to Bigtable?

I am currently trying to dig into Cassandra's data model and its relation to Bigtable, but ended up with a strong headache concerning the Column Family concept.
Mainly my question was asked and already answered. However, I'm not satisfied with the answers :)
Firstly I've read the Bigtable paper especially concerning its data model, i.e. how data is stored. As far as I understood each table in Bigtable basically relies on a multi-dimensional sparse map with the dimensions row, column and time. The map is sorted by rows. Columns can be grouped with the name convention family:qualifier to a column family. Therefore, a single row can contain multiple column families (see the example figure in the paper).
Although it is stated that Cassandra relies on Bigtable data model, I read multiple times that in Cassandra a column family contains multiple rows and is to some extent comparable to a table in relational data stores. Isn't this contrary to Bigtable's approach, where a row could contain multiple column families? What comes first, the column family or row :)? Are these concepts even comparable?
The answer you linked to was from 6 years ago, and a lot has changed in Cassandra since. When Cassandra started out, its data model was indeed based on BigTable's. A row of data could include any number of columns, each of these columns has a name and a value. A row could have a thousand different columns, and a different row could have a thousand other columns - rows do not have to have the same columns. Such a database is called "schema-less", because there is no schema that each row needs to adhere to.
But Toto, we're not in Kansas any more - and Cassandra's model changed in focus (though not in essense) since, and I'll try to explain how and why:
As Cassandra matured, its developers started to realize that schema-less isn't as great as they once thought it was. Schemas are valuable in ensuring application correctness. Moreover, one doesn't normally get to 1000 columns in a single row just because there are 1000 individually-named fields in one record. Rather, the more common case is that the record actually contains 200 entries, each with 5 fields. The schema should fix these 5 fields that every one of these entries should have, and what defines each of these separate entries is called a "clustering key". So around the time of Cassandra 0.8, six years ago, these ideas where introduced to Cassandra as the "CQL" (Cassandra Query Language).
For example, in CQL one declares that a column-family (which was dutifully renamed "table") has a schema, with a known list of fields:
CREATE TABLE groups (
groupname text,
username text,
email text,
age int,
PRIMARY KEY (groupname, username)
)
This schema says that each wide row in the table (now, in modern Cassandra, this was renamed a "partition") with the key "groupname" is a a possibly long list of users, each with username, email and age fields. The first name in the "PRIMARY KEY" specifier is the partition key (it determines the key of the wide rows), and the second is called the clustering key (it determines the key of the small rows that together make up the wide rows).
Despite the new CQL dressup, Cassandra continued to implement these new concepts using the good-old-BigTable-wide-row-without-schema implementation. For example, consider that our data has a group "mygroup" with two people, (john, john#somewhere.com, 27) and (joe, joe#somewhere.com, 38). Cassandra adds the following four column names->values to the wide row:
john:email -> john#somewhere.com
john:age -> 27
joe:email -> joe#somewhere.com
joe:age -> 27
Note how we ended up with a wide row with 4 columns - 2 non-key fields per row (email and age), multiplied by the number of rows in the partition (2). The clustering key field "username" no longer appears anywhere as the value, but rather as part of the column's name! So If we have two username values "john" and "joe", We have some columns prefixed "john" and some columns prefixed "joe", and when we read the column "joe:email" we know this is the value of the email field of the row which has username=joe.
Cassandra still has this internal duality - converting the user-facing CQL rows and clustering keys into old-style wide rows. Until recently, Cassandra's on-disk format known as "SSTables" was still schema-less and used composite names as shown above for column names. I wrote a detailed description of the SSTable format on Scylla's site https://github.com/scylladb/scylla/wiki/SSTables-Data-File (Scylla is a more efficient C++ re-implementation of Cassandra to which I contribute). However, column names are very inefficient in this format so Cassandra recently (in version 3.0) switched to a different file format, which for the first time, accepts clustering keys and schema-full rows as first class citizens. This was the last nail in the coffin of the schema-less Cassandra from 7 years ago. Cassandra is now schema-full, all the way.

data organization in cassandra

I am moving from RDBMS to Cassandra.Documentation saya that Cassandra is a column family based data structure. It means that a row will be divided in multiple column families and particular column family of all the rows will be stored at one place for fast access. At the same time it is written that a row belongs to only one column family in Cassandra and think of Cassandra model like Map<RowKey, SortedMap<ColumnKey, ColumnValue>> . So how does is that column family structure now ? As row keys are used as first level map, all the columns of a particular row will be close on disk, rather than column families of all the rows. What I am getting wrong ? An example or link to some clear documents will be much appreciated as most of the blogs have copied a page from Nosql Distilled
There is a bunch of good articles on DataStax site: about data modeling, PK structures and stuff.
You can think about column families as tables in RDBMS terms but with another set of capabilities and limitations.

Is Cassandra a column oriented or columnar database

Columnar database should store group of columns together. But Cassandra stores data row-wise.
SS Table will hold multiple rows of data mapped to their corresponding partition key. So I feel like Cassandra is a row wise data store like MySQL but has other benefits like "wide rows" and every columns are not necessarily to be present for all the rows and of course it's in memory . Please correct me if I'm wrong.
If you go to the Apache Cassandra project on GitHub, and scroll down to the "Executive Summary," you will get your answer:
Cassandra is a partitioned row store. Rows are organized into tables
with a required primary key.
Partitioning means that Cassandra can distribute your data across
multiple machines in an application-transparent matter. Cassandra will
automatically repartition as machines are added and removed from the
cluster.
Row store means that like relational databases, Cassandra organizes
data by rows and columns.
"So I feel like Cassandra is a row wise data store"
And that would be correct.
In a Column oriented or a columnar database data are stored on disk in a column wise manner.
e.g: Table Bonuses table
ID Last First Bonus
1 Doe John 8000
2 Smith Jane 4000
3 Beck Sam 1000
In a row-oriented database management system, the data would be stored like this: 1,Doe,John,8000;2,Smith,Jane,4000;3,Beck,Sam,1000;
In a column-oriented database management system, the data would be stored like this:
1,2,3;Doe,Smith,Beck;John,Jane,Sam;8000,4000,1000;
Cassandra is basically a column-family store
Cassandra would store the above data as:
Bonuses: { row1: { "ID":1, "Last":"Doe", "First":"John", "Bonus":8000}, row2: { "ID":2, "Last":"Smith", "Jane":"John", "Bonus":4000} ... }
Vertica, VectorWise, MonetDB are some column oriented databases that I've heard of.
Read this for more details.
Hope this helps.
A good way of thinking about cassandra is as a map of maps, where the inner maps are sorted by key. A partition has many columns, and they are always stored together. They are sorted by clustering keys - first by the first key, then the next, then next...and so on. Partitions are then replicated amongst replicas. It's not necessarily stored as "rows" as different rows are stored on different nodes based on replication strategy and active hashing algorithm. In other words, a partition for ProductId 1 is likely not stored next to ProductId 2 if ProductId is the partition key. However the coloumns for Product Id 1, are always stored together.
As for definitions, most NoSQL stores are blurring the lines one way or the other. They usually span multiple categories. I'll leave it up to you to decide whether this qualifies as a columnar database or not :)
It is a wide column database and is also known as column family databases.
The definition from Wikipedia also helps further:
Wide-column stores such as Bigtable and Apache Cassandra are not column stores in the original sense of the term, since their two-level structures do not use a columnar data layout. In genuine column stores, a columnar data layout is adopted such that each column is stored separately on disk. Wide-column stores do often support the notion of column families that are stored separately. However, each such column family typically contains multiple columns that are used together, similar to traditional relational database tables. Within a given column family, all data is stored in a row-by-row fashion, such that the columns for a given row are stored together, rather than each column being stored separately. Wide-column stores that support column families are also known as column family databases.
Reference: https://en.wikipedia.org/wiki/Wide-column_store

What's the difference between creating a table and creating a columnfamily in Cassandra?

I need details from both performance and query aspects, I learnt from some site that only a key can be given when using a columnfamily, if so what would you suggest for my keyspace, I need to use group by, order by, count, sum, ifnull, concat, joins, and some times nested queries.
To answer the original question you posed: a column family and a table are the same thing.
The name "column family" was used in the older Thrift API.
The name "table" is used in the newer CQL API.
More info on the APIs can be found here:
http://wiki.apache.org/cassandra/API
If you need to use "group by,order by,count,sum,ifnull,concat ,joins and some times nested querys" as you state then you probably don't want to use Cassandra, since it doesn't support most of those.
CQL supports COUNT, but only up to 10000. It supports ORDER BY, but only on clustering keys. The other things you mention are not supported at all.
Refer the document: https://cassandra.apache.org/doc/old/CQL-3.0.html
It specifies that the LRM of the CQL supports TABLE keyword wherever COLUMNFAMILY is supported.
This is a proof that TABLE and COLUMNFAMILY are synonyms.
In cassandra there is no difference between table and columnfamily. they are one concept.
For Cassandra 3+ and cqlsh 5.0.1
To verify, enter into a cqlsh prompt within keyspace (ksp):
CREATE COLUMNFAMILY myTable (
... id text,
... name int
);
And type 'desc myTable'.
You'll see:
CREATE TABLE ksp.myTable (
... id text,
... name int
);
They are synonyms, and Cassandra uses table by default.
here small example to understands concept.
A keyspace is an object that holds the column families, user defined types.
Create keyspace University
with replication={'class':SimpleStrategy,
'replication_factor': 3};
create table University.student(roll int Primary KEY,
dept text,
name text,
semester int)
'Create table', table 'Student' will be created in the keyspace 'University' with columns RollNo, Name and dept. RollNo is the primary key. RollNo is also a partition key.
All the data will be in the single partition.
Key aspects while altering Keyspace in Cassandra
Keyspace Name: Keyspace name cannot be altered in Cassandra.
Strategy Name: Strategy name can be altered by specifying new strategy name.
Replication Factor: Replication factor can be altered by specifying new replication factor.
DURABLE_WRITES :DURABLE_WRITES value can be altered by specifying its value true/false. By default, it is true. If set to false, no updates will be written to the commit log and vice versa.
Execution: Here is the snapshot of the executed command "Alter Keyspace" that alters the keyspace strategy from 'SimpleStrategy' to 'NetworkTopologyStrategy' and replication factor from 3 to 1 for DataCenter1.
Column family are somewhat related to relational database's table, with a distribution differences and maybe even idealistic character.
Imaging you have a user entity that might contain 15 column, in a relational db you might want to divide the columns into small-related-column-based struct that we all know as Table. In distributed db such as Cassandra you'll be able to concatenate all those tables entry into a single long row, so if you'll use profiler/ db manager you'll see a single table with 15 columns instead of 2/3 tables. Another interesting thing is that every column family is written to different nodes, maybe on different cluster and be recognized by the row key, meaning that you'll have a single key to all the columns family and won't need to maintain a PK or FK for every table and maintain the relationships between them with 1-1, 1-n, n-n relations. Easy!

Resources