Is Cassandra a column oriented or columnar database - cassandra

Columnar database should store group of columns together. But Cassandra stores data row-wise.
SS Table will hold multiple rows of data mapped to their corresponding partition key. So I feel like Cassandra is a row wise data store like MySQL but has other benefits like "wide rows" and every columns are not necessarily to be present for all the rows and of course it's in memory . Please correct me if I'm wrong.

If you go to the Apache Cassandra project on GitHub, and scroll down to the "Executive Summary," you will get your answer:
Cassandra is a partitioned row store. Rows are organized into tables
with a required primary key.
Partitioning means that Cassandra can distribute your data across
multiple machines in an application-transparent matter. Cassandra will
automatically repartition as machines are added and removed from the
cluster.
Row store means that like relational databases, Cassandra organizes
data by rows and columns.
"So I feel like Cassandra is a row wise data store"
And that would be correct.

In a Column oriented or a columnar database data are stored on disk in a column wise manner.
e.g: Table Bonuses table
ID Last First Bonus
1 Doe John 8000
2 Smith Jane 4000
3 Beck Sam 1000
In a row-oriented database management system, the data would be stored like this: 1,Doe,John,8000;2,Smith,Jane,4000;3,Beck,Sam,1000;
In a column-oriented database management system, the data would be stored like this:
1,2,3;Doe,Smith,Beck;John,Jane,Sam;8000,4000,1000;
Cassandra is basically a column-family store
Cassandra would store the above data as:
Bonuses: { row1: { "ID":1, "Last":"Doe", "First":"John", "Bonus":8000}, row2: { "ID":2, "Last":"Smith", "Jane":"John", "Bonus":4000} ... }
Vertica, VectorWise, MonetDB are some column oriented databases that I've heard of.
Read this for more details.
Hope this helps.

A good way of thinking about cassandra is as a map of maps, where the inner maps are sorted by key. A partition has many columns, and they are always stored together. They are sorted by clustering keys - first by the first key, then the next, then next...and so on. Partitions are then replicated amongst replicas. It's not necessarily stored as "rows" as different rows are stored on different nodes based on replication strategy and active hashing algorithm. In other words, a partition for ProductId 1 is likely not stored next to ProductId 2 if ProductId is the partition key. However the coloumns for Product Id 1, are always stored together.
As for definitions, most NoSQL stores are blurring the lines one way or the other. They usually span multiple categories. I'll leave it up to you to decide whether this qualifies as a columnar database or not :)

It is a wide column database and is also known as column family databases.
The definition from Wikipedia also helps further:
Wide-column stores such as Bigtable and Apache Cassandra are not column stores in the original sense of the term, since their two-level structures do not use a columnar data layout. In genuine column stores, a columnar data layout is adopted such that each column is stored separately on disk. Wide-column stores do often support the notion of column families that are stored separately. However, each such column family typically contains multiple columns that are used together, similar to traditional relational database tables. Within a given column family, all data is stored in a row-by-row fashion, such that the columns for a given row are stored together, rather than each column being stored separately. Wide-column stores that support column families are also known as column family databases.
Reference: https://en.wikipedia.org/wiki/Wide-column_store

Related

How Cassandra stores the column data on disk?

Say I insert three rows in cassandra in below order one by one
ID,firstname, lastname, websitename
1:fname1, lname1, site1
2:fname2, lname2, site2
3:fname3, lname3, site3
The column store stores columns together, like this:
1:fname1,2:fname2,3:fname3
1:lname1,2:lname2,3:lname3
1:site1,2:site2,3:site3
Does it mean when I insert the first row i.e 1:fname1, lname1, site1, it will each column in separate disk block for all three columns so that
during firstname column has to be read in some query. all related column data is on single block ?
Will it not make write slow as it cassandra has to store the data in 3 blocks instead of one to ensure column data is tored together ?
Cassandra is not a column-oriented database, it is a partition-row store, this means that the data in your example will be stored like this:
"YourTable" : {
row1 : { "ID":1, "firstname":"fname1", "lastname":"lname1", "websitename":"site1", "timestamp":1582988571},
row2 : { "ID":2, "firstname":"fname2", "lastname":"lname2", "websitename":"site2", "timestamp":1582989563}
row3 : { "ID":3, "firstname":"fname3", "lastname":"lname3", "websitename":"site3", "timestamp":1582989572}
...
}
The data is grouped and searched based on the primary key (which is the partition key and could include one or several clustering keys).
Some things to consider:
Cassandra is an append-only store, this means that when you try to update or delete a record, internally it will create a new record with the new value and a different timestamp; for the delete operation it will add a meta-data called "tombstone" that identifies the records that will be removed
Adding or removing nodes to the cluster will trigger a rearrangement of the tokens distribution, this means that the instance or server where a record can be located or maintained may change
Cassandra isn't a classical column store. It stores all inserted/updated data together, organized first by partition key, and then inside partition by clustering columns/primary keys. Data could be in different SSTables when you update them at different time point, but the compaction process will eventually try to merge them together.
If you're interested, you can use sstabledump against data files and see how data is stored. There is also a very good blog post from The Last Pickle about storage engine in the Cassandra 3.0 (it's different from previous versions).
Cassandra is basically a column-family database or row partitioned database along with column information not column based/columnar/column oriented database. When insert/fetch we need to mention partition(aka row key , aka primary key) column information. We can add any column at any point of time.
Column-family stores, like Cassandra, is great if you have high throughput writes and want to be able to linearly scale horizontally.
The term "column-family" comes from the original storage engine that was a key/value store, where the value was a "family" of column/value tuples. There was no hard limit on the number of columns that each key could have.

Cassandra - Same partition key in different tables - when it is right?

I modeled my Cassandra in a way that i have couple of tables with the same partition key - Uuid.
Each table has it's partition key and others column representing data for specific query i would like to ask.
For example - 1 table have Uuid and column regarding it's status (no other clustering keys in this table) and table 2 will contain the same Uuid (Also without clustering keys) but with different columns representing the data for this Uuid.
Is it the right modeling? Is it wrong to duplicate the same partition key around tables in order to group each table to hold relevant column for specific use case? or it preferred to use only 1 table and query them and taking the relevant data for the specific use case in the code?
There's nothing wrong with this modeling. Whether it is better, or worse, than the obvious alternative of having just one table with both pieces of data, depends on your workload:
For example, if you commonly need to read both status and data columns of the same uuid, then these reads will be more efficient if both things are in the same table, which only needs to be looked up once. If you always read just one but not both, then reads will be more efficient from separate tables. Also, if this workload is not read-mostly but rather write-mostly, then writing to just one table instead of two will be more efficient.

data organization in cassandra

I am moving from RDBMS to Cassandra.Documentation saya that Cassandra is a column family based data structure. It means that a row will be divided in multiple column families and particular column family of all the rows will be stored at one place for fast access. At the same time it is written that a row belongs to only one column family in Cassandra and think of Cassandra model like Map<RowKey, SortedMap<ColumnKey, ColumnValue>> . So how does is that column family structure now ? As row keys are used as first level map, all the columns of a particular row will be close on disk, rather than column families of all the rows. What I am getting wrong ? An example or link to some clear documents will be much appreciated as most of the blogs have copied a page from Nosql Distilled
There is a bunch of good articles on DataStax site: about data modeling, PK structures and stuff.
You can think about column families as tables in RDBMS terms but with another set of capabilities and limitations.

Is cassandra a row column database?

Im trying to learn cassandra but im confused with the terminology.
Many instances it says the row stores key/value pairs.
but, when I define a table its more like declaring a SQL table ie; you create a table and specify the column names and data types.
Can someone clarify this?
Cassandra is a column based NoSQL database. While yes at its lowest level it does store simple key-value pairs it stores these key-value pairs in collections. This grouping of keys and collections is analogous to rows and columns in a traditional relational model. Cassandra tables contain a schema and can be referenced (with restrictions) using a SQL-like language called CQL.
In your comment you ask about Apples being stored in a different table from oranges. The answer to that specific question is No it will be in the same table. However Cassandra tables have an additional concept call the Partition Key that doesn't really have an analgous concept in the relational world. Take for example the following table definition
CREATE TABLE fruit_types {
fruit text,
location text,
cost float,
PRIMARY KEY ((fruit), location)
}
In this table definition you will notice that we are defining the schema for the table. You will also notice that we are defining a PRIMARY KEY. This primary key is similar but not exactly like a relational concept. In Cassandra the PRIMAY KEY is made up of two parts the PARTITION KEY and CLUSTERING COLUMNS. The PARTITION KEY is the first fields specified in the PRIMARY KEY and can contain one or more fields delimitated by parenthesis. The purpose of the PARTITION KEY is to be hashed and used to define the node that owns the data and is also used to physically divide the information on the disk into files. The CLUSTERING COLUMNS make up the other columns listed in the PRIMARY KEY and amongst other things are used for defining how the data is physically stored on the disk inside the different files as specified by the PARTITION KEY. I suggest you do some additional reading on the PRIMARY KEY here if your interested in more detail:
https://docs.datastax.com/en/cql/3.0/cql/ddl/ddl_compound_keys_c.html
Basically cassandra storage is like sparse matrix, earlier version has a command line tool called cqlsh which can show the exact storage foot print of your columnfamily(aka table in latest version). Later community decided to keep RDBMS kind of syntax for better understanding coz the query language(CQL) syntax is similar to sql.
main storage is key(partition) (which is hash function result of chosen partition column in your table and rest of the columns will be tagged to it like sparse matrix.

Why many refer to Cassandra as a Column oriented database?

Reading several papers and documents on internet, I found many contradictory information about the Cassandra data model. There are many which identify it as a column oriented database, other as a row-oriented and then who define it as a hybrid way of both.
According to what I know about how Cassandra stores file, it uses the *-Index.db file to access at the right position of the *-Data.db file where it is stored the bloom filter, column index and then the columns of the required row.
In my opinion, this is strictly row-oriented. Is there something I'm missing?
If you take a look at the Readme file at Apache Cassandra git repo, it says that,
Cassandra is a partitioned row store. Rows are organized into tables
with a required primary key.
Partitioning means that Cassandra can distribute your data across
multiple machines in an application-transparent matter. Cassandra will
automatically repartition as machines are added and removed from the
cluster.
Row store means that like relational databases, Cassandra organizes
data by rows and columns.
Column oriented or columnar databases are stored on disk column wise.
e.g: Table Bonuses table
ID Last First Bonus
1 Doe John 8000
2 Smith Jane 4000
3 Beck Sam 1000
In a row-oriented database management system, the data would be stored like this: 1,Doe,John,8000;2,Smith,Jane,4000;3,Beck,Sam,1000;
In a column-oriented database management system, the data would be stored like this:
1,2,3;Doe,Smith,Beck;John,Jane,Sam;8000,4000,1000;
Cassandra is basically a column-family store
Cassandra would store the above data as,
"Bonuses" : {
row1 : { "ID":1, "Last":"Doe", "First":"John", "Bonus":8000},
row2 : { "ID":2, "Last":"Smith", "First":"Jane", "Bonus":4000}
...
}
Also, the number of columns in each row doesn't have to be the same. One row can have 100 columns and the next row can have only 1 column.
Read this for more details.
Yes, the "column-oriented" terminology is a bit confusing.
The model in Cassandra is that rows contain columns. To access the smallest unit of data (a column) you have to specify first the row name (key), then the column name.
So in a columnfamily called Fruit you could have a structure like the following example (with 2 rows), where the fruit types are the row keys, and the columns each have a name and value.
apple -> colour weight price variety
"red" 100 40 "Cox"
orange -> colour weight price origin
"orange" 120 50 "Spain"
One difference from a table-based relational database is that one can omit columns (orange has no variety), or add arbitrary columns (orange has origin) at any time. You can still imagine the data above as a table, albeit a sparse one where many values might be empty.
However, a "column-oriented" model can also be used for lists and time series, where every column name is unique (and here we have just one row, but we could have thousands or millions of columns):
temperature -> 2012-09-01 2012-09-02 2012-09-03 ...
40 41 39 ...
which is quite different from a relational model, where one would have to model the entries of a time series as rows not columns. This type of usage is often referred to as "wide rows".
You both make good points and it can be confusing. In the example where
apple -> colour weight price variety
"red" 100 40 "Cox"
apple is the key value and the column is the data, which contains all 4 data items. From what was described it sounds like all 4 data items are stored together as a single object then parsed by the application to pull just the value required. Therefore from an IO perspective I need to read the entire object. IMHO this is inherently row (or object) based not column based.
Column based storage became popular for warehousing, because it offers extreme compression and reduced IO for full table scans (DW) but at the cost of increased IO for OLTP when you needed to pull every column (select *). Most queries don't need every column and due to compression the IO can be greatly reduced for full table scans for just a few columns. Let me provide an example
apple -> colour weight price variety
"red" 100 40 "Cox"
grape -> colour weight price variety
"red" 100 40 "Cox"
We have two different fruits, but both have a colour = red. If we store colour in a separate disk page (block) from weight, price and variety so the only thing stored is colour, then when we compress the page we can achieve extreme compression due to a lot of de-duplication. Instead of storing 100 rows (hypothetically) in a page, we can store 10,000 colour's. Now to read everything with the colour red it might be 1 IO instead of thousands of IO's which is really good for warehousing and analytics, but bad for OLTP if I need to update the entire row since the row might have hundreds of columns and a single update (or insert) could require hundreds of IO's.
Unless I'm missing something I wouldn't call this columnar based, I'd call it object based. It's still not clear on how objects are arranged on disk. Are multiple objects placed into the same disk page? Is there any way of ensuring objects with the same meta data go together? To the point that one fruit might contain different data than another fruit since its just meta data or xml or whatever you want to store in the object itself, is there a way to ensure certain matching fruit types are stored together to increase efficiency?
Larry
The most unambiguous term I have come across is wide-column store.
It is a kind of two-dimensional key-value store, where you use a row key and a column key to access data.
The main difference between this model and the relational ones (both row-oriented and column-oriented) is that the column information is part of the data.
This implies data can be sparse. That means different rows don't need to share the same column names nor number of columns. This enables semi-structured data or schema free tables.
You can think of wide-column stores as tables that can hold an unlimited number of columns, and thus are wide.
Here's a couple of links to back this up:
This mongodb article
This Datastax article mentions it too, although it classifies Cassandra as a key-value store.
This db-engines article
This 2013 article
Wikipedia
Column Family does not mean it is column-oriented. Cassandra is column family but not column-oriented. It stores the row with all its column families together.
Hbase is column family as well as stores column families in column-oriented fashion. Different column families are stored separately in a node or they can even reside in different node.
IMO that's the wrong term used for Cassandra. Instead, it is more appropriate to call it row-partition store. Let me provide you some details on it:
Primary Key, Partitioning Key, Clustering Columns, and Data Columns:
Every table must have a primary key with unique constraint.
Primary Key = Partition key + Clustering Columns
# Example
Primary Key: ((col1, col2), col3, col4) # primary key uniquely identifies a row
# we need to choose its components partition key
# and clustering columns so that each row can be
# uniquely identified
Partition Key: (col1, col2) # decides on which node to store the data
# partitioning key is mandatory, and it
# can be made up of one column or multiple
Clustering Columns: col3, col4 # decides arrangement within a partition
# clustering columns are optional
Partition key is the first component of Primary key. Its hashed value is used to determine the node to store the data. The partition key can be a compound key consisting of multiple columns. We want almost equal spreads of data, and we keep this in mind while choosing primary key.
Any fields listed after the Partition Key in Primary Key are called Clustering Columns. These store data in ascending order within the partition. The clustering column component also helps in making sure the primary key of each row is unique.
You can use as many clustering columns as you would like. You cannot use the clustering columns out of order in the SELECT statement. You may choose to omit using a clustering column in you SELECT statement. That's OK. Just remember to sue them in order when you are using the SELECT statement. But note that, in your CQL query, you can not try to access a column or a clustering column if you have not used the other defined clustering columns. For example, if primary key is (year, artist_name, album_name) and you want to use city column in your query's WHERE clause, then you can use it only if your WHERE clause makes use of all of the columns which are part of primary key.
Tokens:
Cassandra uses tokens to determine which node holds what data. A token is a 64-bit integer, and Cassandra assigns ranges of these tokens to nodes so that each possible token is owned by a node. Adding more nodes to the cluster or removing old ones leads to redistributing these token among nodes.
A row's partition key is used to calculate a token using a given partitioner (a hash function for computing the token of a partition key) to determine which node owns that row.
Cassandra is Row-partition store:
Row is the smallest unit that stores related data in Cassandra.
Don't think of Cassandra's column family (that is, table) as a RDBMS table, but think of it as a dict of a dict (here dict is data structure similar to Python's OrderedDict):
the outer dict is keyed by a row key (primary key): this determines which partition and which row in partition
the inner dict is keyed by a column key (data columns): this is data in dict with column names as keys
both dict are ordered (by key) and are sorted: the outer dict is sorted by primary key
This model allows you to omit columns or add arbitrary columns at any time, as it allows you to have different data columns for different rows.
Cassandra has a concept of column families(table), which originally comes from BigTable. Though, it is really misleading to call them column-oriented as you mentioned. Within each column family, they store all columns from a row together, along with a row key, and they do not use column compression. Thus, the Bigtable model is still mostly row-oriented.

Resources