Cassandra: choosing a Partition Key - cassandra

I'm undecided whether it's better, performance-wise, to use a very commonly shared column value (like Country) as partition key for a compound primary key or a rather unique column value (like Last_Name).
Looking at Cassandra 1.2's documentation about indexes I get this:
"When to use an index:
Cassandra's built-in indexes are best on a table
having many rows that contain the indexed value. The more unique
values that exist in a particular column, the more overhead you will
have, on average, to query and maintain the index. For example,
suppose you had a user table with a billion users and wanted to look
up users by the state they lived in. Many users will share the same
column value for state (such as CA, NY, TX, etc.). This would be a
good candidate for an index."
"When not to use an index:
Do not use an index to query a huge volume of records for a small
number of results. For example, if you create an index on a column
that has many distinct values, a query between the fields will incur
many seeks for very few results. In the table with a billion users,
looking up users by their email address (a value that is typically
unique for each user) instead of by their state, is likely to be very
inefficient. It would probably be more efficient to manually maintain
the table as a form of an index instead of using the Cassandra
built-in index. For columns containing unique data, it is sometimes
fine performance-wise to use an index for convenience, as long as the
query volume to the table having an indexed column is moderate and not
under constant load."
Looking at the examples from CQL's SELECT for
"Querying compound primary keys and sorting results", I see something like a UUID being used as partition key... which would indicate that it's preferable to use something rather unique?

Indexing in the documentation you wrote up refers to secondary indexes. In cassandra there is a difference between the primary and secondary indexes. For a secondary index it would indeed be bad to have very unique values, however for the components in a primary key this depends on what component we are focusing on. In the primary key we have these components:
PRIMARY KEY(partitioning key, clustering key_1 ... clustering key_n)
The partitioning key is used to distribute data across different nodes, and if you want your nodes to be balanced (i.e. well distributed data across each node) then you want your partitioning key to be as random as possible. That is why the example you have uses UUIDs.
The clustering key is used for ordering so that querying columns with a particular clustering key can be more efficient. That is where you want your values to not be unique and where there would be a performance hit if unique rows were frequent.
The cql docs have a good explanation of what is going on.

if you use cql3, given a column family:
CREATE TABLE table1 (
a1 text,
a2 text,
b1 text,
b2 text,
c1 text,
c2 text,
PRIMARY KEY ( (a1, a2), b1, b2) )
);
by defining a
primary key ( (a1, a2, ...), b1, b2, ... )
This implies that:
a1, a2, ... are fields used to craft a row key in order to:
determine how the data is partitioned
determine what is phisically stored in a single row
referred as row key or partition key
b1, b2, ... are column family fields used to cluster a row key in order to:
create logical sets inside a single row
allow more flexible search schemes such as range range
referred as column key or cluster key
All the remaining fields are effectively multiplexed / duplicated for every possible combination of column keys. Here below an example about composite keys with partition keys and clustering keys work.
If you want to use range queries, you can use secondary indexes or (starting from cql3) you can declare those fields as clustering keys. In terms of speed having them as clustering key will create a single wide row. This has impact on speed since you will fetch multiple clustering key values such as:
select * from accounts where Country>'Italy' and Country<'Spain'

I am sure you would have got the answer but still this can help you for better understanding.
CREATE TABLE table1 (
a1 text,
a2 text,
b1 text,
b2 text,
c1 text,
c2 text,
PRIMARY KEY ( (a1, a2), b1, b2) )
);
here the partition keys are (a1, a2) and row keys are b1,b2.
combination of both partition keys and row keys must be unique for each new record entry.
the above primary key can be define like this.
Node< key, value>
Node<(a1a2), Map< b1b2, otherColumnValues>>
as we know Partition Key is responsible for data distribution accross your nodes.
So if you are inserting 100 records in table1 with same partition keys and different row keys. it will store data in same node but in different columns.
logically we can represent like this.
Node<(a1a2), Map< string1, otherColumnValues>, Map< string2, otherColumnValues> .... Map< string100, otherColumnValues>>
So the record will store sequentially in memory.

Related

How does range query work in Cassandra?

Query I would like to fire
select * from t1 where c1 > 1000 and c2 > 1Million and c3 > 8Million
Data model of table t1
create table t1 {
c1 int,
c2 int,
c3 int,
c4 text
}
Which columns should I use as partition key and which as clustering key.
c1 , c2 , c3 can have value between 1 to 10 Million.
If I do primary key ((c1,c2,c3)) then the values will be spread across cluster. But as I fire > queries on c1,c2,c3 columns how does Cassandra know which nodes to contact or does it do a full shard scan?
It wont allow you to make that query without an ALLOW FILTERING which lets it read entire dataset because its throughout the cluster. It would read everything, throwing things away that don't match. Its highly recommended never to use ALLOW FILTERING outside dev/test unless really sure what your doing.
Partition keys can only be filtered with equalities, not inequalities such as the ones you have. Inequalities can only be used with clustering keys.
If your table does not have that many rows, you can use the bucket strategy. With it you create a auxiliary column to be the only partition key with a predefined value (such as 1).
create table t1 {
bucket int,
c1 int,
c2 int,
c3 int,
c4 text,
PRIMARY KEY (bucket, c1, c2, c3)
}
Because you have a single partition, it is not adequate for scaling tables with many rows.
If you do have many rows, which you need to partition, then you have to rethink your strategy, and think about:
Finding some kind of key (or keys) in the data that is able to partition the data and at the same time help filtering it when needed. Then you would use it as the partition key in the example above. Maybe denormalizing the data can help bring that key (Ex.: Creating a column called Status for Low/Medium/High numbers, which you could filter better later in the inequality filtering of the clustering keys).
Plan a table (or tables) to be queried by an analytics framework such as Spark. In analytics it's common the need to query by any column, with equalities or inequalities.

Duplicate partitioning key performance impact in Cassandra

I've read in some posts that having duplicate partitioning key can have a performance impact. I've two tables like:
CREATE TABLE "Test1" ( CREATE TABLE "Test2" (
key text, key text,
column1 text, name text,
value text, age text,
PRIMARY KEY (key, column1) ...
) PRIMARY KEY (key, name,age)
)
In Test1 column1 will contain column name and value will contain its corresponding value.The main advantage of Test1 is that I can add any number of column/value pairs to it without altering the table by just providing same partitioning key each time.
Now my question is how will each of these table schema's impact the read/write performance if I've millions of rows and number of columns can be upto 50 in each row. How will it impact the compaction/repair time if I'm writing duplicate entries frequently?
For efficient queries, you want to hit a parition (i.e. have the first key of your primary key in your query). Inside of your partition, each column is stored in sorted form by the respective clustering keys. Cassandra stores data as "map of sorted maps".
Your Test1 schema will allow you to fetch all columns for a key, or a specific column for a key. Each "entry" will be on a separate parition.
For Test2, you can query by key, (key and name), or (key, name and age). But you won't be able to get to the age for a key without also specifying the name (w/o adding a secondary index). For this schema too, each "entry" will be in its own partition.
Cross partition queries are more expensive than those that hit a single partition. If you're looking for simply key-value lookups, then either schema will suffice. I wouldn't be worried using either for 50 columns. The first will give you direct access to a particular column. The latter will give you access to the whole data for an entry.
What you should focus more on is which structure allows you to do the queries you want. The first won't be very useful for secondary indexes, but the second will, for example.

Why cassandra/cql restrict to use where clause on a column that not indexed?

I have a table as follows in Cassandra 2.0.8:
CREATE TABLE emp (
empid int,
deptid int,
first_name text,
last_name text,
PRIMARY KEY (empid, deptid)
)
when I try to search by: "select * from emp where first_name='John';"
cql shell says:
"Bad Request: No indexed columns present in by-columns clause with Equal operator"
I searched for the issue and every places it says add a secondary index for the column 'first_name'.
But I need to know the exact reason for why that column need to be indexed?
Only thing I can figure out is performance.
Any other reasons?
Cassandra does not support for searching by arbitrary column. It is because it would involve scanning all the rows, which is not supported.
The data are internally organised into something which one can compare to HashMap[X, SortedMap[Y, Z]]. The key of the outer map is a partition key value and the key of the inner map is a kind of concatenation of all clustering columns values and a name of some regular column.
Unless you have an index on a column, you need to provide full (preferred) or partial path to the data you want to collect with the query. Therefore, you should design your schema so that queries contain primary key value and some range on clustering columns.
You may read about what is allowed and what is not here
Alternatively you can create an index in Cassandra, but that will hamper your write performance.

Does Cassandra Store Columns from Composite Keys on Different Nodes

I'm reading documentation on the Datastax site at http://www.datastax.com/documentation/cassandra/1.2/cassandra/cql_reference/create_table_r.html
and I see:
"When you use a composite partition key, Cassandra treats the columns in nested parentheses as partition keys and stores columns of a row on more than one node. "
The example given is:
CREATE TABLE Cats (
block_id uuid,
breed text,
color text,
short_hair boolean,
PRIMARY KEY ((block_id, breed), color, short_hair)
);
I understand how the cluster columns (in this case, color and short_hair) work in regard to how they are actually stored on disk as contiguous "columns" for the given row. What I don't understand is the line "...stores columns of a row on more than one node". Is this right?
For a given block_id and breed, doesn't this composite key just make a partition key similar to "block_id + breed", in which case the columns/clusters would be in the same row, whose physical location is determined by the partition key (block_id + breed) ?
Or is there some kind of splitting in this row going on because the primary key is based on two fields?
EDIT:
I think Richard's answer below is probably right, but I've also come across this in the Datastax documentation for 1.2 which enforces the first quote I posted:
"composite partition key - Stores columns of a row on more than one node using partition keys declared in nested parentheses of the PRIMARY KEY definition of a table."
Why would it say using plural partition key*s*... The fields that make up the composite key make up the only row key, as far as I know, and they are all used to make the key.
Then they say, the columns of a row can be split, which to me means a single row (with a given partition key) could have its columns split up on different nodes, which would mean the fields of the composite key are being handled separately.
Still a little confused on the Datastax documentation and whether it's actually right.
I think what it means is that rows with the same block_id are stored on different nodes. As you say, the partition key is like "block_id + breed", so columns with the same block_id but different breed will in general be stored on different nodes. But columns with the same block_id and breed will be stored on the same node.
Basically, the nodes that store a partition are found by a function of the partition key only. Whether it is composite or not, nothing else can join together or split rows.

Why many refer to Cassandra as a Column oriented database?

Reading several papers and documents on internet, I found many contradictory information about the Cassandra data model. There are many which identify it as a column oriented database, other as a row-oriented and then who define it as a hybrid way of both.
According to what I know about how Cassandra stores file, it uses the *-Index.db file to access at the right position of the *-Data.db file where it is stored the bloom filter, column index and then the columns of the required row.
In my opinion, this is strictly row-oriented. Is there something I'm missing?
If you take a look at the Readme file at Apache Cassandra git repo, it says that,
Cassandra is a partitioned row store. Rows are organized into tables
with a required primary key.
Partitioning means that Cassandra can distribute your data across
multiple machines in an application-transparent matter. Cassandra will
automatically repartition as machines are added and removed from the
cluster.
Row store means that like relational databases, Cassandra organizes
data by rows and columns.
Column oriented or columnar databases are stored on disk column wise.
e.g: Table Bonuses table
ID Last First Bonus
1 Doe John 8000
2 Smith Jane 4000
3 Beck Sam 1000
In a row-oriented database management system, the data would be stored like this: 1,Doe,John,8000;2,Smith,Jane,4000;3,Beck,Sam,1000;
In a column-oriented database management system, the data would be stored like this:
1,2,3;Doe,Smith,Beck;John,Jane,Sam;8000,4000,1000;
Cassandra is basically a column-family store
Cassandra would store the above data as,
"Bonuses" : {
row1 : { "ID":1, "Last":"Doe", "First":"John", "Bonus":8000},
row2 : { "ID":2, "Last":"Smith", "First":"Jane", "Bonus":4000}
...
}
Also, the number of columns in each row doesn't have to be the same. One row can have 100 columns and the next row can have only 1 column.
Read this for more details.
Yes, the "column-oriented" terminology is a bit confusing.
The model in Cassandra is that rows contain columns. To access the smallest unit of data (a column) you have to specify first the row name (key), then the column name.
So in a columnfamily called Fruit you could have a structure like the following example (with 2 rows), where the fruit types are the row keys, and the columns each have a name and value.
apple -> colour weight price variety
"red" 100 40 "Cox"
orange -> colour weight price origin
"orange" 120 50 "Spain"
One difference from a table-based relational database is that one can omit columns (orange has no variety), or add arbitrary columns (orange has origin) at any time. You can still imagine the data above as a table, albeit a sparse one where many values might be empty.
However, a "column-oriented" model can also be used for lists and time series, where every column name is unique (and here we have just one row, but we could have thousands or millions of columns):
temperature -> 2012-09-01 2012-09-02 2012-09-03 ...
40 41 39 ...
which is quite different from a relational model, where one would have to model the entries of a time series as rows not columns. This type of usage is often referred to as "wide rows".
You both make good points and it can be confusing. In the example where
apple -> colour weight price variety
"red" 100 40 "Cox"
apple is the key value and the column is the data, which contains all 4 data items. From what was described it sounds like all 4 data items are stored together as a single object then parsed by the application to pull just the value required. Therefore from an IO perspective I need to read the entire object. IMHO this is inherently row (or object) based not column based.
Column based storage became popular for warehousing, because it offers extreme compression and reduced IO for full table scans (DW) but at the cost of increased IO for OLTP when you needed to pull every column (select *). Most queries don't need every column and due to compression the IO can be greatly reduced for full table scans for just a few columns. Let me provide an example
apple -> colour weight price variety
"red" 100 40 "Cox"
grape -> colour weight price variety
"red" 100 40 "Cox"
We have two different fruits, but both have a colour = red. If we store colour in a separate disk page (block) from weight, price and variety so the only thing stored is colour, then when we compress the page we can achieve extreme compression due to a lot of de-duplication. Instead of storing 100 rows (hypothetically) in a page, we can store 10,000 colour's. Now to read everything with the colour red it might be 1 IO instead of thousands of IO's which is really good for warehousing and analytics, but bad for OLTP if I need to update the entire row since the row might have hundreds of columns and a single update (or insert) could require hundreds of IO's.
Unless I'm missing something I wouldn't call this columnar based, I'd call it object based. It's still not clear on how objects are arranged on disk. Are multiple objects placed into the same disk page? Is there any way of ensuring objects with the same meta data go together? To the point that one fruit might contain different data than another fruit since its just meta data or xml or whatever you want to store in the object itself, is there a way to ensure certain matching fruit types are stored together to increase efficiency?
Larry
The most unambiguous term I have come across is wide-column store.
It is a kind of two-dimensional key-value store, where you use a row key and a column key to access data.
The main difference between this model and the relational ones (both row-oriented and column-oriented) is that the column information is part of the data.
This implies data can be sparse. That means different rows don't need to share the same column names nor number of columns. This enables semi-structured data or schema free tables.
You can think of wide-column stores as tables that can hold an unlimited number of columns, and thus are wide.
Here's a couple of links to back this up:
This mongodb article
This Datastax article mentions it too, although it classifies Cassandra as a key-value store.
This db-engines article
This 2013 article
Wikipedia
Column Family does not mean it is column-oriented. Cassandra is column family but not column-oriented. It stores the row with all its column families together.
Hbase is column family as well as stores column families in column-oriented fashion. Different column families are stored separately in a node or they can even reside in different node.
IMO that's the wrong term used for Cassandra. Instead, it is more appropriate to call it row-partition store. Let me provide you some details on it:
Primary Key, Partitioning Key, Clustering Columns, and Data Columns:
Every table must have a primary key with unique constraint.
Primary Key = Partition key + Clustering Columns
# Example
Primary Key: ((col1, col2), col3, col4) # primary key uniquely identifies a row
# we need to choose its components partition key
# and clustering columns so that each row can be
# uniquely identified
Partition Key: (col1, col2) # decides on which node to store the data
# partitioning key is mandatory, and it
# can be made up of one column or multiple
Clustering Columns: col3, col4 # decides arrangement within a partition
# clustering columns are optional
Partition key is the first component of Primary key. Its hashed value is used to determine the node to store the data. The partition key can be a compound key consisting of multiple columns. We want almost equal spreads of data, and we keep this in mind while choosing primary key.
Any fields listed after the Partition Key in Primary Key are called Clustering Columns. These store data in ascending order within the partition. The clustering column component also helps in making sure the primary key of each row is unique.
You can use as many clustering columns as you would like. You cannot use the clustering columns out of order in the SELECT statement. You may choose to omit using a clustering column in you SELECT statement. That's OK. Just remember to sue them in order when you are using the SELECT statement. But note that, in your CQL query, you can not try to access a column or a clustering column if you have not used the other defined clustering columns. For example, if primary key is (year, artist_name, album_name) and you want to use city column in your query's WHERE clause, then you can use it only if your WHERE clause makes use of all of the columns which are part of primary key.
Tokens:
Cassandra uses tokens to determine which node holds what data. A token is a 64-bit integer, and Cassandra assigns ranges of these tokens to nodes so that each possible token is owned by a node. Adding more nodes to the cluster or removing old ones leads to redistributing these token among nodes.
A row's partition key is used to calculate a token using a given partitioner (a hash function for computing the token of a partition key) to determine which node owns that row.
Cassandra is Row-partition store:
Row is the smallest unit that stores related data in Cassandra.
Don't think of Cassandra's column family (that is, table) as a RDBMS table, but think of it as a dict of a dict (here dict is data structure similar to Python's OrderedDict):
the outer dict is keyed by a row key (primary key): this determines which partition and which row in partition
the inner dict is keyed by a column key (data columns): this is data in dict with column names as keys
both dict are ordered (by key) and are sorted: the outer dict is sorted by primary key
This model allows you to omit columns or add arbitrary columns at any time, as it allows you to have different data columns for different rows.
Cassandra has a concept of column families(table), which originally comes from BigTable. Though, it is really misleading to call them column-oriented as you mentioned. Within each column family, they store all columns from a row together, along with a row key, and they do not use column compression. Thus, the Bigtable model is still mostly row-oriented.

Resources