How to solve 'Secondary indexes cardinality' for cfs.inode? - cassandra

In OpsCenter 6.0.3, I got the following problem
The above figure appeared after clicking 'Services' -> 'Best Practice Service' -> 'Performance Service - Table Metrics Advisor' -> 'Secondary indexes cardinality' in turn.
The inode table viewed in DevCenter looks as follows:
As far as I know, [inode]link tracks each files metadata and block locations. But, what can I do to fix this problem ?
OpsCenter Version: 6.0.3 Cassandra Version: 2.1.15.1423 DataStax Enterprise Version: 4.8.10

Don't use Secondary index for high cardinality column.
High-cardinality refers to columns with values that are very uncommon or unique. High-cardinality column values are typically identification numbers, email addresses, or user names. An example of a data table column with high-cardinality would be a USERS table with a column named USER_ID.
Problems using a high-cardinality column index datastax doc :
If you create an index on a high-cardinality column, which has many distinct values, a query between the fields will incur many seeks for very few results. In the table with a billion songs, looking up songs by writer (a value that is typically unique for each song) instead of by their artist, is likely to be very inefficient. It would probably be more efficient to manually maintain the table as a form of an index instead of using the Cassandra built-in index. For columns containing unique data, it is sometimes fine performance-wise to use an index for convenience, as long as the query volume to the table having an indexed column is moderate and not under constant load.
Solution :
Create another table with that column in the partition key

Related

How to understand the 'Flexible schema' in Cassandra?

I am new to Cassandra, and found below in the wikipedia.
A column family (called "table" since CQL 3) resembles a table in an RDBMS (Relational Database Management System). Column families contain rows and columns. Each row is uniquely identified by a row key. Each row has multiple columns, each of which has a name, value, and a timestamp. Unlike a table in an RDBMS, different rows in the same column family do not have to share the same set of columns, and a column may be added to one or multiple rows at any time.[29]
It said that 'different rows in the same column family do not have to share the same set of columns', but how to implement it? I have almost read all the documents in the offical site.
I can create table and insert data like below.
CREATE TABLE Emp_record(E_id int PRIMARY KEY,E_score int,E_name text,E_city text);
INSERT INTO Emp_record(E_id, E_score, E_name, E_city) values (101, 85, 'ashish', 'Noida');
INSERT INTO Emp_record(E_id, E_score, E_name, E_city) values (102, 90, 'ankur', 'meerut');
It's very like I did in the relational database. So how to create multiply rows with different columns?
I also found the offical document mentioned 'Flexible schema', how to understand it here?
Thanks very much in advance.
Column family is from the original design of Cassandra, when the data model looked like the Google BigTable or Apache HBase, and Thrift protocol was used for communication. But this required that schema was defined inside the application, and that makes access to data from many applications more problematic, as you need to update the schema inside all of them...
The CREATE TABLE and INSERT is a part of the Cassandra Query Language (CQL) that was introduced long time ago, and replaced Thrift-based implementation (Cassandra 4.0 completely removed the Thrift support). In CQL you need to have schema defined for a table, where you need to provide column name & type. If you really need to have dynamic columns, there are several approaches to that (I'll link answers that I already wrote over the time, so there won't duplicates):
If you have values of the same type, you can use one column as a name of the attribute/column, and another to store the value, like described here
if you have values of different types, you can also use one column as a name of attribute/column, and define multiple columns for values - one for each of the data types: int, text, ..., and you insert value into the corresponding columns only (described here)
you can use maps (described here) - it's similar to first or second, but mostly designed for very small number of "dynamic columns", plus have other limitations, like, you need to read the full map to fetch one value, etc.)

Cassandra pagination and token function; selecting a partition key

I've been doing a lot of reading lately on Cassandra data modelling and best practices.
What escapes me is what the best practice is for choosing a partition key if I want an application to page through results via the token function.
My current problem is that I want to display 100 results per page in my application and be able to move on to the next 100 after.
From this post: https://stackoverflow.com/a/24953331/1224608
I was under the impression a partition key should be selected such that data spreads evenly across each node. That is, a partition key does not necessarily need to be unique.
However, if I'm using the token function to page through results, eg:
SELECT * FROM table WHERE token(partitionKey) > token('someKey') LIMIT 100;
That would mean that the number of results returned from my partition may not necessarily match the number of results I show on my page, since multiple rows may have the same token(partitionKey) value. Or worse, if the number of rows that share the partition key exceeds 100, I will miss results.
The only way I could guarantee 100 results on every page (barring the last page) is if I were to make the partition key unique. I could then read the last value in my page and retrieve the next query with an almost identical query:
SELECT * FROM table WHERE token(partitionKey) > token('lastKeyOfCurrentPage') LIMIT 100;
But I'm not certain if it's good practice to have a unique partition key for a complex table.
Any help is greatly appreciated!
But I'm not certain if it's good practice to have a unique partition key for a complex table.
It depends on requirement and Data Model how you should choose your partition key. If you have one key as partition key it has to be unique otherwise data will be upsert (overridden with new data). If you have wide row (a clustering key), then make your partition key unique (a key that appears once in a table) will not serve the purpose of wide row. In CQL “wide rows” just means that there can be more than one row per partition. But here there will be one row per partition. It would be better if you can provide the schema.
Please follow below link about pagination of Cassandra.
You do not need to use tokens if you are using Cassandra 2.0+.
Cassandra 2.0 has auto paging. Instead of using token function to
create paging, it is now a built-in feature.
Results pagination in Cassandra (CQL)
https://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
https://docs.datastax.com/en/developer/java-driver/2.1/manual/paging/
Saving and reusing the paging state
You can use pagingState object that represents where you are in the result set when the last page was fetched.
EDITED:
Please check the below link:
Paging Resultsets in Cassandra with compound primary keys - Missing out on rows
I recently did a POC for a similar problem. Maybe adding this here quickly.
First there is a table with two fields. Just for illustration we use only few fields.
1.Say we insert a million rows with this
Along comes the product owner with a (rather strange) requirement that we need to list all the data as pages in the GUI. Assuming that there are hundred entries 10 pages each.
For this we update the table with a column called page_no.
Create a secondary index for this column.
Then do a one time update for this column with page numbers. Page number 10 will mean 10 contiguous rows updated with page_no as value 10.
Since we can query on a secondary index each page can be queried independently.
Code is self explanatory and here - https://github.com/alexcpn/testgo
Note caution on how to use secondary index properly abound. Please check it. In this use case I am hoping that i am using it properly. Have not tested with multiple clusters.
"In practice, this means indexing is most useful for returning tens,
maybe hundreds of results. Bear this in mind when you next consider
using a secondary index." From http://www.wentnet.com/blog/?p=77

Conceptual difference concerning column families in Cassandras data model compared to Bigtable?

I am currently trying to dig into Cassandra's data model and its relation to Bigtable, but ended up with a strong headache concerning the Column Family concept.
Mainly my question was asked and already answered. However, I'm not satisfied with the answers :)
Firstly I've read the Bigtable paper especially concerning its data model, i.e. how data is stored. As far as I understood each table in Bigtable basically relies on a multi-dimensional sparse map with the dimensions row, column and time. The map is sorted by rows. Columns can be grouped with the name convention family:qualifier to a column family. Therefore, a single row can contain multiple column families (see the example figure in the paper).
Although it is stated that Cassandra relies on Bigtable data model, I read multiple times that in Cassandra a column family contains multiple rows and is to some extent comparable to a table in relational data stores. Isn't this contrary to Bigtable's approach, where a row could contain multiple column families? What comes first, the column family or row :)? Are these concepts even comparable?
The answer you linked to was from 6 years ago, and a lot has changed in Cassandra since. When Cassandra started out, its data model was indeed based on BigTable's. A row of data could include any number of columns, each of these columns has a name and a value. A row could have a thousand different columns, and a different row could have a thousand other columns - rows do not have to have the same columns. Such a database is called "schema-less", because there is no schema that each row needs to adhere to.
But Toto, we're not in Kansas any more - and Cassandra's model changed in focus (though not in essense) since, and I'll try to explain how and why:
As Cassandra matured, its developers started to realize that schema-less isn't as great as they once thought it was. Schemas are valuable in ensuring application correctness. Moreover, one doesn't normally get to 1000 columns in a single row just because there are 1000 individually-named fields in one record. Rather, the more common case is that the record actually contains 200 entries, each with 5 fields. The schema should fix these 5 fields that every one of these entries should have, and what defines each of these separate entries is called a "clustering key". So around the time of Cassandra 0.8, six years ago, these ideas where introduced to Cassandra as the "CQL" (Cassandra Query Language).
For example, in CQL one declares that a column-family (which was dutifully renamed "table") has a schema, with a known list of fields:
CREATE TABLE groups (
groupname text,
username text,
email text,
age int,
PRIMARY KEY (groupname, username)
)
This schema says that each wide row in the table (now, in modern Cassandra, this was renamed a "partition") with the key "groupname" is a a possibly long list of users, each with username, email and age fields. The first name in the "PRIMARY KEY" specifier is the partition key (it determines the key of the wide rows), and the second is called the clustering key (it determines the key of the small rows that together make up the wide rows).
Despite the new CQL dressup, Cassandra continued to implement these new concepts using the good-old-BigTable-wide-row-without-schema implementation. For example, consider that our data has a group "mygroup" with two people, (john, john#somewhere.com, 27) and (joe, joe#somewhere.com, 38). Cassandra adds the following four column names->values to the wide row:
john:email -> john#somewhere.com
john:age -> 27
joe:email -> joe#somewhere.com
joe:age -> 27
Note how we ended up with a wide row with 4 columns - 2 non-key fields per row (email and age), multiplied by the number of rows in the partition (2). The clustering key field "username" no longer appears anywhere as the value, but rather as part of the column's name! So If we have two username values "john" and "joe", We have some columns prefixed "john" and some columns prefixed "joe", and when we read the column "joe:email" we know this is the value of the email field of the row which has username=joe.
Cassandra still has this internal duality - converting the user-facing CQL rows and clustering keys into old-style wide rows. Until recently, Cassandra's on-disk format known as "SSTables" was still schema-less and used composite names as shown above for column names. I wrote a detailed description of the SSTable format on Scylla's site https://github.com/scylladb/scylla/wiki/SSTables-Data-File (Scylla is a more efficient C++ re-implementation of Cassandra to which I contribute). However, column names are very inefficient in this format so Cassandra recently (in version 3.0) switched to a different file format, which for the first time, accepts clustering keys and schema-full rows as first class citizens. This was the last nail in the coffin of the schema-less Cassandra from 7 years ago. Cassandra is now schema-full, all the way.

what's the difference among row key, primary key and index in cassandra?

I'm so confused.
When to use them and how to determine which one to use?
If a column is index/primary key/row key, could it be duplicated?
I want to create a column family to store some many-to-many info, for example, one column is the given name and the other is surname. One given name can related to many surnames, and one surname could have different given names.
I need to query surnames by a given name, and the given names by a specified surname too.
How to create the table?
Thanks!
Cassandra is a NoSQL database, and as such has no such concept of many-to-many relationships. Ideally a table should not have anything other than a primary key. In your case the right way to model it in Cassandra is to create two tables, one with name as the primary key and the other with surname as the primary key
When you need to query by either key, you need to query the table that has that key as the primary key
EDIT:
From the Cassandra docs:
Cassandra's built-in indexes are best on a table having many rows that
contain the indexed value. The more unique values that exist in a
particular column, the more overhead you will have, on average, to
query and maintain the index. For example, suppose you had a races
table with a billion entries for cyclists in hundreds of races and
wanted to look up rank by the cyclist. Many cyclists' ranks will share
the same column value for race year. The race_year column is a good
candidate for an index.
Do not use an index in these situations:
On high-cardinality columns for a query of a huge volume of records for a small number of results.
In tables that use a counter column On a frequently updated or deleted column.
To look for a row in a large partition unless narrowly queried.

Why many refer to Cassandra as a Column oriented database?

Reading several papers and documents on internet, I found many contradictory information about the Cassandra data model. There are many which identify it as a column oriented database, other as a row-oriented and then who define it as a hybrid way of both.
According to what I know about how Cassandra stores file, it uses the *-Index.db file to access at the right position of the *-Data.db file where it is stored the bloom filter, column index and then the columns of the required row.
In my opinion, this is strictly row-oriented. Is there something I'm missing?
If you take a look at the Readme file at Apache Cassandra git repo, it says that,
Cassandra is a partitioned row store. Rows are organized into tables
with a required primary key.
Partitioning means that Cassandra can distribute your data across
multiple machines in an application-transparent matter. Cassandra will
automatically repartition as machines are added and removed from the
cluster.
Row store means that like relational databases, Cassandra organizes
data by rows and columns.
Column oriented or columnar databases are stored on disk column wise.
e.g: Table Bonuses table
ID Last First Bonus
1 Doe John 8000
2 Smith Jane 4000
3 Beck Sam 1000
In a row-oriented database management system, the data would be stored like this: 1,Doe,John,8000;2,Smith,Jane,4000;3,Beck,Sam,1000;
In a column-oriented database management system, the data would be stored like this:
1,2,3;Doe,Smith,Beck;John,Jane,Sam;8000,4000,1000;
Cassandra is basically a column-family store
Cassandra would store the above data as,
"Bonuses" : {
row1 : { "ID":1, "Last":"Doe", "First":"John", "Bonus":8000},
row2 : { "ID":2, "Last":"Smith", "First":"Jane", "Bonus":4000}
...
}
Also, the number of columns in each row doesn't have to be the same. One row can have 100 columns and the next row can have only 1 column.
Read this for more details.
Yes, the "column-oriented" terminology is a bit confusing.
The model in Cassandra is that rows contain columns. To access the smallest unit of data (a column) you have to specify first the row name (key), then the column name.
So in a columnfamily called Fruit you could have a structure like the following example (with 2 rows), where the fruit types are the row keys, and the columns each have a name and value.
apple -> colour weight price variety
"red" 100 40 "Cox"
orange -> colour weight price origin
"orange" 120 50 "Spain"
One difference from a table-based relational database is that one can omit columns (orange has no variety), or add arbitrary columns (orange has origin) at any time. You can still imagine the data above as a table, albeit a sparse one where many values might be empty.
However, a "column-oriented" model can also be used for lists and time series, where every column name is unique (and here we have just one row, but we could have thousands or millions of columns):
temperature -> 2012-09-01 2012-09-02 2012-09-03 ...
40 41 39 ...
which is quite different from a relational model, where one would have to model the entries of a time series as rows not columns. This type of usage is often referred to as "wide rows".
You both make good points and it can be confusing. In the example where
apple -> colour weight price variety
"red" 100 40 "Cox"
apple is the key value and the column is the data, which contains all 4 data items. From what was described it sounds like all 4 data items are stored together as a single object then parsed by the application to pull just the value required. Therefore from an IO perspective I need to read the entire object. IMHO this is inherently row (or object) based not column based.
Column based storage became popular for warehousing, because it offers extreme compression and reduced IO for full table scans (DW) but at the cost of increased IO for OLTP when you needed to pull every column (select *). Most queries don't need every column and due to compression the IO can be greatly reduced for full table scans for just a few columns. Let me provide an example
apple -> colour weight price variety
"red" 100 40 "Cox"
grape -> colour weight price variety
"red" 100 40 "Cox"
We have two different fruits, but both have a colour = red. If we store colour in a separate disk page (block) from weight, price and variety so the only thing stored is colour, then when we compress the page we can achieve extreme compression due to a lot of de-duplication. Instead of storing 100 rows (hypothetically) in a page, we can store 10,000 colour's. Now to read everything with the colour red it might be 1 IO instead of thousands of IO's which is really good for warehousing and analytics, but bad for OLTP if I need to update the entire row since the row might have hundreds of columns and a single update (or insert) could require hundreds of IO's.
Unless I'm missing something I wouldn't call this columnar based, I'd call it object based. It's still not clear on how objects are arranged on disk. Are multiple objects placed into the same disk page? Is there any way of ensuring objects with the same meta data go together? To the point that one fruit might contain different data than another fruit since its just meta data or xml or whatever you want to store in the object itself, is there a way to ensure certain matching fruit types are stored together to increase efficiency?
Larry
The most unambiguous term I have come across is wide-column store.
It is a kind of two-dimensional key-value store, where you use a row key and a column key to access data.
The main difference between this model and the relational ones (both row-oriented and column-oriented) is that the column information is part of the data.
This implies data can be sparse. That means different rows don't need to share the same column names nor number of columns. This enables semi-structured data or schema free tables.
You can think of wide-column stores as tables that can hold an unlimited number of columns, and thus are wide.
Here's a couple of links to back this up:
This mongodb article
This Datastax article mentions it too, although it classifies Cassandra as a key-value store.
This db-engines article
This 2013 article
Wikipedia
Column Family does not mean it is column-oriented. Cassandra is column family but not column-oriented. It stores the row with all its column families together.
Hbase is column family as well as stores column families in column-oriented fashion. Different column families are stored separately in a node or they can even reside in different node.
IMO that's the wrong term used for Cassandra. Instead, it is more appropriate to call it row-partition store. Let me provide you some details on it:
Primary Key, Partitioning Key, Clustering Columns, and Data Columns:
Every table must have a primary key with unique constraint.
Primary Key = Partition key + Clustering Columns
# Example
Primary Key: ((col1, col2), col3, col4) # primary key uniquely identifies a row
# we need to choose its components partition key
# and clustering columns so that each row can be
# uniquely identified
Partition Key: (col1, col2) # decides on which node to store the data
# partitioning key is mandatory, and it
# can be made up of one column or multiple
Clustering Columns: col3, col4 # decides arrangement within a partition
# clustering columns are optional
Partition key is the first component of Primary key. Its hashed value is used to determine the node to store the data. The partition key can be a compound key consisting of multiple columns. We want almost equal spreads of data, and we keep this in mind while choosing primary key.
Any fields listed after the Partition Key in Primary Key are called Clustering Columns. These store data in ascending order within the partition. The clustering column component also helps in making sure the primary key of each row is unique.
You can use as many clustering columns as you would like. You cannot use the clustering columns out of order in the SELECT statement. You may choose to omit using a clustering column in you SELECT statement. That's OK. Just remember to sue them in order when you are using the SELECT statement. But note that, in your CQL query, you can not try to access a column or a clustering column if you have not used the other defined clustering columns. For example, if primary key is (year, artist_name, album_name) and you want to use city column in your query's WHERE clause, then you can use it only if your WHERE clause makes use of all of the columns which are part of primary key.
Tokens:
Cassandra uses tokens to determine which node holds what data. A token is a 64-bit integer, and Cassandra assigns ranges of these tokens to nodes so that each possible token is owned by a node. Adding more nodes to the cluster or removing old ones leads to redistributing these token among nodes.
A row's partition key is used to calculate a token using a given partitioner (a hash function for computing the token of a partition key) to determine which node owns that row.
Cassandra is Row-partition store:
Row is the smallest unit that stores related data in Cassandra.
Don't think of Cassandra's column family (that is, table) as a RDBMS table, but think of it as a dict of a dict (here dict is data structure similar to Python's OrderedDict):
the outer dict is keyed by a row key (primary key): this determines which partition and which row in partition
the inner dict is keyed by a column key (data columns): this is data in dict with column names as keys
both dict are ordered (by key) and are sorted: the outer dict is sorted by primary key
This model allows you to omit columns or add arbitrary columns at any time, as it allows you to have different data columns for different rows.
Cassandra has a concept of column families(table), which originally comes from BigTable. Though, it is really misleading to call them column-oriented as you mentioned. Within each column family, they store all columns from a row together, along with a row key, and they do not use column compression. Thus, the Bigtable model is still mostly row-oriented.

Resources