what's the difference among row key, primary key and index in cassandra? - cassandra

I'm so confused.
When to use them and how to determine which one to use?
If a column is index/primary key/row key, could it be duplicated?
I want to create a column family to store some many-to-many info, for example, one column is the given name and the other is surname. One given name can related to many surnames, and one surname could have different given names.
I need to query surnames by a given name, and the given names by a specified surname too.
How to create the table?
Thanks!

Cassandra is a NoSQL database, and as such has no such concept of many-to-many relationships. Ideally a table should not have anything other than a primary key. In your case the right way to model it in Cassandra is to create two tables, one with name as the primary key and the other with surname as the primary key
When you need to query by either key, you need to query the table that has that key as the primary key
EDIT:
From the Cassandra docs:
Cassandra's built-in indexes are best on a table having many rows that
contain the indexed value. The more unique values that exist in a
particular column, the more overhead you will have, on average, to
query and maintain the index. For example, suppose you had a races
table with a billion entries for cyclists in hundreds of races and
wanted to look up rank by the cyclist. Many cyclists' ranks will share
the same column value for race year. The race_year column is a good
candidate for an index.
Do not use an index in these situations:
On high-cardinality columns for a query of a huge volume of records for a small number of results.
In tables that use a counter column On a frequently updated or deleted column.
To look for a row in a large partition unless narrowly queried.

Related

Regarding suggestion of best schema for a cassandra table?

I want to have a table in Cassandra that has a partition key say column 'A', and a column say 'B' which is of 'set' type and can have up to 10000 elements in the set.
But when i retrieve a row from this table then the whole set is retrieved at once and because of that the JVM heap increases rapidly. So should i stick to this schema or go with other schema where 'A' is partition key and i make dynamic columns for each element in the set in my other schema say 'B1', 'B2' ..... 'B10,000'where each of this column is a clustering key.
Which schema is suited best and will give the optimal performance please recommend.
NOTE: cqlsh 5.0.1v
Based off of what you've described, and the documentation I've read, I would not create a collection with 10k elements. Instead I would have two tables, one with everything but the collection, and then use the primary key values of the first table, as the partition key columns of the second table; adding the element name (or whatever you can use to identify an individual element) as a clustering column.
So for a given query, if you wanted everything for a particular primary key value (including all elements), you'd query the first table with the primary key, grab whatever you need, then hit the second table as well, looping/fetching through all elements.
If the query only provides a filter on the partition key (not the primary key - i.e. retrieving multiple rows) , the first query would have to retrieve all columns that make up the primary key for each row, and then query the second table looping for all elements - nested loop here - one loop for each primary key record retrieved from the first table, and a second loop to grab all elements for each pk record.
Probably the best way to go with this. That's how I would probably tackle this.
Does that make sense?
-Jim

How to have unique key except primary key in cassandra?

I am not good in English!
There is a table in Cassandra 3.5 which all columns of a row don't come at same time. Unique of table is some columns that are unique in a row together, but some of them are null at first. I can not set them the primary key because of null value. I have identify a column with name id and type uuid in Cassandra.
How can I have a unique key with that columns together in Cassandra?
Is my data model true?
How can I solve this problem?
You can't. It's not a relational DB. Use clustering and/or partitioning keys to add an unique constraint.
See this answer
To store unique values, create a separate table having your unique value as a key. Check if it exists by requesting this table before inserting a row. But beware, even doing this, you cannot ensure it will be unique in your final table if you have two concurrent inserts.
Basically, I would recommend using Cassandra as it really is: A data store. And find a way to implement your business logic where it belongs: in your code.

Get last record in Cassandra

Have a table with about 20 million rows in Cassandra.
The table is ordered by a primary_key column, which is a string. We are using 'ByteOrderedPartitioner', so the rows are ordered by the primary_key and not a hash of the primary_key column.
What is a good way to get the very last record in the table?
Thanks so much!
If for "very last record" you mean the one ordered as last I don't think you can do it like a "GET", you have to scan rows. The best you can do, afaik, is select a good range to scan (good start key) according to your primary key.
From datastax docs:
"Using the ordered partitioner allows ordered scans by primary key.
This means you can scan rows as though you were moving a cursor
through a traditional index. For example, if your application has user
names as the row key, you can scan rows for users whose names fall
between Jake and Joe. This type of query is not possible using
randomly partitioned row keys because the keys are stored in the order
of their MD5 hash (not sequentially)."
If you find better solution let me know.
Regards,
Carlo

Why many refer to Cassandra as a Column oriented database?

Reading several papers and documents on internet, I found many contradictory information about the Cassandra data model. There are many which identify it as a column oriented database, other as a row-oriented and then who define it as a hybrid way of both.
According to what I know about how Cassandra stores file, it uses the *-Index.db file to access at the right position of the *-Data.db file where it is stored the bloom filter, column index and then the columns of the required row.
In my opinion, this is strictly row-oriented. Is there something I'm missing?
If you take a look at the Readme file at Apache Cassandra git repo, it says that,
Cassandra is a partitioned row store. Rows are organized into tables
with a required primary key.
Partitioning means that Cassandra can distribute your data across
multiple machines in an application-transparent matter. Cassandra will
automatically repartition as machines are added and removed from the
cluster.
Row store means that like relational databases, Cassandra organizes
data by rows and columns.
Column oriented or columnar databases are stored on disk column wise.
e.g: Table Bonuses table
ID Last First Bonus
1 Doe John 8000
2 Smith Jane 4000
3 Beck Sam 1000
In a row-oriented database management system, the data would be stored like this: 1,Doe,John,8000;2,Smith,Jane,4000;3,Beck,Sam,1000;
In a column-oriented database management system, the data would be stored like this:
1,2,3;Doe,Smith,Beck;John,Jane,Sam;8000,4000,1000;
Cassandra is basically a column-family store
Cassandra would store the above data as,
"Bonuses" : {
row1 : { "ID":1, "Last":"Doe", "First":"John", "Bonus":8000},
row2 : { "ID":2, "Last":"Smith", "First":"Jane", "Bonus":4000}
...
}
Also, the number of columns in each row doesn't have to be the same. One row can have 100 columns and the next row can have only 1 column.
Read this for more details.
Yes, the "column-oriented" terminology is a bit confusing.
The model in Cassandra is that rows contain columns. To access the smallest unit of data (a column) you have to specify first the row name (key), then the column name.
So in a columnfamily called Fruit you could have a structure like the following example (with 2 rows), where the fruit types are the row keys, and the columns each have a name and value.
apple -> colour weight price variety
"red" 100 40 "Cox"
orange -> colour weight price origin
"orange" 120 50 "Spain"
One difference from a table-based relational database is that one can omit columns (orange has no variety), or add arbitrary columns (orange has origin) at any time. You can still imagine the data above as a table, albeit a sparse one where many values might be empty.
However, a "column-oriented" model can also be used for lists and time series, where every column name is unique (and here we have just one row, but we could have thousands or millions of columns):
temperature -> 2012-09-01 2012-09-02 2012-09-03 ...
40 41 39 ...
which is quite different from a relational model, where one would have to model the entries of a time series as rows not columns. This type of usage is often referred to as "wide rows".
You both make good points and it can be confusing. In the example where
apple -> colour weight price variety
"red" 100 40 "Cox"
apple is the key value and the column is the data, which contains all 4 data items. From what was described it sounds like all 4 data items are stored together as a single object then parsed by the application to pull just the value required. Therefore from an IO perspective I need to read the entire object. IMHO this is inherently row (or object) based not column based.
Column based storage became popular for warehousing, because it offers extreme compression and reduced IO for full table scans (DW) but at the cost of increased IO for OLTP when you needed to pull every column (select *). Most queries don't need every column and due to compression the IO can be greatly reduced for full table scans for just a few columns. Let me provide an example
apple -> colour weight price variety
"red" 100 40 "Cox"
grape -> colour weight price variety
"red" 100 40 "Cox"
We have two different fruits, but both have a colour = red. If we store colour in a separate disk page (block) from weight, price and variety so the only thing stored is colour, then when we compress the page we can achieve extreme compression due to a lot of de-duplication. Instead of storing 100 rows (hypothetically) in a page, we can store 10,000 colour's. Now to read everything with the colour red it might be 1 IO instead of thousands of IO's which is really good for warehousing and analytics, but bad for OLTP if I need to update the entire row since the row might have hundreds of columns and a single update (or insert) could require hundreds of IO's.
Unless I'm missing something I wouldn't call this columnar based, I'd call it object based. It's still not clear on how objects are arranged on disk. Are multiple objects placed into the same disk page? Is there any way of ensuring objects with the same meta data go together? To the point that one fruit might contain different data than another fruit since its just meta data or xml or whatever you want to store in the object itself, is there a way to ensure certain matching fruit types are stored together to increase efficiency?
Larry
The most unambiguous term I have come across is wide-column store.
It is a kind of two-dimensional key-value store, where you use a row key and a column key to access data.
The main difference between this model and the relational ones (both row-oriented and column-oriented) is that the column information is part of the data.
This implies data can be sparse. That means different rows don't need to share the same column names nor number of columns. This enables semi-structured data or schema free tables.
You can think of wide-column stores as tables that can hold an unlimited number of columns, and thus are wide.
Here's a couple of links to back this up:
This mongodb article
This Datastax article mentions it too, although it classifies Cassandra as a key-value store.
This db-engines article
This 2013 article
Wikipedia
Column Family does not mean it is column-oriented. Cassandra is column family but not column-oriented. It stores the row with all its column families together.
Hbase is column family as well as stores column families in column-oriented fashion. Different column families are stored separately in a node or they can even reside in different node.
IMO that's the wrong term used for Cassandra. Instead, it is more appropriate to call it row-partition store. Let me provide you some details on it:
Primary Key, Partitioning Key, Clustering Columns, and Data Columns:
Every table must have a primary key with unique constraint.
Primary Key = Partition key + Clustering Columns
# Example
Primary Key: ((col1, col2), col3, col4) # primary key uniquely identifies a row
# we need to choose its components partition key
# and clustering columns so that each row can be
# uniquely identified
Partition Key: (col1, col2) # decides on which node to store the data
# partitioning key is mandatory, and it
# can be made up of one column or multiple
Clustering Columns: col3, col4 # decides arrangement within a partition
# clustering columns are optional
Partition key is the first component of Primary key. Its hashed value is used to determine the node to store the data. The partition key can be a compound key consisting of multiple columns. We want almost equal spreads of data, and we keep this in mind while choosing primary key.
Any fields listed after the Partition Key in Primary Key are called Clustering Columns. These store data in ascending order within the partition. The clustering column component also helps in making sure the primary key of each row is unique.
You can use as many clustering columns as you would like. You cannot use the clustering columns out of order in the SELECT statement. You may choose to omit using a clustering column in you SELECT statement. That's OK. Just remember to sue them in order when you are using the SELECT statement. But note that, in your CQL query, you can not try to access a column or a clustering column if you have not used the other defined clustering columns. For example, if primary key is (year, artist_name, album_name) and you want to use city column in your query's WHERE clause, then you can use it only if your WHERE clause makes use of all of the columns which are part of primary key.
Tokens:
Cassandra uses tokens to determine which node holds what data. A token is a 64-bit integer, and Cassandra assigns ranges of these tokens to nodes so that each possible token is owned by a node. Adding more nodes to the cluster or removing old ones leads to redistributing these token among nodes.
A row's partition key is used to calculate a token using a given partitioner (a hash function for computing the token of a partition key) to determine which node owns that row.
Cassandra is Row-partition store:
Row is the smallest unit that stores related data in Cassandra.
Don't think of Cassandra's column family (that is, table) as a RDBMS table, but think of it as a dict of a dict (here dict is data structure similar to Python's OrderedDict):
the outer dict is keyed by a row key (primary key): this determines which partition and which row in partition
the inner dict is keyed by a column key (data columns): this is data in dict with column names as keys
both dict are ordered (by key) and are sorted: the outer dict is sorted by primary key
This model allows you to omit columns or add arbitrary columns at any time, as it allows you to have different data columns for different rows.
Cassandra has a concept of column families(table), which originally comes from BigTable. Though, it is really misleading to call them column-oriented as you mentioned. Within each column family, they store all columns from a row together, along with a row key, and they do not use column compression. Thus, the Bigtable model is still mostly row-oriented.

Cassandra super column structure

I'm new to Cassandra, and I'm not familiar with super columns.
Consider this scenario: Suppose we have a some fields of a customer entity like
Name
Contact_no
address
and we can store all these values in a normal column. I want to arrange that when a person moves from one location to another location (the representative field could store the longitude and latitude) that values will be stored consecutively with respect to customer location. I think we can do this with super columns but I'm confused how to design the schema to accomplish this.
Please help me to create this schema and come to understand the concepts behind super columns.
supercolumns are really not recommended anymore...still used but more and more have switched to composite columns. For example playOrm uses this concept for indexing. If I am indexing an integer, and indexing row may look like this
rowkey = 10.pk56 10.pk39 11.pk50
Where the column name type is a composite integer and string in this case. These rows can be up to about 10 million columns though I have only run expirements up to 1 million my self. For example, playOrm's queries use these types of indexes to do a query that took 60 ms on 1,000,000 rows.
With playOrm, you can do scalable relational models in noSQL....you just need to figure out how to partition your data correctly as you can have as many partitions as you want in each table, but a partition should really not be over 10 million rows.
Back to the example though, if you have a table with columns numShares, price, username, age, you may wnat to index numShares and the above row would be that index so you could grab the index by key OR better yet, grab all column names with numShares > 20 and numShares < 50
Once you have those columns, you can then get the second half of the column name which is the primary key. The reason primary key is NOT a value is because as in the example above there is two rows pk56 and pk39 with the same 10 and you can't have two columns named 10, but you can have a 10.pk56 and 10.pk39.
later,
Dean

Resources