Cassandra Partition and Cluster Keys to Store Inverted index - cassandra

I need to use Cassandra to store an inverted index, in which words and their frequencies in articles are stored as follows:
word, article_title, frequency
Number of unique words is about 40M and number of Cassandra nodes = 2.
Which is better to use the first character of the word as Partition key or the word itself?
what about the Primary key?

TL;DR: With regards to your query, I would definitely say to use the word as the partition key.
If you would use only the first character, you will have only 26 partitions. You do not want that, if anything else, you will get hot-spotting. Some rows will be pretty short, since there are not a lot of words starting with a specific letter, and others will be very, very long, maybe even beyond the point it is performant to use. Yes, Cassandra has a two billion column per row limit, but the recommendation is to keep the size of the row in the millions. You also do not want to access all the words starting with 'A' if you want only 'AIRPORT'.
You need a high carnality, as random as possible, partition key so that the rows are easily dispersed throughout the cluster. On the other hand, it has to reflect your access patterns. In your case, you wan't to see stats for a word or set of words. Accessing by partition/primary is basically as fast as it gets with Cassandra.
As for the clustering key, it is more or less obvious, you can use the article title, OR, what I would do is actually use an article identifier (a UUID or such) as the cluster key. Article titles might change (typo?), and you certainly do not want to iterate through all your rows changing the title.

Related

Cassandra pagination and token function; selecting a partition key

I've been doing a lot of reading lately on Cassandra data modelling and best practices.
What escapes me is what the best practice is for choosing a partition key if I want an application to page through results via the token function.
My current problem is that I want to display 100 results per page in my application and be able to move on to the next 100 after.
From this post: https://stackoverflow.com/a/24953331/1224608
I was under the impression a partition key should be selected such that data spreads evenly across each node. That is, a partition key does not necessarily need to be unique.
However, if I'm using the token function to page through results, eg:
SELECT * FROM table WHERE token(partitionKey) > token('someKey') LIMIT 100;
That would mean that the number of results returned from my partition may not necessarily match the number of results I show on my page, since multiple rows may have the same token(partitionKey) value. Or worse, if the number of rows that share the partition key exceeds 100, I will miss results.
The only way I could guarantee 100 results on every page (barring the last page) is if I were to make the partition key unique. I could then read the last value in my page and retrieve the next query with an almost identical query:
SELECT * FROM table WHERE token(partitionKey) > token('lastKeyOfCurrentPage') LIMIT 100;
But I'm not certain if it's good practice to have a unique partition key for a complex table.
Any help is greatly appreciated!
But I'm not certain if it's good practice to have a unique partition key for a complex table.
It depends on requirement and Data Model how you should choose your partition key. If you have one key as partition key it has to be unique otherwise data will be upsert (overridden with new data). If you have wide row (a clustering key), then make your partition key unique (a key that appears once in a table) will not serve the purpose of wide row. In CQL “wide rows” just means that there can be more than one row per partition. But here there will be one row per partition. It would be better if you can provide the schema.
Please follow below link about pagination of Cassandra.
You do not need to use tokens if you are using Cassandra 2.0+.
Cassandra 2.0 has auto paging. Instead of using token function to
create paging, it is now a built-in feature.
Results pagination in Cassandra (CQL)
https://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
https://docs.datastax.com/en/developer/java-driver/2.1/manual/paging/
Saving and reusing the paging state
You can use pagingState object that represents where you are in the result set when the last page was fetched.
EDITED:
Please check the below link:
Paging Resultsets in Cassandra with compound primary keys - Missing out on rows
I recently did a POC for a similar problem. Maybe adding this here quickly.
First there is a table with two fields. Just for illustration we use only few fields.
1.Say we insert a million rows with this
Along comes the product owner with a (rather strange) requirement that we need to list all the data as pages in the GUI. Assuming that there are hundred entries 10 pages each.
For this we update the table with a column called page_no.
Create a secondary index for this column.
Then do a one time update for this column with page numbers. Page number 10 will mean 10 contiguous rows updated with page_no as value 10.
Since we can query on a secondary index each page can be queried independently.
Code is self explanatory and here - https://github.com/alexcpn/testgo
Note caution on how to use secondary index properly abound. Please check it. In this use case I am hoping that i am using it properly. Have not tested with multiple clusters.
"In practice, this means indexing is most useful for returning tens,
maybe hundreds of results. Bear this in mind when you next consider
using a secondary index." From http://www.wentnet.com/blog/?p=77

cassandra: `sstabledump` output questions

I'm inspecting the output of sstabledump, to gain better understanding of the cassandra data model, and I have some questions
From the output of of sstabledump it seems that
a table is a list of partitions (split by partition key)
a partition is a list of rows (split according to clustering key)
a row is a map of key-value pairs, where the keys belong in a predefined list
Question 1: For each partition, as well as for each row inside a partition, there is a position key. What does this value correspond to? Physical storage details? And how exactly?
Question 2: Each row inside each partition has a type: row key-value pair. Could this type be anything else? If yes, what? If not
why have a value that is always the same?
why is cassandra is classified as wide-column and other similar terms? Looks more like a two-level row storage.
Partition key is the murmur3 hash of whatever you assigned as the primary key. Consistent hashing is used with that hash to determine which node in the cluster that partition belongs to and its replicas. Within each partition data is sorted by clustering key, and then by cell name within the row. The structure is used so redundant things like timestamps if inserted for a row at once is only inserted once as a vint delta sequence from the partitions to save space.
On disk the partitions are sorted in order of this hashed key. The output of the position key is just referring to where in the sstable's data file its located (decompressed byte offset). type can also identify in that spot as a static block, which is located at the beginning of each partition for any static cells or a ranged tombstone marker (beginning or end). Note that values are sometimes for sstabledump repeated in json for readability even if not physically written on disk (ie repeated timestamps).
You can have many of these rows inside a partition, a common datamodel for time series for example is to use timestamp as the clustering key which makes very wide partitions with millions of rows. Pre 3.0 as well the data storage was closer to big table's design. It was essentially a Map<byte[], SortedMap<byte[], Cell>> where the Comparator of the sorted map was changed based on schema. It did not differentiate rows and columns within a partition, and it lead to massive amounts redundant data and was redesigned to fit the query language better.
Some more references:
Explanation of motivation of 3.0 change by DataStax here
Blog post by TLP has a good detailed explanation of the new disk format
CASSANDRA-8099

What is the cardinality of a partition key?

If I use a randomly generated unique Id , is it correct that
the cardinality would be rather large ?
If I have a key with a low cardinality like 5 category values that the partition key can take, and I want to distribute it, the recommended approach seems to be to make partition key into composite key.
But this requires that I have to specify all the parts of a composite key in my query to retrieve all records of that key.
Even then the generated token might end up being for the same node.
Is there any way to decide on a the additional column for composite key to that would guarantee that the data would be distributed ?
The thing is that with cassandra you actually want to have partitioning keys "known" so that you can access the data when you need it. I'm not sure what you mean when you say large cardinality on partitioning key. You would get a lot of partitions in the cluster. This is usually o.k.
If you want to distribute the data around the cluster. You can use artificial columns. And this approach is sometimes also called bucketing. Basically if you want to keep 100k+ or in never version 1 million+ columns it's o.k. to split this data into partitions.
Some people simply use a trick and when they insert the data they add some artificial bucket column to partition ... let's say random(1-10) and then when they are reading the data out they simply issue 10 queries or use an in operator and then fetch the data and merge it on the client side. This approach has many benefits in that it prevents appearance of "hot rows" in the cluster.
Chances for every key are more or less 1/NUM_NODES that it will end on the same node. So I would say most of the time this is not something you should worry about too much. Unless you have number of partitions that is smaller then the number of nodes in the cluster.
Basically there are two choices for additional column random (already described) or some function based on some input data i.e. when using time series data and you decide to bucket based on the month you can always calculate the month based on the data that you are going to insert and then you just put it in bucket. When you are retrieving the data then you know ... o.k. I'm looking something in May 2016 and then you know how to select the appropriate bucket.

Cassandra schema design: should more columns go into partition vs. cluster?

In my case I have a table structure like this:
table_1 {
entity_uuid text
,fk1_uuid text
,fk2_uuid text
,int_timestamp bigint
,cnt counter
,primary key (entity_uuid, fk1_uuid, fk2_uuid, int_timestamp)
}
The text columns are made up of random strings. However, only entity_uuid is truly random and evenly distributed. fk1_uuid and fk2_uuid have much lower cardinality and may be sparse (sometimes fk1_uuid=null or fk2_uuid=null).
In this case, I can either define only entity_uuid as the partition key or entity_uuid, fk1_uuid, fk2_uuid combination as the partition key.
And this is a LOOKUP-type of table, meaning we don't plan to do any aggregations/slice-dice based on this table. And the rows will be rotated out since we will be inserting with TTL defined for each row.
Can someone enlighten me:
What is the downside of having too many partition keys with very few
rows in each? Is there a hit/cost on the storage engine level?
My understanding is the cluster keys are ALWAYS sorted. Does that mean having text columns in a cluster will always incur tree
balancing cost?
Well you can tell where my heart lies by now. However, when all rows in a partition all TTL-ed out, that partition still lives, or is there a way they will be removed by the DB engine as well?
Thanks,
Bing
The major and possibly most significant difference between having big partitions and small partitions is the ability to do range scans. If you want to be able to do scan queries like
SELECT * FROM table_1 where entity_id = x and fk1_uuid > something
Then you'll need to have the clustering column for performance, otherwise this query would be difficult (a multi-get at best, full table scan at worst.) I've never heard of any cases where having too many partitions is a drag on performance but having too wide a partition (ie lots of clustering column values) can cause issues when you get into the 1B+ cell range.
In terms of the cost of clustering, it is basically free at write time (in memory sort is very very fast) but you can incur costs at read time as partitions become spread amongst various SSTables. Small partitions which are written once will not occur the merge penalty since they will most likely only exist in 1 SSTable.
TTL'd partitions will be removed but be sure to read up on GC_GRACE_SECONDS to see how Cassandra actually deals with removing data.
TL;DR
Everything is dependent on your read/write pattern
No Range Scans? No need for clustering keys
Yes Range Scans? Clustering keys a must

Azure Table Storage Delete where Row Key is Between two Row Key Values

Is there a good way to delete entities that are in the same partition given a row key range? It looks like the only way to do this would be to do a range lookup and then batch the deletes after looking them up. I'll know my range at the time that entities will be deleted so I'd rather skip the lookup.
I want to be able to delete things to keep my partitions from getting too big. As far as I know a single partition cannot be scaled across multiple servers. Each partition is going to represent a type of message that a user sends. There will probably be less than 50 types. I need a way to show all the messages of each type that were sent (ex: show recent messages regardless of who sent it of type 0). This is why I plan to make the type the partition key. Since the types don't scale with the number of users/messages though I don't want to let each partition grow indefinitely.
Unfortunately, you need to know precise Partition Keys and Row Keys in order to issue deletes. You do not need to retrieve entities from storage if you know precise RowKeys, but you do need to have them in order to issue batch delete. There is no magic "Delete from table where partitionkey = 10" command like there is in SQL.
However, consider breaking your data up into tables that represent archivable time units. For example in AzureWatch we store all of the metric data into tables that represent one month of data. IE: Metrics201401, Metrics201402, etc. Thus, when it comes time to archive, a full table is purged for a particular month.
The obvious downside of this approach is the need to "union" data from multiple tables if your queries span wide time ranges. However, if your keep your time ranges to minimum quantity, amount of unions will not be as big. Basically, this approach allows you to utilize table name as another partitioning opportunity.

Resources