cassandra primary key design - cassandra

I am using datastax enterprise 4.5. Is there any disadvantage of defining a composite partition key than only a single column partition key in terms of any performance? What if one column of composite partition has high cardinality but the other coulmn of the composite has low cardinality?

A composite key is used to increase the cardinality of your partitions. For example a key like PRIMARY KEY ((x,y)) with 5 values of x and 10 values of y will end up creating 50 different partitions. This is usefuls if you need to distribute your data more but is unnecessary if you have a single variable with high enough cardinality.
A more realistic example might be creating a composite key of PRIMARY KEY ((Gender, ZipCode), age, userid). If you used only Gender as the Partition key you would end up with only 2 partitions to store your data! Adding zipcode allows for a total of all 99999 zipcodes or (zip+4 to get even more) while still allowing you to segregate your data by gender. This would be ideal for looking demographic information by location or something like that.
Basically the rule of thumb is that you want a large number of partitions to avoid hotspots in your cluster and composite keys allow an easy way of increasing the number of partitions by combining the cardinality of your fields.

Related

cassandra: `sstabledump` output questions

I'm inspecting the output of sstabledump, to gain better understanding of the cassandra data model, and I have some questions
From the output of of sstabledump it seems that
a table is a list of partitions (split by partition key)
a partition is a list of rows (split according to clustering key)
a row is a map of key-value pairs, where the keys belong in a predefined list
Question 1: For each partition, as well as for each row inside a partition, there is a position key. What does this value correspond to? Physical storage details? And how exactly?
Question 2: Each row inside each partition has a type: row key-value pair. Could this type be anything else? If yes, what? If not
why have a value that is always the same?
why is cassandra is classified as wide-column and other similar terms? Looks more like a two-level row storage.
Partition key is the murmur3 hash of whatever you assigned as the primary key. Consistent hashing is used with that hash to determine which node in the cluster that partition belongs to and its replicas. Within each partition data is sorted by clustering key, and then by cell name within the row. The structure is used so redundant things like timestamps if inserted for a row at once is only inserted once as a vint delta sequence from the partitions to save space.
On disk the partitions are sorted in order of this hashed key. The output of the position key is just referring to where in the sstable's data file its located (decompressed byte offset). type can also identify in that spot as a static block, which is located at the beginning of each partition for any static cells or a ranged tombstone marker (beginning or end). Note that values are sometimes for sstabledump repeated in json for readability even if not physically written on disk (ie repeated timestamps).
You can have many of these rows inside a partition, a common datamodel for time series for example is to use timestamp as the clustering key which makes very wide partitions with millions of rows. Pre 3.0 as well the data storage was closer to big table's design. It was essentially a Map<byte[], SortedMap<byte[], Cell>> where the Comparator of the sorted map was changed based on schema. It did not differentiate rows and columns within a partition, and it lead to massive amounts redundant data and was redesigned to fit the query language better.
Some more references:
Explanation of motivation of 3.0 change by DataStax here
Blog post by TLP has a good detailed explanation of the new disk format
CASSANDRA-8099

What is the cardinality of a partition key?

If I use a randomly generated unique Id , is it correct that
the cardinality would be rather large ?
If I have a key with a low cardinality like 5 category values that the partition key can take, and I want to distribute it, the recommended approach seems to be to make partition key into composite key.
But this requires that I have to specify all the parts of a composite key in my query to retrieve all records of that key.
Even then the generated token might end up being for the same node.
Is there any way to decide on a the additional column for composite key to that would guarantee that the data would be distributed ?
The thing is that with cassandra you actually want to have partitioning keys "known" so that you can access the data when you need it. I'm not sure what you mean when you say large cardinality on partitioning key. You would get a lot of partitions in the cluster. This is usually o.k.
If you want to distribute the data around the cluster. You can use artificial columns. And this approach is sometimes also called bucketing. Basically if you want to keep 100k+ or in never version 1 million+ columns it's o.k. to split this data into partitions.
Some people simply use a trick and when they insert the data they add some artificial bucket column to partition ... let's say random(1-10) and then when they are reading the data out they simply issue 10 queries or use an in operator and then fetch the data and merge it on the client side. This approach has many benefits in that it prevents appearance of "hot rows" in the cluster.
Chances for every key are more or less 1/NUM_NODES that it will end on the same node. So I would say most of the time this is not something you should worry about too much. Unless you have number of partitions that is smaller then the number of nodes in the cluster.
Basically there are two choices for additional column random (already described) or some function based on some input data i.e. when using time series data and you decide to bucket based on the month you can always calculate the month based on the data that you are going to insert and then you just put it in bucket. When you are retrieving the data then you know ... o.k. I'm looking something in May 2016 and then you know how to select the appropriate bucket.

Trying to visual how wide and skinny rows are layed out

Can someone give and show me how the data is layed out when you design your tables for wide vs. skinny rows.
I'm not sure I fully grasp how the data is spread out with a "wide" row.
Is there a difference in how you can fetch the data or will it be the same i.e. if it is ordered it doesn't matter if the data is vertical (skinny) or horizontally (wide) organized.
Update
Is a table considered with if the primary key consists of more than one column?
Or table will have wide rows only if the partition key is a composite partition key?
Wide... Skinny... Terms that make your head explode... I prefer to oversimplify the thing as such:
All the tables have wide rows
You simply need to take care of how wide the rows gets
This allows me to think this as follow (mangling a bit the C* terminology):
Number of RECORDS in a partition
1 <--------------------------------------- ... 2Billion
^ ^
Skinny rows wide rows
The lesser records in a partition, the skinner is the "partition", and vice-versa.
When designing for C* I always keep in mind a couple of things:
I want to use "skinny partitions" when my data can be fetched with one query and it is fully contained in one record of one partition. Typical example is something along SELECT * FROM table WHERE username = 'xmas79'; where the table has a primary key in the form of PRIMARY KEY (username)that let me get all the data belonging to a particular username.
I want to use "wide rows" when my data can be fetched with one query and it is fully contained on multiple records of one partition. Typical examples are range queries like SELECT * FROM table WHERE sensor = 'pressure' AND time >= '2016-09-22';, where the table has a primary key in the form of PRIMARY KEY (sensor, time).
So, first approach for one shot queries, second approach for range queries. Beware that this second approach have the (major) drawback that you can keep adding data to the partition, and it will get wider and wider, hurting performances.
In order to control how wide your partitions are, you need to add something to the partition key. In the sensor example above, if your don't violate your requirements of course, you can "group" some measurements by date, eg you split the measures in a day-by-day groups, making the primary key like PRIMARY KEY ((sensor, day), time), where the partition key was transformed to (sensor, day). By this approach, you have full (well, let's say good at least) control on the wideness of your partitions.
You only need to find a good compromise between your query capabilities and the desired performance.
I suggest these three readings for further investigation on the details:
Wide Rows in Cassandra CQL
Does CQL support dynamic columns / wide rows?
CQL3 for Cassandra experts
Beware that in the 1. there's a mistake in the second to last picture: the primary key should be
PRIMARY KEY ((user_id, tweet_id))
with double parenthesis around the columns instead of one.

Cassandra schema design: should more columns go into partition vs. cluster?

In my case I have a table structure like this:
table_1 {
entity_uuid text
,fk1_uuid text
,fk2_uuid text
,int_timestamp bigint
,cnt counter
,primary key (entity_uuid, fk1_uuid, fk2_uuid, int_timestamp)
}
The text columns are made up of random strings. However, only entity_uuid is truly random and evenly distributed. fk1_uuid and fk2_uuid have much lower cardinality and may be sparse (sometimes fk1_uuid=null or fk2_uuid=null).
In this case, I can either define only entity_uuid as the partition key or entity_uuid, fk1_uuid, fk2_uuid combination as the partition key.
And this is a LOOKUP-type of table, meaning we don't plan to do any aggregations/slice-dice based on this table. And the rows will be rotated out since we will be inserting with TTL defined for each row.
Can someone enlighten me:
What is the downside of having too many partition keys with very few
rows in each? Is there a hit/cost on the storage engine level?
My understanding is the cluster keys are ALWAYS sorted. Does that mean having text columns in a cluster will always incur tree
balancing cost?
Well you can tell where my heart lies by now. However, when all rows in a partition all TTL-ed out, that partition still lives, or is there a way they will be removed by the DB engine as well?
Thanks,
Bing
The major and possibly most significant difference between having big partitions and small partitions is the ability to do range scans. If you want to be able to do scan queries like
SELECT * FROM table_1 where entity_id = x and fk1_uuid > something
Then you'll need to have the clustering column for performance, otherwise this query would be difficult (a multi-get at best, full table scan at worst.) I've never heard of any cases where having too many partitions is a drag on performance but having too wide a partition (ie lots of clustering column values) can cause issues when you get into the 1B+ cell range.
In terms of the cost of clustering, it is basically free at write time (in memory sort is very very fast) but you can incur costs at read time as partitions become spread amongst various SSTables. Small partitions which are written once will not occur the merge penalty since they will most likely only exist in 1 SSTable.
TTL'd partitions will be removed but be sure to read up on GC_GRACE_SECONDS to see how Cassandra actually deals with removing data.
TL;DR
Everything is dependent on your read/write pattern
No Range Scans? No need for clustering keys
Yes Range Scans? Clustering keys a must

Need recommendation on appropriate primary key structure

I have a lot of time series data that I would like to store in a Cassandra database. Since I can only do WHERE clauses on fields in the primary key, I need some recommendations on how to lay this out based on the way that I will need to query it.
My data is in this format:
SYSTEM_SERIAL_NUMBER,DEVICE_ID,TIMESTAMP,...OTHER COLUMNS
Each serial number has multiple devices, and I will have thousands of timestamps for every device, so my primary key to uniquely identify each set of data has to include all three.
There are basically two types of queries I will do on this data.
SELECT * FROM TABLE WHERE system_serial_number = 'X' and device_id = 'x' and timestamp (is in a range)
or
SELECT * FROM TABLE WHERE system_serial_number = 'X' and timestamp (is in a range)
The second one is the more likely query, because I am typically going to input a time range in the application and I want to see data from every single device for a given serial number. But I can't leave the device name out of the key because you need serial/device/timestamp to be able to uniquely identify an entire row.
I've tried to create my tables as follows:
CREATE TABLE devices (
system_serial_number text,
device_id int,
time_stamp timestamp,
...,
PRIMARY KEY ((system_serial_number,device_id),time_stamp)
);
And also as:
CREATE TABLE devices (
system_serial_number text,
device_id int,
time_stamp timestamp,
...,
PRIMARY KEY (system_serial_number,device_id,time_stamp)
);
The first one I think would keep me from hitting column limitations, but it always requires me to enter a Device ID along with the Serial every time I query. The second one is less column efficient (based on my understanding), and it allows me to search by serial only. Neither one of them lets me search by just serial/timestamp, which is actually the most common search that I am going to do, but isn't unique enough to be a primary key.
The only way I've even been able to get a query to work is by using the first one with the compound key and then adding a secondary index for just serial number, which then allows me to search by serial/timestamp, but I have to use the inefficient ALLOW FILTERING.
Any suggestions on the best way to get what I need?
The simplest answer is:
PRIMARY KEY (system_serial_number, time_stamp, device_id)
system_serial_number will be the partition key that identifies which replicas (nodes) will contain the data. All data for a single serial number will need to fit in the same partition. For efficient access, all queries will be required to specify a serial number. If partition size is a concern, there may be ways to further subdivide if the use case allows.
time_stamp will be the clustering key used to sort the rows within the partition. That is, all logical rows for the same serial number will be ordered by the timestamp, irrespective of the device. The first PK column that is not a part of the partition key determines the sort order.
device_id is an additional PK column to distinguish your logical rows, but does not help you sort or do other range scans.
Since you mentioned that each device would generate thousands of timestamps, and each serial number will have many devices, you may also need to be concerned about the size of your partitions if you take the above approach. A common approach is to break the data for a single serial number across multiple partitions, but that can make querying your data either more efficient or more troublesome, depending on how you decide to subdivide the data.
You will have to use some imagination and knowledge of your specific use cases to decide on the proper partitioning layout. Off the top of my head, I can think of some ideas:
PRIMARY KEY ((system_serial_number, device_hash_modulus), time_stamp, device_id)
Idea: hash your device IDs and apply a modulus to split the data across a fixed number of "buckets"
Advantage: with an even hash distribution, spreads data evenly across a known number of nodes
Disadvantage: querying across "all devices" for a given serial number requires making N queries, one for each "bucket" based on the number chosen for the modulo operation
Disadvantage: may need to adjust bucketing scheme (and migrate data) if initial choice is too small for eventual data size
PRIMARY KEY ((system_serial_number, coarse_time_stamp), time_stamp, device_id)
Idea: split the data over time into different partitions, size determined by how coarse you make the partitioning timestamp (year? year+month?, year+day?, etc.). The decision should be made based on how many unique records are expected within a given time period.
Advantage: assuming the cluster is configured with a random partitioner, the data will be evenly distributed around the cluster as time moves forward.
Disadvantage: querying for records across a range of time may involve making separate queries to different partitions, making the program logic more complex. If the partition timestamp isn't coarse enough, or the timestamp range to be searched is too wide, performance will be impacted.
There may be other options available to you, but it will all depend on how well you understand your current use cases (and how well you can predict the future behavior of your data set).

Resources