What are the rules-of-thumb for choosing the right partition key in Cassandra? - cassandra

I just start learning about Cassandra and going deeper to understand what is happening backstage that makes Cassandra too much faster. I go through the following docs1 & docs2 but was still confused about choosing the right partition key for my table.
I'm designing the Model for a test application like Slack and creating a message table like:
CREATE TABLE messages (
id uuid,
work_space_id text,
user_id text,
channel_id text,
body text,
edited boolean,
deleted boolean,
year text,
created_at TIMESTAMP,
PRIMARY KEY (..................)
);
My query is to fetch all the messages by a channel_id and work_space_id. So following are the options in my mind to choose the Primary Key:
PRIMARY KEY ((work_space_id, year), channel_id, created_at)
PRIMARY KEY ((channel_id, work_space_id), created_at)
If I go with option 1, so each workspace has a separate partition by a year. This will might create Hotspot if one workspace has 100 Million messages and other has few hundreds in a year.
If I go with option 2, so each workspace channel has seprate partition. What if there are 1Million workspaces & each has 1K channels. This will create about 1B partitions. I know the limit is of 2Billion.
So what is the rool of thumb to choose the right partition key that will distribute data evenly and not create hotspots in a data center?

The primary rule of data modeling for Cassandra is that you must design a table for each application query. In your case, the app query needs to retrieve all messages based on the workspace and channel IDs.
The two critical things from your app query which should guide you are:
Retrieve multiple messages.
Filter by workspace + channel IDs.
The filter determines the partition key for the table which is (workspace_id, channel_id) and since each partition contains rows of messages, we'll use the created_at column as the clustering key so it can be sorted in descending chronological order so we have:
CREATE TABLE messages_by_workspace_channel_ids (
workspace_id text,
channel_id text,
created_at timestamp,
user_id text,
body text,
edited boolean,
deleted boolean,
PRIMARY KEY ((workspace_id, channel_id), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
Ordinarily we would stop there but as you pointed out correctly, each channel could potentially have millions of messages which would lead to very large partitions. To avoid that, we need to group the messages into "buckets" to make the partitions smaller.
You attempted to do that by grouping messages by year but it may not be enough. The general recommendation is to keep partitions to 100MB for optimum performance -- smaller partitions are faster to read. We can make the partitions smaller by also grouping them into months:
CREATE TABLE messages_by_workspace_channel_ids_yearmonth (
workspace_id text,
channel_id text,
year int,
month int,
created_at timestamp,
...
PRIMARY KEY ((workspace_id, channel_id, year, month), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
You could make them even smaller by further grouping them into dates:
CREATE TABLE messages_by_workspace_channel_ids_createdate (
workspace_id text,
channel_id text,
createdate date,
created_at timestamp,
...
PRIMARY KEY ((workspace_id, channel_id, createdate), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
The more "buckets" you use, the more partitions you will have in the table which is ideal since more partitions means greater distribution of data across the cluster. Cheers!
๐Ÿ‘‰ Please support the Apache Cassandra community by hovering over the cassandra tag then click on the Watch tag button. ๐Ÿ™ Thanks!

Related

Cassandra Data modelling with multiple tables for same data

Cassandra Data Modeling Query
Hello,
The data model i am working on is as below with different tables for same data data set for satisfying different kinds of query. The data mainly stores event data of some campaigns sent out on multiple channels like email, web, mobile app, sms etc. Events can include page visits, email opens, link clicks etc for different subscribers.
Table 1:
(enterprise_id int, domain_id text, campaign_id int, event_category text, event_action text, datetime timestamp, subscription_id text, event_label text, ........) (many more columns not part of primary key)
PRIMARY KEY ((enterprise_id,campaign_id),domain_id, event_category, event_action, datetime, subscription_id))
CLUSTERING ORDER BY (domain_id DESC, event_category DESC, event_action DESC, datetime DESC, subscription_id DESC)
Keys and Data size for Table 1:
I have partition key as enterprise_id + campaign_id . Each enterprise can have several campaigns . The datastore may have data for few hundred campaigns. Each campaign can have upto 2-3 million records. Hence there may be 3000 partitions across 100 enterprises and each partition having 2-3 miilion records.
Cassandra Queries: Query always with partition key + primary key including the datetime field. The subscription id is included in primary key to keep each record unique as we can have multiple records with similar values for rest of the keys in primary key. enterprise_id +c ampaign_id is always available as a filter in the queries.
Table 2:
(enterprise_id int, domain_id text, event_category text, event_action text, datetime timestamp, subscription_id text, event_label text, campaign_id int........) (many more columns not part of primary key)
PRIMARY KEY (enterprise_id, domain_id, event_category, event_action, datetime, subscription_id))
CLUSTERING ORDER BY (domain_id DESC, event_category DESC, event_action DESC, datetime DESC, subscription_id DESC)
Keys and Data size for Table 2) : I have partition key as enterprise_id only. Each enterprise can have several campaigns . May be few hundred campaigns. Each campaign can have upto 2-3 Mn records. In this case the partition is quite big with data for all campaigns in a single partition. can have upto 800 - 900 million entries
Cassandra Queries: Query always with partition key + primary key upto datetime. The subscription id is included in primary key to keep each record unique as we can have multiple records with similar values for rest of the keys in primary key. In this case, data has to be queries across campaigns and we may not have campaign_id as a filter in the queries.
Table 3:
(enterprise_id int, subscription_id text, domain_id text, event_category text, event_action text, datetime timestamp, event_label text, campaign_id int........) (many more columns not part of primary key)
PRIMARY KEY (enterprise_id, subscription_id, domain_id, event_category, event_action, datetime, ))
CLUSTERING ORDER BY ( subscription_id DESC, domain_id DESC, event_category DESC, event_action DESC, datetime DESC,)
Keys and Data size for Table 3) : I have partition key as enterprise_id. Each enterprise can have several campaigns . May be few hundred campaigns. Each campaign can have upto 2-3 Mn records. In this case the partition is quite big with data for all campaigns in a single partition. can have upto 800 -900 million entries
Cassandra Queries: Query always with partition key + primary key as subscription_id only. Should be able to query directly on enterprise_id + subscription_id.
My Queries:
Size of data on each partition: With Table 2) and Table 3) i may end up with more than 800 -900 million rows per partition. As per my reading it is not ok to have so many entries per partition. How can i achieve my use case in this scenario? Even if i create multiple partitions based on some data like a week_number (1-52 in a year), the query will need to query across all partitions and end up using a IN clause with all week numbers which is as good as scanning all data.
Is it ok to have multiple tables with same partition key and different primary keys with Clustering order change? For example in Table 2 and Table 3 the hash will be on enterprise_id and will lead to same node. However only the clustering key order has changed and will allow me to query directly on the required key. Will the data be in different physical partitions for Table2 and Table3 in such a scenario? Or if it maps to same partition number how will cassandra internally distinguish between the two tables?
Is it ok to use ALLOW FILTERING if i specify the partition key. For example i can avoid the need for creating Table 3 and use table 2 for query on subscription_id directly if i use ALLOW FILTERING on Table 2. What will be the impact again.
First of all, please only as one question per question. Given the length and detail required for your answers, this post is unlikely to provide long term value for future users.
As per my reading it is not ok to have so many entries per partition. How can I achieve my use case in this scenario?
Unfortunately, if partitioning on a time component will not work, then you'll have to find some other column to partition the data by. I've seen rows-per-partition work ok in the range of 50k to 20k. Most of those use cases on the higher end had small partitions. It looks like your model has many columns, so I'd be curious as to the average partition size. Essentially, find a column to partition on which keeps your partition sizes in the 10MB to 1MB range.
Is it ok to have multiple tables with same partition key and different primary keys with Clustering order change?
Yes, this is perfectly fine.
Will the data be in different physical partitions for Table2 and Table3 in such a scenario? Or if it maps to same partition number how will cassandra internally distinguish between the two tables?
The partition is hashed into a number ranging from +/- 2^63. That number will then be compared to the partition ranges mapped to all nodes, and then the query will be sent to that node. So all the partition does, is determine which node is responsible for the data.
The tables have their data files written to different directories, based on table name. So Cassandra distinguishes between the tables by the table name provided in the query. Nothing you need to worry about.
Is it ok to use ALLOW FILTERING if I specify the partition key.
I would still recommend against it if you're concerned about performance. But the good thing about using the ALLOW FILTERING directive while specifying a full partition key, will indeed prevent Cassandra from reading multiple nodes to build the result set. So that should be ok. The only drawback here, is that Cassandra stores/reads data from disk by the defined CLUSTERING ORDER, and using ALLOW FILTERING obviously complicates that process (forcing random reads vs. sequential reads).

If not MaterializedViews and not secondary indices then what else is the recommended way to query data in cassandra

I have some data in Cassandra. Say
create table MyTable {
id text PRIMARY KEY,
data text,
updated_on timestamp
}
My application in addition to querying this data by primary key id, needs to query it by updated_on timestamp as well. To fulfil the query by time use case I have tried the following.
create table MyTable {
id text PRIMARY KEY,
data text,
updated_on timestamp,
updated_on_minute timestamp
}
Secondary index on the updated_on_minute field. As I understand, secondary indexes are not recommended for high cardinality cases (which is my case, because I could have a lot of data at the same minute mark). Moreover I have data that gets frequently updated, which means the updated_on_minute will keep revving.
MaterializedView with updated_on_minute as the partition key and a id as the clustering key. I am on version 3.9 of cassandra and had just begun using these, but alas I find these release notes for 3.11x (https://github.com/apache/cassandra/blob/cassandra-3.11/NEWS.txt), which declare them purely experimental and not meant for production clusters.
So then what are my options? Do I just need to maintain my own tables to track data that comes in timewise? Would love some input on this.
Thanks in advance.
As always have been the case, create additional table to query by a different partition key.
In your case the table would be
create table MyTable_by_timestamp {
id text,
data text,
updated_on timestamp,
Primary key(updated_on, id)
}
Write to both tables mytable_by_timetamp and mytable_by_id. Use the corresponding table to READ from based on the partition key either updated_on or id.
Itโ€™s absolutely fine to duplicate data based on the use case (query) itโ€™s trying solve.
Edited:
In case there is a fear about huge partition, you can always bucket into smaller partitions. For example the table above could be broken down into
create table MyTable_by_timestamp {
id text,
data text,
updated_on timestamp,
updated_min timestamp,
Primary key(updated_min, id)
}
Here I have chosen every minute as the bucket size. Depending on how many updates you receive, you can change it to seconds (updated_sec) to reduce the partition size further.

How to model inbox

How would ago about modelling the data if I have a web app for messaging and I expect the user to either see all the messages ordered by date, or see the messages exchanged with a specific contact, again ordered by date.
Should I have two tables, called "global_inbox" and "contacts_inbox" where I would add each message to both?
For example:
CREATE TABLE global_inbox(user_id int, timestamp timestamp,
message text, PRIMARY KEY(user_id, timestamp)
CREATE TABLE inbox(user_id int, contact_id int,
timestamp timestapm, message text,
PRIMARY KEY(user_id, contact_id, timestamp)
This means that every message should be copied 4 times, 2 for sender and 2 for receiver. Does it sound reasonable?
Yes, It's reasonable.
You need some modification.
Inbox table : If a user have many contact and every contact send message, then a huge amount of data will be inserted into a single partition (user_id). So add contact_id to partition key.
Updated Schema :
CREATE TABLE inbox (
user_id int,
contact_id int,
timestamp timestamp,
message text,
PRIMARY KEY((user_id, contact_id), timestamp)
);
global_inbox : Though It's global inbox, a huge amount of data can be inserted into a single partition (user_id). So add more key to partition key to more distribution.
Updated Schema :
CREATE TABLE global_inbox (
user_id int,
year int,
month int,
timestamp timestamp,
message text,
PRIMARY KEY((user_id,year,month), timestamp)
);
Here you can also add also add week to partition key, if you have huge data in a single partition in a week. Or remove month from partition key if you think not much data will insert in a year.
In term of queries performance, Yes it sounds good for me. Apache cassandra is really built in for this kind of data modeling. We build table to satisfy queries. This is the process called 'Denormalization' in Cassandra paradigm. This will increase queries performance. You have duplicated data but the main goal is to have fast queries.

cassandra ordering or sorting

Need to fetch latest result in a table without mentioning partion key . For example need latest tweets .Problems facing as follows,
create table test2.threads(
thread text ,
created_date timestamp,
forum_name text,
subject text,
posted_by text,
last_reply_timestamp timestamp,
PRIMARY KEY (thread,last_reply_timestamp)
)
WITH CLUSTERING ORDER BY (last_reply_timestamp DESC);
Only if i know the partion key , I can retrive data .
select * from test2.threads where thread='one' order by last_reply_timestamp DESC;
How can i get latest threads sort by desc without mentioning where condition?
Your data model is not suited for that purpose. The partitions are not ordered. You'd have to loop over the partition keys, fetch a few and then see which ones are the most recent at the application level.

cassandra primary key column cannot be restricted

I am using Cassandra for the first time in a web app and I got a query problem.
Here is my tab :
CREATE TABLE vote (
doodle_id uuid,
user_id uuid,
schedule_id uuid,
vote int,
PRIMARY KEY ((doodle_id), user_id, schedule_id)
);
On every request, I indicate my partition key, doodle_id.
For example I can make without any problems :
select * from vote where doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7 and user_id = 97a7378a-e1bb-4586-ada1-177016405142;
But on the last request I made :
select * from vote where doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7 and schedule_id = c37df0ad-f61d-463e-bdcc-a97586bea633;
I got the following error :
Bad Request: PRIMARY KEY column "schedule_id" cannot be restricted (preceding column "user_id" is either not restricted or by a non-EQ relation)
I'm new with Cassandra, but correct me if I'm wrong, in a composite primary key, the first part is the PARTITION KEY which is mandatory to allow Cassandra to know where to look for data.
Then the others parts are CLUSTERING KEY to sort data.
But I still don't get why my first request is working and not the second one ?
If anyone could help it will be a great pleasure.
In Cassandra, you should design your data model to suit your queries. Therefore the proper way to support your second query (queries by doodle_id and schedule_id, but not necessarilly with user_id), is to create a new table to handle that specific query. This table will be pretty much the same, except the PRIMARY KEY will be slightly different:
CREATE TABLE votebydoodleandschedule (
doodle_id uuid,
user_id uuid,
schedule_id uuid,
vote int,
PRIMARY KEY ((doodle_id), schedule_id, user_id)
);
Now this query will work:
SELECT * FROM votebydoodleandschedule
WHERE doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7
AND schedule_id = c37df0ad-f61d-463e-bdcc-a97586bea633;
This gets you around having to specify ALLOW FILTERING. Relying on ALLOW FILTERING is never a good idea, and is certainly not something that you should do in a production cluster.
The clustering key is also used to find the columns within a given partition. With your model, you'll be able to query by:
doodle_id
doodle_id/user_id
doodle_id/user_id/schedule_id
user_id using ALLOW FILTERING
user_id/schedule_id using ALLOW FILTERING
You can see your primary key as a file path doodle_id#123/user_id#456/schedule_id#789 where all data is stored in the deepest folder (ie schedule_id#789). When you're querying you have to indicate the subfolder/subtree from where you start searching.
Your 2nd query doesn't work because of how columns are organized within partition. Cassandra can not get a continuous slice of columns in the partition because they are interleaved.
You should invert the primary key order (doodle_id, schedule_id, user_id) to be able to run your query.

Resources