Cassandra Data modelling with multiple tables for same data - cassandra

Cassandra Data Modeling Query
Hello,
The data model i am working on is as below with different tables for same data data set for satisfying different kinds of query. The data mainly stores event data of some campaigns sent out on multiple channels like email, web, mobile app, sms etc. Events can include page visits, email opens, link clicks etc for different subscribers.
Table 1:
(enterprise_id int, domain_id text, campaign_id int, event_category text, event_action text, datetime timestamp, subscription_id text, event_label text, ........) (many more columns not part of primary key)
PRIMARY KEY ((enterprise_id,campaign_id),domain_id, event_category, event_action, datetime, subscription_id))
CLUSTERING ORDER BY (domain_id DESC, event_category DESC, event_action DESC, datetime DESC, subscription_id DESC)
Keys and Data size for Table 1:
I have partition key as enterprise_id + campaign_id . Each enterprise can have several campaigns . The datastore may have data for few hundred campaigns. Each campaign can have upto 2-3 million records. Hence there may be 3000 partitions across 100 enterprises and each partition having 2-3 miilion records.
Cassandra Queries: Query always with partition key + primary key including the datetime field. The subscription id is included in primary key to keep each record unique as we can have multiple records with similar values for rest of the keys in primary key. enterprise_id +c ampaign_id is always available as a filter in the queries.
Table 2:
(enterprise_id int, domain_id text, event_category text, event_action text, datetime timestamp, subscription_id text, event_label text, campaign_id int........) (many more columns not part of primary key)
PRIMARY KEY (enterprise_id, domain_id, event_category, event_action, datetime, subscription_id))
CLUSTERING ORDER BY (domain_id DESC, event_category DESC, event_action DESC, datetime DESC, subscription_id DESC)
Keys and Data size for Table 2) : I have partition key as enterprise_id only. Each enterprise can have several campaigns . May be few hundred campaigns. Each campaign can have upto 2-3 Mn records. In this case the partition is quite big with data for all campaigns in a single partition. can have upto 800 - 900 million entries
Cassandra Queries: Query always with partition key + primary key upto datetime. The subscription id is included in primary key to keep each record unique as we can have multiple records with similar values for rest of the keys in primary key. In this case, data has to be queries across campaigns and we may not have campaign_id as a filter in the queries.
Table 3:
(enterprise_id int, subscription_id text, domain_id text, event_category text, event_action text, datetime timestamp, event_label text, campaign_id int........) (many more columns not part of primary key)
PRIMARY KEY (enterprise_id, subscription_id, domain_id, event_category, event_action, datetime, ))
CLUSTERING ORDER BY ( subscription_id DESC, domain_id DESC, event_category DESC, event_action DESC, datetime DESC,)
Keys and Data size for Table 3) : I have partition key as enterprise_id. Each enterprise can have several campaigns . May be few hundred campaigns. Each campaign can have upto 2-3 Mn records. In this case the partition is quite big with data for all campaigns in a single partition. can have upto 800 -900 million entries
Cassandra Queries: Query always with partition key + primary key as subscription_id only. Should be able to query directly on enterprise_id + subscription_id.
My Queries:
Size of data on each partition: With Table 2) and Table 3) i may end up with more than 800 -900 million rows per partition. As per my reading it is not ok to have so many entries per partition. How can i achieve my use case in this scenario? Even if i create multiple partitions based on some data like a week_number (1-52 in a year), the query will need to query across all partitions and end up using a IN clause with all week numbers which is as good as scanning all data.
Is it ok to have multiple tables with same partition key and different primary keys with Clustering order change? For example in Table 2 and Table 3 the hash will be on enterprise_id and will lead to same node. However only the clustering key order has changed and will allow me to query directly on the required key. Will the data be in different physical partitions for Table2 and Table3 in such a scenario? Or if it maps to same partition number how will cassandra internally distinguish between the two tables?
Is it ok to use ALLOW FILTERING if i specify the partition key. For example i can avoid the need for creating Table 3 and use table 2 for query on subscription_id directly if i use ALLOW FILTERING on Table 2. What will be the impact again.

First of all, please only as one question per question. Given the length and detail required for your answers, this post is unlikely to provide long term value for future users.
As per my reading it is not ok to have so many entries per partition. How can I achieve my use case in this scenario?
Unfortunately, if partitioning on a time component will not work, then you'll have to find some other column to partition the data by. I've seen rows-per-partition work ok in the range of 50k to 20k. Most of those use cases on the higher end had small partitions. It looks like your model has many columns, so I'd be curious as to the average partition size. Essentially, find a column to partition on which keeps your partition sizes in the 10MB to 1MB range.
Is it ok to have multiple tables with same partition key and different primary keys with Clustering order change?
Yes, this is perfectly fine.
Will the data be in different physical partitions for Table2 and Table3 in such a scenario? Or if it maps to same partition number how will cassandra internally distinguish between the two tables?
The partition is hashed into a number ranging from +/- 2^63. That number will then be compared to the partition ranges mapped to all nodes, and then the query will be sent to that node. So all the partition does, is determine which node is responsible for the data.
The tables have their data files written to different directories, based on table name. So Cassandra distinguishes between the tables by the table name provided in the query. Nothing you need to worry about.
Is it ok to use ALLOW FILTERING if I specify the partition key.
I would still recommend against it if you're concerned about performance. But the good thing about using the ALLOW FILTERING directive while specifying a full partition key, will indeed prevent Cassandra from reading multiple nodes to build the result set. So that should be ok. The only drawback here, is that Cassandra stores/reads data from disk by the defined CLUSTERING ORDER, and using ALLOW FILTERING obviously complicates that process (forcing random reads vs. sequential reads).

Related

What are the rules-of-thumb for choosing the right partition key in Cassandra?

I just start learning about Cassandra and going deeper to understand what is happening backstage that makes Cassandra too much faster. I go through the following docs1 & docs2 but was still confused about choosing the right partition key for my table.
I'm designing the Model for a test application like Slack and creating a message table like:
CREATE TABLE messages (
id uuid,
work_space_id text,
user_id text,
channel_id text,
body text,
edited boolean,
deleted boolean,
year text,
created_at TIMESTAMP,
PRIMARY KEY (..................)
);
My query is to fetch all the messages by a channel_id and work_space_id. So following are the options in my mind to choose the Primary Key:
PRIMARY KEY ((work_space_id, year), channel_id, created_at)
PRIMARY KEY ((channel_id, work_space_id), created_at)
If I go with option 1, so each workspace has a separate partition by a year. This will might create Hotspot if one workspace has 100 Million messages and other has few hundreds in a year.
If I go with option 2, so each workspace channel has seprate partition. What if there are 1Million workspaces & each has 1K channels. This will create about 1B partitions. I know the limit is of 2Billion.
So what is the rool of thumb to choose the right partition key that will distribute data evenly and not create hotspots in a data center?
The primary rule of data modeling for Cassandra is that you must design a table for each application query. In your case, the app query needs to retrieve all messages based on the workspace and channel IDs.
The two critical things from your app query which should guide you are:
Retrieve multiple messages.
Filter by workspace + channel IDs.
The filter determines the partition key for the table which is (workspace_id, channel_id) and since each partition contains rows of messages, we'll use the created_at column as the clustering key so it can be sorted in descending chronological order so we have:
CREATE TABLE messages_by_workspace_channel_ids (
workspace_id text,
channel_id text,
created_at timestamp,
user_id text,
body text,
edited boolean,
deleted boolean,
PRIMARY KEY ((workspace_id, channel_id), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
Ordinarily we would stop there but as you pointed out correctly, each channel could potentially have millions of messages which would lead to very large partitions. To avoid that, we need to group the messages into "buckets" to make the partitions smaller.
You attempted to do that by grouping messages by year but it may not be enough. The general recommendation is to keep partitions to 100MB for optimum performance -- smaller partitions are faster to read. We can make the partitions smaller by also grouping them into months:
CREATE TABLE messages_by_workspace_channel_ids_yearmonth (
workspace_id text,
channel_id text,
year int,
month int,
created_at timestamp,
...
PRIMARY KEY ((workspace_id, channel_id, year, month), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
You could make them even smaller by further grouping them into dates:
CREATE TABLE messages_by_workspace_channel_ids_createdate (
workspace_id text,
channel_id text,
createdate date,
created_at timestamp,
...
PRIMARY KEY ((workspace_id, channel_id, createdate), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
The more "buckets" you use, the more partitions you will have in the table which is ideal since more partitions means greater distribution of data across the cluster. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag then click on the Watch tag button. 🙏 Thanks!

Select row with highest timestamp

I have a table that stores events
CREATE TABLE active_events (
event_id VARCHAR,
number VARCHAR,
....
start_time TIMESTAMP,
PRIMARY KEY (event_id, number)
);
Now, I want to select an event with the highest start_time. It is possible? I've tried to create a secondary index, but no success.
This is a query I've created
select * from active_call order by start_time limit 1
But the error says ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Should I create some kind of materialized view? What should I do to execute my query?
This is an anti-pattern in Cassandra. To order the data you need to read all data and find the highest value. And this will require scanning of data on multiple nodes, and will be very long.
Materialized view also won't help much as order for data only exists inside an individual partition, so you will need to put all your data into a single partition that could be huge and data would be imbalanced.
I can only think of following workaround:
Have an additional table that will have all columns of the original table, but with a fake partition key and no clustering columns
You do inserts into that table in parallel to normal inserts, but use a fixed value for that fake partition key, and explicitly setting a timestamp for a record equal to start_time (don't forget to multiple by 1000 as timestamp uses microseconds). In this case it will guaranteed to be the value with the highest timestamp as Cassandra won't override it with other data with lower timestamp.
But this doesn't solve a problem with data skew, and all traffic will be handled by fixed number of nodes equal to RF.
Another alternative - use another database.
This type of query isn't valid in big data because it requires a full table scan and doesn't scale. It works in traditional relational databases because the dataset is smaller. Imagine you had billions of partitions each with thousands of rows spread across hundreds of nodes. A full table scan in a large cluster will take a very long time if it was allowed.
The error:
ORDER BY is only supported when the partition key is restricted by an EQ or an IN
gets returned because you can only sort the results provided (a) the query is restricted to a partition key, and (b) the rows are ordered by a clustering column. You cannot sort the results based on a column that is not part of the clustering key. Cheers!

Cassandra partion Key

Currently, I am exploring cassandra and having an special use case to design an support view of an application
My access patterns.
To fetch specific transaction
select * from purchase_by_user where userid='Tom' and transaction_date='1/20/22'
select * from purchase_by_user where userid='Jerry' and transaction_date <=1/21/22 and transaction_date >= '1/16/22'
select * from purchase_by_user where userid='Tom' and amount="100"
select * from purchase by user where user='Jerry' and amount>='50'
Create table purchase_by_user (
order_id uuid,
amount decimal,
transaction_ts timestamp,
user_id text,
Primary key((user_id), uuid)
)
Lets say Tom is making millions of orders, With this above partion key the data will not be evenly spread against the cluster and also the search will be expensive here.
Can anyone help, what would be better partion key here.
I'd go with a PRIMARY KEY definition like this:
PRIMARY KEY((user_id, transaction_year), transaction_date, order_id)
) WITH CLUSTERING ORDER BY (transaction_date DESC, order_id ASC)
This makes use of the "bucketing" concept that Manish mentioned. In this case, if Tom is creating an order every single day, there will only be 365 in each partition.
Lets say Tom is making millions of orders
In fact, even if Tom placed two orders per day, it's still only be 730. So while thinking about throughput extremes is a good exercise, a single user placing even one million orders is probably not realistic.
Also, some of the queries above are using transaction_date in a range query. I've added transaction_date as the first clustering key to support those queries. And if transaction_date is in DESCending order, the most-recent transactions will be at the "top" of the partition (they'll be read first), which is usually how most date/time-driven applications tend to function.
You can use the concept of bucketing to reduce the number of rows in a single partition. For example you can create a key like (user_id int, bucket_number int). Here you can identify the max value of bucket_number on your expected data size. IF you expect this user can make millions order then you can have bucket value till 1000. The main idea is to focus that you dont end up creating partition with large number of rows.

Apache Cassandra stock data model design

I got a lot of data regarding stock prices and I want to try Apache Cassandra out for this purpose. But I'm not quite familiar with the primary/ partition/ clustering keys.
My database columns would be:
Stock_Symbol
Price
Timestamp
My users will always filter for the Stock_Symbol (where stock_symbol=XX) and then they might filter for a certain time range (Greater/ Less than (equals)). There will be around 30.000 stock symbols.
Also, what is the big difference when using another "filter", e.g. exchange_id (only two stock exchanges are available).
Exchange_ID
Stock_Symbol
Price
Timestamp
So my users would first filter for the stock exchange (which is more or less a foreign key), then for the stock symbol (which is also more or less a foreign key). The data would be inserted/ written in this order as well.
How do I have to choose the keys?
The Quick Answer
Based on your use-case and predicted query pattern, I would recommend one of the following for your table:
PRIMARY KEY (Stock_Symbol, Timestamp)
The partition key is made of Stock_Symbol, and Timestamp is the only clustering column. This will allow WHERE to be used with those two fields. If either are to be filtered on, filtering on Stock_Symbol will be required in the query and must come as the first condition to WHERE.
Or, for the second case you listed:
PRIMARY KEY ((Exchange_ID, Stock_Symbol), Timestamp)
The partition key is composed of Exchange_ID and Stock_Symbol, and Timestamp is the only clustering column. This will allow WHERE to be used with those three fields. If any of those three are to be filtered on, filtering on both Exchange_ID and Stock_Symbol will be required in the query and must come in that order as the first two conditions to WHERE.
See the last section of this answer for a few other variations that could also be applied based on your needs.
Long Answer & Explanation
Primary Keys, Partition Keys, and Clustering Columns
Primary keys in Cassandra, similar to their role in relational databases, serve to identify records and index them in order to access them quickly. However, due to the distributed nature of records in Cassandra, they serve a secondary purpose of also determining which node that a given record should be stored on.
The primary key in a Cassandra table is further broken down into two parts - the Partition Key, which is mandatory and by default is the first column in the primary key, and optional clustering column(s), which are all fields that are in the primary key that are not a part of the partition key.
Here are some examples:
PRIMARY KEY (Exchange_ID)
Exchange_ID is the sole field in the primary key and is also the partition key. There are no additional clustering columns.
PRIMARY KEY (Exchange_ID, Timestamp, Stock_Symbol)
Exchange_ID, Timestamp, and Stock_Symbol together form a composite primary key. The partition key is Exchange_ID and Timestamp and Stock_Symbol are both clustering columns.
PRIMARY KEY ((Exchange_ID, Timestamp), Stock_Symbol)
Exchange_ID, Timestamp, and Stock_Symbol together form a composite primary key. The partition key is composed of both Exchange_ID and Timestamp. The extra parenthesis grouping Exchange_ID and Timestamp group them into a single composite partition key, and Stock_Symbol is a clustering column.
PRIMARY KEY ((Exchange_ID, Timestamp))
Exchange_ID and Timestamp together form a composite primary key. The partition key is composed of both Exchange_ID and Timestamp. There are no clustering columns.
But What Do They Do?
Internally, the partitioning key is used to calculate a token, which determines on which node a record is stored. The clustering columns are not used in determining which node to store the record on, but they are used in determining order of how records are laid out within the node - this is important when querying a range of records. Records whose clustering columns are similar in value will be stored close to each other on the same node; they "cluster" together.
Filtering in Cassandra
Due to the distributed nature of Cassandra, fields can only be filtered on if they are indexed. This can be accomplished in a few ways, usually by being a part of the primary key or by having a secondary index on the field. Secondary indexes can cause performance issues according to DataStax Documentation, so it is typically recommended to capture your use-cases using the primary key if possible.
Any field in the primary key can have a WHERE clause applied to it (unlike unindexed fields which cannot be filtered on in the general case), but there are some stipulations:
Order Matters - The primary key fields in the WHERE clause must be in the order that they are defined; if you have a primary key of (field1, field2, field3), you cannot do WHERE field2 = 'value', but rather you must include the preceding fields as well: WHERE field1 = 'value' AND field2 = 'value'.
The Entire Partition Key Must Be Present - If applying a WHERE clause to the primary key, the entire partition key must be given so that the cluster can determine what node in the cluster the requested data is located in; if you have a primary key of ((field1, field2), field3), you cannot do WHERE field1 = 'value', but rather you must include the full partition key: WHERE field1 = 'value' AND field2 = 'value'.
Applied to Your Use-Case
With the above info in mind, you can take the analysis of how users will query the database, as you've done, and use that information to design your data model, or more specifically in this case, the primary key of your table.
You mentioned that you will have about 30k unique values for Stock_Symbol and further that it will always be included in WHERE cluases. That sounds initially like a resonable candidate for a partition key, as long as queries will include only a single value that they are searching for in Stock_Symbol (e.g. WHERE Stock_Symbol = 'value' as opposed to WHERE Stock_Symbol < 'value'). If a query is intended to return multiple records with multiple values in Stock_Symbol, there is a danger that the cluster will need to retrieve data from multiple nodes, which may result in performance penalties.
Further, if your users wish to filter on Timestamp, it should also be a part of the primary key, though wanting to filter on a range indicates to me that it probably shouldn't be a part of the partition key, so it would be a good candidate for a clustering column.
This brings me to my recommendation:
PRIMARY KEY (Stock_Symbol, Timestamp)
If it were important to distribute data based on both the Stock_Symbol and the Timestamp, you could introduce a pre-calculated time-bucketed field that is based on the time but with less cardinality, such as Day_Of_Week or Month or something like that:
PRIMARY KEY ((Stock_Symbol, Day_Of_Week), Timestamp)
If you wanted to introduce another field to filtering, such as Exchange_ID, it could be a part of the partition key, which would mandate it being included in filters, or it could be a part of the clustering column, which would mean that it wouldn't be required unless subsequent fields in the primary key needed to be filtered on. As you mentioned that users will always filter by Exchange_ID and then by Stock_Symbol, it might make sense to do:
PRIMARY KEY ((Exchange_ID, Stock_Symbol), Timestamp)
Or to make it non-mandatory:
PRIMARY KEY (Stock_Symbol, Exchange_ID, Timestamp)

Cassandra data model for time series data

For monitoring some distributed software I insert their monitoring data into Cassandra table. The columns are metric_type, metric_value, host_name, component_type and time_stamp. The scenario is I collect all the metrics for all the nodes in every second. The time in uniform for all nodes and their metrics. The keys(that differentiate rows) are host_name, component_type, metric_type and time_stamp. I design my table like below:
CREATE TABLE metrics (
component_type text,
host_name text,
metric_type text,
time_stamp bigint,
metric_value text,
PRIMARY KEY ((component_type, host_name, metric_type), general_timestamp)
) WITH CLUSTERING ORDER BY (time_stamp DESC)
where component_type, host_name and metric_type are partitions key and time_stamp is clustering key.
The metrics table is suitable for the queries that gets some data according to their timestamp just for a host_name or a metric_type or a component_type, as using partition keys Cassandra will find the partition that data are stored and using clustering key will fetch data from that partition and this is the optimal case for Cassandra queries.
Besides that, I need a query that fetches all data just using time_stamp. For example :
SELECT * from metrics WHERE time_stamp >= 1529632009872 and time_stamp < 1539632009872 ;
I know the metric table is not optimal for the above query, because it should search every partition to fetch data. I guess in this situation we should design another table with the time_stamp as partition key, so data will be fetched from one or some limited number of partitions. But I am not certain about some aspects:
Is it optimal to set time_stamp as partition key? because of I insert data into the database every second and the partition key numbers will be a lot!
I need my queries to be interval on time_stamp and I know interval conditions are not allowed in partition keys, just allowed on clustering keys!
So what is the best Cassandra data model for such time series data and query?
Using time_stamp as partition key is not optimal in my opinion, as it would create a lot of partitions.
I would propose 2 solutions:
1) Go with a "week_first_day" as partition key. You would have to compute the correct week_first_day keys on your application side and then emit multiple select queries.
2) You could use ElasticSearch on top of cassandra. Cassandra remains the primary data source, but you have the freedom, to do complex selects. If you are interested, I would recommend to take a look at Elassandra .

Resources