How would ago about modelling the data if I have a web app for messaging and I expect the user to either see all the messages ordered by date, or see the messages exchanged with a specific contact, again ordered by date.
Should I have two tables, called "global_inbox" and "contacts_inbox" where I would add each message to both?
For example:
CREATE TABLE global_inbox(user_id int, timestamp timestamp,
message text, PRIMARY KEY(user_id, timestamp)
CREATE TABLE inbox(user_id int, contact_id int,
timestamp timestapm, message text,
PRIMARY KEY(user_id, contact_id, timestamp)
This means that every message should be copied 4 times, 2 for sender and 2 for receiver. Does it sound reasonable?
Yes, It's reasonable.
You need some modification.
Inbox table : If a user have many contact and every contact send message, then a huge amount of data will be inserted into a single partition (user_id). So add contact_id to partition key.
Updated Schema :
CREATE TABLE inbox (
user_id int,
contact_id int,
timestamp timestamp,
message text,
PRIMARY KEY((user_id, contact_id), timestamp)
);
global_inbox : Though It's global inbox, a huge amount of data can be inserted into a single partition (user_id). So add more key to partition key to more distribution.
Updated Schema :
CREATE TABLE global_inbox (
user_id int,
year int,
month int,
timestamp timestamp,
message text,
PRIMARY KEY((user_id,year,month), timestamp)
);
Here you can also add also add week to partition key, if you have huge data in a single partition in a week. Or remove month from partition key if you think not much data will insert in a year.
In term of queries performance, Yes it sounds good for me. Apache cassandra is really built in for this kind of data modeling. We build table to satisfy queries. This is the process called 'Denormalization' in Cassandra paradigm. This will increase queries performance. You have duplicated data but the main goal is to have fast queries.
Related
I just start learning about Cassandra and going deeper to understand what is happening backstage that makes Cassandra too much faster. I go through the following docs1 & docs2 but was still confused about choosing the right partition key for my table.
I'm designing the Model for a test application like Slack and creating a message table like:
CREATE TABLE messages (
id uuid,
work_space_id text,
user_id text,
channel_id text,
body text,
edited boolean,
deleted boolean,
year text,
created_at TIMESTAMP,
PRIMARY KEY (..................)
);
My query is to fetch all the messages by a channel_id and work_space_id. So following are the options in my mind to choose the Primary Key:
PRIMARY KEY ((work_space_id, year), channel_id, created_at)
PRIMARY KEY ((channel_id, work_space_id), created_at)
If I go with option 1, so each workspace has a separate partition by a year. This will might create Hotspot if one workspace has 100 Million messages and other has few hundreds in a year.
If I go with option 2, so each workspace channel has seprate partition. What if there are 1Million workspaces & each has 1K channels. This will create about 1B partitions. I know the limit is of 2Billion.
So what is the rool of thumb to choose the right partition key that will distribute data evenly and not create hotspots in a data center?
The primary rule of data modeling for Cassandra is that you must design a table for each application query. In your case, the app query needs to retrieve all messages based on the workspace and channel IDs.
The two critical things from your app query which should guide you are:
Retrieve multiple messages.
Filter by workspace + channel IDs.
The filter determines the partition key for the table which is (workspace_id, channel_id) and since each partition contains rows of messages, we'll use the created_at column as the clustering key so it can be sorted in descending chronological order so we have:
CREATE TABLE messages_by_workspace_channel_ids (
workspace_id text,
channel_id text,
created_at timestamp,
user_id text,
body text,
edited boolean,
deleted boolean,
PRIMARY KEY ((workspace_id, channel_id), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
Ordinarily we would stop there but as you pointed out correctly, each channel could potentially have millions of messages which would lead to very large partitions. To avoid that, we need to group the messages into "buckets" to make the partitions smaller.
You attempted to do that by grouping messages by year but it may not be enough. The general recommendation is to keep partitions to 100MB for optimum performance -- smaller partitions are faster to read. We can make the partitions smaller by also grouping them into months:
CREATE TABLE messages_by_workspace_channel_ids_yearmonth (
workspace_id text,
channel_id text,
year int,
month int,
created_at timestamp,
...
PRIMARY KEY ((workspace_id, channel_id, year, month), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
You could make them even smaller by further grouping them into dates:
CREATE TABLE messages_by_workspace_channel_ids_createdate (
workspace_id text,
channel_id text,
createdate date,
created_at timestamp,
...
PRIMARY KEY ((workspace_id, channel_id, createdate), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
The more "buckets" you use, the more partitions you will have in the table which is ideal since more partitions means greater distribution of data across the cluster. Cheers!
๐ Please support the Apache Cassandra community by hovering over the cassandra tag then click on the Watch tag button. ๐ Thanks!
I have been tasked to come up with a schema to store our application logs using Cassandra. I am quite new to Cassandra but from what I have read and learned so far, it could be the best approach for our use case.
Our application send thousands of SMS each day (provided by 3 local service providers) and we would love to keep a log each time an SMS is sent (for reconciliation purposes at each month's end among other things). We intend to store the information below:
id text, // uuid
phone_number text, // recipient of the SMS
message text, // Message sent
status boolean, // if the SMS was sent or not
response text, // Request response
service_provider text, // e.g Twilio, Telnyx, Venmo etc
date timestamp, // Time SMS is sent
We would like to query the following reports at any one time:
Total number of SMS sent
Total SMS sent for a given period of time (between 2 dates)
Total SMS sent by a specific service provider (also within a given time period)
Total SMS sent to a specific recipient phone number (also within a given time period)
Total failed or successful SMS sent (also within a given period of time)
I have come up with the following tables (3) but I feel like I am over engineering or over thinking it? Perhaps it could be done simpler? I would appreciate any advice in getting this to work efficiently.
create table sms_logs_by_id
(
id text,
phone_number text,
message text,
status boolean,
response text,
provider text,
service_provider text,
date timestamp,
primary key (id, date)
) with clustering order by (date DESC);
create table sms_logs_by_service_provider
(
id text,
phone_number text,
message text,
status boolean,
response text,
provider text,
service_provider text,
date timestamp,
primary key (service_provider, date)
) with clustering order by (date DESC);
create table sms_logs_by_phone_number
(
id text,
phone_number text,
message text,
status boolean,
response text,
provider text,
service_provider text,
date timestamp,
primary key (phone_number, date)
) with clustering order by (date DESC);
create table sms_logs_by_status
(
id text,
phone_number text,
message text,
status boolean,
response text,
provider text,
service_provider text,
date timestamp,
primary key (status, date)
) with clustering order by (date DESC);
Queries run pretty well so far. I am not sure if this is the most optimum way of modelling the data. I would appreciate any advice on how I can improve this data model. Thank you!
The only potential issue I see is that for the last 3 tables (logs by status, phone number and provider), is that the partitions will get larger over time. It's important to remember that Cassandra has a mathematical limit of 2 billion cells per partition (where a "cell" == a column value or key). But you want to model you data so that you don't get anywhere near that limit, because your table will start getting slow long before that.
For these three, I'd recommend a "bucketing" approach:
sms_logs_by_service_provider
...
primary key (service_provider, date)
For this one, my other concern is the fact that you're tracking 3 service providers. So in addition to the partitions growing with each message, there's only 3 partitions. So data isn't distributing very well. With thousands of messages sent per day, I'm thinking that your "bucket" is going to need to be fairly precise...probably by "day." Maybe you could get away with a "week_bucket," but I'll use day for this example:
id text,
provider text,
service_provider text,
day_bucket. int,
date timestamp,
PRIMARY KEY ((service_provider, day_bucket), date, id)
This way, you're creating a partition for each combination of service_provider and day. That will give you plenty of data distribution, plus your partitions won't grow beyond the activity which happens in a single day. I've kept date as a DESC clustering key (good idea) but added id as a "tie-breaker," just in case two messages have the exact same timestamp.
create table sms_logs_by_phone_number (
...
primary key (phone_number, date)
So for this one, I'd take a similar approach. But as we're talking about individual users, we can use a much larger bucket. Based on a quick Google search, the average person sends 85 text messages per day, which comes out to 31,025 per year. That's probably ok to store by year.
id text,
phone_number text,
year_bucket int,
date timestamp,
PRIMARY KEY ((phone_number, year_bucket), date, id)
Partitioning by phone_number already gives you some good distribution.
Adding year_bucket in there will ensure that the partition won't have unbound growth. Also, id for a tie-breaker.
create table sms_logs_by_status(
...
primary key (status, date)
Logs by status is going to have a similar problem to the "provider" table, in that you probably only have a few statuses, so the data distribution will be limited. For this one, you're probably also going to want to use a bucket that's small, like by day.
id text,
status text,
day_bucket. int,
date timestamp,
PRIMARY KEY ((status, day_bucket), date, id)
The unfortunate part, is these changes this complicate your query patterns. But they're necessary to save you from problems down the road.
I'am working on insane Time Series Data. So, I have two Kafka Topic -
1) Real time Time-Series Data of moving vehicles every 5 seconds.
2) History Time-Series Data of 10% of vehicles in case vehicles travels in remote area so, data is send once it comes into network, it may be after few hours, days or week.
So, my cassandra Table is somewhat like this
CREATE TABLE locationinfo (
imei text,
date text,
entrydt timestamp,
gpsdt timestamp,
lastgpsdt timestamp,
latitude text,
longitude text,
odo int,
speed int,
PRIMARY KEY ((imei, date), gpsdt)
) WITH CLUSTERING ORDER BY (gpsdt ASC)
& I'm using Spark Streaming to fetch data from Kafka and inserting into Cassandra, here clustering key is gpsdt. Whenever History data comes from Kafka, lot of shuffle happens in table as we know the architecture of Cassandra. Data is nothing but stored in sequential order on the partition defined & for history entries records comes from between the lines. So, What happens is after a certain period of time, spark streaming application gets hang. After lot of search I found that there might be some problem with my table strategy, So if I create a table schema like this -
CREATE TABLE locationinfo (
imei text,
date text,
entrydt timestamp,
gpsdt timestamp,
lastgpsdt timestamp,
latitude text,
longitude text,
odo int,
speed int,
PRIMARY KEY ((imei, date), entrydt)
) WITH CLUSTERING ORDER BY (entrydt ASC)
Here order is defined as per insertion time so whenever history data will come it will always append in the last and there will be no overhead of shuffling. But, in this case I wont be able to make range queries on gpsdt. So, I would like to know what should be the best strategy to handle this scenario. My load from kafka is more than 2k/sec.
I have some data in Cassandra. Say
create table MyTable {
id text PRIMARY KEY,
data text,
updated_on timestamp
}
My application in addition to querying this data by primary key id, needs to query it by updated_on timestamp as well. To fulfil the query by time use case I have tried the following.
create table MyTable {
id text PRIMARY KEY,
data text,
updated_on timestamp,
updated_on_minute timestamp
}
Secondary index on the updated_on_minute field. As I understand, secondary indexes are not recommended for high cardinality cases (which is my case, because I could have a lot of data at the same minute mark). Moreover I have data that gets frequently updated, which means the updated_on_minute will keep revving.
MaterializedView with updated_on_minute as the partition key and a id as the clustering key. I am on version 3.9 of cassandra and had just begun using these, but alas I find these release notes for 3.11x (https://github.com/apache/cassandra/blob/cassandra-3.11/NEWS.txt), which declare them purely experimental and not meant for production clusters.
So then what are my options? Do I just need to maintain my own tables to track data that comes in timewise? Would love some input on this.
Thanks in advance.
As always have been the case, create additional table to query by a different partition key.
In your case the table would be
create table MyTable_by_timestamp {
id text,
data text,
updated_on timestamp,
Primary key(updated_on, id)
}
Write to both tables mytable_by_timetamp and mytable_by_id. Use the corresponding table to READ from based on the partition key either updated_on or id.
Itโs absolutely fine to duplicate data based on the use case (query) itโs trying solve.
Edited:
In case there is a fear about huge partition, you can always bucket into smaller partitions. For example the table above could be broken down into
create table MyTable_by_timestamp {
id text,
data text,
updated_on timestamp,
updated_min timestamp,
Primary key(updated_min, id)
}
Here I have chosen every minute as the bucket size. Depending on how many updates you receive, you can change it to seconds (updated_sec) to reduce the partition size further.
I have a fairly simple data model. I am tracking events for users based on timestamps. I'm converting a JSON object which has this scema:
userID:{
event: [
{ timestamp: data },
{ timestamp: data }
]
}
I have come up with two Cassandra schemas.
The first:
CREATE TABLE users ( guid uuid, date timestamp, events varchar, PRIMARY KEY(guid, date) );
The second:
CREATE TABLE users ( guid uuid PRIMARY KEY, date timestamp, events map<text, text> );
Either one would work, requiring the data to be a stringified JSON object. My query will be returning all data from a user in a given time range. Which model makes more sense, or is there a better way to go about this?
The second approach won't allow you to do queries by time range since you don't have date as a clustering column. So you might want to do this:
CREATE TABLE users (
guid uuid,
date timestamp,
events map<text, text>,
PRIMARY KEY(guid, date) );
How you want to define the events field depends on what's in there and how you need to access it. If you access small parts of it often, you might want to break events in the map out into separate rows by making the event key another clustering column like this:
CREATE TABLE users (
guid uuid,
date timestamp,
event_type text,
event_value text,
PRIMARY KEY(guid, date, event_type) );
It's hard to give more specific advice since you didn't describe your use case in terms of what queries you want to run and the volume of data, number of users, etc.
As Jim was saying the second schema does not allow query on the timestamp since it is not contained in the primary key.
He suggested a valid solution but I would also suggest that you use not a uuid and timestamp but a TimeUUID (which provide both an id and a timestamp at the same time) if you can. However if you need to get the users by id only sometimes then the solution of Jim is probably the best :
PRIMARY KEY(guid, date, event_type)