Recommended Cassandra Schema to store application logs - cassandra

I have been tasked to come up with a schema to store our application logs using Cassandra. I am quite new to Cassandra but from what I have read and learned so far, it could be the best approach for our use case.
Our application send thousands of SMS each day (provided by 3 local service providers) and we would love to keep a log each time an SMS is sent (for reconciliation purposes at each month's end among other things). We intend to store the information below:
id text, // uuid
phone_number text, // recipient of the SMS
message text, // Message sent
status boolean, // if the SMS was sent or not
response text, // Request response
service_provider text, // e.g Twilio, Telnyx, Venmo etc
date timestamp, // Time SMS is sent
We would like to query the following reports at any one time:
Total number of SMS sent
Total SMS sent for a given period of time (between 2 dates)
Total SMS sent by a specific service provider (also within a given time period)
Total SMS sent to a specific recipient phone number (also within a given time period)
Total failed or successful SMS sent (also within a given period of time)
I have come up with the following tables (3) but I feel like I am over engineering or over thinking it? Perhaps it could be done simpler? I would appreciate any advice in getting this to work efficiently.
create table sms_logs_by_id
(
id text,
phone_number text,
message text,
status boolean,
response text,
provider text,
service_provider text,
date timestamp,
primary key (id, date)
) with clustering order by (date DESC);
create table sms_logs_by_service_provider
(
id text,
phone_number text,
message text,
status boolean,
response text,
provider text,
service_provider text,
date timestamp,
primary key (service_provider, date)
) with clustering order by (date DESC);
create table sms_logs_by_phone_number
(
id text,
phone_number text,
message text,
status boolean,
response text,
provider text,
service_provider text,
date timestamp,
primary key (phone_number, date)
) with clustering order by (date DESC);
create table sms_logs_by_status
(
id text,
phone_number text,
message text,
status boolean,
response text,
provider text,
service_provider text,
date timestamp,
primary key (status, date)
) with clustering order by (date DESC);
Queries run pretty well so far. I am not sure if this is the most optimum way of modelling the data. I would appreciate any advice on how I can improve this data model. Thank you!

The only potential issue I see is that for the last 3 tables (logs by status, phone number and provider), is that the partitions will get larger over time. It's important to remember that Cassandra has a mathematical limit of 2 billion cells per partition (where a "cell" == a column value or key). But you want to model you data so that you don't get anywhere near that limit, because your table will start getting slow long before that.
For these three, I'd recommend a "bucketing" approach:
sms_logs_by_service_provider
...
primary key (service_provider, date)
For this one, my other concern is the fact that you're tracking 3 service providers. So in addition to the partitions growing with each message, there's only 3 partitions. So data isn't distributing very well. With thousands of messages sent per day, I'm thinking that your "bucket" is going to need to be fairly precise...probably by "day." Maybe you could get away with a "week_bucket," but I'll use day for this example:
id text,
provider text,
service_provider text,
day_bucket. int,
date timestamp,
PRIMARY KEY ((service_provider, day_bucket), date, id)
This way, you're creating a partition for each combination of service_provider and day. That will give you plenty of data distribution, plus your partitions won't grow beyond the activity which happens in a single day. I've kept date as a DESC clustering key (good idea) but added id as a "tie-breaker," just in case two messages have the exact same timestamp.
create table sms_logs_by_phone_number (
...
primary key (phone_number, date)
So for this one, I'd take a similar approach. But as we're talking about individual users, we can use a much larger bucket. Based on a quick Google search, the average person sends 85 text messages per day, which comes out to 31,025 per year. That's probably ok to store by year.
id text,
phone_number text,
year_bucket int,
date timestamp,
PRIMARY KEY ((phone_number, year_bucket), date, id)
Partitioning by phone_number already gives you some good distribution.
Adding year_bucket in there will ensure that the partition won't have unbound growth. Also, id for a tie-breaker.
create table sms_logs_by_status(
...
primary key (status, date)
Logs by status is going to have a similar problem to the "provider" table, in that you probably only have a few statuses, so the data distribution will be limited. For this one, you're probably also going to want to use a bucket that's small, like by day.
id text,
status text,
day_bucket. int,
date timestamp,
PRIMARY KEY ((status, day_bucket), date, id)
The unfortunate part, is these changes this complicate your query patterns. But they're necessary to save you from problems down the road.

Related

What are the rules-of-thumb for choosing the right partition key in Cassandra?

I just start learning about Cassandra and going deeper to understand what is happening backstage that makes Cassandra too much faster. I go through the following docs1 & docs2 but was still confused about choosing the right partition key for my table.
I'm designing the Model for a test application like Slack and creating a message table like:
CREATE TABLE messages (
id uuid,
work_space_id text,
user_id text,
channel_id text,
body text,
edited boolean,
deleted boolean,
year text,
created_at TIMESTAMP,
PRIMARY KEY (..................)
);
My query is to fetch all the messages by a channel_id and work_space_id. So following are the options in my mind to choose the Primary Key:
PRIMARY KEY ((work_space_id, year), channel_id, created_at)
PRIMARY KEY ((channel_id, work_space_id), created_at)
If I go with option 1, so each workspace has a separate partition by a year. This will might create Hotspot if one workspace has 100 Million messages and other has few hundreds in a year.
If I go with option 2, so each workspace channel has seprate partition. What if there are 1Million workspaces & each has 1K channels. This will create about 1B partitions. I know the limit is of 2Billion.
So what is the rool of thumb to choose the right partition key that will distribute data evenly and not create hotspots in a data center?
The primary rule of data modeling for Cassandra is that you must design a table for each application query. In your case, the app query needs to retrieve all messages based on the workspace and channel IDs.
The two critical things from your app query which should guide you are:
Retrieve multiple messages.
Filter by workspace + channel IDs.
The filter determines the partition key for the table which is (workspace_id, channel_id) and since each partition contains rows of messages, we'll use the created_at column as the clustering key so it can be sorted in descending chronological order so we have:
CREATE TABLE messages_by_workspace_channel_ids (
workspace_id text,
channel_id text,
created_at timestamp,
user_id text,
body text,
edited boolean,
deleted boolean,
PRIMARY KEY ((workspace_id, channel_id), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
Ordinarily we would stop there but as you pointed out correctly, each channel could potentially have millions of messages which would lead to very large partitions. To avoid that, we need to group the messages into "buckets" to make the partitions smaller.
You attempted to do that by grouping messages by year but it may not be enough. The general recommendation is to keep partitions to 100MB for optimum performance -- smaller partitions are faster to read. We can make the partitions smaller by also grouping them into months:
CREATE TABLE messages_by_workspace_channel_ids_yearmonth (
workspace_id text,
channel_id text,
year int,
month int,
created_at timestamp,
...
PRIMARY KEY ((workspace_id, channel_id, year, month), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
You could make them even smaller by further grouping them into dates:
CREATE TABLE messages_by_workspace_channel_ids_createdate (
workspace_id text,
channel_id text,
createdate date,
created_at timestamp,
...
PRIMARY KEY ((workspace_id, channel_id, createdate), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
The more "buckets" you use, the more partitions you will have in the table which is ideal since more partitions means greater distribution of data across the cluster. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag then click on the Watch tag button. 🙏 Thanks!

Can you "WITH CLUSTERING ORDER BY" with a time uuid

CREATE TABLE IF NOT EXISTS .views (
uuid timeuuid,
country text,
ip inet,
region text,
city text,
lat text,
long text,
metro text,
zip text,
video_id int,
date_created timestamp,
PRIMARY KEY(video_id, uuid)
) WITH CLUSTERING ORDER BY (uuid DESC);
My question is can i use a Time UUID to reliably cluster order by my table. Or do I need to use a time stamp
I originally used the time stamp field to cluster my views. However I want to try to avoid having the extra data and am curious if i can sort by my time uuid instead.
My limited tests have confirmed this so far, but I want to make sure it will always work.
Yes, TimeUUID is reliably order your data and it's universal unique.
A TimeUUID (also known as a v1 UUID) should be a combination of the machine's MAC address and a time component. The included MAC address ensures that the value will be unique across machines.
But if you use timestamp and More user concurrently view the same video, then same timestamp can be generate and so you will lose some of the view.
Note : You should generate timeuuid using standard library. Ex. UUIDs.timeBased() or cql function now()

How to model inbox

How would ago about modelling the data if I have a web app for messaging and I expect the user to either see all the messages ordered by date, or see the messages exchanged with a specific contact, again ordered by date.
Should I have two tables, called "global_inbox" and "contacts_inbox" where I would add each message to both?
For example:
CREATE TABLE global_inbox(user_id int, timestamp timestamp,
message text, PRIMARY KEY(user_id, timestamp)
CREATE TABLE inbox(user_id int, contact_id int,
timestamp timestapm, message text,
PRIMARY KEY(user_id, contact_id, timestamp)
This means that every message should be copied 4 times, 2 for sender and 2 for receiver. Does it sound reasonable?
Yes, It's reasonable.
You need some modification.
Inbox table : If a user have many contact and every contact send message, then a huge amount of data will be inserted into a single partition (user_id). So add contact_id to partition key.
Updated Schema :
CREATE TABLE inbox (
user_id int,
contact_id int,
timestamp timestamp,
message text,
PRIMARY KEY((user_id, contact_id), timestamp)
);
global_inbox : Though It's global inbox, a huge amount of data can be inserted into a single partition (user_id). So add more key to partition key to more distribution.
Updated Schema :
CREATE TABLE global_inbox (
user_id int,
year int,
month int,
timestamp timestamp,
message text,
PRIMARY KEY((user_id,year,month), timestamp)
);
Here you can also add also add week to partition key, if you have huge data in a single partition in a week. Or remove month from partition key if you think not much data will insert in a year.
In term of queries performance, Yes it sounds good for me. Apache cassandra is really built in for this kind of data modeling. We build table to satisfy queries. This is the process called 'Denormalization' in Cassandra paradigm. This will increase queries performance. You have duplicated data but the main goal is to have fast queries.

Cassandra data modeling timestamps

I have a fairly simple data model. I am tracking events for users based on timestamps. I'm converting a JSON object which has this scema:
userID:{
event: [
{ timestamp: data },
{ timestamp: data }
]
}
I have come up with two Cassandra schemas.
The first:
CREATE TABLE users ( guid uuid, date timestamp, events varchar, PRIMARY KEY(guid, date) );
The second:
CREATE TABLE users ( guid uuid PRIMARY KEY, date timestamp, events map<text, text> );
Either one would work, requiring the data to be a stringified JSON object. My query will be returning all data from a user in a given time range. Which model makes more sense, or is there a better way to go about this?
The second approach won't allow you to do queries by time range since you don't have date as a clustering column. So you might want to do this:
CREATE TABLE users (
guid uuid,
date timestamp,
events map<text, text>,
PRIMARY KEY(guid, date) );
How you want to define the events field depends on what's in there and how you need to access it. If you access small parts of it often, you might want to break events in the map out into separate rows by making the event key another clustering column like this:
CREATE TABLE users (
guid uuid,
date timestamp,
event_type text,
event_value text,
PRIMARY KEY(guid, date, event_type) );
It's hard to give more specific advice since you didn't describe your use case in terms of what queries you want to run and the volume of data, number of users, etc.
As Jim was saying the second schema does not allow query on the timestamp since it is not contained in the primary key.
He suggested a valid solution but I would also suggest that you use not a uuid and timestamp but a TimeUUID (which provide both an id and a timestamp at the same time) if you can. However if you need to get the users by id only sometimes then the solution of Jim is probably the best :
PRIMARY KEY(guid, date, event_type)

Using Cassandra for time series data

I'm on my research for storing logs to Cassandra.
The schema for logs would be something like this.
EDIT: I've changed the schema in order to make some clarification.
CREATE TABLE log_date (
userid bigint,
time timeuuid,
reason text,
item text,
price int,
count int,
PRIMARY KEY ((userid), time) - #1
PRIMARY KEY ((userid), time, reason, item, price, count) - #2
);
A new table will be created for the day everyday.
So a table contains logs for only one day.
My querying condition is as follows.
Query all logs from a specific user on a specific day(date not time).
So the reason, item, price, count will not be used as hints or conditions for queries at all.
My Question is which PRIMARY KEY design suits better.
EDIT: And the key here is that I want to store the logs in a schematic way.
If I choose #1 so many columns would be created per log. And the possibility of having more values per log is very high. The schema above is just an example. The log can contain values like subreason, friendid and so on.
If I choose #2 one (very) composite column will be created per log, and so far I couldn't find any valuable information about the overhead of the composite columns.
Which one should I choose? Please help.
My advise is that none of your two options seems to be ideal for your time-series, the fact the you're creating a table per-day, doesn't seem optimal either.
Instead I'd recommend to create a single Table and partition by userid and day and use a time uuids as the clustered column for the event, an example of this would look like:
CREATE TABLE log_per_day (
userid bigint,
date text,
time timeuuid,
value text,
PRIMARY KEY ((userid, date), time)
)
This will allow you to have all events in a day in a single row and allow you to do your query per day per user.
By declaring the time clustered column allows to have a wide row where you can insert as a many events as you need in a day.
So the row key is a composite key of the userid and plus date in text e.g.
insert into log_per_day (userid, date, time, value) values (1000,'2015-05-06',aTimeUUID1,'my value')
insert into log_per_day (userid, date, time, value) values (1000,'2015-05-06',aTimeUUID2,'my value2')
The two inserts above will be in the same row and therefore you will be able to read in a single query.
Also if you want more information about time series I highly recommend you to check Getting Started with Time Series Data Modeling
Hope it helps,
José Luis

Resources