Can you "WITH CLUSTERING ORDER BY" with a time uuid - cassandra

CREATE TABLE IF NOT EXISTS .views (
uuid timeuuid,
country text,
ip inet,
region text,
city text,
lat text,
long text,
metro text,
zip text,
video_id int,
date_created timestamp,
PRIMARY KEY(video_id, uuid)
) WITH CLUSTERING ORDER BY (uuid DESC);
My question is can i use a Time UUID to reliably cluster order by my table. Or do I need to use a time stamp
I originally used the time stamp field to cluster my views. However I want to try to avoid having the extra data and am curious if i can sort by my time uuid instead.
My limited tests have confirmed this so far, but I want to make sure it will always work.

Yes, TimeUUID is reliably order your data and it's universal unique.
A TimeUUID (also known as a v1 UUID) should be a combination of the machine's MAC address and a time component. The included MAC address ensures that the value will be unique across machines.
But if you use timestamp and More user concurrently view the same video, then same timestamp can be generate and so you will lose some of the view.
Note : You should generate timeuuid using standard library. Ex. UUIDs.timeBased() or cql function now()

Related

Recommended Cassandra Schema to store application logs

I have been tasked to come up with a schema to store our application logs using Cassandra. I am quite new to Cassandra but from what I have read and learned so far, it could be the best approach for our use case.
Our application send thousands of SMS each day (provided by 3 local service providers) and we would love to keep a log each time an SMS is sent (for reconciliation purposes at each month's end among other things). We intend to store the information below:
id text, // uuid
phone_number text, // recipient of the SMS
message text, // Message sent
status boolean, // if the SMS was sent or not
response text, // Request response
service_provider text, // e.g Twilio, Telnyx, Venmo etc
date timestamp, // Time SMS is sent
We would like to query the following reports at any one time:
Total number of SMS sent
Total SMS sent for a given period of time (between 2 dates)
Total SMS sent by a specific service provider (also within a given time period)
Total SMS sent to a specific recipient phone number (also within a given time period)
Total failed or successful SMS sent (also within a given period of time)
I have come up with the following tables (3) but I feel like I am over engineering or over thinking it? Perhaps it could be done simpler? I would appreciate any advice in getting this to work efficiently.
create table sms_logs_by_id
(
id text,
phone_number text,
message text,
status boolean,
response text,
provider text,
service_provider text,
date timestamp,
primary key (id, date)
) with clustering order by (date DESC);
create table sms_logs_by_service_provider
(
id text,
phone_number text,
message text,
status boolean,
response text,
provider text,
service_provider text,
date timestamp,
primary key (service_provider, date)
) with clustering order by (date DESC);
create table sms_logs_by_phone_number
(
id text,
phone_number text,
message text,
status boolean,
response text,
provider text,
service_provider text,
date timestamp,
primary key (phone_number, date)
) with clustering order by (date DESC);
create table sms_logs_by_status
(
id text,
phone_number text,
message text,
status boolean,
response text,
provider text,
service_provider text,
date timestamp,
primary key (status, date)
) with clustering order by (date DESC);
Queries run pretty well so far. I am not sure if this is the most optimum way of modelling the data. I would appreciate any advice on how I can improve this data model. Thank you!
The only potential issue I see is that for the last 3 tables (logs by status, phone number and provider), is that the partitions will get larger over time. It's important to remember that Cassandra has a mathematical limit of 2 billion cells per partition (where a "cell" == a column value or key). But you want to model you data so that you don't get anywhere near that limit, because your table will start getting slow long before that.
For these three, I'd recommend a "bucketing" approach:
sms_logs_by_service_provider
...
primary key (service_provider, date)
For this one, my other concern is the fact that you're tracking 3 service providers. So in addition to the partitions growing with each message, there's only 3 partitions. So data isn't distributing very well. With thousands of messages sent per day, I'm thinking that your "bucket" is going to need to be fairly precise...probably by "day." Maybe you could get away with a "week_bucket," but I'll use day for this example:
id text,
provider text,
service_provider text,
day_bucket. int,
date timestamp,
PRIMARY KEY ((service_provider, day_bucket), date, id)
This way, you're creating a partition for each combination of service_provider and day. That will give you plenty of data distribution, plus your partitions won't grow beyond the activity which happens in a single day. I've kept date as a DESC clustering key (good idea) but added id as a "tie-breaker," just in case two messages have the exact same timestamp.
create table sms_logs_by_phone_number (
...
primary key (phone_number, date)
So for this one, I'd take a similar approach. But as we're talking about individual users, we can use a much larger bucket. Based on a quick Google search, the average person sends 85 text messages per day, which comes out to 31,025 per year. That's probably ok to store by year.
id text,
phone_number text,
year_bucket int,
date timestamp,
PRIMARY KEY ((phone_number, year_bucket), date, id)
Partitioning by phone_number already gives you some good distribution.
Adding year_bucket in there will ensure that the partition won't have unbound growth. Also, id for a tie-breaker.
create table sms_logs_by_status(
...
primary key (status, date)
Logs by status is going to have a similar problem to the "provider" table, in that you probably only have a few statuses, so the data distribution will be limited. For this one, you're probably also going to want to use a bucket that's small, like by day.
id text,
status text,
day_bucket. int,
date timestamp,
PRIMARY KEY ((status, day_bucket), date, id)
The unfortunate part, is these changes this complicate your query patterns. But they're necessary to save you from problems down the road.

If not MaterializedViews and not secondary indices then what else is the recommended way to query data in cassandra

I have some data in Cassandra. Say
create table MyTable {
id text PRIMARY KEY,
data text,
updated_on timestamp
}
My application in addition to querying this data by primary key id, needs to query it by updated_on timestamp as well. To fulfil the query by time use case I have tried the following.
create table MyTable {
id text PRIMARY KEY,
data text,
updated_on timestamp,
updated_on_minute timestamp
}
Secondary index on the updated_on_minute field. As I understand, secondary indexes are not recommended for high cardinality cases (which is my case, because I could have a lot of data at the same minute mark). Moreover I have data that gets frequently updated, which means the updated_on_minute will keep revving.
MaterializedView with updated_on_minute as the partition key and a id as the clustering key. I am on version 3.9 of cassandra and had just begun using these, but alas I find these release notes for 3.11x (https://github.com/apache/cassandra/blob/cassandra-3.11/NEWS.txt), which declare them purely experimental and not meant for production clusters.
So then what are my options? Do I just need to maintain my own tables to track data that comes in timewise? Would love some input on this.
Thanks in advance.
As always have been the case, create additional table to query by a different partition key.
In your case the table would be
create table MyTable_by_timestamp {
id text,
data text,
updated_on timestamp,
Primary key(updated_on, id)
}
Write to both tables mytable_by_timetamp and mytable_by_id. Use the corresponding table to READ from based on the partition key either updated_on or id.
It’s absolutely fine to duplicate data based on the use case (query) it’s trying solve.
Edited:
In case there is a fear about huge partition, you can always bucket into smaller partitions. For example the table above could be broken down into
create table MyTable_by_timestamp {
id text,
data text,
updated_on timestamp,
updated_min timestamp,
Primary key(updated_min, id)
}
Here I have chosen every minute as the bucket size. Depending on how many updates you receive, you can change it to seconds (updated_sec) to reduce the partition size further.

How to model inbox

How would ago about modelling the data if I have a web app for messaging and I expect the user to either see all the messages ordered by date, or see the messages exchanged with a specific contact, again ordered by date.
Should I have two tables, called "global_inbox" and "contacts_inbox" where I would add each message to both?
For example:
CREATE TABLE global_inbox(user_id int, timestamp timestamp,
message text, PRIMARY KEY(user_id, timestamp)
CREATE TABLE inbox(user_id int, contact_id int,
timestamp timestapm, message text,
PRIMARY KEY(user_id, contact_id, timestamp)
This means that every message should be copied 4 times, 2 for sender and 2 for receiver. Does it sound reasonable?
Yes, It's reasonable.
You need some modification.
Inbox table : If a user have many contact and every contact send message, then a huge amount of data will be inserted into a single partition (user_id). So add contact_id to partition key.
Updated Schema :
CREATE TABLE inbox (
user_id int,
contact_id int,
timestamp timestamp,
message text,
PRIMARY KEY((user_id, contact_id), timestamp)
);
global_inbox : Though It's global inbox, a huge amount of data can be inserted into a single partition (user_id). So add more key to partition key to more distribution.
Updated Schema :
CREATE TABLE global_inbox (
user_id int,
year int,
month int,
timestamp timestamp,
message text,
PRIMARY KEY((user_id,year,month), timestamp)
);
Here you can also add also add week to partition key, if you have huge data in a single partition in a week. Or remove month from partition key if you think not much data will insert in a year.
In term of queries performance, Yes it sounds good for me. Apache cassandra is really built in for this kind of data modeling. We build table to satisfy queries. This is the process called 'Denormalization' in Cassandra paradigm. This will increase queries performance. You have duplicated data but the main goal is to have fast queries.

table definition statement for cassandra for range queries?

Here is the table data
video_id uuid
user_id timeuuid
added_year int
added_date timestamp
title text
description text
I want to construct table based on the following query
select * from video_by_year where added_year<2013;
create table videos_by_year (
video_id uuid
user_id timeuuid
added_year int
added_date timestamp
title text
description text
PRIMARY KEY ((added_year) added_year)
) ;
NOTE: I have used added_year as both primary key and clustering key which is not correct I suppose.
So one of the issues with data modeling in cassandra is that the first component - the partition key - must use "=". The reason for this is pretty clear if you realize what cassandra's doing - it uses that value, hashes it (md5 or murmur3), and uses that to determine which servers in the cluster own that partition.
For that reason, you can't use an inequality - it would require scanning every row in the cluster.
If you need to get videos added before 2013, consider a system where you use some portion of the date as partition key, and then SELECT from each of those date 'buckets', which you can do asynchronously and in parallel. For example:
create table videos_by_year (
video_id uuid
user_id timeuuid
added_date_bucket text
added_date timestamp
title text
description text
PRIMARY KEY ((added_date_bucket), added_date, video_id)
) ;
I used text for added_date_bucket so you could use 'YYYY', or 'YYYY-MM' or similar. Note that depending on how quickly you add videos to the system, you may even want 'YYYY-MM-DD' or 'YYYY-MM-DD-HH:ii:ss', because you'll hit a practical limit of a few million videos per bucket.
You could get clever and have the video_id be a timeuuid, then you get added_date and video_id in a single column.

Cassandra data modeling timestamps

I have a fairly simple data model. I am tracking events for users based on timestamps. I'm converting a JSON object which has this scema:
userID:{
event: [
{ timestamp: data },
{ timestamp: data }
]
}
I have come up with two Cassandra schemas.
The first:
CREATE TABLE users ( guid uuid, date timestamp, events varchar, PRIMARY KEY(guid, date) );
The second:
CREATE TABLE users ( guid uuid PRIMARY KEY, date timestamp, events map<text, text> );
Either one would work, requiring the data to be a stringified JSON object. My query will be returning all data from a user in a given time range. Which model makes more sense, or is there a better way to go about this?
The second approach won't allow you to do queries by time range since you don't have date as a clustering column. So you might want to do this:
CREATE TABLE users (
guid uuid,
date timestamp,
events map<text, text>,
PRIMARY KEY(guid, date) );
How you want to define the events field depends on what's in there and how you need to access it. If you access small parts of it often, you might want to break events in the map out into separate rows by making the event key another clustering column like this:
CREATE TABLE users (
guid uuid,
date timestamp,
event_type text,
event_value text,
PRIMARY KEY(guid, date, event_type) );
It's hard to give more specific advice since you didn't describe your use case in terms of what queries you want to run and the volume of data, number of users, etc.
As Jim was saying the second schema does not allow query on the timestamp since it is not contained in the primary key.
He suggested a valid solution but I would also suggest that you use not a uuid and timestamp but a TimeUUID (which provide both an id and a timestamp at the same time) if you can. However if you need to get the users by id only sometimes then the solution of Jim is probably the best :
PRIMARY KEY(guid, date, event_type)

Resources