Sliding Window TTL in Cassandra - cassandra

I'm looking into Cassandra for a potential upcoming project which I think it could be a good fit for. The one potential place where it is stumping me is around a requirement for data retention. Basically we have a schema like this:
CREATE TABLE Things (
user_id int
thing_id int
a text static
b text static
.... more static fields
updated_at timestamp static
type text
subthing_id int
PRIMARY KEY (user_id, thing_id, subthing_id)
)
In relational database terms I would say that a Thing belongs to a User and a Thing has many Subthings.
A Thing has various sub-things associated with it that come in at later times that will do a new insert in turn updating the appropriate static fields. We need to store each Thing for 30 days after the last time a subthing was inserted for that Thing. So for example, Thing A and Thing B get inserted. A subthing for Thing B is inserted a week later. Thing A is deleted 30 days after initial insertion. Thing B (and all associated subthings) are deleted 7 days later.
As far as I can tell, I can't just insert with a TTL since I need to update the TTL of the other Thing rows sharing the same user_id and thing_id. I'm also not entirely sure how I would just run a DELETE command here since I'm not deleting by any of the keys. I believe the primary key is correct here since ALL queries will be based on the user_id (except the deletion which is determined by the updated_at).
My other concern is the idea of the tombstones. I have only read about them but the concern here is that I would be deleting potentially millions of these Things each day. Is that going to require daily compaction after the daily deletes are performed?
Update:
An alternative I have thought of since the original posting was having a second table that gets inserted to each time a subthing is added. It would look like:
CREATE TABLE Expirations (
expiry date
user_id int
thing_id int
PRIMARY KEY (expiry, user_id, thing_id)
)
Where expiry is the date of the given user_id and thing_id to be deleted. This table would have to be updated as necessary as things are inserted into the Things table and then I would have to run something each day to query for values where expiry is today and iterate over them to delete things from the Things table. I am not sure if this is considered the "Cassandra way" but it seems like it could work.

This is an interesting challenge. I would use a map data type to map each thing_id to all its subthing_ids. I'd go for something like:
CREATE TABLE Things (
partition_date timestamp,
insertion_date timestamp,
user_id int,
thing_map map<int,int>
a text static
b text static
.... more static fields
updated_at timestamp static
type text
PRIMARY KEY (partition_date, insertion_date, user_id)
) WITH CLUSTERING ORDER BY (insertion_date DESC)
Here I inserted a new field insertion_date that should hold exactly the insertion date, and a new field partition_date that becomes new new only PARTITION KEY, that should store a truncation of the insertion_date field, just to avoid some hotspots (I'm assuming that can simply query based on a day field due to your TTL requirements, if you need to query on the user_id field things are a bit different). I recently answered to similar questions about this modeling problem here and here, so have a look at these to get more information about the used technique (it's called bucketing).
Then there's the thing_map that is the core of your problem. Pushing a new object in the map should reset the TTL for that map entirely, so that could give you exactly the desired behavior. Note that the TTL will remove the field only, not the entire row, you'll simply need to test if it's null or not.
Finally, the tombstone behavior is a problem you're gonna having to face. If you can afford a complete row rewrite, that instead of updating only the map field you upsert all the row at once you'd get a delete at partition level, and the "reverse time-series" I've modeled with the clustering key should take care of that without too much problems.

Related

Data modelling to faciliate pruning/bulk update/delete in scylladb/cassandra

Lets say I have a table like below with a composite partition key.
CREATE TABLE heartrate (
pet_chip_id uuid,
date text,
time timestamp,
heart_rate int,
PRIMARY KEY ((pet_chip_id, date), time)
);
Lets say there is a batch job to prune all the data older than X. I can't do below query since its missing other partition key in the query.
DELETE FROM heartrate WHERE date < '2020-01-01';
How do you model your data such a way that this can be achieved in Scylla? I understand that internally scylla creates a partition based on partition keys but in this case its impossible to query all the list of pet_chip_id and do N queries to delete.
Just wanted to know how people do this outside RDBMS world.
The recommended way to delete old data automatically in Scylla is using the Time-to-live (TTL) feature:
When you write a row, you add "USING TTL 864000" is you want that data to be deleted automatically in 10 days. You can also specify a default TTL for a given table, so that every piece of data written to the table will get expired after (say) 10 days.
Scylla's TTL feature is separate from the data itself, so it doesn't matter which columns you used as partition keys or clustering keys - in particular the "date" column no longer needs to be a clustering key (or exist at all, for that matter) - unless you also need it for something else.
As #nadav-harel said in his answer if you can define a TTL that's always the best solution but if you can't, a possible solution is to create a materialized view to be able to list the primary keys of the main table based on the field that you need to use in the delete query. In the prune job you can first do a select from the MV and then delete from the main table using the values that you got from the MV.
Example:
CREATE TABLE my_table (
a uuid,
b text,
c text,
d int,
e timestamp
PRIMARY KEY ((a, b), c)
);
CREATE MATERIALIZED VIEW my_mv AS
SELECT a,
b,
c
FROM my_table
WHERE time IS NOT NULL
PRIMARY KEY (b, a, c);
Then in your prune job you could select from my_mv based on b and then delete from my_table based on the values returned from the select query.
Note that this solution might not be effective depending on your model and the amount of data you have, but keep in mind that deleting data is also a way of querying your data and your model should be defined based on your queries needs, i.e. before defining your model, you need to think about every way you will query it (including how you will prune your data).

Cassandra sort not by primary key

I'm trying to model a table in Cassandra, I'm quite new and stumbled upon one problem. I've got the following:
CREATE TABLE content_registry (
service text,
file text,
type_id tinyint,
container text,
status_id tinyint,
source_location text,
expiry_date timestamp,
modify_date timestamp,
create_date timestamp,
to_overwrite boolean,
PRIMARY KEY ((service), file, type_id)
);
So as I understand:
service is my partition key and based on this value hashes will be generated and values will be split in cluster
file is clustering key
type_id is clustering key
These three bodies combine a composite (compound) primary key
What I've figured out is that whenever I'll insert new data, Cassandra will upsert (either insert or update if the value with that compound primary key exists)
Now what I'm struggling is, that I want my data to come back sorted by create_date in descending order, however create_date is not part of primary key.
If I add create_date to my primary key, I won't be able to upsert data, because create_date means timestamp when record was inserted, so if I add it to primary key every time there's an insert, I'll end up with multiple records.
What are the other options? Order in application? That doesn't seem very efficient.
What I've figured out is that whenever I'll insert new data, Cassandra
will upsert (either insert or update if the value with that compound
primary key exists)
Totally right.
Now what I'm struggling is, that I want my data to come back sorted by
create_date in descending order, however create_date is not part of
primary key.
If I add create_date to my primary key, I won't be able to upsert
data, because create_date means timestamp when record was inserted, so
if I add it to primary key every time there's an insert, I'll end up
with multiple records.
With these sentences you are actually contradicting.
If create_date isn't part of your key but a property and the data is upserted, it means that the records are always the same. Therefore when querying by the key and fetching create_date you always have the latest. If you actually want to have the date when the record got created you should just not override the data anymore after the first time you inserted that record.
If it's the case you want to represent a series of data, you indeed need to avoid upserting, this is could be done by using create_date as additional partition key. I'd rather prefeer using time_uuid which comes with quite handy functions.
Last but not least, the most interesting question is, what actually the usecase is that you want to reflect. When modelling data in cassandra you always should know your queries you need to run in advance.
The key concept in Cassandra is that you have to decide what's your PRIMARY KEY, that is what in your rows can be unique and known at query times. This is a very basic requirement, since failing at recognizing this will lead to a bad model.
From what I can see, you identified service as your PARTITION KEY, so I'm thinking that this field is what "rules" your data. This is something you must really know to perform even a single query (ignoring the inefficient table scan SELECT * FROM content_registry;). Within each service, you currently have your rows ordered by file and then by type_id. I don't know the exact meaning of the latter field, but you can currently have two rows identified by ('service1', 'a.jpg', 1) and ('service1', 'a.jpg', 2). So if type_id is somehow related to the file, the model is a bit incorrect.
Now, assuming you want to fetch the same records for each service in another order, what you really need to do is create another table that will include the create_date as the first clustering column, eg (service, create_date, file, type_id). This will allow you to fetch records ordered by creation date, and when two records are created in the same date, they will be further ordered by file, and then by type_id.
A second approach is to attach a secondary index to the create_date field of your original table. This will allow to query by creation date.
A third approach, probably better than the second, is the use of a Materialized View. It will hide a lot of burdens for you and will probably scale better than secondary indexes.
Please note that having secondary indexes or materialized views usually don't scale well. Check if these approaches are enough for your use case.
If I add create_date to my primary key, I won't be able to upsert data.
Why not? Suppose your key was PRIMAY KEY (service, create_date, file, type_id)? That will let you sort by create_date for each service but not globally.
If you want to do it globally (that is, you want all services and all files sorted by create date) then things are probably more complex if you still want to be able to shard your data. One option would be to make the primary key PRIMARY KEY (create_date, service, file, type_id) and use one of the order preserving partitioners.
Also, a bit more information here: http://www.datastax.com/dev/blog/we-shall-have-order

Cassandra data modeling

So I'm designing this data model for product price tracking.
A product can be followed by many users and an user can follow many products, so it's a many to many relation.
The products are under constant tracking, but a new price is inserted only if it has varied from the previous one.
The users have set an upper price limit for their followed products, so every time a price varies, the preferences are checked and the users will be notified if the price has dropped below their treshold.
So initially I thought of the following product model:
However "subscriberEmails" is a list collection that will handle up to 65536 elements. But being a big data solution, it's a boundary that we don't want to have. So we end up writing a separate table for that:
So now "usersByProduct" can have up to 2 billion columns, fair enough. And the user preferences are stored in a "Map" which is again limited but we think it's a good maximum number of products to follow by user.
Now the problem we're facing is the following:
Every time we want to update a product's price we would have to make a query like this:
INSERT INTO products("Id", date, price) VALUES (7dacedd2-c09b-46c5-8686-00c2a03c71dd, dateof(now()), 24.87); // Example only
But INSERT operations don't admit other conditional clauses than (IF NOT EXISTS) and that isn't what we want. We need to update the price only if it's different from the previous one, so this forces us to make two queries (one for reading current value and another to update it if necessary).
PD. UPDATE operations do have IF conditions but it's not our case because we need an INSERT.
UPDATE products SET date = dateof(now()) WHERE "Id" = 7dacedd2-c09b-46c5-8686-00c2a03c71dd IF price != 20.3; // example only
Don't try to apply a normal model on a cassandra database. It may work but you'll end up with terrible performance and scalability.
The recommended approach to Cassandra data modeling is to first figure out your read queries against the database and structure your data so that these reads are cheap. You'll probably need to duplicate writes somewhat but it's OK because writes are pretty cheap in Cassandra.
For your specific use case, the key query seems to be able to get all users interested in a price change in a product, so you create a table for this, for example:
create table productSubscriptions (
productId uuid,
priceLimit float,
createdAt timestamp,
email text,
primary key (productId,priceLimit,createdAt)
);
but since you also need to know all product subscriptions for a user, you all need a user-keyed table of the same data:
create table userProductSubscriptions (
email text,
productId uuid,
priceLimit float,
primary key (email, productId)
)
With these 2 tables, I guess you can see that all your main queries can be done with a single-row select and your insert/delete are straightforward but will require you to modify both tables in sync.
Obviously, you'll need to flesh out a bit more the schema for your complete need but this should give you an example on how to think about your cassandra schema.
Conditional update issue
For your conditional insert issue, the easiest answer is: do it with an UPDATE if you really need it (update and insert are nearly identical in CQL) but it's a very expensive operation so avoid it if you can.
For your use case, I would split your product table in three :
create table products (
category uuid,
productId uuid,
url text,
price float,
primary key (category, productId)
)
create table productPricingAudit (
productId uuid,
date timestamp,
price float,
primary key (productId, date)
)
create table priceScheduler (
day text,
checktime timestamp,
productId uuid,
url text,
primary key (day, checktime)
)
products table can hold for full catalog, optionally split in categories (so that listing all products in a single category is a single-row select)
productPricingAudit would have an insert with the latest price retrieved whatever it is since this will let you debug any pricing issue you may have
priceScheduler holds all the check to be made for a given day, ordered by check time. Your scheduler simply has to make a column range query on single row whenever it runs.
With such a schema, you don't care about the conditional updates, you simply issue 3 inserts whenever you update a product price even it doesn't change.
Okay, I will try to answer my own question: conditional inserts other than "IF NOT EXISTS" are not supported in Cassandra by the date, period.
The closest thing is a conditional update, but that doesn't work in our scenario. So there's one simple option left: application side logic. This means that you have to read the previous entry and perform the decision on your application. The obvious downside is that 2 queries are performed (one SELECT and one INSERT) which obviously adds latency.
However this suits our application because every time we perform a query to enqueue all items that should be checked, we can select the items urls and their current prices too. So the workers that check the latest price can then make the decision of inserting or not because they have the current price to compare with.
So... A query similar to this would be performed every X minutes:
SELECT id, url, price FROM products WHERE "nextCheckTime" < now();
// example only, wouldn't even work if nextCheckTime is not part of the PK or index
This is a very costly operation to perform on a Cassandra cluster because it has to go through all rows that are stored randomly in different nodes by default. Another downside is that we need some advanced and specific statistics regarding products and users.
So we've decided that a relational database will serve us better than Cassandra in this particular case.
We sadly leave all of Cassandra's advantages (fast inserts, easy scaling, built in sharding...) and look towards a MySQL Cluster or master-slave implementation.

Cassandra CQL SELECT/DELETE issue due to primary key constraints

I need to store latest updates that needs to be pushed to users' newsfeed page in Cassandra table for later retrieval and my table's schema is as follow:
CREATE TABLE newsfeed (user_name text,
post_id bigint,
post_type text,
favorited boolean,
shared boolean,
own boolean,
date timestamp,
PRIMARY KEY (user_name,date,post_id,post_type) );
The first three column (username, postid, and posttype) in combination will build the actual primary-key of the table, however since I wanted to ORDER the SELECT queries on this table based on "date"s of rows I placed the date-column into the primary key fields as the "second" entry (did I have to do this?).
When I want to delete a row by giving only "user_name, post_id, and post_type" as follow:
DELETE FROM newsfeed WHERE user_name='pooria' and post_id=36 and post_type='p';
I will get the following error:
Bad Request: Missing PRIMARY KEY part date since post_id is set
I need the date-column to be part of the primary key since I want to use it in my ORDER BY clauses and on the other hand I have to delete some rows without knowing their "date" values!
So how such problems are tackled in Cassandra? should I be fixing my Data Model and have different schema for job?
DataStax's Chief Evangelist Patrick McFadden posted an article demonstrating a few time series modeling patterns. Definitely makes for a good read, and should be of some help to you: Getting Started with Time Series Data Modeling.
I think your table is just fine. Although, with the way that composite primary keys work in Cassandra, if you cannot skip primary key components in a query. So if you do end up needing to query data by user_name, post_id, and/or post_type differently (without date), you should create a table specifically for that query (which does not include date in the primary key).
I will however say that in-general, creating a table which will process regular delete operations is not a good idea. In fact, I'm pretty sure that has been classified as a Cassandra "anti-pattern." Data really isn't deleted from Cassandra; it is tombstoned. Tombstones are reconciled at compaction time (assuming that the tombstone threshold time has been met), and having too many of them has been known to cause performance issues.
If you read the article I linked above, go down to the section named "Time Series Pattern 3." You will notice that the INSERT statements are run with the USING TTL clause. This gives the data a time-to-live in seconds, after which it will "quietly disappear." For instance, if you wanted to keep your data around for 24 hours (86400 seconds) you could do something like this:
INSERT INTO newsfeed (...) VALUES (...) USING TTL 86400
Using the TTL feature is a preferable alternative to regular cleansing by DELETE.

How to make Cassandra have a varying column key for a specific row key?

I was reading the following article about Cassandra:
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/#.UzIcL-ddVRw
and it seemed to imply you can have varying column keys in cassandra for a given row key. Is that true? And if its true, how do you allow for varying row keys.
The reason I think this might be true is because say we have a user and it can like many items and we simply want the userId to be the rowkey. We let this rowKey (userID) map to all the items that specific user might like. Each specific user might like a different number of items. Therefore, if we could have multiple column keys, one for each itemID each user likes, then we could solve the problem that way.
Therefore, is it possible to have varying length of cassandra column keys for a specific rowKey? (and how do you do it)
Providing an example and/or some cql code would be awesome!
The thing that is confusing me is that I have seen some .cql files and they define keyspaces before hand and it seems pretty inflexible on how to make it dynamic, i.e. allow it to have additional columns as we please. For example:
CREATE TABLE IF NOT EXISTS results (
test blob,
tid timeuuid,
result text,
PRIMARY KEY(test, tid)
);
How can this even allow growing columns? Don't we need to specify the name before hand anyway?Or additional custom columns as the application desires?
Yes, you can have a varying number of columns per row_key. From a relational perspective, it's not obvious that tid is the name of a variable. It acts as a placeholder for the variable column key. Note in the inserts statements below, "tid", "result", and "data" are never mentioned in the statement.
CREATE TABLE IF NOT EXISTS results (
data blob,
tid timeuuid,
result text,
PRIMARY KEY(test, tid)
);
So in your example, you need to identify the row_key, column_key, and payload of the table.
The primary key contains both the row_key and column_key.
Test is your row_key.
tid is your column_key.
data is your payload.
The following inserts are all valid:
INSERT your_keyspace.results('row_key_1', 'a4a70900-24e1-11df-8924-001ff3591711', 'blob_1');
INSERT your_keyspace.results('row_key_1', 'a4a70900-24e1-11df-8924-001ff3591712', 'blob_2');
#notice that the column_key changed but the row_key remained the same
INSERT your_keyspace.results('row_key_2', 'a4a70900-24e1-11df-8924-001ff3591711', 'blob_3');
See here
Did you thought of exploring collection support in cassandra for handling such relations in colocated way{e.g. on same data node}.
Not sure if it helps, but what about keeping user id as row key and a map containing item id as key and some value?
-Vivel

Resources