Purge old data strategy for Cassandra DB - cassandra

We store events in multiple tables depending on category.
Each event have an id but contains multiple subelements.
We have a lookup table to find events using the subelement_id.
Each subelement can participate at max in 7 events.
Hence the partition will hold max 7 rows.
We will have 30-50 BILLIONS of rows in eventlookup over a period of 5 years.
CREATE TABLE eventlookup (
subelement_id text,
recordtime timeuuid,
event_id text,
PRIMARY KEY ((subelement_id), recordtime)
)
Problem: How do we delete old data once we reach the 5 (or some other number) year mark.
We want to purge the "tail" at some specific intervals, say every week or month.
Approaches investigated so far:
TTL of X years (performs well, but TTL needs to be known before hand, 8 extra bytes for each column)
NO delete - simply ignore the problem (somebody else's problem :0)
Rate limited single row delete (do complete table scan and potentially billions of delete statements)
Split the table to multiple tables -> "CREATE TABLE eventlookupYYYY". Once a year is not needed, simply drop it. (Problem is every read should potentially query all tables)
Is there any other approaches we can consider ?
Is there a design decision we can make now ( we are not in production yet) that will mitigate the future problem?

If it's worth the extra space, track for ranges of recordtimes your subelement_id in a seperate table / columnfamiliy.
Then you can easily get the ids to delete for records having a specific age if you do not want to set a ttl a priori.
But keep in mind to make this tracking distribute well, just a single date will generate hotspots in your cluster and very wide rows, so think about some partition key like (date,chunk) where I uses a random number from 0-10 in the past for chunk. Also you might look at TimeWindowCompactionStrategy - here is a blog post about it: http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
Your partition key is only set to subelement_id, so all tuples of 7 events for all recordtimes will be in one partition.

Given your table structure, you need to know all the subelement_id of all your data just to fetch a single row. So, with this assumption, your table structure can be improved a bit by sorting your data by recordtime DESC:
CREATE TABLE eventlookup (
subelement_id text,
recordtime timeuuid,
eventtype int,
parentid text,
partition bigint,
event_id text,
PRIMARY KEY ((subelement_id), recordtime)
)
WITH CLUSTERING ORDER BY (recordtime DESC);
Now all of your data is in descending order and this will give you a big advantage.
Suppose that you have multiple years of data (eg from 2000 to 2018). Assuming you need to keep only the last 5 years, you'd need to fetch data by something like:
SELECT * FROM eventlookup WHERE subelement_id = 'mysub_id' AND recordtime >= '2013-01-01';
This query is efficient because C* will retrieve your data and will stop scanning the partition exactly where you wanted to: 5 years ago. The big plus is that if you have tombstones after that point, well, they won't impact your reads at all. That means you can "safely" trim after that point safely by issuing a delete with
WHERE subelement_id = 'mysub_id' AND recordtime < '2013-01-01';
Beware that this delete will create tombstones that will be skipped by your reads, BUT they will be read during compactions, so keep it in mind.
Alternatively, you can simply skip the delete part if you don't need to reclaim your storage space, your system will always run smooth because you will always retrieve your data efficiently.

Related

Best Cassandra data model for maintaining bounded lists per user

I have Kafka streams containing interactions of users with a website, so every event has a timestamp and information about the event. For each user I want to store the last K events in Cassandra (e.g. 100 events).
Our website is constantly experiencing bot / heavy users that is why we want to cap events, just to consider "normal" users.
I currently have the current data model in Cassandra:
user_id, event_type, timestamp, event_blob
where
<user_id, event_type> = partition key, timestamp = clustering key
For now we write a new record in Cassandra as soon as a new event happens and later on we go and clean up "heavier" partitions (i.e. count of events > 100).
This doesn't happen in real time and until we don't clean up the heavy partitions we sometimes get bad latencies when reading.
Do you have any suggestions of a better table design for such case?
Is there a way to tell Cassandra to store only at most K elements for partition and expire the old ones in a FIFO way? Or is there a better table design that I can opt for?
Do you have any suggestions of a better table design for such case?
When data modeling for scenarios like this, I recommend a pattern that makes use of three things:
Default TTL set on the table.
Clustering on a time component in descending order.
Adjust query to use a range on the timestamp, never querying data past the TTL.
TTL:
later on we go and clean up "heavier" partitions
How long (on average) before the cleanup happens? One thing I would do, is to use a TTL on that table set to somewhere around the maximum amount of time before your team usually has to clean them up.
Clustering Key, Descending Order:
So your PRIMARY KEY definition looks like this:
PRIMARY KEY ((user_id,event_type),timestamp)
Make sure that you're also clustering in a descending order on timestamp.
WITH CLUSTERING ORDER BY (timestamp DESC)
This is important to use in conjunction with your TTL. Here, your tombstones are on the "bottom" of the partition (when sorting on timestamp descinding) and the recent data (the data you care about) is at the "top" of the partition.
Range Query:
Finally, make sure your query has a range component on the timestamp.
For example: if today is the 11th, and my TTL is 5 days, I can then query the last 4 days of data without pulling back tombstones:
SELECT * FROM events
WHERE user_id = 11111 AND event_type = 'B'
AND timestamp > '2020-03-07 00:00:00';
Problem with your existing implementation is that deletes create tombstones which eventually cause latencies in the read. Creating too many tombstones is not recommended.
FIFO implementation based on count (number of rows per partition) is not possible. The better approach for your use case is not to delete records in the same table. Use Spark to migrate the table into a new temp table and remove the extra records in the migration process. Something like:
1) Create a new table
2) Using Spark , read from the orignal table , migrate all required records (filter extra records) and write to new temp table.
3) Truncate the orignal table. Note that truncate operation do not create Tombstones.
4) Migrate everything from the temp table back to orignal table using Spark.
5) Truncate the temp table.
You can do this in maintenance window of your application ( something like once in a month) until then you can restrict reads with Limit 100 per partition.

Cassandra simple primary key queries

We would like to create a Cassandra table with Simple Primary Key that is consisted of UUID column.
The table will look like:
CREATE TABLE simple_table(
id UUID PRIMARY KEY,
col1 text,
col2 text,
col3 UUID
);
This table will potentially store few billions of rows, and the rows should expire after some time (few months) using the TTL feature.
I have few questions regarding the efficiency of this table:
What is the efficiency of a query against this table using the primary key? Meaning, how Cassandra finds a specific row after resolving in which partition it resides?
Considering that the rows will expire and create many tombstones, how does this will effect the reads and writes to this table? Let's say that we expire the data after 180 days, if I am not mistaken, the ratio of tombstones would be 10/180~=0.056 (when 10 is the gc_grace_periods in days).
In your case, the primary key is equal to the partition key, so you have so-called "skinny" partitions, consisting of one row. If you remove data, then instead of data inside partition you'll have only tombstone, and it's not a problem. If the data is expired, then it will be simply removed during compaction - gc_grace_period isn't applied here - it's required only when you explicitly remove the data - we need to keep tombstone because other nodes may need to "catch up" with changes if they weren't able to receive delete operation. You can find more details about data deletion in following document.
Problem with tombstones arise when you have many (thousands) of rows inside the same partition, for example, if you use several clustering keys. And when such data is deleted, then the tombstone is generated, and should be skipped when we read data inside partition.
P.S. Have you seen this blog post that explains how deletions happen?
After reading the blog (and the comments) that #Alex referred me to, I concluded that tombstones are created for expired rows due to default_time_to_live of the table.
Those tombstones will be cleaned only after gc_grace_periods have passed. See this stack overflow question.
Regarding my first questions this datastax page describes it pretty well.

Cassandra Time Series Data Modelling and Limiting Partition Size

We are currently investigating Cassandra as the database for a large time series system.
I have read through https://academy.datastax.com/resources/getting-started-time-series-data-modeling about modelling time series data in Cassandra.
What we have is high velocity timeseries data coming in for many weather stations. Each weather station has a number of "sensors" that each collect three metrics: temperature, humidity, and light.
We are trying to store each series as a wide row. However, we expect to get billions of readings per station over the life of the project, so we would like to limit the row size.
We would like there to be a single row for each (weather_station_id, year, day_of_year), that is, a new row for every day. However, we still want the partition key to be weather_station_id - that is, we want all readings for a station to be on the same node.
We currently have the following schema, but I would like to get some feedback.
CREATE TABLE weather_station_data (
weather_station_id int,
year int,
day_of_year int,
time timestamp,
sensor_id int,
temperature int,
humidity int,
light int,
PRIMARY KEY ((weather_station_id), year, day_of_year, time, sensor_id)
) WITH CLUSTERING ORDER BY (year DESC, day_of_year DESC, time DESC, sensor_id DESC);
In the aforementioned document, they make use of this "limit partition row by date" concept. However, it is unclear to me whether or not the date in their examples is part of the partition key.
According to the tutorial, if we choose to have weather_station_id as the only partition, the row will be exhausted.
i.e C* has a practical limitation of 2 billion columns per partition.
So IMO, your data-model is bad.
However, it is unclear to me whether or not the date in their examples is part of the partition key.
The tutorial used
PRIMARY KEY ((weatherstation_id,date),event_time)
So, yes they considered data to be part of partition key.
we want all readings for a station to be on the same node.
I am not sure, why you wan't such a requirement. You can always fetch weather data using multiple queries for more than one year.
select * from weather_station_data where weather_station_id=1234 and year= 2013;
select * from weather_station_data where weather_station_id=1234 and year= 2014;
So consider changing your structure to
PRIMARY KEY ((weather_station_id, year), day_of_year, time, sensor_id)
Hope it helps!
In my opinion the datastax model isn't really great. The problem with this model:
They are using the weather station as partition key. All rows with the same partition key are stored on the same machine. This means: If you have 10 years raw data (100ms steps), you will break cassandras limit really fast. 10 years × 365 days × 24 hours × 60 min × 60 seconds x 10 (for 100ms steps) x 7 columns. The limit is 2 Billion. In my opinion you will not use the benefits of cassandra if you build this data model. You can also use, for each weather station, a mongo, mysql or another database.
The better solution: Ask yourself how you will query this data. If you say: I query all data per year, use the year also as partion key. If you need also query data of more than one year, you can create two queries with a different year. This works and the performance is better. (The bottleneck is maybe only the network to your client)
One little more tipp: Cassandra isn't like mysql. It's a denormalized database. This means: It's not dirty to save your data more than one time. This means: It is important for your to query your data per year, it's also important to query your data per hour, per day of year or per sensor_id, you can create column families with different partition key and parimary key order. It's okay to duplicate your data. Cassandra is optimized for write performance, not for read. This means: It's often better to write the data in the right order instead of reading it in the right order. In cassandra 3.0 there is a new feature, called materialized views, for automatic duplicating. And if you think: Ohhh no, i will duplicate the needed storage. Remember: Storage is really cheap. It's okay to buy ten HDDs with 1tb. It cost nothing. The performance is important.
I have one question to your: Can you aggregate your data? Cassandra has a column type called counter. You can create a java/scala application where your aggregate your data while they are produced. You can use a streaming framework for this: Flink or Spark. (If you need a bit more than only counting.). One scenario: You aggregating your data for each hour and day. You got your data in your streaming app. Now: You have an variable for hourly data. You count up or down or whatever. If the hour is finishes, your put this row in your hourly column family and daily column family. In your daily column family your using a counter. I hope, you understand what i mean.

how to avoid sorting on clustering key columns in cassandra

I am a bit new to cassandra.
I have created a table like below
create table events(day text, hour text, sip text, dip text, count, counter,
primary key((day,hour), sip,dip));
our use case is, application receives many events per second. we would like to have a seprate partition per hour of a day and we need to update the counter if the same event is received again. and also we would like to have unique entries for the combination of dip and sip columns hence I have included those as part of the primary key.
Here as dip, sip columns are forming a clustering key, sorting is taking place while inserting the records into the table. In our case sorting is not required for these columns, sorting is a overhead while we include millions of rows in a table. How to avoid this sorting overhead, Can any one help me?
Ordering by clustering columns is needed for Cassandra to function correctly. It needs to store the data that way to keep the row keys unique and to support things like range queries on clustering columns. As Arun says, this allows your subsequent updates to run quickly.
You could reduce the amount of sorting by inserting rows in sorted order, for example by having the first clustering column be a time stamp. But then you'd lose the benefit of being able to increment your counter since you wouldn't know the time stamp key of the earlier event. To get the final counts you'd need to do a roll up operation after each hour to aggregate matching events.
Another way would be to make sip and/or dip part of your partition key. Each event would then hash to a different partition bucket and no sorting would be required. But then you'd lose the grouping of events into one hour partitions. This could be good or bad depending on your needs. If you have a very high rate of events, grouping them all into the same one hour partition could create hot spots since all events will hash to the same node, so making events separate partitions would spread out the write load. If reading the events later as a one hour chunk is more important to you, then having them grouped into one partition will make reading them more efficient at the cost of more expensive writes due to the sorting.
So in general, if you keep your partitions to a reasonable size, the sorting overhead should not be too large since it is done in memory. If your partitions are so large that they are causing performance problems, decrease their size by adding another field to the partition key to break the partitions into smaller chunks to spread out the load on more nodes.

how to get last n results by updated time in cassandra?

I want to fetch last n, say last 5 updated rows i.e. order by updated_time desc in cassandra. Is there any good way of doing it?
Exact use case is like, I want to update the count of event whenever it occurs in the event table and fetch the last five events by updated time along with the count.
table structure:-
event_name text, updated_time timestamp, count counter
In Cassandra you can retrieve the editing time with writetime (cell_name). But as you have multiple columns and Cassandra is fast-reads only you may consider doing another view providing exactly the data needed in an ordered manner. On that new table you want to limit read results and periodically trim it down.
It may be possible doing it with writetime() -- but this was not the Cassandra way as it is too slow in production. Another table with just your data is the denormalized Cassandra way of solving it.

Resources