Cassandra : Does deleting a whole partition create tombstone? - cassandra

I'm new to Cassandra. I had a situation where delete per partition is performed. Does deleting the entire partition create tombstones? Right now space is not getting released after the deletion.

Yes, deletion of the whole partition creates a special type of the tombstone that "shadows" the all data in the partition. But like the other tombstones, it's kept for gc_grace_seconds, and only after that collected.
There is a great blog post from the The Last Pickle that explains tombstones in great details

As mentioned you can update gc_grace_seconds to 0 but I wouldn't recommend that unless you only have one node in your cluster or that your RF=1. You could try to reduce GC grace to an acceptable time for you. I'd like to put the maximum time I think a Cassandra node could stay down.
An other option to immediately releasing space is to change your data model to use truncate/drop. For instance if you only need your data for 24h you could create one table per day and at some point drop the tables that you don't need.

I made test with insert new data after delete by the same partition key.
create table message_routes (
user_id bigint,
route_id bigint,
primary key ((user_id), service_id)
)
insert into message_routes (user_id, route_id) values (1, 2)
delete from message_routes where user_id = 1
insert info message_routes (user_Id, route_id) values (1, 3)
After each stage was executed nodetool flush & nodetool compact but tombstone from stage 2 was't evicted as shown by sstablemetadata. After delete was executed new insert. I was hoping that Cassandra has optimizations for such cases.
It's interesting how this tombstones affect select queries by partition key if delete will be frequents?
select * from message_routes where user_id = 1

Related

Best Cassandra data model for maintaining bounded lists per user

I have Kafka streams containing interactions of users with a website, so every event has a timestamp and information about the event. For each user I want to store the last K events in Cassandra (e.g. 100 events).
Our website is constantly experiencing bot / heavy users that is why we want to cap events, just to consider "normal" users.
I currently have the current data model in Cassandra:
user_id, event_type, timestamp, event_blob
where
<user_id, event_type> = partition key, timestamp = clustering key
For now we write a new record in Cassandra as soon as a new event happens and later on we go and clean up "heavier" partitions (i.e. count of events > 100).
This doesn't happen in real time and until we don't clean up the heavy partitions we sometimes get bad latencies when reading.
Do you have any suggestions of a better table design for such case?
Is there a way to tell Cassandra to store only at most K elements for partition and expire the old ones in a FIFO way? Or is there a better table design that I can opt for?
Do you have any suggestions of a better table design for such case?
When data modeling for scenarios like this, I recommend a pattern that makes use of three things:
Default TTL set on the table.
Clustering on a time component in descending order.
Adjust query to use a range on the timestamp, never querying data past the TTL.
TTL:
later on we go and clean up "heavier" partitions
How long (on average) before the cleanup happens? One thing I would do, is to use a TTL on that table set to somewhere around the maximum amount of time before your team usually has to clean them up.
Clustering Key, Descending Order:
So your PRIMARY KEY definition looks like this:
PRIMARY KEY ((user_id,event_type),timestamp)
Make sure that you're also clustering in a descending order on timestamp.
WITH CLUSTERING ORDER BY (timestamp DESC)
This is important to use in conjunction with your TTL. Here, your tombstones are on the "bottom" of the partition (when sorting on timestamp descinding) and the recent data (the data you care about) is at the "top" of the partition.
Range Query:
Finally, make sure your query has a range component on the timestamp.
For example: if today is the 11th, and my TTL is 5 days, I can then query the last 4 days of data without pulling back tombstones:
SELECT * FROM events
WHERE user_id = 11111 AND event_type = 'B'
AND timestamp > '2020-03-07 00:00:00';
Problem with your existing implementation is that deletes create tombstones which eventually cause latencies in the read. Creating too many tombstones is not recommended.
FIFO implementation based on count (number of rows per partition) is not possible. The better approach for your use case is not to delete records in the same table. Use Spark to migrate the table into a new temp table and remove the extra records in the migration process. Something like:
1) Create a new table
2) Using Spark , read from the orignal table , migrate all required records (filter extra records) and write to new temp table.
3) Truncate the orignal table. Note that truncate operation do not create Tombstones.
4) Migrate everything from the temp table back to orignal table using Spark.
5) Truncate the temp table.
You can do this in maintenance window of your application ( something like once in a month) until then you can restrict reads with Limit 100 per partition.

Can we restrict in cassandra that a table only have limited number of records or rows?

Can we restrict in cassandra that a table only have limited number of records or rows? If we want to insert maximum 20 rows in a table then how do we do?
Cassandra does not support this kind of operation. This is part of the business logic in your application and it should be done on application level.
No, but you can make a PER PARTITION LIMIT on the query, then issue a delete periodically to create a range tombstone for everything past that range. ie in a
CREATE TABLE mytable (
primary text
clustering timestamp
value text
PRIMARY KEY ((primary), clustering)
You can SELECT * FROM mytable WHERE primary = 'mykey' PER PARTITION LIMIT 20 which then the last one has a clustering of 1548857236000 can then DELETE FROM mytable WHERE primary = 'mykey' and clustering > 1548857236000. For most part id just issue that delete very infrequently (like once an hour or a day depending on load in order to keep partition size down) and use LeveledCompactionStrategy. If enough load include a date component to the primary key like ((primary, yyyyMMdd), clustering) to prevent too much tombstone much buildup in the partition.

Cassandra simple primary key queries

We would like to create a Cassandra table with Simple Primary Key that is consisted of UUID column.
The table will look like:
CREATE TABLE simple_table(
id UUID PRIMARY KEY,
col1 text,
col2 text,
col3 UUID
);
This table will potentially store few billions of rows, and the rows should expire after some time (few months) using the TTL feature.
I have few questions regarding the efficiency of this table:
What is the efficiency of a query against this table using the primary key? Meaning, how Cassandra finds a specific row after resolving in which partition it resides?
Considering that the rows will expire and create many tombstones, how does this will effect the reads and writes to this table? Let's say that we expire the data after 180 days, if I am not mistaken, the ratio of tombstones would be 10/180~=0.056 (when 10 is the gc_grace_periods in days).
In your case, the primary key is equal to the partition key, so you have so-called "skinny" partitions, consisting of one row. If you remove data, then instead of data inside partition you'll have only tombstone, and it's not a problem. If the data is expired, then it will be simply removed during compaction - gc_grace_period isn't applied here - it's required only when you explicitly remove the data - we need to keep tombstone because other nodes may need to "catch up" with changes if they weren't able to receive delete operation. You can find more details about data deletion in following document.
Problem with tombstones arise when you have many (thousands) of rows inside the same partition, for example, if you use several clustering keys. And when such data is deleted, then the tombstone is generated, and should be skipped when we read data inside partition.
P.S. Have you seen this blog post that explains how deletions happen?
After reading the blog (and the comments) that #Alex referred me to, I concluded that tombstones are created for expired rows due to default_time_to_live of the table.
Those tombstones will be cleaned only after gc_grace_periods have passed. See this stack overflow question.
Regarding my first questions this datastax page describes it pretty well.

Using default TTL columns but high number of tombstones in Cassandra

I use Cassandra 3.0.12.
And I have a cassandra Column Family, or CQL table with the following schema:
CREATE TABLE win30 (
cust_id text,
tid timeuuid,
info text,
PRIMARY KEY (cust_id , tid )
) WITH CLUSTERING ORDER BY (tid DESC)
and compaction = {'class': 'DateTieredCompactionStrategy', 'max_sstable_age_days': 31 };
alter table win30 with default_time_to_live = '2592000';
I have set the default_time_to_live property for the entire table, but when I query the table,
select * from win30 order by tid desc limit 9999
Cassandra WARN that
Read xx live rows and xxxx tombstone for query xxxxxx (see tombstone_warn_threshold).
According to this doc How is data deleted,
Cassandra allows you to set a default_time_to_live property for an
entire table. Columns and rows marked with regular TTLs are processed
as described above; but when a record exceeds the table-level TTL,
Cassandra deletes it immediately, without tombstoning or compaction.
"but when a record exceeds the table-level TTL,Cassandra deletes it immediately, without tombstoning or compaction."
Why Cassandra still WARN for tombstone since I have set a default_time_to_live?
I insert data using some CQL like, without using TTL.
insert into win30 (cust_id, tid, info ) values ('123', now(), 'sometext');
a similar question but it does not use default_time_to_live
And it seems that I could set the unchecked_tombstone_compaction to true?
Another question, I select data with ordering the same as the CLUSTERING ORDER,
why Cassandra hit so many tombstones?
Why Cassandra still WARN for tombstone since I have set a default_time_to_live?
The way TTL works in Cassandra is that once the record is expired, its marked as tombstone (the same process of deletion of a record). So instead of manually having a purge job in RDBMS world, Cassandra enables you to cleanup old records based on their TTL. But it still follows through the same process as DELETE and hence the tombstone. Since your TTL value is '2592000' (30days), anything older than 30 days in the table gets expired (marked as tombstone - deleted).
Now the reason for the warning is that your SELECT statement is looking for records that are alive (non-deleted) and the warning message is for how many tombstoned (expired / deleted) records were encountered in the process. So while trying to serve 9999 alive records, the table hit X number of tombstones along the way.
Since the TTL is set at table level, any inserted record to this table will have a default TTL of 30days.
Here is the documentation reference, in case you want to read more.
After the number of seconds since the column's creation exceeds the TTL value, TTL data is considered expired and is included in results. Expired data is marked with a tombstone after on the next read on the read path, but it remains for a maximum of gc_grace_seconds.
Above reference is from this link
And it seems that I could set the unchecked_tombstone_compaction to true?
Its nothing related to the warning that you are getting. You could think about reducing gc_grace_seconds value (default 10 days) to get rid of tombstones quicker. But there is a reason for this value to be 10days.
Note that DateTieriedCompactionStrategy is depcreated and once you upgrade to 3.11 Apache Cassandra or DSE 5.1.2 there is TimeWindowCompactionStrategy which does a better job with handling tombstones.

Cassandra Query Timeout with small set of data

I am having a problem with Cassandra 2.1.17. I have a table with about 40k "rows" in it. One partition I am having a problem with has maybe about 5k entries in it.
Table is:
create table billing (
accountid uuid,
date timeuuid,
credit double,
debit double,
type text,
primary key (accountid,date)
) with clustering order by (date desc)
So there is a lot of inserting and deleting from this table.
My problem is that somehow it seems to get corrupt I think because I am no longer able to select data past a certain point from a partition.
From cqlsh I can run soemthing like this.
SELECT accoutid,date,credit,debit,type FROM billing WHERE accountid=XXXXX-xxxx-xxxx-xxxxx... AND date < 3d466d80-189c-11e7-8a57-f33cbced2fc5 limit 2;
First I did a select limit of 10000 it works up to around 5000 rows pageing through them then towards the end it will give a timeout error.
I then use the second from last timeuuid and select limit 2 it will fail limit 1 will work.
If I use the last timeuuid as a < and limit to 1 it will also fail.
So just looking for what I can do here I am not sure what is wrong and not sure how I can fix/diagnose what happened.
I have tired a repair and force a compaction. but it still seems to have the issue.
Thank you for any help.
Try to start with running manual compaction on your table.
You can increase read_request_timeout_in_ms parameter in cassandra config.
Consider moving to leveled compaction strategy if you are having a lot of deletes and updates.
I think you got too many tombstones in this partition.
What is a tombstone ?
To remember that a record has been deleted Cassandra creates a special value called a "tombstone". A tombstone has a TTL as any other value has but it is not compacted as easily as any other value is. Cassandra keeps it longer to avoid such inconsistency as data reappearence.
How to watch tombstones ?
nodetool cfstats gives you an idea of how many tombstones you have on average per slice
How to fix the issue ?
The duration a tombstone is preserved is gc_grace_seconds. You have to reduce it and then run a major compaction to fix the issue.
It looks to me like you are hitting a lot of tombstones when you do selects. The thing is while they are there cassandra still has to go over them. There might be multiple factors like ttl with insert statements, a lot of deletes, inserting of nulls etc.
My bet would be that you would need to adjust gc_grace_seconds on table and run repairs more often. But be careful and don't set it to to low (one round of repair has to finish before this time).
It's all nicely explained here:
https://opencredo.com/cassandra-tombstones-common-issues/

Resources