Using default TTL columns but high number of tombstones in Cassandra - cassandra

I use Cassandra 3.0.12.
And I have a cassandra Column Family, or CQL table with the following schema:
CREATE TABLE win30 (
cust_id text,
tid timeuuid,
info text,
PRIMARY KEY (cust_id , tid )
) WITH CLUSTERING ORDER BY (tid DESC)
and compaction = {'class': 'DateTieredCompactionStrategy', 'max_sstable_age_days': 31 };
alter table win30 with default_time_to_live = '2592000';
I have set the default_time_to_live property for the entire table, but when I query the table,
select * from win30 order by tid desc limit 9999
Cassandra WARN that
Read xx live rows and xxxx tombstone for query xxxxxx (see tombstone_warn_threshold).
According to this doc How is data deleted,
Cassandra allows you to set a default_time_to_live property for an
entire table. Columns and rows marked with regular TTLs are processed
as described above; but when a record exceeds the table-level TTL,
Cassandra deletes it immediately, without tombstoning or compaction.
"but when a record exceeds the table-level TTL,Cassandra deletes it immediately, without tombstoning or compaction."
Why Cassandra still WARN for tombstone since I have set a default_time_to_live?
I insert data using some CQL like, without using TTL.
insert into win30 (cust_id, tid, info ) values ('123', now(), 'sometext');
a similar question but it does not use default_time_to_live
And it seems that I could set the unchecked_tombstone_compaction to true?
Another question, I select data with ordering the same as the CLUSTERING ORDER,
why Cassandra hit so many tombstones?

Why Cassandra still WARN for tombstone since I have set a default_time_to_live?
The way TTL works in Cassandra is that once the record is expired, its marked as tombstone (the same process of deletion of a record). So instead of manually having a purge job in RDBMS world, Cassandra enables you to cleanup old records based on their TTL. But it still follows through the same process as DELETE and hence the tombstone. Since your TTL value is '2592000' (30days), anything older than 30 days in the table gets expired (marked as tombstone - deleted).
Now the reason for the warning is that your SELECT statement is looking for records that are alive (non-deleted) and the warning message is for how many tombstoned (expired / deleted) records were encountered in the process. So while trying to serve 9999 alive records, the table hit X number of tombstones along the way.
Since the TTL is set at table level, any inserted record to this table will have a default TTL of 30days.
Here is the documentation reference, in case you want to read more.
After the number of seconds since the column's creation exceeds the TTL value, TTL data is considered expired and is included in results. Expired data is marked with a tombstone after on the next read on the read path, but it remains for a maximum of gc_grace_seconds.
Above reference is from this link
And it seems that I could set the unchecked_tombstone_compaction to true?
Its nothing related to the warning that you are getting. You could think about reducing gc_grace_seconds value (default 10 days) to get rid of tombstones quicker. But there is a reason for this value to be 10days.
Note that DateTieriedCompactionStrategy is depcreated and once you upgrade to 3.11 Apache Cassandra or DSE 5.1.2 there is TimeWindowCompactionStrategy which does a better job with handling tombstones.

Related

Cassandra : Does deleting a whole partition create tombstone?

I'm new to Cassandra. I had a situation where delete per partition is performed. Does deleting the entire partition create tombstones? Right now space is not getting released after the deletion.
Yes, deletion of the whole partition creates a special type of the tombstone that "shadows" the all data in the partition. But like the other tombstones, it's kept for gc_grace_seconds, and only after that collected.
There is a great blog post from the The Last Pickle that explains tombstones in great details
As mentioned you can update gc_grace_seconds to 0 but I wouldn't recommend that unless you only have one node in your cluster or that your RF=1. You could try to reduce GC grace to an acceptable time for you. I'd like to put the maximum time I think a Cassandra node could stay down.
An other option to immediately releasing space is to change your data model to use truncate/drop. For instance if you only need your data for 24h you could create one table per day and at some point drop the tables that you don't need.
I made test with insert new data after delete by the same partition key.
create table message_routes (
user_id bigint,
route_id bigint,
primary key ((user_id), service_id)
)
insert into message_routes (user_id, route_id) values (1, 2)
delete from message_routes where user_id = 1
insert info message_routes (user_Id, route_id) values (1, 3)
After each stage was executed nodetool flush & nodetool compact but tombstone from stage 2 was't evicted as shown by sstablemetadata. After delete was executed new insert. I was hoping that Cassandra has optimizations for such cases.
It's interesting how this tombstones affect select queries by partition key if delete will be frequents?
select * from message_routes where user_id = 1

How to delete data from Cassandra table with TWCS and counter column?

I have a table that uses TWCS including a counter column:
create table sensors_by_time (
group text, // sensor group
date date, // bucketing
id text, // sensor id
count counter, // detected count
primary key ((group, date), id))
WITH CLUSTERING ORDER BY (id DESC)
AND compaction = {
'compaction_window_size': '24',
'compaction_window_unit': 'HOURS',
'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy'}
After a week I have 7 sstables (1 for each day). I need the data for 7 days so i thought to use ttl and gc_grace_seconds but Cassandra doe's not support ttl on table with counter column..
My other option is use some job to delete data older than 7 days but I understand that It's not good for my performance because of the TWCS: http://www.redshots.com/cassandra-twcs-must-have-ttls/
How should i delete old data from such a table?
I know I'm resurrecting an old question, but I ran into a similar problem, and wrote a tool to help solve it. On each node, you'll have to:
stop the cassandra process
delete the SSTables that contain the old records
start the process again
The difficult part is knowing which SSTables contain date ranges you're no longer interested in. Cassandra comes with a tool, sstablemetadata, that display SSTable metadata, including the Min/Max timestamps.
sstablemetadata is slow, and the output is difficult to process. Instead try ls-sstm, which outputs nicely formatted tabular data about each SSTable within a Cassandra table directory: https://github.com/lokkju/cassandra-tools/blob/main/ls-sstm.sh

Cassandra simple primary key queries

We would like to create a Cassandra table with Simple Primary Key that is consisted of UUID column.
The table will look like:
CREATE TABLE simple_table(
id UUID PRIMARY KEY,
col1 text,
col2 text,
col3 UUID
);
This table will potentially store few billions of rows, and the rows should expire after some time (few months) using the TTL feature.
I have few questions regarding the efficiency of this table:
What is the efficiency of a query against this table using the primary key? Meaning, how Cassandra finds a specific row after resolving in which partition it resides?
Considering that the rows will expire and create many tombstones, how does this will effect the reads and writes to this table? Let's say that we expire the data after 180 days, if I am not mistaken, the ratio of tombstones would be 10/180~=0.056 (when 10 is the gc_grace_periods in days).
In your case, the primary key is equal to the partition key, so you have so-called "skinny" partitions, consisting of one row. If you remove data, then instead of data inside partition you'll have only tombstone, and it's not a problem. If the data is expired, then it will be simply removed during compaction - gc_grace_period isn't applied here - it's required only when you explicitly remove the data - we need to keep tombstone because other nodes may need to "catch up" with changes if they weren't able to receive delete operation. You can find more details about data deletion in following document.
Problem with tombstones arise when you have many (thousands) of rows inside the same partition, for example, if you use several clustering keys. And when such data is deleted, then the tombstone is generated, and should be skipped when we read data inside partition.
P.S. Have you seen this blog post that explains how deletions happen?
After reading the blog (and the comments) that #Alex referred me to, I concluded that tombstones are created for expired rows due to default_time_to_live of the table.
Those tombstones will be cleaned only after gc_grace_periods have passed. See this stack overflow question.
Regarding my first questions this datastax page describes it pretty well.

How does the data overhead of multiple columns with TTL in Cassandra work?

In the documentation for expiring data for Cassandra (here) it is mentioned that
Expiring data has an additional overhead of 8 bytes in memory and on disk (to record the TTL and expiration time) compared to standard data.
If one sets a TTL (time-to-live) on a table level, does that mean that for each data entry there is an overhead of 8 bytes more in memory and on disk multiplied by the number of columns, or it's independent of the number of columns?
For example, in the documentation one also finds the example here to determine the TTL for a column, even though data is inserted on more than 1 column and TTL is defined for the actual data entry being inserted, not on a per-column basis.
No, not anymore at least. That documentation is outdated and only relevant pre 3.0.
Currently if all the columns in a partition or a row in a partition have same TTL set at insertion it is just set the once for it. If they are stored they are written delta encoded from sstables minTimestamp as an unsigned variable int, not 8 bytes.
According to Cassandra documentation, on the create table section it says:
default_time_to_live
TTL (Time To Live) in seconds, where zero is
disabled. When specified, the value is set for the Time To Live (TTL)
marker on each column in the table; default value: 0. When the table
TTL is exceeded, the table is tombstoned.
Meaning that when you define a TTL for the table, it is valid for each column (except the primary key).

cassandra TTL for table behaviour

Suppose I inserted a column at second-1 and another column at second-2. Default TTL for table is set to 10 seconds for example:
Question 1: Is data1 and data2 going to be deleted after 10 seconds or data 1 will be deleted after 10 seconds and data-2 after 11 seconds ( as it was inserted in second-2)?
Question 2: Is it possible to set a TTL at a table level in such a way that each entry in the table will expire based on the TTL in a FIFO fashion ? (data-1 will expire at second-10 and data-2 at second-11), without specifying TTL while inserting for each data point? (Should be able to specify at a table level ?)
Thanks for the help :)
EDIT:
the page at https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_expire_c.html says
Setting a TTL for a table
The CQL table definition supports the default_time_to_live property,
which applies a specific TTL to each column in the table. After the
default_time_to_live TTL value has been exceed, Cassandra tombstones
the entire table. Apply this default TTL to a table in CQL using
CREATE TABLE or ALTER TABLE
they say "entire table" which confused me.
TTL at table level is by no means different than TTL at values level: it specifies the default TTL time for each row.
The TTL specifies after how many seconds the values must be considered outdated and thus deleted. The reference point is the INSERT/UPDATE timestamp, so if you insert/update a row at 09:53:01:
with a TTL of 10 seconds, it will expire at 09:53:11
with a TTL of 15 seconds, it will expire at 09:53:16
with a TTL of 0 seconds, it will never expire
You can override the default TTL time by specifying USING TTL X clause in your queries, where X is your new TTl value.
Please note that using TTL not wisely can cause tombstones problems. And note also that the TTL usage have some quirks. Have a look at this recent answer for further details.
Question 1 Ans : data1 will deleted after 10 and data2 will deleted after 11 seconds
Question 2 Ans : Cassandra insert every column with the table's ttl, So Every column will expire on insertion time + ttl.
I read this topic and a lot of anothers but I'm still confused because at https://docs.datastax.com/en/cql-oss/3.3/cql/cql_using/useExpire.html
they say exactly this:
If any column exceeds TTL, the entire table is tombstoned.
What do they mean? I understand that there is no any sence to tombstone all columns in table when only one exceeded default_time_to_live but they wrote exactly this!
UPD: I did several tests. default_time_to_live means just default TTL on column level. When this TTL expires just concrete columns with expired TTL are tombstoned.
They used very strange sentence in that article.

Resources