ttl in cassandra creating tombstones - cassandra

I am only doing inserts to cassandra. While inserting , not nulls are only inserted to avoid tombstones. But few records are inserted with TTL. But then doing select count(*) from table gives following errors -
Read 76 live rows and 1324 tombstone cells for query SELECT * FROM
xx.yy WHERE token(y) >=
token(fc872571-1253-45a1-ada3-d6f5a96668e8) LIMIT 100 (see
tombstone_warn_threshold)
Do TTL inserts lead to tombstones in cassandra 3.7 ? How can the warning be mitigated ?
There are no updates done only inserts , some records without TTL , others with TTL

From datastax documentation: https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_expire_c.html
After the number of seconds since the column's creation exceeds the TTL value, TTL data is considered expired and is included in results. Expired data is marked with a tombstone after on the next read on the read path, but it remains for a maximum of gc_grace_seconds. After this amount of time, the tombstoned data is automatically removed during the normal compaction and repair processes.
These entries will be treated as tombstones until compaction or repair.

To add one more point for TTL and compaction. Even though, after gc_grace_seconds, the default setting for compaction only kicks off depending on tombstone_compaction_interval and tombstone_threshold
Previously, we were having read timeout issue due to high number of tombstones for tables having high number of records. Eventually, we need to reduce tombstone_threshold as well as enable unchecked_tombstone_compaction to make compaction process triggered more frequently.
You can refer to the below docs for more details
http://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlCreateTable.html?hl=unchecked_tombstone_compaction#tabProp__cqlTableGc_grace_seconds

Related

Cassandra tombstones with TTL

I have worked with cassandra for quite some time (DSE) and am trying to understand something that isn't quite clear. We're running DSE 5.1.9 for this illustration. It's a single node cluster (If you have a multi-node cluster, ensure RF=nodeCount to make things easier).
It's very simple example:
Create the following simple table:
CREATE TABLE mytable (
status text,
process_on_date_time int,
PRIMARY KEY (status, process_on_date_time)
) WITH CLUSTERING ORDER BY (process_on_date_time ASC)
AND gc_grace_seconds = 60
I have a piece of code that inserts 5k records at a time up to 200k total records with TTL of 300 seconds. The status is ALWAYS "pending" and the process_on_date_time is a counter that increments by 1, starting at 1 (all unique records - 1 - 200k basically).
I run the code and then once it completes, I flush the memtable to disk. There's only a single sstable created. After this, no compaction, no repair, nothing else runs that would create or change the sstable configuration.
After the sstable dump, I go into cqlsh, turn on tracing, set consistency to LOCAL_ONE and paging off. I then run this repetitively:
SELECT * from mytable where status = 'pending' and process_on_date_time <= 300000;
What is interesting is I see things like this (cutting out some text for readability):
Run X) Read 31433 live rows and 85384 tombstone cells (31k rows returned to my screen)
Run X+1) Read 0 live rows and 76376 tombstone cells (0 rows returned to my screen - all rows expired at this point)
Run X+2) Read 0 live rows and 60429 tombstone cells
Run X+3) Read 0 live rows and 55894 tombstone cells
...
Run X+X) Read 0 live rows and 0 tombstone cells
What is going on? The sstable isn't changing (obviously as it's immutable), nothing else inserted, flushed, etc. Why is the tombstone count decreasing until it's at 0? What causes this behavior?
I would expect to see every run: 100k tombstones read and the query aborting as all TTL have expired in the single sstable.
For anyone else who may be curious to this answer, I opened a ticket with Datastax, and here is what they mentioned:
After the tombstones pass the gc_grace_seconds they will be ignored in
result sets because they are filtered out after they have past that
point. So you are correct in the assumption that the only way for the
tombstone warning to post would be for the data to be past their ttl
but still within gc_grace.
And since they are ignored/filtered out they wont have any harmful
effect on the system since like you said they are skipped.
So what this means is that if TTLs expire, but are within the GC Grace Seconds, they will be counted as tombstones when queried against. If TTLs expire AND GC Grace Seconds also expires, they will NOT be counted as tombstones (skipped). The system still has to "weed" through the expired TTL records, but other than processing time, are not "harmful" for the query. I found this very interesting as I don't see this documented anywhere.
Thought others may be interested in this information and could add to it if their experiences differ.

Which one is better to use TTL or Delete in Cassandra?

I want to remove records from Cassandra cluster after a particular time.
So what Should I use TTL or manually delete?
The answer is "it depends". Deleting data in cassandra is never free.
If you have to "DELETE" you need always to issue those queries, with TTL it's done from the moment you write the data. But by using DELETE you have more control over data deletion.
On the operation side, you should try to get your tombstones in the same sstable so once gc_grace is expired the full sstable can be dropped. Because data is only actually deleted when the sstables are compacted, even if gc_grace has passed, and a compaction didn't happen with the sstable holding the tombstone, the tombstone will not be deleted from the harddrive. This also make relevant the choice of compaction strategy for your table.
If you're also using a lot of tombstones, you should always enable: "unchecked_tombstone_compaction" at table level. You can read more about that here: https://docs.datastax.com/en/cql/3.1/cql/cql_reference/compactSubprop.html
It depends on your data model. The fortunate answer, is that due to their predictable nature, you can build your data model to accommodate TTLs.
Let's say I build the following table to track user requests to a REST service, for example. Suppose that I really only care about the last week's worth of data, so I'll set a TTL of 604800 seconds (7 days). So the query I need to support is basically this (querying transactions for user 'Bob' for the prior 7 days):
SELECT * FROM rest_transactions_by_user
WHERE username='Bob' AND transaction_time > '2018-05-28 13:41';
To support that query, I'll build this table:
CREATE TABLE rest_transactions_by_user (
username TEXT,
transaction_time TIMESTAMP,
service_name TEXT,
HTTP_result BIGINT,
PRIMARY KEY (username,transaction_time))
WITH CLUSTERING ORDER BY (transaction_time DESC)
AND gc_grace_seconds = 864000
AND default_time_to_live = 604800;
A few things to note:
I am leaving gc_grace_seconds at the default of 864000 (ten days). This will ensure that the TTL tombstones will have adequate time to be propagated throughout the cluster.
Rows will TTL at 7 days (as mentioned above). After that, they become tombstones for an additional 10 days.
I am clustering by transaction_time in DESCending order. This puts the rows I care about (the ones that haven't TTL'd) at the "top" of my partition (sequentially).
By querying for a transaction_time of the prior 7 days, I am ignoring anything older than that. As my TTL tombstones will exist for 10 days afterward, they will be at the "bottom" of my partition.
In this way, limiting my query to the last 7 days ensures that Cassandra will never have to deal with the tombstones, as my query will never find them. So in this case, I have built a data model where a TTL is "better" than a random delete.
Letting the record expire based on TTL is better. With TTL based delete, you can set the gc_grace_seconds to a much lower value (1 day or two) and you do not have to worry about tombstones lingering for a longer duration.
With manual delete, you need to make sure the tombstones do not increase beyond the warning and error threshold, as it impacts the query.

Tombstone in Cassandra

I have a Cassandra table with TTL of 60 seconds, I have few questions in this,
1) I am getting the following warning
Read 76 live rows and 1324 tombstone cells for query SELECT * FROM xx.yy WHERE token(y) >= token(fc872571-1253-45a1-ada3-d6f5a96668e8) LIMIT 100 (see tombstone_warn_threshold)
What does this mean?
2) As per my study, Tombstone is a flag in case of TTL (will be deleted after gc_grace_seconds)
i) so till 10 days does it mean that it won't be deleted ?
ii) What will be the consequence of it waiting for 10 days?
iii) Why it is a long time 10 days?
https://docs.datastax.com/en/cql/3.1/cql/cql_reference/tabProp.html
gc_grace_seconds 864000 [10 days] The number of seconds after data is marked with a tombstone (deletion marker) before it is eligible for garbage-collection. Cassandra will not execute hints or batched mutations on a tombstoned record within its gc_grace_period. The default value allows a great deal of time for Cassandra to maximize consistency prior to deletion. For details about decreasing this value, see garbage collection below.
3) I read that performing compaction and repair using nodetool will delete the tombstone, How frequently we need to run this in background, What will be the consequence of it?
This means that your query returned 76 "live" or non-deleted/non-obsoleted rows of data, and that it had to sift through 1324 tombstones (deletion markers) to accomplish that.
In the world of distributed databases, deletes are hard. After all, if you delete a piece of data from one node, and you expect that deletion to happen on all of your nodes, how would you know if it worked? Quite literally, how do you replicate nothing? Tombstones (delete markers) are the answer to that question.
i. The data is gone (obsoleted, rather). The tombstone(s) will remain for gc_grace_seconds.
ii. The "consequence" is that you'll have to put up with those tombstone warning messages for that duration, or find a way to run your query without having to scan over the tombstones.
iii. The idea behind the 10 days, is that if the tombstones are collected too early, that your deleted data will "ghost" its way back up to some nodes. 10 days gives you enough time to run a weekly repair, which ensures your tombstones are properly replicated before removal.
Compaction removes tombstones. Repair replicates them. You should run repair once per week. While you can run compaction on-demand, don't. Cassandra has its own thresholds (based on number and size of SSTable files) to figure out when to run compaction, and it's best not to get in its way. If you do, you'll be manually running compaction from there on out, as you'll probably never reach the compaction conditions organically.
The consequences, are that both repair and compaction take compute resources, and can reduce a node's ability to serve requests. But they need to happen. You want them to happen. If compaction doesn't run, your SSTable files will grow in number and size; eventually causing rows to exist over multiple files, and queries for them will get slow. If repair doesn't run, your data is at risk of not being in-sync.

Cassandra SizeTieredCompaction tombstone

I've got an append only Cassandra table where every entry has a ttl. Using SizeTieredCompaction the table seems to grow unbounded. Is there a way to ensure that sstables are checked for tombstoned columns more often?
Tombstones will be effectively deleted after the grace period (search for gc_grace_seconds) expires and a compaction occurs. You should check your YAML configuration for the parameter value (default is 864000 seconds, 10 days) and change it to something suitable for your needs. Beware that lowering this value has its own drawbacks e.g. row "resurrecting" if a node is down for more than the grace period.
Check the datastax documentation about deletes in Cassandra.
If you have high loads of input data and do not modify it, you may better use DateTieredCompactionStrategy as it will be more efficient processing and removing the tombstones.

When does Cassandra remove data after it has been deleted?

Recently I have been trying to familiarize myself with Cassandra but don't quite understand when data is removed from disk after it has been deleted. The use case I'm particularly interested is expiring time series data with DTCS. As an example, consider the following table:
CREATE TABLE metrics (
metric_id text,
time timestamp,
value double,
PRIMARY KEY (metric_id, time),
) WITH CLUSTERING ORDER BY (time DESC) AND
default_time_to_live = 86400 AND
gc_grace_seconds = 3600 AND
compaction = {
'class': 'DateTieredCompactionStrategy',
'timestamp_resolution':'MICROSECONDS',
'base_time_seconds':'3600',
'max_sstable_age_days':'365',
'min_threshold':'4'
};
I understand that Cassandra will create a tombstone for all rows inserted into this table after 24 hours (86400 seconds). These tombstones will first be written to an in-memory Memtable and then flushed to disk as an SSTable when the Memtable reaches a certain size. My question is when will the data that is now expired be removed from disk? Is it the next time the SSTable which contains the data gets compacted? So, with DTCS and min_threshold set to four, we would wait until at least three other SSTables are in the same time window as the expired data, and then those SSTables will be compacted into a SSTable. Is it during this compaction that the data will be removed? It seems to me that this would require Cassandra to maintain some metadata on which rows have been deleted since the newer tombstones would likely not be in the older SSTables that are being compacted.
Alternatively, do the SSTables which contain the tombstones have to be compacted with the SSTables which contain the expired data for the data to be removed? It seems to me that this could result in Cassandra holding the expired data long after it has expired since it's waiting for the new tombstones to be compacted with the older expired data.
Finally, I was also unsure when the tombstones themselves are removed. I know Cassandra does not delete them until after gc_grace_seconds but it can't delete the tombstones until it's sure the expired data has been deleted right? Otherwise it would see the expired data as being valid. Consequently, it seems to me that the question of when tombstones are deleted is intimately tied to the questions above. Thanks!
If it helps I've been experimenting with version 2.0.15 myself.
There's two ways to definitly remove data in Cassandra.
1 : When gc_grace_seconds expires. In your table, gc_grace_seconds is set to 3600. wich means that when you execute a delete statement on a row. You will have to wait 3600 seconds before the data is definitly removed from all the cluster.
2 : When a compaction comes in. During a compaction, Cassandra looks for all the data marked with a tombstone and simply ignores it when writing the new SSTable to ensure that the new SSTable doesn't have already deleted data.
However, it might happen that a node goes down longer than gc_grace_seconds or during a compaction, you'll find more information in the Cassandra documentation.
After some further research and help from others I've realized that I had some misconceptions in my original questions. Specifically: "Data deleted by TTL isn’t the same as issuing a delete – each expiring cell internally has a ttl/timestamp at which it will be converted into a tombstone. There is no tombstone added to the memtable, or flushed to disk – it just treats the expired cells as tombstones once they’re past that timestamp."
Furthermore, Cassandra will check if it can drop SSTables containing only expired data when a memtable is flushed to disk and a minor compaction runs, no more than once every ten minutes though (see this issue). Hope that helps if you had the same questions as me!

Resources