Cassandra tombstones with TTL - cassandra

I have worked with cassandra for quite some time (DSE) and am trying to understand something that isn't quite clear. We're running DSE 5.1.9 for this illustration. It's a single node cluster (If you have a multi-node cluster, ensure RF=nodeCount to make things easier).
It's very simple example:
Create the following simple table:
CREATE TABLE mytable (
status text,
process_on_date_time int,
PRIMARY KEY (status, process_on_date_time)
) WITH CLUSTERING ORDER BY (process_on_date_time ASC)
AND gc_grace_seconds = 60
I have a piece of code that inserts 5k records at a time up to 200k total records with TTL of 300 seconds. The status is ALWAYS "pending" and the process_on_date_time is a counter that increments by 1, starting at 1 (all unique records - 1 - 200k basically).
I run the code and then once it completes, I flush the memtable to disk. There's only a single sstable created. After this, no compaction, no repair, nothing else runs that would create or change the sstable configuration.
After the sstable dump, I go into cqlsh, turn on tracing, set consistency to LOCAL_ONE and paging off. I then run this repetitively:
SELECT * from mytable where status = 'pending' and process_on_date_time <= 300000;
What is interesting is I see things like this (cutting out some text for readability):
Run X) Read 31433 live rows and 85384 tombstone cells (31k rows returned to my screen)
Run X+1) Read 0 live rows and 76376 tombstone cells (0 rows returned to my screen - all rows expired at this point)
Run X+2) Read 0 live rows and 60429 tombstone cells
Run X+3) Read 0 live rows and 55894 tombstone cells
...
Run X+X) Read 0 live rows and 0 tombstone cells
What is going on? The sstable isn't changing (obviously as it's immutable), nothing else inserted, flushed, etc. Why is the tombstone count decreasing until it's at 0? What causes this behavior?
I would expect to see every run: 100k tombstones read and the query aborting as all TTL have expired in the single sstable.

For anyone else who may be curious to this answer, I opened a ticket with Datastax, and here is what they mentioned:
After the tombstones pass the gc_grace_seconds they will be ignored in
result sets because they are filtered out after they have past that
point. So you are correct in the assumption that the only way for the
tombstone warning to post would be for the data to be past their ttl
but still within gc_grace.
And since they are ignored/filtered out they wont have any harmful
effect on the system since like you said they are skipped.
So what this means is that if TTLs expire, but are within the GC Grace Seconds, they will be counted as tombstones when queried against. If TTLs expire AND GC Grace Seconds also expires, they will NOT be counted as tombstones (skipped). The system still has to "weed" through the expired TTL records, but other than processing time, are not "harmful" for the query. I found this very interesting as I don't see this documented anywhere.
Thought others may be interested in this information and could add to it if their experiences differ.

Related

Which one is better to use TTL or Delete in Cassandra?

I want to remove records from Cassandra cluster after a particular time.
So what Should I use TTL or manually delete?
The answer is "it depends". Deleting data in cassandra is never free.
If you have to "DELETE" you need always to issue those queries, with TTL it's done from the moment you write the data. But by using DELETE you have more control over data deletion.
On the operation side, you should try to get your tombstones in the same sstable so once gc_grace is expired the full sstable can be dropped. Because data is only actually deleted when the sstables are compacted, even if gc_grace has passed, and a compaction didn't happen with the sstable holding the tombstone, the tombstone will not be deleted from the harddrive. This also make relevant the choice of compaction strategy for your table.
If you're also using a lot of tombstones, you should always enable: "unchecked_tombstone_compaction" at table level. You can read more about that here: https://docs.datastax.com/en/cql/3.1/cql/cql_reference/compactSubprop.html
It depends on your data model. The fortunate answer, is that due to their predictable nature, you can build your data model to accommodate TTLs.
Let's say I build the following table to track user requests to a REST service, for example. Suppose that I really only care about the last week's worth of data, so I'll set a TTL of 604800 seconds (7 days). So the query I need to support is basically this (querying transactions for user 'Bob' for the prior 7 days):
SELECT * FROM rest_transactions_by_user
WHERE username='Bob' AND transaction_time > '2018-05-28 13:41';
To support that query, I'll build this table:
CREATE TABLE rest_transactions_by_user (
username TEXT,
transaction_time TIMESTAMP,
service_name TEXT,
HTTP_result BIGINT,
PRIMARY KEY (username,transaction_time))
WITH CLUSTERING ORDER BY (transaction_time DESC)
AND gc_grace_seconds = 864000
AND default_time_to_live = 604800;
A few things to note:
I am leaving gc_grace_seconds at the default of 864000 (ten days). This will ensure that the TTL tombstones will have adequate time to be propagated throughout the cluster.
Rows will TTL at 7 days (as mentioned above). After that, they become tombstones for an additional 10 days.
I am clustering by transaction_time in DESCending order. This puts the rows I care about (the ones that haven't TTL'd) at the "top" of my partition (sequentially).
By querying for a transaction_time of the prior 7 days, I am ignoring anything older than that. As my TTL tombstones will exist for 10 days afterward, they will be at the "bottom" of my partition.
In this way, limiting my query to the last 7 days ensures that Cassandra will never have to deal with the tombstones, as my query will never find them. So in this case, I have built a data model where a TTL is "better" than a random delete.
Letting the record expire based on TTL is better. With TTL based delete, you can set the gc_grace_seconds to a much lower value (1 day or two) and you do not have to worry about tombstones lingering for a longer duration.
With manual delete, you need to make sure the tombstones do not increase beyond the warning and error threshold, as it impacts the query.

Tombstone in Cassandra

I have a Cassandra table with TTL of 60 seconds, I have few questions in this,
1) I am getting the following warning
Read 76 live rows and 1324 tombstone cells for query SELECT * FROM xx.yy WHERE token(y) >= token(fc872571-1253-45a1-ada3-d6f5a96668e8) LIMIT 100 (see tombstone_warn_threshold)
What does this mean?
2) As per my study, Tombstone is a flag in case of TTL (will be deleted after gc_grace_seconds)
i) so till 10 days does it mean that it won't be deleted ?
ii) What will be the consequence of it waiting for 10 days?
iii) Why it is a long time 10 days?
https://docs.datastax.com/en/cql/3.1/cql/cql_reference/tabProp.html
gc_grace_seconds 864000 [10 days] The number of seconds after data is marked with a tombstone (deletion marker) before it is eligible for garbage-collection. Cassandra will not execute hints or batched mutations on a tombstoned record within its gc_grace_period. The default value allows a great deal of time for Cassandra to maximize consistency prior to deletion. For details about decreasing this value, see garbage collection below.
3) I read that performing compaction and repair using nodetool will delete the tombstone, How frequently we need to run this in background, What will be the consequence of it?
This means that your query returned 76 "live" or non-deleted/non-obsoleted rows of data, and that it had to sift through 1324 tombstones (deletion markers) to accomplish that.
In the world of distributed databases, deletes are hard. After all, if you delete a piece of data from one node, and you expect that deletion to happen on all of your nodes, how would you know if it worked? Quite literally, how do you replicate nothing? Tombstones (delete markers) are the answer to that question.
i. The data is gone (obsoleted, rather). The tombstone(s) will remain for gc_grace_seconds.
ii. The "consequence" is that you'll have to put up with those tombstone warning messages for that duration, or find a way to run your query without having to scan over the tombstones.
iii. The idea behind the 10 days, is that if the tombstones are collected too early, that your deleted data will "ghost" its way back up to some nodes. 10 days gives you enough time to run a weekly repair, which ensures your tombstones are properly replicated before removal.
Compaction removes tombstones. Repair replicates them. You should run repair once per week. While you can run compaction on-demand, don't. Cassandra has its own thresholds (based on number and size of SSTable files) to figure out when to run compaction, and it's best not to get in its way. If you do, you'll be manually running compaction from there on out, as you'll probably never reach the compaction conditions organically.
The consequences, are that both repair and compaction take compute resources, and can reduce a node's ability to serve requests. But they need to happen. You want them to happen. If compaction doesn't run, your SSTable files will grow in number and size; eventually causing rows to exist over multiple files, and queries for them will get slow. If repair doesn't run, your data is at risk of not being in-sync.

ttl in cassandra creating tombstones

I am only doing inserts to cassandra. While inserting , not nulls are only inserted to avoid tombstones. But few records are inserted with TTL. But then doing select count(*) from table gives following errors -
Read 76 live rows and 1324 tombstone cells for query SELECT * FROM
xx.yy WHERE token(y) >=
token(fc872571-1253-45a1-ada3-d6f5a96668e8) LIMIT 100 (see
tombstone_warn_threshold)
Do TTL inserts lead to tombstones in cassandra 3.7 ? How can the warning be mitigated ?
There are no updates done only inserts , some records without TTL , others with TTL
From datastax documentation: https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_expire_c.html
After the number of seconds since the column's creation exceeds the TTL value, TTL data is considered expired and is included in results. Expired data is marked with a tombstone after on the next read on the read path, but it remains for a maximum of gc_grace_seconds. After this amount of time, the tombstoned data is automatically removed during the normal compaction and repair processes.
These entries will be treated as tombstones until compaction or repair.
To add one more point for TTL and compaction. Even though, after gc_grace_seconds, the default setting for compaction only kicks off depending on tombstone_compaction_interval and tombstone_threshold
Previously, we were having read timeout issue due to high number of tombstones for tables having high number of records. Eventually, we need to reduce tombstone_threshold as well as enable unchecked_tombstone_compaction to make compaction process triggered more frequently.
You can refer to the below docs for more details
http://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlCreateTable.html?hl=unchecked_tombstone_compaction#tabProp__cqlTableGc_grace_seconds

Cassandra: how to reduce the number of tombstones in a table? (tombstone_compaction_interval, gc_grace_seconds and LevelledCompactionStrategy)

I've a table where I insert data with a TTL of 1 minute and I have a warning in DSE OpsCenter about the high number of tombstones in that table. Which does make sense since in average 80 records per minute are inserted in this table.
So for example for a full day 80 * 60 * 24 = 115200 records inserted and TTL'ed in one day.
My question is what should I do in order to decrease the number of tombstones in this table?
I've been been looking into tombstone_compaction_interval and gc_grace_seconds and this is where it gets a bit confusing as I'm having problems to understand the exact impact of these properties on the tombstones (even after reading the documentation provided by DataStax - http://docs.datastax.com/en/cql/3.1/cql/cql_reference/compactSubprop.html and http://docs.datastax.com/en/cql/3.1/cql/cql_reference/tabProp.html).
I've also been looking into LevelledCompactionStrategy (https://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra) since it also does seem to impact the tombstones compaction although I don't fully understand why.
So I'm hoping someone will be able to help me better understand how this all works, or even just let me know if I'm going in the right direction.
Please read this http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html. Very good read.
Overall: gc_grace_seconds parameter is the minimal time that tombstones will be kept on disk after data has been deleted. We need to make sure that all the replicas received the delete and have all tombstones stored to avoid having zombie data issues. By default its 10 days.
tombstone_compaction_interval: As part of this JIRA (https://issues.apache.org/jira/browse/CASSANDRA-4781), this property got introduced.
When the compaction ratio was high enough to trigger a single-SSTable compaction, but that tombstones were not evicted due to overlapping SSTables.
I am not sure about your current datamodel but here are my suggestions.
Probably you have to change your DM. Please read https://academy.datastax.com/resources/getting-started-time-series-data-modeling and Time series modelling( with start & end date) in cassandra
Change write pattern.
Change read pattern. Try to read only active data. (As per your current DM, when you are reading it, its going through tombstone cells in-order to reach active cells)
Try to use TimeWindowCompactionStrategy and tune it as per your workload. (http://thelastpickle.com/blog/2017/01/10/twcs-part2.html)
If you are use TTL while inserting (like with INSERT or UPDATE stmnt), see if you can change it to the Table level.
If you are using STCS and want to change compaction sub-properties, probably you could change
unchecked_tombstone_compaction=true and min_threshold=3 (little bit aggressive)

Delete-Upsert-Read Access Pattern in Cassandra

I use Cassandra to store trading information. Based on the queries available, I design my CF as below:
CREATE trades (trading_book text,
trading_date timestamp,
OTHER TRADING INFO ...,
PRIMARY KEY (trading_book, trading_date));
I want to delete all the data on a given date in the following way:
collect all the trading books (which are stored somewhere else);
evenly distribute all the trading books in 20 threads;
in each thread, loop through the books, and
DELETE FROM trades WHERE trading_book='A_BOOK' AND
trading_date='2015-01-01'
There are about 1 million trades and the deletion takes 2 min to complete. Then insert the trading data on 2015-01-01 again (about 1 million trades) immediate after the deletion done.
When the insertion done and I re-read the data, I got the error even with query timeout set to 600 seconds:
ReadTimeout: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'} info={'received_responses': None, 'required_responses': None, 'consistency': 'Not Set'}
It looks like some data inconsistency in the CF now, i.e. the coordinator could identify the partition, but there is no data on the partition?
Is there anything wrong with my access pattern? How to solve this problem?
Any hints will be highly appreciated! Thank you.
You are creating tombstones for every column on that date (by doing the deletes), then writing new records over the top. So now each read must first read the original column, then the tombstone, then the new record. If you do a trace you will see that tombstone reads are killing you. This kind of pattern is problematic with Cassandra, so you should try to find a different (immutable) way to do this. An alternative could be to simply overwrite the data, in which case there are no tombstones to reconcile. But you'll still have to deal with two versions.
In addition to rs_atl's response (which hits the nail on the had with tombstones) here's a bit of info for you to understand / address the problem:
What are tombstones anyway?
Because sstables are immutable, rather than deleting records in Cassandra, we insert a new cell that essentially holds a null value. That's a tombstone. Tombstones become available for deletion or garbage collection after gc_grace seconds (configurable by table).
Tombstones and repairs:
The reason we wait is to ensure that c* has time to propagate a tombstone to all replicas. If a tombstone does not get replicated to all replicas (in some edge cases with low CL writes and flopping nodes for example) and then gets removed / gc'ed, the original data that was deleted will come back to life. This is why we run repairs at least every GC_Grace, ensuring tombstone consistency and preventing zombie data.
How many tombstones am I hitting?
If you turn on tracing in cqlsh tracing on or turn on probabilistic tracing in the yaml or via nodetool you'll be able to see how many tombstones you are hitting for a particular request. As this number gets bigger, your read performance will decrease until you see the timeouts you mentioned.
nodetool cfstats also gives you more macro details (average tombstones per slice) of how many tombstones are in your table.
the sstablemetadata utility shows you the total # of tombstones in a table.
What can I do to get rid of tombstones?
1) If you're deleting everything in the table, truncate table is a way of deleting data for free in c* since you can expire entire sstables.
2) Tombstones are removed by compaction. You can more aggressively delete tombstones by decreasing gc_grace_seconds and/or increasing the tombstone ratio for compaction, but make sure you're running your repairs or you may see zombie data.

Resources