Which one is better to use TTL or Delete in Cassandra? - cassandra

I want to remove records from Cassandra cluster after a particular time.
So what Should I use TTL or manually delete?

The answer is "it depends". Deleting data in cassandra is never free.
If you have to "DELETE" you need always to issue those queries, with TTL it's done from the moment you write the data. But by using DELETE you have more control over data deletion.
On the operation side, you should try to get your tombstones in the same sstable so once gc_grace is expired the full sstable can be dropped. Because data is only actually deleted when the sstables are compacted, even if gc_grace has passed, and a compaction didn't happen with the sstable holding the tombstone, the tombstone will not be deleted from the harddrive. This also make relevant the choice of compaction strategy for your table.
If you're also using a lot of tombstones, you should always enable: "unchecked_tombstone_compaction" at table level. You can read more about that here: https://docs.datastax.com/en/cql/3.1/cql/cql_reference/compactSubprop.html

It depends on your data model. The fortunate answer, is that due to their predictable nature, you can build your data model to accommodate TTLs.
Let's say I build the following table to track user requests to a REST service, for example. Suppose that I really only care about the last week's worth of data, so I'll set a TTL of 604800 seconds (7 days). So the query I need to support is basically this (querying transactions for user 'Bob' for the prior 7 days):
SELECT * FROM rest_transactions_by_user
WHERE username='Bob' AND transaction_time > '2018-05-28 13:41';
To support that query, I'll build this table:
CREATE TABLE rest_transactions_by_user (
username TEXT,
transaction_time TIMESTAMP,
service_name TEXT,
HTTP_result BIGINT,
PRIMARY KEY (username,transaction_time))
WITH CLUSTERING ORDER BY (transaction_time DESC)
AND gc_grace_seconds = 864000
AND default_time_to_live = 604800;
A few things to note:
I am leaving gc_grace_seconds at the default of 864000 (ten days). This will ensure that the TTL tombstones will have adequate time to be propagated throughout the cluster.
Rows will TTL at 7 days (as mentioned above). After that, they become tombstones for an additional 10 days.
I am clustering by transaction_time in DESCending order. This puts the rows I care about (the ones that haven't TTL'd) at the "top" of my partition (sequentially).
By querying for a transaction_time of the prior 7 days, I am ignoring anything older than that. As my TTL tombstones will exist for 10 days afterward, they will be at the "bottom" of my partition.
In this way, limiting my query to the last 7 days ensures that Cassandra will never have to deal with the tombstones, as my query will never find them. So in this case, I have built a data model where a TTL is "better" than a random delete.

Letting the record expire based on TTL is better. With TTL based delete, you can set the gc_grace_seconds to a much lower value (1 day or two) and you do not have to worry about tombstones lingering for a longer duration.
With manual delete, you need to make sure the tombstones do not increase beyond the warning and error threshold, as it impacts the query.

Related

Cassandra tombstones with TTL

I have worked with cassandra for quite some time (DSE) and am trying to understand something that isn't quite clear. We're running DSE 5.1.9 for this illustration. It's a single node cluster (If you have a multi-node cluster, ensure RF=nodeCount to make things easier).
It's very simple example:
Create the following simple table:
CREATE TABLE mytable (
status text,
process_on_date_time int,
PRIMARY KEY (status, process_on_date_time)
) WITH CLUSTERING ORDER BY (process_on_date_time ASC)
AND gc_grace_seconds = 60
I have a piece of code that inserts 5k records at a time up to 200k total records with TTL of 300 seconds. The status is ALWAYS "pending" and the process_on_date_time is a counter that increments by 1, starting at 1 (all unique records - 1 - 200k basically).
I run the code and then once it completes, I flush the memtable to disk. There's only a single sstable created. After this, no compaction, no repair, nothing else runs that would create or change the sstable configuration.
After the sstable dump, I go into cqlsh, turn on tracing, set consistency to LOCAL_ONE and paging off. I then run this repetitively:
SELECT * from mytable where status = 'pending' and process_on_date_time <= 300000;
What is interesting is I see things like this (cutting out some text for readability):
Run X) Read 31433 live rows and 85384 tombstone cells (31k rows returned to my screen)
Run X+1) Read 0 live rows and 76376 tombstone cells (0 rows returned to my screen - all rows expired at this point)
Run X+2) Read 0 live rows and 60429 tombstone cells
Run X+3) Read 0 live rows and 55894 tombstone cells
...
Run X+X) Read 0 live rows and 0 tombstone cells
What is going on? The sstable isn't changing (obviously as it's immutable), nothing else inserted, flushed, etc. Why is the tombstone count decreasing until it's at 0? What causes this behavior?
I would expect to see every run: 100k tombstones read and the query aborting as all TTL have expired in the single sstable.
For anyone else who may be curious to this answer, I opened a ticket with Datastax, and here is what they mentioned:
After the tombstones pass the gc_grace_seconds they will be ignored in
result sets because they are filtered out after they have past that
point. So you are correct in the assumption that the only way for the
tombstone warning to post would be for the data to be past their ttl
but still within gc_grace.
And since they are ignored/filtered out they wont have any harmful
effect on the system since like you said they are skipped.
So what this means is that if TTLs expire, but are within the GC Grace Seconds, they will be counted as tombstones when queried against. If TTLs expire AND GC Grace Seconds also expires, they will NOT be counted as tombstones (skipped). The system still has to "weed" through the expired TTL records, but other than processing time, are not "harmful" for the query. I found this very interesting as I don't see this documented anywhere.
Thought others may be interested in this information and could add to it if their experiences differ.

ttl in cassandra creating tombstones

I am only doing inserts to cassandra. While inserting , not nulls are only inserted to avoid tombstones. But few records are inserted with TTL. But then doing select count(*) from table gives following errors -
Read 76 live rows and 1324 tombstone cells for query SELECT * FROM
xx.yy WHERE token(y) >=
token(fc872571-1253-45a1-ada3-d6f5a96668e8) LIMIT 100 (see
tombstone_warn_threshold)
Do TTL inserts lead to tombstones in cassandra 3.7 ? How can the warning be mitigated ?
There are no updates done only inserts , some records without TTL , others with TTL
From datastax documentation: https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_expire_c.html
After the number of seconds since the column's creation exceeds the TTL value, TTL data is considered expired and is included in results. Expired data is marked with a tombstone after on the next read on the read path, but it remains for a maximum of gc_grace_seconds. After this amount of time, the tombstoned data is automatically removed during the normal compaction and repair processes.
These entries will be treated as tombstones until compaction or repair.
To add one more point for TTL and compaction. Even though, after gc_grace_seconds, the default setting for compaction only kicks off depending on tombstone_compaction_interval and tombstone_threshold
Previously, we were having read timeout issue due to high number of tombstones for tables having high number of records. Eventually, we need to reduce tombstone_threshold as well as enable unchecked_tombstone_compaction to make compaction process triggered more frequently.
You can refer to the below docs for more details
http://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlCreateTable.html?hl=unchecked_tombstone_compaction#tabProp__cqlTableGc_grace_seconds

How do we track the impact expired entries have on a time series table?

We are processing the oldest data as it comes into the time-series table. I am taking care to make sure that the oldest entries expire as soon as they are processed. Expectation is to have all the deletes at the bottom part of the clustering column of TimeUUID. So query will always read time slot without any deleted entries.
Will this scheme work? Are there any impacts of the expired columns that I should be aware of?
So keeping the timeuuid as part of clustering key guarantee the sort order to provide the most recent data.
If Cassandra 3.1 (DSE 5.x) and above :-
Now regarding the deletes, "avoid manual and use TWCS": Here is how
Let's say every X minutes your job process the data. Lets say X = 5min, (hopefully less than 24hours). Set the compaction to TWCS: Time Window Compaction Strategy and lets assume with TTL of 24hours.
WITH compaction= {
'compaction_window_unit': 'HOURS',
'compaction_window_size': '1',
};
Now there are 24buckets created in a day, each with one hour of data. These 24 buckets simply relates to 24 sstables (after compaction) in your Cassandra data directory. Now during the 25hour, the entire 1st bucket/sstable would automatically get dropped by TTL. Hence instead of coding for deletes, let Cassandra take care of the cleanup. The beauty of TWCS is to TTL the entire data within that sstable.
Now the READs from your application always goes to the recent bucket, 24th sstable in this case always. So the reads would never have to scan through the tombstones (caused by TTL).
If Cassandra 2.x or DSE 4.X, if TWCS isn't available yet :-
A way out till you upgrade to Cassandra 3.1 or above is to use artificial buckets. Say you introduce a time bucket variable as part of the partition key and keep the bucket value to be date and hour. This way each partition is different and you could adjust the bucket size to match the job processing interval.
So when you delete, only the processed partition gets deleted and will not come in the way while reading unprocessed ones. So scanning of tombstones could be avoided.
Its an additional effort on application side to start writing to the correct partition based on the current date/time bucket. But its worth it in production scenario to avoid Tombstone scan.
You can use TWCS to easily manage expired data, and perform filtering by some timestamp column on query time, to ensure that your query always getting the last results.
How do you "taking care" about oldest entries expiry? Cassandra will not show records with expired ttl, but they will persist in sstables until next compaction for this sstable. If you are deleting the rows by yourself, you can't make sure that your query will always read latest records, since Cassandra is eventually consistent, and theoretically there's can be some moments, when you will read stale data (or many such moments, based on your consistency settings).

gc_grace_seconds to remove tombstone rows in cassandra

I am using awesome Cassandra DB (3.7.0) and I have questions about tombstone.
I have table called raw_data. This table has default TTL as 1 hour. This table gets new data every second. Then another processor reads one row and remove the row.
It seems like this raw_data table becomes slow at reading and writing after several days of running.
Is this because of deleted rows are staying as tombstone? This table already has TTL as 1 hour. Should I set gc_grace_period to something less than 10 days (default value) to remove tombstones quickly? (By the way, I am single-node DB)
Thank you in advance.
Deleting your data is the way to have tombstone problems. TTL is the other way.
It is pretty normal for a Cassandra cluster to become slower and slower after each delete, and your cluster will eventually refuse to read data from this table.
Setting gc_grace_period to less than the default 10 days is only one part of the equation. The other part is the compaction strategy you use. Indeed, in order to remove tombstones a compaction is needed.
I'd change my mind about my single-node cluster and I'd go with the minimum standard 3 nodes with RF=3. Then I'd design my project around something that doesn't explicitly delete data. If you absolutely need to delete data, make sure that C* runs compaction periodically and removes tombstones (or force C* to run compactions), and make sure to have plenty of IOPS, because compaction is very IO intensive.
In short Tombstones are used to Cassandra to mark the data is deleted, and replicate the same to other nodes so the deleted data doesn't re-appear. These tombstone will be stored in Cassandra till the gc_grace_period. Creating more tobestones might slow down your table. As you are using a single node Cassandra you don't have to replicate anything in other nodes, hence you can update your gc grace seconds to 1 day, which will not affect. In future if you are planning to add new nodes and data centers change this gc grace seconds.

Cassandra SizeTieredCompaction tombstone

I've got an append only Cassandra table where every entry has a ttl. Using SizeTieredCompaction the table seems to grow unbounded. Is there a way to ensure that sstables are checked for tombstoned columns more often?
Tombstones will be effectively deleted after the grace period (search for gc_grace_seconds) expires and a compaction occurs. You should check your YAML configuration for the parameter value (default is 864000 seconds, 10 days) and change it to something suitable for your needs. Beware that lowering this value has its own drawbacks e.g. row "resurrecting" if a node is down for more than the grace period.
Check the datastax documentation about deletes in Cassandra.
If you have high loads of input data and do not modify it, you may better use DateTieredCompactionStrategy as it will be more efficient processing and removing the tombstones.

Resources