Delete-Upsert-Read Access Pattern in Cassandra - cassandra

I use Cassandra to store trading information. Based on the queries available, I design my CF as below:
CREATE trades (trading_book text,
trading_date timestamp,
OTHER TRADING INFO ...,
PRIMARY KEY (trading_book, trading_date));
I want to delete all the data on a given date in the following way:
collect all the trading books (which are stored somewhere else);
evenly distribute all the trading books in 20 threads;
in each thread, loop through the books, and
DELETE FROM trades WHERE trading_book='A_BOOK' AND
trading_date='2015-01-01'
There are about 1 million trades and the deletion takes 2 min to complete. Then insert the trading data on 2015-01-01 again (about 1 million trades) immediate after the deletion done.
When the insertion done and I re-read the data, I got the error even with query timeout set to 600 seconds:
ReadTimeout: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'} info={'received_responses': None, 'required_responses': None, 'consistency': 'Not Set'}
It looks like some data inconsistency in the CF now, i.e. the coordinator could identify the partition, but there is no data on the partition?
Is there anything wrong with my access pattern? How to solve this problem?
Any hints will be highly appreciated! Thank you.

You are creating tombstones for every column on that date (by doing the deletes), then writing new records over the top. So now each read must first read the original column, then the tombstone, then the new record. If you do a trace you will see that tombstone reads are killing you. This kind of pattern is problematic with Cassandra, so you should try to find a different (immutable) way to do this. An alternative could be to simply overwrite the data, in which case there are no tombstones to reconcile. But you'll still have to deal with two versions.

In addition to rs_atl's response (which hits the nail on the had with tombstones) here's a bit of info for you to understand / address the problem:
What are tombstones anyway?
Because sstables are immutable, rather than deleting records in Cassandra, we insert a new cell that essentially holds a null value. That's a tombstone. Tombstones become available for deletion or garbage collection after gc_grace seconds (configurable by table).
Tombstones and repairs:
The reason we wait is to ensure that c* has time to propagate a tombstone to all replicas. If a tombstone does not get replicated to all replicas (in some edge cases with low CL writes and flopping nodes for example) and then gets removed / gc'ed, the original data that was deleted will come back to life. This is why we run repairs at least every GC_Grace, ensuring tombstone consistency and preventing zombie data.
How many tombstones am I hitting?
If you turn on tracing in cqlsh tracing on or turn on probabilistic tracing in the yaml or via nodetool you'll be able to see how many tombstones you are hitting for a particular request. As this number gets bigger, your read performance will decrease until you see the timeouts you mentioned.
nodetool cfstats also gives you more macro details (average tombstones per slice) of how many tombstones are in your table.
the sstablemetadata utility shows you the total # of tombstones in a table.
What can I do to get rid of tombstones?
1) If you're deleting everything in the table, truncate table is a way of deleting data for free in c* since you can expire entire sstables.
2) Tombstones are removed by compaction. You can more aggressively delete tombstones by decreasing gc_grace_seconds and/or increasing the tombstone ratio for compaction, but make sure you're running your repairs or you may see zombie data.

Related

Cassandra tombstones with TTL

I have worked with cassandra for quite some time (DSE) and am trying to understand something that isn't quite clear. We're running DSE 5.1.9 for this illustration. It's a single node cluster (If you have a multi-node cluster, ensure RF=nodeCount to make things easier).
It's very simple example:
Create the following simple table:
CREATE TABLE mytable (
status text,
process_on_date_time int,
PRIMARY KEY (status, process_on_date_time)
) WITH CLUSTERING ORDER BY (process_on_date_time ASC)
AND gc_grace_seconds = 60
I have a piece of code that inserts 5k records at a time up to 200k total records with TTL of 300 seconds. The status is ALWAYS "pending" and the process_on_date_time is a counter that increments by 1, starting at 1 (all unique records - 1 - 200k basically).
I run the code and then once it completes, I flush the memtable to disk. There's only a single sstable created. After this, no compaction, no repair, nothing else runs that would create or change the sstable configuration.
After the sstable dump, I go into cqlsh, turn on tracing, set consistency to LOCAL_ONE and paging off. I then run this repetitively:
SELECT * from mytable where status = 'pending' and process_on_date_time <= 300000;
What is interesting is I see things like this (cutting out some text for readability):
Run X) Read 31433 live rows and 85384 tombstone cells (31k rows returned to my screen)
Run X+1) Read 0 live rows and 76376 tombstone cells (0 rows returned to my screen - all rows expired at this point)
Run X+2) Read 0 live rows and 60429 tombstone cells
Run X+3) Read 0 live rows and 55894 tombstone cells
...
Run X+X) Read 0 live rows and 0 tombstone cells
What is going on? The sstable isn't changing (obviously as it's immutable), nothing else inserted, flushed, etc. Why is the tombstone count decreasing until it's at 0? What causes this behavior?
I would expect to see every run: 100k tombstones read and the query aborting as all TTL have expired in the single sstable.
For anyone else who may be curious to this answer, I opened a ticket with Datastax, and here is what they mentioned:
After the tombstones pass the gc_grace_seconds they will be ignored in
result sets because they are filtered out after they have past that
point. So you are correct in the assumption that the only way for the
tombstone warning to post would be for the data to be past their ttl
but still within gc_grace.
And since they are ignored/filtered out they wont have any harmful
effect on the system since like you said they are skipped.
So what this means is that if TTLs expire, but are within the GC Grace Seconds, they will be counted as tombstones when queried against. If TTLs expire AND GC Grace Seconds also expires, they will NOT be counted as tombstones (skipped). The system still has to "weed" through the expired TTL records, but other than processing time, are not "harmful" for the query. I found this very interesting as I don't see this documented anywhere.
Thought others may be interested in this information and could add to it if their experiences differ.

TTL tombstones in Cassandra using LCS are created in the same level data TTLed data?

I'm using LCS and a relatively large TTL of 2 years for all inserted rows and I'm concerned about the moment at which C* would drop the corresponding tombstones (neither explicit deletes nor updates are being performed).
From Missing Manual for Leveled Compaction Strategy, Tombstone Compactions in Cassandra and Deletes Without Tombstones or TTLs I understand that
All levels except L0 contain non-overlapping SSTables, but a partition key may be present in one SSTable in each level (aka distributed in all levels).
For a compaction to be able to drop a tombstone it must be sure that is compacting all SStables that contains de data to prevent zombie data (this is done checking bloom filters). It also considers gc_grace_seconds
So, for my particular use case (2 years TTL and write heavy load) I can conclude that TTLed data will be in highest levels so I'm wondering when those SSTables with TTLed data will be compacted with the SSTables that contains the corresponding SSTables.
The main question will be: Where are tombstones (from ttls) being created? Are being created at Level 0 so it will take a long time until it will end up in the highest levels (hence disk space will take long time to be freed)?
In a comment from About deletes and tombstones Alain says that
Yet using TTLs helps, it reduces the chances of having data being fragmented between SSTables that will not be compacted together any time soon. Using any compaction strategy, if the delete comes relatively late in the row history, as it use to happen, the 'upsert'/'insert' of the tombstone will go to a new SSTable. It might take time for this tombstone to get to the right compaction "bucket" (with the rest of the row) and for Cassandra to be able to finally free space.
My understanding is that with TTLs the tombstones is created in-place, thus it is often and for many reasons easier and safer to get rid of a TTLs than from a delete.
Another clue to explore would be to use the TTL as a default value if that's a good fit. TTLs set at the table level with 'default_time_to_live' should not generate any tombstone at all in C*3.0+. Not tested on my hand, but I read about this.
I'm not sure what it means with "in-place" since SSTables are immutable.
(I also have some doubts about what it says of using default_time_to_live that I've asked in How default_time_to_live would delete rows without tombstones in Cassandra?).
My guess is that is referring to tombstones being created in the same level (but different SStables) that the TTLed data during a compaction triggered by one of the following reasons:
"Going from highest level, any level having score higher than 1.001 can be picked by a compaction thread" The Missing Manual for Leveled Compaction Strategy
"If we go 25 rounds without compacting in the highest level, we start bringing in sstables from that level into lower level compactions" The Missing Manual for Leveled Compaction Strategy
"When there are no other compactions to do, we trigger a single-sstable compaction if there is more than X% droppable tombstones in the sstable." CASSANDRA-7019
Since tombstones are created during compaction, I think it may be using SSTable metadata to estimate droppable tombstones.
So, compactions (2) and (3) should be creating/dropping tombstones in highest levels hence using LCS with a large TTL should not be an issue per se.
With creating/dropping I mean that the same kind of compactions will be creating tombstones for expired data and/or dropping tombstones if the gc period has already passed.
A link to source code that clarifies this situation will be great, thanks.
Alain Rodriguez's answer from mailing list
Another clue to explore would be to use the TTL as a default value if
that's a good fit. TTLs set at the table level with 'default_time_to_live'
should not generate any tombstone at all in C*3.0+. Not tested on my hand,
but I read about this.
As explained on a parallel thread, this is wrong, mea culpa. I believe the rest of my comment still stands (hopefully :)).
I'm not sure what it means with "in-place" since SSTables are immutable.
My guess is that is referring to tombstones being created in the same
Yes, I believe during the next compaction following the expiration date,
the entry is 'transformed' into a tombstone, and lives in the SSTable that
is the result of the compaction, on the level/bucket this SSTable is put
into. That's why I said 'in-place' which is indeed a bit weird for
immutable data.
As a side idea for your problem, on 'modern' versions of Cassandra (I don't
remember the version, that's what 'modern' means ;-)), you can run
'nodetool garbagecollect' regularly (not necessarily frequently) during the
off-peak period. That might use the cluster resources when you don't need
them to claim some disk space. Also making sure that a 2 years old record
is not being updated regularly by design would definitely help. In the
extreme case of writing a data once (never updated) and with a TTL for
example, I see no reason for a 2 years old data not to be evicted
correctly. As long as the disk can grow, it should be fine.
I would not be too much scared about it, as there is 'always' a way to
remove tombstones. Yet it's good to think about the design beforehand
indeed, generally, it's good if you can rotate the partitions over time,
not to reuse old partitions for example.

Tombstone in Cassandra

I have a Cassandra table with TTL of 60 seconds, I have few questions in this,
1) I am getting the following warning
Read 76 live rows and 1324 tombstone cells for query SELECT * FROM xx.yy WHERE token(y) >= token(fc872571-1253-45a1-ada3-d6f5a96668e8) LIMIT 100 (see tombstone_warn_threshold)
What does this mean?
2) As per my study, Tombstone is a flag in case of TTL (will be deleted after gc_grace_seconds)
i) so till 10 days does it mean that it won't be deleted ?
ii) What will be the consequence of it waiting for 10 days?
iii) Why it is a long time 10 days?
https://docs.datastax.com/en/cql/3.1/cql/cql_reference/tabProp.html
gc_grace_seconds 864000 [10 days] The number of seconds after data is marked with a tombstone (deletion marker) before it is eligible for garbage-collection. Cassandra will not execute hints or batched mutations on a tombstoned record within its gc_grace_period. The default value allows a great deal of time for Cassandra to maximize consistency prior to deletion. For details about decreasing this value, see garbage collection below.
3) I read that performing compaction and repair using nodetool will delete the tombstone, How frequently we need to run this in background, What will be the consequence of it?
This means that your query returned 76 "live" or non-deleted/non-obsoleted rows of data, and that it had to sift through 1324 tombstones (deletion markers) to accomplish that.
In the world of distributed databases, deletes are hard. After all, if you delete a piece of data from one node, and you expect that deletion to happen on all of your nodes, how would you know if it worked? Quite literally, how do you replicate nothing? Tombstones (delete markers) are the answer to that question.
i. The data is gone (obsoleted, rather). The tombstone(s) will remain for gc_grace_seconds.
ii. The "consequence" is that you'll have to put up with those tombstone warning messages for that duration, or find a way to run your query without having to scan over the tombstones.
iii. The idea behind the 10 days, is that if the tombstones are collected too early, that your deleted data will "ghost" its way back up to some nodes. 10 days gives you enough time to run a weekly repair, which ensures your tombstones are properly replicated before removal.
Compaction removes tombstones. Repair replicates them. You should run repair once per week. While you can run compaction on-demand, don't. Cassandra has its own thresholds (based on number and size of SSTable files) to figure out when to run compaction, and it's best not to get in its way. If you do, you'll be manually running compaction from there on out, as you'll probably never reach the compaction conditions organically.
The consequences, are that both repair and compaction take compute resources, and can reduce a node's ability to serve requests. But they need to happen. You want them to happen. If compaction doesn't run, your SSTable files will grow in number and size; eventually causing rows to exist over multiple files, and queries for them will get slow. If repair doesn't run, your data is at risk of not being in-sync.

Impact of not running Cassandra repair on a time series data

Data we store in Cassandra is pure time series with no manual deletes. Data gets deleted only by TTL.
For such use cases, is repair really needed? What is the impact of not running repair?
Tobstoned data really deleted after gc_grace_seconds + compaction. if table with tombstoned data is not compacted, you will stack with this data, and it will cause performance degradation.
If you don't run repair within gc_grace period, dead data can live again. Here's datastax article on this (and why you need to run repairs regulary):
https://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_about_deletes_c.html
EDIT:
TTLed data isn't tombstoned on the time of the expire, but only when there's a compaction proccess (at least in 3.9). You will not see expired data, even when there's no tombstones.
So, if there is a problem with the node, and TTLed data isn't got it's tombstone on compaction, it will get one on the next compaction, or will be simply deleted. According to this, and the fact that the data is NEVER deleted and only expired, and you don't have any owerwrites to same key, you don't have to run repairs for data consistency.
And, regarding to all above, i will recommend to run repairs once in a while (with much higher interval between them), in case that something accidentally was written not using you write pass.
If you set TTL, cassandra will mark the data with tombstone after the time exceeded. If you don't run repair regularly, huge tombstone will be generated and it will affect cassandra performance
After the number of seconds since the column's creation exceeds the TTL value, TTL data is considered expired and is included in results. Expired data is marked with a tombstone after on the next read on the read path, but it remains for a maximum of gc_grace_seconds. After this amount of time, the tombstoned data is automatically removed during the normal compaction and repair processes
https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_expire_c.html

Retrieving "tombstoned" records in Cassandra

My question is very simple. Is it in any way possible to retrieve columns that have been marked tombstone before the GCGraceSeconds period expiry(default 10 days). If yes what would be the exact CQL query for that?
If I were to understand the deletion process the tombstones are marked on the MemTables and the SSTable being immutable waiting for compaction still has the the deleted data waiting for compaction. So before compaction occurs is there any way to read the tombstoned data from either the Memtable or SSTable?
Using CQL 3.0 on CQLSH command prompt & Cassandra 2.0.
You are right, when a tombstone is inserted it usually doesn't immediately delete the underlying data (unless all your data is in a memtable). However, you can't control when it does. If you don't have much data and compaction happens quickly, the underlying data may be deleted very quickly, much sooner than 10 days.
There is no query to read deleted data, but you can inspect all your SSTables with sstable2json to see if they contain the deleted data.
Just to add on to the previous comment. Have a low value of gc_grace_seconds for the column families that have frequent deletions. It will take some time for gc but tombstones are expected to get cleared .

Resources