What delays a tombstone purge when using LCS in Cassandra - cassandra

In a C* 1.2.x cluster we have 7 keyspaces and each keyspace contains a column family that uses wide rows. The cf uses LCS. I am periodically doing deletes in the rows. Initially each row may contain at most 1 entry per day. Entries older than 3 months are deleted and at max 1 entry per week is kept. I have been running this for a few months but disk space isn't really reclaimed. I need to investigate why. For me it looks like the tombstones are not purged. Each keyspace has around 1300 sstable files (*-Data.db) and each file is around 130 Mb in size (sstable_size_in_mb is 128). GC grace seconds is 864000 in each CF. tombstone_threshold is not specified, so it should default to 0.2. What should I look at to find out why diskspace isn't reclaimed?

I've answered a similar question before on the cassandra mailing list here
To elaborate a bit further, it's crucial you understand the Levelled Compaction Strategy and leveldb in general (given normal write behavior)
To summarize the above:
The data store is organized as "levels". Each level is 10 times larger than the level under it. Files in level 0 have overlapping ranges. Files in higher levels do not have overlapping ranges within each level.
New writes are stored as new sstables entering level 0. Every once in a while all sstables in level0 are "compacted" upwards to level 1 sstables, and these are then compacted upwards to level 2 sstables etc..
Reads for a given key will perform ~N reads, N being the number of levels in your tree (which is a function of the total data set size). Level 0 sstables are all scanned (since there are no constraints that each has non-overlapping range with siblings). Level 1 and higher sstables however have no-overlapping ranges, so the DB knows which 1 exact sstable in level1 covers the range of the key you're asking for, same for level 2 etc...
The layout of your LCS tree in cassandra is stored in a json file that you can easily check - you can find it in the same directory as the sstables for the keyspace+ColumnFamily. Here's an example of one of my nodes (coupled with the jq tool + awk to summarize):
$ cat users.json | jq ".generations[].members|length" | awk '{print "Level", NR-1, ":", $0, "sstables"}'
Level 0 : 1 sstables
Level 1 : 10 sstables
Level 2 : 109 sstables
Level 3 : 1065 sstables
Level 4 : 2717 sstables
Level 5 : 0 sstables
Level 6 : 0 sstables
Level 7 : 0 sstables
As you've noted, the sstables are usually of equal size, so you can see that each level is roughly 10x the size of the previous one. I would expect in the node above to satisfy the majority of read operations in ~5 sstable reads. Once I add enough data for Level 4 to reach 10000 sstables and Level 5 starts getting populated, my read latency will increase slightly as each read will incur 1 more sstable read to satisfy. (on a tangent, cassandra provides bucketed histograms for you to check all these stats).
With the above out of the way, let's walk through some operations:
We issue a write ["bob"]["age"] = 30. This will enter level0. Usually soon after it'll be compacted to level1. Slowly then it'll spend time in each level but as more writes enter the system, it'll migrate upwards to the highest level N
We issue a read for ["bob"]["age"]. The DB can then check each level from lowest to highest - as soon as it finds the data it can return it. If it reaches the highest level and it hasn't found it, the data doesn't exist on this node. If at any level it finds a tombstone, it can return "not found" as the data has been deleted
We issue a delete ["bob"]["age"]. This will enter level0 as a normal write with a special value "column tombstone". Usually soon after it'll be compacted to level1. Slowly then it'll spend time in each level but as more writes enter the system, it'll migrate upwards to the highest level N. During each compaction, if the sstables being compacted together have a tombstone (such as in l1) and an actual value (such as "30" in an l2), the tombstone "swallows up" the value and affects the logical deletion at that level. The tomstone however cannot be discarded yet, and must persist till it has had the chance to compact against every level until the highest one is reached - this is the only way to ensure that if L2 has age=30, L3 has an older age=29, and L4 has an even older age=28, all of them will have the chance to be destroyed by the tombstone. Only when the tombstone reaches the highest level can it actually be discarded entirely
We issue a read for ["bob"]["age"]. The DB can then check each level from lowest to highest - as soon as it finds the data it can return it. If it reaches the highest level and it hasn't found it, the data doesn't exist on this node. If at any level it finds a tombstone, it can return "not found" as the data has been deleted
We issue a delete ["bob"]. This will enter level0 as a normal write with a special value "row tombstone". It will follow the same logic as the above column-level tombstone, except if it collides with any existing data of any column under row "bob" it discards it.
We issue a read for ["bob"]["age"]. The DB can then check each level from lowest to highest - as soon as it finds the data it can return it. If it reaches the highest level and it hasn't found it, the data doesn't exist on this node. If at any level it finds a tombstone, it can return "not found" as the data has been deleted
I hope this answers your questions regarding why deletes in cassandra, especially with LCS, actually consume space instead of free up space (at least initially). The rows+columns the tombstones are attached to themselves have a size (which might actually be larger than the size of the value you're trying to delete if you have simple values).
The key point here is that they must traverse all the levels up to the highest level L before cassandra will actually discard them, and the primary driver of that bubbling up is the total write volume.

I was hoping for magic sauce here.
We are going to do a JMX-triggered LCS -> STCS -> LCS in a rolling fashion through the cluster. The switching of compaction strategy forces LCS structured sstables to restructure and apply the tombstones (in our version of cassandra we can't force an LCS compact).
There are nodetool commands to force compactions between tables, but that might screw up LCS. There are also nodetool commands to reassign the level of sstables, but again, that might foobar LCS if you muck with its structure.
What really should probably happen is that row tombstones should be placed in a separate sstable type that can be independently processed against "data" sstables to get the purge to occur. The tombstone sstable <-> data sstable processing doesn't remove the tombstone sstable, just removes tombstones from the tombstone sstable that are no longer needed after the data sstable was processed/pared/pruned. Perhaps these can be classified as "PURGE" tombstones for large scale data removals as opposed to more ad-hoc "DELETE" tombstones that would be intermingled with data. But who knows when that would be added to Cassandra.

Thanks for the great explanation of LCS, #minaguib. I think the statement from Datastax is misleading, at least to me
at most 10% of space will be wasted by obsolete rows.
Depends on how we define the “obsolete rows”. If “obsolete rows” is defined as ALL the rows which are supposed to be compacted, in your example, these “obsolete rows" will be age=30, age=29, age=28. We can end up wasting (N-1)/N space as these “age" can be in different levels.

Related

What is the effect of number of levels in levelled compaction?

I know how levelled compaction works in DBS like Cassandra, rocksdb etc. Some have max number of levels 4 and some have 7. How does this number affect compaction process? Why can't I have just 2 levels, 1st one which has flushed mem-table data (overlap possible between files) and 2nd one which contains nonoverlapping SSTs?
If there is any doc or duplicate question, please redirect.
Edit-1: Duplicate data increases when the number of levels goes up.
LCS comes to solves STCS’s space-amplification problem. It also reduces read amplification (the average number of disk reads needed per read request).
Leveled compaction divides the small sstables (“fragments”) into levels:
Level 0 (L0) is the new sstables, recently flushed from memtables. As their number grows (and reads slow down), our goal is to move sstables out of this level to the next levels.
Each of the other levels, L1, L2, L3, etc., is a single run of an exponentially increasing size: L1 is a run of 10 sstables, L2 is a run of 100 sstables, L3 is a run of 1000 sstables, and so on. (Factor 10 is the default setting in both Scylla and Apache Cassandra).
While solving, or at least significantly improving, the space amplification problem, LCS makes another problem, write amplification, worse.
"Write amplification” is the amount of bytes we had to write to the disk for each one byte of newly flushed sstable data. Write amplification is always higher than 1.0 because we write each piece of data to the commit-log, and then write it again to an sstable, and then each time compaction involves this piece of data and copies it to a new sstable, that’s another write.
Read more about it here:
https://www.scylladb.com/2018/01/31/compaction-series-leveled-compaction/
https://docs.scylladb.com/kb/compaction/
https://docs.scylladb.com/architecture/compaction/compaction-strategies/
Leveled compaction works Scylla very similarly to how it works in Cassandra and Rocksdb (with some small differences). If you want a short overview on how leveled compaction works in Scylla and why, I suggest that you read my blog post https://www.scylladb.com/2018/01/31/compaction-series-leveled-compaction/.
Your specific question on why two levels (L0 of recently flushed sstables, Ln of disjoint-range sstables) are not enough - is a very good question:
The main problem is that a single flushed memtable (sstable in L0), containing a random collection of writes, will often intersect all of the sstables in Ln. This means rewriting the entire database every time there's a new memtable flushed, and the result is a super-huge amount of write amplification, which is completely unacceptable.
One way to reduce this write amplification significantly (but perhaps not enough) is to introduce a cascade of intermediate levels, L0, L1, ..., Ln. The end result is that we have L(n-1) which is 1/10th (say) the size of Ln, and we merge L(n-1) - not a single sstable - into Ln. This is the approach that leveled compaction strategy (LCS) uses in all systems you mentioned.
A completely different approach could be not to merge a single sstable into Ln, but rather try to collect a large amount of data first, and only then merge it into Ln. We can't just collect 1,000 tables in L0 because this would make reads very slow. Rather, to collect this large amount of data, one could use size-tiered compaction (STCS) inside L0. In other words, this approach is a "mix" of STCS and LCS with two "levels": L0 uses STCS on new sstables, Ln contains a run of sstables (sstables with disjoint ranges). When L0 reaches 1/10th (say) the size of Ln, L0 is compacted into Ln. Such a mixed approach could have lower write amplification than LCS, but because most of the data is in a run in Ln, it would have same low space and read amplifications as in LCS. None of the mentioned databases (Scylla, Cassandra, or Rocksdb) has such "mixed" compaction supported, as far as I know.

Cassandra Compacting wide rows large partitions

I have been searching some docs online to get good understanding of how to tackle large partitions in cassandra.
I followed a document on the below link:
https://www.safaribooksonline.com/library/view/cassandra-high-performance/9781849515122/ch13s10.html.
Regarding "LARGE ROWS WITH COMPACTION LIMITS", below is metioned:
"The default value for in_memory_compaction_limit_in_mb is 64. This value is set in conf/cassandra.yaml. For use cases that have fixed columns, the limit should never be exceeded. Setting this value can work as a sanity check to ensure that processes are not inadvertently writing to many columns to the same key.
Keys with many columns can also be problematic when using the row cache because it requires the entire row to be stored in memory."
In the /conf/cassandra.yaml, I did find a configuration named "in_memory_compaction_limit_in_mb".
The Definition in the cassandra.yaml goes as below:
In Cassandra 2.0:
in_memory_compaction_limit_in_mb
(Default: 64) Size limit for rows being compacted in memory. Larger rows spill to disk and use a slower two-pass compaction process. When this occurs, a message is logged specifying the row key. The recommended value is 5 to 10 percent of the available Java heap size.
In Cassandra 3.0: (No such entries found in cassandra.yaml)
compaction_large_partition_warning_threshold_mb
(Default: 100) Cassandra logs a warning when compacting partitions larger than the set value
I have searching lot on what exactly the setting in_memory_compaction_limit_in_mb does.
It mentions some compaction is done in memory and some compaction is done on disk.
As per my understanding goes, When Compaction process runs:
SSTABLE is being read from disk---->(compared,tombstones removed,stale data removed) all happens in memory--->new sstable written to disk-->old table being removed
This operations accounts to high Disc space requirements and Disk I/O(Bandwidth).
Do help me with,if my understanding of compaction is wrong. Is there anything in compaction that happens in memory.
In my environment the
in_memory_compaction_limit_in_mb is set to 800.
I need to understand the purpose and implications.
Thanks in advance
in_memory_compaction_limit_in_mb is no longer necessary since the size doesn't need to be known before writing. There is no longer a 2 pass compaction so can be ignored. You don't have to do the entire partition at once, just a row at a time.
Now the primary cost is in deserializing the large index at the beginning of the partition that occurs in memory. You can increase the column_index_size_in_kb to reduce the size of that index (at cost of more IO during reads, but likely insignificant compared to the deserialization). Also if you use a newer version (3.11+) the index is lazy loaded after exceeding a certain size which improves things quite a bit.

TTL tombstones in Cassandra using LCS are created in the same level data TTLed data?

I'm using LCS and a relatively large TTL of 2 years for all inserted rows and I'm concerned about the moment at which C* would drop the corresponding tombstones (neither explicit deletes nor updates are being performed).
From Missing Manual for Leveled Compaction Strategy, Tombstone Compactions in Cassandra and Deletes Without Tombstones or TTLs I understand that
All levels except L0 contain non-overlapping SSTables, but a partition key may be present in one SSTable in each level (aka distributed in all levels).
For a compaction to be able to drop a tombstone it must be sure that is compacting all SStables that contains de data to prevent zombie data (this is done checking bloom filters). It also considers gc_grace_seconds
So, for my particular use case (2 years TTL and write heavy load) I can conclude that TTLed data will be in highest levels so I'm wondering when those SSTables with TTLed data will be compacted with the SSTables that contains the corresponding SSTables.
The main question will be: Where are tombstones (from ttls) being created? Are being created at Level 0 so it will take a long time until it will end up in the highest levels (hence disk space will take long time to be freed)?
In a comment from About deletes and tombstones Alain says that
Yet using TTLs helps, it reduces the chances of having data being fragmented between SSTables that will not be compacted together any time soon. Using any compaction strategy, if the delete comes relatively late in the row history, as it use to happen, the 'upsert'/'insert' of the tombstone will go to a new SSTable. It might take time for this tombstone to get to the right compaction "bucket" (with the rest of the row) and for Cassandra to be able to finally free space.
My understanding is that with TTLs the tombstones is created in-place, thus it is often and for many reasons easier and safer to get rid of a TTLs than from a delete.
Another clue to explore would be to use the TTL as a default value if that's a good fit. TTLs set at the table level with 'default_time_to_live' should not generate any tombstone at all in C*3.0+. Not tested on my hand, but I read about this.
I'm not sure what it means with "in-place" since SSTables are immutable.
(I also have some doubts about what it says of using default_time_to_live that I've asked in How default_time_to_live would delete rows without tombstones in Cassandra?).
My guess is that is referring to tombstones being created in the same level (but different SStables) that the TTLed data during a compaction triggered by one of the following reasons:
"Going from highest level, any level having score higher than 1.001 can be picked by a compaction thread" The Missing Manual for Leveled Compaction Strategy
"If we go 25 rounds without compacting in the highest level, we start bringing in sstables from that level into lower level compactions" The Missing Manual for Leveled Compaction Strategy
"When there are no other compactions to do, we trigger a single-sstable compaction if there is more than X% droppable tombstones in the sstable." CASSANDRA-7019
Since tombstones are created during compaction, I think it may be using SSTable metadata to estimate droppable tombstones.
So, compactions (2) and (3) should be creating/dropping tombstones in highest levels hence using LCS with a large TTL should not be an issue per se.
With creating/dropping I mean that the same kind of compactions will be creating tombstones for expired data and/or dropping tombstones if the gc period has already passed.
A link to source code that clarifies this situation will be great, thanks.
Alain Rodriguez's answer from mailing list
Another clue to explore would be to use the TTL as a default value if
that's a good fit. TTLs set at the table level with 'default_time_to_live'
should not generate any tombstone at all in C*3.0+. Not tested on my hand,
but I read about this.
As explained on a parallel thread, this is wrong, mea culpa. I believe the rest of my comment still stands (hopefully :)).
I'm not sure what it means with "in-place" since SSTables are immutable.
My guess is that is referring to tombstones being created in the same
Yes, I believe during the next compaction following the expiration date,
the entry is 'transformed' into a tombstone, and lives in the SSTable that
is the result of the compaction, on the level/bucket this SSTable is put
into. That's why I said 'in-place' which is indeed a bit weird for
immutable data.
As a side idea for your problem, on 'modern' versions of Cassandra (I don't
remember the version, that's what 'modern' means ;-)), you can run
'nodetool garbagecollect' regularly (not necessarily frequently) during the
off-peak period. That might use the cluster resources when you don't need
them to claim some disk space. Also making sure that a 2 years old record
is not being updated regularly by design would definitely help. In the
extreme case of writing a data once (never updated) and with a TTL for
example, I see no reason for a 2 years old data not to be evicted
correctly. As long as the disk can grow, it should be fine.
I would not be too much scared about it, as there is 'always' a way to
remove tombstones. Yet it's good to think about the design beforehand
indeed, generally, it's good if you can rotate the partitions over time,
not to reuse old partitions for example.

How does the Leveled Compaction Strategy ensure 90% of reads are from one sstable

I am trying to understand how the Leveled Compaction Strategy in Cassandra works that guarantees 90% of all reads will be satisfied from a single sstable.
From DataStax Doc:
new sstables are added to the first level, L0, and immediately compacted with the sstables in L1. When L1 fills up, extra sstables are promoted to L2. Subsequent sstables generated in L1 will be compacted with the sstables in L2 with which they overlap.
LeveledCompactionStrategy (LCS) in Cassandra implements the internals of LevelDB. You can check the exact implementation details in LevelDB implementation doc.
In order to give you a simple explanation take into account the following points:
Every SSTable is created when a fixed (relatively small) size limit is reached. By default L0 gets 5MB files of files, and each subsequent level is 10x the size. (in L1 you'll have 50MB of data, L2 500MB, and so on).
SSTables are created with the guarantee that they don't overlap
When a level fills up, a compaction is triggered and stables from level-L are promoted to level-L+1. So, in L1 you'll have 50MB in ~10 files, L2 500MB in ~100 files, etc..
In bold are the relevant details that justify the 90% reads from the same file (SSTable). Let's do the math together and everything will become clearer.
Imagine you have keys A,B,C,D,E in L0, and each keys takes 1MB of data.
Next we insert key F. Because level 0 is filled a compaction will create a file with [A,B,C,D,E] in level 1, and F will remain in level 0.
That's ~83% of data in 1 file in L1.
Next we insert G,H,I,J and K. So L0 fills up again, L1 gets a new sstable with [I,G,H,I,J].
By now we have K in L0, [A,B,C,D,E] and [F,G,H,I,J] in L1
And that's ~90% of data in L1.
If we continue inserting keys we will get around the same behavior so, that's why you get 90% of reads served from roughly the same file/SSTable.
A more in depth and detailed (what happens with updates and tombstones) info is given in this paragraph on the link I mentioned (the sizes for compaction election are different because they are LevelDB defaults, not C*s):
When the size of level L exceeds its limit, we compact it in a background thread. The compaction picks a file from level L and all overlapping files from the next level L+1. Note that if a level-L file overlaps only part of a level-(L+1) file, the entire file at level-(L+1) is used as an input to the compaction and will be discarded after the compaction. Aside: because level-0 is special (files in it may overlap each other), we treat compactions from level-0 to level-1 specially: a level-0 compaction may pick more than one level-0 file in case some of these files overlap each other.
A compaction merges the contents of the picked files to produce a sequence of level-(L+1) files. We switch to producing a new level-(L+1) file after the current output file has reached the target file size (2MB). We also switch to a new output file when the key range of the current output file has grown enough to overlap more then ten level-(L+2) files. This last rule ensures that a later compaction of a level-(L+1) file will not pick up too much data from level-(L+2).

Cassandra cfstats: differences between Live and Total used space values

For about 1 month I'm seeing the following values of used space for the 3 nodes ( I have replication factor = 3) in my Cassandra cluster in nodetool cfstats output:
Pending Tasks: 0
Column Family: BinaryData
SSTable count: 8145
Space used (live): 787858513883
Space used (total): 1060488819870
For other nodes I see good values, something like:
Space used (live): 780599901299
Space used (total): 780599901299
You can note a 25% difference (~254Gb) between Live and Total space. It seems I have a lot garbage on these 3 nodes which cannot be compacted for some reason.
The column family I'm talking about has a LeveledCompaction strategy configured with SSTable size of 100Mb:
create column family BinaryData with key_validation_class=UTF8Type
and compaction_strategy=LeveledCompactionStrategy
and compaction_strategy_options={sstable_size_in_mb: 100};
Note, that total value staying for month on all of the three nodes. I relied Cassandra normalize data automatically.
What I tried to decrease space (without result):
nodetool cleanup
nodetool repair -pr
nodetool compact [KEYSPACE] BinaryData (nothing happens: major compaction is ignored for LeveledCompaction strategy)
Are there any other things I should try to cleanup a garbage and free space?
Ok, I have a solution. It looks like Cassandra issue.
First, I went deep into the Cassandra 1.1.9 sources and noted that Cassandra perform some re-analysing of SStables during node starting. It removes the SStables marked as compacted, performs recalculation of used space, and do some other staff.
So, what I did is restarted the 3 problem nodes. The Total and Live values have become equals immediately after restart was completed and then Compaction process has been started and used space is reducing now.
Leveled compaction creates sstables of a fixed, relatively small size,
in your case it is 100Mb that are grouped into “levels”. Within each
level, sstables are guaranteed to be non-overlapping. Each level is
ten times as large as the previous.
So basically from this statement provided in cassandra doc, we can conclude that may be in your case ten time large level background is not formed yet, resulting to no compaction.
Coming to second question, since you have kept the replication factor as 3, so data has 3 duplicate copies, for which you have this anomaly.
And finally 25% difference between Live and Total space, as you know its due over deletion operation.
For LeveledCompactionStrategy you want to set the sstable size to a max of around 15 MB. 100MB is going to cause you a lot of needless disk IO and it will cause it to take a long time for data to propagate to higher levels, making deleted data stick around for a long time.
With a lot of deletes, you are most likely hitting some of the issues with minor compactions not doing a great job cleaning up deleted data in Cassandra 1.1. There are a bunch of fixes for tombstone cleanup during minor compaction in Cassandra 1.2. Especially when combined with LCS. I would take a look at testing Cassandra 1.2 in your Dev/QA environment. 1.2 does still have some kinks being ironed out, so you will want to make sure to keep up to date with installing new versions, or even running off of the 1.2 branch in git, but for your data size and usage pattern, I think it will give you some definite improvements.

Resources