Dealing with uncompactable/overlapping sstables in Cassandra - cassandra

We have a new cluster running Cassandra 2.2.14, and have left compactions to "sort themselves out". This is in our UAT environment, so load is low. We run STCS.
We are seeing forever growing tombstones. I understand that compactions will take care of the data eventually once the sstable is eligible for compaction.
This is not occuring often enough for us, so I enabled some settings as a test (I am aware they are aggressive, this is purely for testing):
'tombstone_compaction_interval': '120',
'unchecked_tombstone_compaction': 'true',
'tombstone_threshold': '0.2',
'min_threshold': '2'
this did result in some compactions occurring, however the amount of dropped tombstones are low, nor did it go below the threshold (0.2).
After these settings were applied, this is what I can see from sstablemetadata:
Estimated droppable tombstones: 0.3514636277302944
Estimated droppable tombstones: 0.0
Estimated droppable tombstones: 6.007563159628437E-5
Note that this is only one CF, and there are much worse CF's out there (90% tombstones, etc). Using this as an example, but all CF's are suffering the same symptoms.
tablestats:
SSTable count: 3
Space used (live): 3170892738
Space used (total): 3170892738
Space used by snapshots (total): 3170892750
Off heap memory used (total): 1298648
SSTable Compression Ratio: 0.8020960426857765
Number of keys (estimate): 506775
Memtable cell count: 4
Memtable data size: 104
Memtable off heap memory used: 0
Memtable switch count: 2
Local read count: 2161
Local read latency: 14.531 ms
Local write count: 212
Local write latency: NaN ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 645872
Bloom filter off heap memory used: 645848
Index summary off heap memory used: 192512
Compression metadata off heap memory used: 460288
Compacted partition minimum bytes: 61
Compacted partition maximum bytes: 5839588
Compacted partition mean bytes: 8075
Average live cells per slice (last five minutes): 1.0
Maximum live cells per slice (last five minutes): 1
Average tombstones per slice (last five minutes): 124.0
Maximum tombstones per slice (last five minutes): 124
The obvious answer here is that the tombstones were not eligible for removal.
gc_grace_seconds is set to 10 days, and has not been moved.
I dumped one of the sstables to json, and I can see tombstones dating back to April 2019:
{"key": "353633393435353430313436373737353036315f657370a6215211e68263740a8cc4fdec",
"cells": [["d62cf4f420fb11e6a92baabbb43c0a93",1566793260,1566793260977489,"d"],
["d727faf220fb11e6a67702e5d23e41ec",1566793260,1566793260977489,"d"],
["d7f082ba20fb11e6ac99efca1d29dc3f",1566793260,1566793260977489,"d"],
["d928644a20fb11e696696e95ac5b1fdd",1566793260,1566793260977489,"d"],
["d9ff10bc20fb11e69d2e7d79077d0b5f",1566793260,1566793260977489,"d"],
["da935d4420fb11e6a960171790617986",1566793260,1566793260977489,"d"],
["db6617c020fb11e6925271580ce42b57",1566793260,1566793260977489,"d"],
["dc6c40ae20fb11e6b1163ce2bad9d115",1566793260,1566793260977489,"d"],
["dd32495c20fb11e68f7979c545ad06e0",1566793260,1566793260977489,"d"],
["ddd7d9d020fb11e6837dd479bf59486e",1566793260,1566793260977489,"d"]]},
So I do not believe gc_grace_seconds is the issue here.
I have run a manual user defined compaction over every Data.db file within the column family folder (singular Data.db file only, one at a time). Compactions ran, but there was very little change to tombstone values. The old data still remains.
I can confirm repairs have occurred, yesterday actually. I can also confirm repairs have been running regularly, with no issues showing in the logs.
So repairs are fine. Compactions are fine.
All I can think of is overlapping SSTables.
The final test is to run a full compaction on the column family. I performed a user defined( not nodetool compact) on the 3 SSTables using JMXterm.
This resulted in a singular SSTable file, with the following:
Estimated droppable tombstones: 9.89886650537452E-6
If i look for the example EPOCH as above (1566793260), it is not visible. Nor is the key. So it was compacted out or Cassandra did something.
The total number of lines containing a tombstone ("d") flag is 1317, of the 120million line dump. And the EPOCH values are all within 10 days. Good.
So I assume the -6 value is a very small percentage and sstablemetadata is having problems showing it.
So, success right?
But it took a full compaction to remove the old tombstones. As far as I am aware, a full compaction is only a last ditch effort maneuver.
My questions are -
How can I determine if overlapping sstables is my issue here? I cant see any other reason why the data would not compact out unless it is overlapping related.
How can I resolve overlapping sstables, without performing a full compaction? I am afraid this is simply going to reoccur in a few weeks time. I don't want to get stuck having to perform full compactions regularly to keep tombstones at bay.
What are the reasons for the creation of overlapping sstables? Is this a data design problem, or some other issue?
Cheers.

To answer your questions:
How can I determine if overlapping sstables is my issue here? I cant see any other reason why the data would not compact out unless it is overlapping related.
If the tombstones weren't generated by using TTL, more of the time the tombstones and the shadowed data could locate into different sstables. When using STCS and there is low volume of write into the cluster, few compaction will be triggered which causes the tombstones stay for extended time. If you have the partition key of a tombstone, run nodetool getsstables -- <keyspace> <table> <key> on a node will return all sstables that contain the key in the local node. You can dump the sstable content to confirm.
How can I resolve overlapping sstables, without performing a full compaction? I am afraid this is simply going to reoccur in a few weeks time. I don't want to get stuck having to perform full compactions regularly to keep tombstones at bay.
There is a new option in "nodetool compaction -s" which can do a major compaction and slit the output to 4 sstables with different sizes. This solves the previous problem of the major compaction which creates a single large sstable. If the droppable tombstones ratio is as high as 80-90%, the resulted sstable size will be even smaller as the majority tombstones had been purged.
In the newer version Cassandra (3.10+), there is a new tool, nodetool garbagecollect, to clean up the tombstones. However, there is limitations in this tool. Not all kinds of tombstones could be removed by it.
All being said, for your situation that there are overlapping sstables and low volume of activities/less frequency of compactions, either you have to find out all related sstables and use user defined compaction, or do major compaction with "-s". https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/tools/nodetool/toolsCompact.html
What are the reasons for the creation of overlapping sstables? Is this a data design problem, or some other issue?
Fast growing of tombstones usually indicates a data modeling problem: whether the application is inserting null, or periodically deleting data, or using collection and doing update instead of appending. If your data is time series, check if it makes sense to use TTL and TWCS.

Related

Disk space not decreasing after gc_grace_seconds (10 days) elapsed

I deleted a lot of data(10 billions rows) from my table (made a small app that query from LONG.MIN_VALUE up to LONG.MAX_VALUE in token range and DELETE some data).
Disk space did not decrease after 20 days from then (also I run nodetool repair on 1 node from total of 6), but number of keys(estimate) have decrease accordingly.
Will the space decrease in the future in a natural way, or there is some utility from cassandra I need to run to reclaim the space?
In general, yes, the space will decrease accordingly (once compaction runs). Depending on the compaction strategy chosen for that table, it could take some time. Size Tiered Compaction Strategy for example requires, by default, that 4 sstables be the same size before being compacted. If you have very large SSTABLES then they may not get compacted for quite some time, or indefinitely if there are not 4 of the same size. A manual compaction would fix that situation, but it would put everything in a single sstable, which is not recommended either. If the resulting sstable of a manual compaction is very small, then it won't hurt you. If it ends up compacting to a "large" SSTABLE, then you have sacrificed "now" for "later" (again, because you now have only a single large sstable, it may take a very long time for it to participate in compaction). You can split the sstable after a manual compaction to remidy the situation you've created, but you'll have to take your node off-line to do it. Anyway, short answer is that over time the table should shrink accordingly - when depends on the compaction strategy chosen.
Try running "nodetool garbagecollect" as this will trigger compaction and removes deleted data. which you can verify running status by "nodetool compacationstats"

How are Cassandra Tombstones deleted in old SSTables?

If I have compaction enabled, like SizeTieredCompaction, my SSTables get compacted until a certain size level is reached. When I "delete" an old entry which is in an SSTable partition that is quite old and wont be compacted again in the near future, when is the deletion taking place?
Imagine you delete 100 entries and all are part of a really old SSTable that was compacted several times, has no hot data and is already quite big. It will take ages until it's compacted again and tombstones are removed, right?
When the tombstone is merged with the data in a compaction the data will be deleted from disk. When that happens depends on the rate new data is being added and your compaction strategy. The tombstones are not purged until after gc_grace_seconds to prevent data resurrection (make sure repairs complete within this period of time).
If you override or delete data a lot and not ok with a lot of obsolete data on disk you should probably use LeveledCompactionStrategy instead (I would recommend always defaulting to LCS if using ssds). It can take a long time for the largest sstables to get compacted if using STCS. STCS is more for constantly appending data (like logs or events). If the entries expire over time and you rely heavily on TTLs you will probably want to use the timed window strategy.

cassandra table looks empty in cqlsh, but nodetool cfstats thinks otherwise

Using nodetool cfstats I can see that a particular table (table1) is using 59mb and has 545597 keys. Another related table (table2) is using 568mb and has 2,506,141 keys.
Using cqlsh, when I do select count( * ) from table1 it pauses for about 7 seconds then returns a count of 0. However, if I do select count( * ) from table2 it pauses for much longer and then returns a count of 2,481,669.
I also tried select * from table1 and select * from table2. The first takes 7 seconds then returns nothing. The second instantly starts paging through results.
I'm well aware these are expensive operations, however this is on a single dev server which has only this one Cassandra instance. It's a cluster of 1 and not meant for production. I just want to figure out why the values in table1 are invisible.
Is it possible that table1 actually has no values in it? That shouldn't be possible given that I just ran a job to add a bunch of values to it. I also ran "nodetool compact", so that should have eliminated all the tombstones and the cfstats should show what's actually there, right? Here are the cfstats for table1 after I ran nodetool compact:
SSTable count: 1
Space used (live): 59424392
Space used (total): 59424392
Space used by snapshots (total): 73951087
Off heap memory used (total): 806762
SSTable Compression Ratio: 0.28514022725059224
Number of keys (estimate): 545597
Memtable cell count: 393204
Memtable data size: 17877650
Memtable off heap memory used: 0
Memtable switch count: 3
Local read count: 5
Local read latency: 0.252 ms
Local write count: 545804
Local write latency: 0.013 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 611792
Bloom filter off heap memory used: 611784
Index summary off heap memory used: 180202
Compression metadata off heap memory used: 14776
Compacted partition minimum bytes: 216
Compacted partition maximum bytes: 310
Compacted partition mean bytes: 264
Average live cells per slice (last five minutes): 1.0
Maximum live cells per slice (last five minutes): 1
Average tombstones per slice (last five minutes): 6.0
Maximum tombstones per slice (last five minutes): 7
If it helps, I'm using apache cassandra 2.2.0 on a linux server.
Cassandra saves all the data in files (sstables). For speeds, writes append data at the end of the files (the index certainly works differently, but they do not describe how those function...)
The deletion of data (or expiration in your case) does not remove the data from the files because it would otherwise mean a lot of large moves and tons of I/O. So instead of they just mark the entry as "dead" (hence they are called tombstones).
Once in a while, the compaction system comes in (Assuming you did not turn it off against that table) and compacts tables. That means it reads from the start of the file and moves live entries over dead ones. More or less, something like this assuming B gets deleted at some point (columns left to right represent different points in time):
Creation Deletion Compaction
A A A
B B-tombstone C
C C
If your table has too many tombstones, the compaction may fail (I do not understand why it could fail, but that's what I read). A table that fails compaction is marked as "do not ever compact", which is a big problem, if you ask me. And a table with half a million keys could very well be failing.
While the table is in the "Deletion" state (includes tombstones), a SELECT that goes over a tombstone still creates a TombStone memory object (do not ask me why, I have no idea, it looks like Cassandra would not work right otherwise...) Hence, the 7 seconds to read all the tombstones and create Java objects for each one of them.
The CQL interface includes a TRACE feature that can be used to see the number of tombstones you have in a table. It prints out a bunch of things that you'd like to know about.
TRACE ON;
SELECT COUNT( * ) FROM table1;

When does Cassandra remove data from an SSTable

In Cassandra 2.x when I delete one or multiple columns, they receive a tombstone in the Memtable but the data is not removed. At some point, the Memtable is flushed to an SSTable including the deleted data and the tombstone. When compaction is running, it will retain the tombstone with the specified grace period. What happens to the data? I have deleted a bunch of columns last week - less than gc_grace_seconds ago. I am not sure compaction has run yet. I haven't seen any change on disk size used yet, so I was wondering at which point is the data physically removed from disk?
In Cassandra 2.x when I delete one or multiple columns, they receive a tombstone in the Memtable but the data is not removed. At some point, the Memtable is flushed to an SSTable including the deleted data and the tombstone. When compaction is running, it will retain the tombstone with the specified grace period.
True.
What happens to the data?
The data will remain on disk at least for gc_grace_seconds. Next minor compaction right after gc_grace_seconds may remove it, but real timing depends mostly on your dataset and workload type.
I have deleted a bunch of columns last week - less than gc_grace_seconds ago. I am not sure compaction has run yet. I haven't seen any change on disk size used yet, so I was wondering at which point is the data physically removed from disk?
If you want to free some disk space, you can:
wait for gc_grace_seconds for normal minor compaction.
run nodetool compact which will trigger major compaction on current node freeing disk space right now.

Cassandra cfstats: differences between Live and Total used space values

For about 1 month I'm seeing the following values of used space for the 3 nodes ( I have replication factor = 3) in my Cassandra cluster in nodetool cfstats output:
Pending Tasks: 0
Column Family: BinaryData
SSTable count: 8145
Space used (live): 787858513883
Space used (total): 1060488819870
For other nodes I see good values, something like:
Space used (live): 780599901299
Space used (total): 780599901299
You can note a 25% difference (~254Gb) between Live and Total space. It seems I have a lot garbage on these 3 nodes which cannot be compacted for some reason.
The column family I'm talking about has a LeveledCompaction strategy configured with SSTable size of 100Mb:
create column family BinaryData with key_validation_class=UTF8Type
and compaction_strategy=LeveledCompactionStrategy
and compaction_strategy_options={sstable_size_in_mb: 100};
Note, that total value staying for month on all of the three nodes. I relied Cassandra normalize data automatically.
What I tried to decrease space (without result):
nodetool cleanup
nodetool repair -pr
nodetool compact [KEYSPACE] BinaryData (nothing happens: major compaction is ignored for LeveledCompaction strategy)
Are there any other things I should try to cleanup a garbage and free space?
Ok, I have a solution. It looks like Cassandra issue.
First, I went deep into the Cassandra 1.1.9 sources and noted that Cassandra perform some re-analysing of SStables during node starting. It removes the SStables marked as compacted, performs recalculation of used space, and do some other staff.
So, what I did is restarted the 3 problem nodes. The Total and Live values have become equals immediately after restart was completed and then Compaction process has been started and used space is reducing now.
Leveled compaction creates sstables of a fixed, relatively small size,
in your case it is 100Mb that are grouped into “levels”. Within each
level, sstables are guaranteed to be non-overlapping. Each level is
ten times as large as the previous.
So basically from this statement provided in cassandra doc, we can conclude that may be in your case ten time large level background is not formed yet, resulting to no compaction.
Coming to second question, since you have kept the replication factor as 3, so data has 3 duplicate copies, for which you have this anomaly.
And finally 25% difference between Live and Total space, as you know its due over deletion operation.
For LeveledCompactionStrategy you want to set the sstable size to a max of around 15 MB. 100MB is going to cause you a lot of needless disk IO and it will cause it to take a long time for data to propagate to higher levels, making deleted data stick around for a long time.
With a lot of deletes, you are most likely hitting some of the issues with minor compactions not doing a great job cleaning up deleted data in Cassandra 1.1. There are a bunch of fixes for tombstone cleanup during minor compaction in Cassandra 1.2. Especially when combined with LCS. I would take a look at testing Cassandra 1.2 in your Dev/QA environment. 1.2 does still have some kinks being ironed out, so you will want to make sure to keep up to date with installing new versions, or even running off of the 1.2 branch in git, but for your data size and usage pattern, I think it will give you some definite improvements.

Resources