Cassandra cfstats: differences between Live and Total used space values - cassandra

For about 1 month I'm seeing the following values of used space for the 3 nodes ( I have replication factor = 3) in my Cassandra cluster in nodetool cfstats output:
Pending Tasks: 0
Column Family: BinaryData
SSTable count: 8145
Space used (live): 787858513883
Space used (total): 1060488819870
For other nodes I see good values, something like:
Space used (live): 780599901299
Space used (total): 780599901299
You can note a 25% difference (~254Gb) between Live and Total space. It seems I have a lot garbage on these 3 nodes which cannot be compacted for some reason.
The column family I'm talking about has a LeveledCompaction strategy configured with SSTable size of 100Mb:
create column family BinaryData with key_validation_class=UTF8Type
and compaction_strategy=LeveledCompactionStrategy
and compaction_strategy_options={sstable_size_in_mb: 100};
Note, that total value staying for month on all of the three nodes. I relied Cassandra normalize data automatically.
What I tried to decrease space (without result):
nodetool cleanup
nodetool repair -pr
nodetool compact [KEYSPACE] BinaryData (nothing happens: major compaction is ignored for LeveledCompaction strategy)
Are there any other things I should try to cleanup a garbage and free space?

Ok, I have a solution. It looks like Cassandra issue.
First, I went deep into the Cassandra 1.1.9 sources and noted that Cassandra perform some re-analysing of SStables during node starting. It removes the SStables marked as compacted, performs recalculation of used space, and do some other staff.
So, what I did is restarted the 3 problem nodes. The Total and Live values have become equals immediately after restart was completed and then Compaction process has been started and used space is reducing now.

Leveled compaction creates sstables of a fixed, relatively small size,
in your case it is 100Mb that are grouped into “levels”. Within each
level, sstables are guaranteed to be non-overlapping. Each level is
ten times as large as the previous.
So basically from this statement provided in cassandra doc, we can conclude that may be in your case ten time large level background is not formed yet, resulting to no compaction.
Coming to second question, since you have kept the replication factor as 3, so data has 3 duplicate copies, for which you have this anomaly.
And finally 25% difference between Live and Total space, as you know its due over deletion operation.

For LeveledCompactionStrategy you want to set the sstable size to a max of around 15 MB. 100MB is going to cause you a lot of needless disk IO and it will cause it to take a long time for data to propagate to higher levels, making deleted data stick around for a long time.
With a lot of deletes, you are most likely hitting some of the issues with minor compactions not doing a great job cleaning up deleted data in Cassandra 1.1. There are a bunch of fixes for tombstone cleanup during minor compaction in Cassandra 1.2. Especially when combined with LCS. I would take a look at testing Cassandra 1.2 in your Dev/QA environment. 1.2 does still have some kinks being ironed out, so you will want to make sure to keep up to date with installing new versions, or even running off of the 1.2 branch in git, but for your data size and usage pattern, I think it will give you some definite improvements.

Related

Disk space requirement for compaction on a token range in scylla/cassandra

I am using SizeTieredCompaction strategy in Scylla db. I deleted half of my data in a specific token range (let's say x to y). My gc_grace_seconds is set to 6 hours. I want to get rid of all the tombstones that are created in this token range. If I run nodetool compact --start-token x --end-token y keyspace table on all the nodes in cluster after gc_grace_seconds has passed, what would happen? will it delete the tombstones and how much disk space will it consume? Will it be same as nodetool compact major compaction that needs 50% more space?
Scylla's documentation of nodetool compact (see https://docs.scylladb.com/operating-scylla/nodetool-commands/compact/) doesn't even the token range option, unfortunately. But the Cassandra documentation (https://cassandra.apache.org/doc/latest/operating/compaction/index.html) explains what the so-called sub-range compaction does:
It is possible to only compact a given sub range - this could be useful if you know a token that has been misbehaving - either gathering many updates or many deletes. (nodetool compact -st x -et y) will pick all sstables containing the range between x and y and issue a compaction for those sstables. For STCS this will most likely include all sstables but with LCS it can issue the compaction for a subset of the sstables.
With STCS the common case is that all sstables have tokens from all over the token ring, so your nodetool compact call will usually invoke a full major compaction of all sstables. The token range option will likely not exempt any of the sstables from being compacted. So the temporary disk space overhead will be as usual with STCS: At the end of the compaction, you have both the old sstables, and the new one. You assumed the new ones have only half of the original data, so the new sstable will be around half the total size of the old sstable, so this is probably the "50%" you asked about.
To delete the tombstones you also need to run nodetool repair. See here for details on the repair procedure. Basically repair compares data between node so that tombstones can be safely expired.
The space required for compaction is dependent on the specific workload, it is impossible to provide an answer without data about your workload. But 2x is a safe bet which takes into account safety margins. After full compaction the space used will be minimal as only 1 copy of the data is save on each node.

Disk space not decreasing after gc_grace_seconds (10 days) elapsed

I deleted a lot of data(10 billions rows) from my table (made a small app that query from LONG.MIN_VALUE up to LONG.MAX_VALUE in token range and DELETE some data).
Disk space did not decrease after 20 days from then (also I run nodetool repair on 1 node from total of 6), but number of keys(estimate) have decrease accordingly.
Will the space decrease in the future in a natural way, or there is some utility from cassandra I need to run to reclaim the space?
In general, yes, the space will decrease accordingly (once compaction runs). Depending on the compaction strategy chosen for that table, it could take some time. Size Tiered Compaction Strategy for example requires, by default, that 4 sstables be the same size before being compacted. If you have very large SSTABLES then they may not get compacted for quite some time, or indefinitely if there are not 4 of the same size. A manual compaction would fix that situation, but it would put everything in a single sstable, which is not recommended either. If the resulting sstable of a manual compaction is very small, then it won't hurt you. If it ends up compacting to a "large" SSTABLE, then you have sacrificed "now" for "later" (again, because you now have only a single large sstable, it may take a very long time for it to participate in compaction). You can split the sstable after a manual compaction to remidy the situation you've created, but you'll have to take your node off-line to do it. Anyway, short answer is that over time the table should shrink accordingly - when depends on the compaction strategy chosen.
Try running "nodetool garbagecollect" as this will trigger compaction and removes deleted data. which you can verify running status by "nodetool compacationstats"

Cleanup space in almost full Cassandra Node

I have a Cassandra Cluster (2 DC) with 6 nodes each and RF 2. 4 of the nodes (in each DC) getting full so I need to cleanup space very soon.
I tried to run a full repair but ended up as a bad idea since the space start increased even more and the repair eventually hanged. As a last solution I am thinking to start repairing and then cleanup specific columns starting from the smallest to the biggest.
i.e
nodetool repair -full foo_keyspace bar_columnfamily
nodetool cleanup foo_keyspace bar_columnfamily
Do you think that this procedure will be safe for the data?
Thank you
The commands that you presented in your question make several incorrect assumptions. First, "repair" is not supposed to, and will not, save any space. All repair does is to find inconsistencies between different replicas and repair them. It will either do nothing (if there's no inconsistencies), or add data, not remove data.
Second, "cleanup" is something you need to do after adding new nodes to the cluster - after each node sent some of its data to the new node, a "cleanup" removes the data from the old nodes. But cleanup is not relevant when not adding node.
The command you may be looking for is "compact". This can save space, but only when you know you had a lot of overwrites (rewriting existing rows), deletions or data expirations (TTL). What compaction strategy are you using? If it's the default, size-tiered compaction strategy (STCS) you can start major compaction (nodetool compact) but should be aware of a big risk involved:
Major compaction merges all the data into one sstable (Cassandra's on-disk file format), dropping deleted, expired or overwritten data. However, during this compaction process, you have both input and output files, and at worst case this may double your disk usage, and may fail if the disk is more than 50% full. This is why a lot of Cassandra best-practice guides suggest never to fill more than 50% of the disk. But this is just the worst case. You can get along with less free space if you know that the output file will be much smaller than the input (because most of the data has been deleted). Perhaps more usefully, if you have many separate tables (column family), you can compact each one separately (as you suggested, from smallest to biggest) and the maximum amount of disk space needed temporarily during the compaction can be much less than 50% of the disk.
Scylla, a C++ reimplementation of Cassandra, is developing something known as "hybrid compaction" (see https://www.slideshare.net/ScyllaDB/scylla-summit-2017-how-to-ruin-your-performance-by-choosing-the-wrong-compaction-strategy) which is like Cassandra's size-tiered compaction but does compaction in small pieces instead of generating one huge file, to avoid the huge temporary disk usage during compaction. Unfortunately, Cassandra doesn't have this feature yet.
Good idea is first start repair on smallest table on smallest keyspace one by one and complete repair. It will take time but safer way and no chance to hang and traffic loss.
Once repair completed start cleanup in the same way as repair. This way no impact on node and cluster as well.
You shouldn't fill more than about 50-60 % of your disks to make room for compaction. If you're above that amount of disk usage you need to consider getting bigger disks or add more nodes.
Datastax recommendations are usually good to follow: https://docs.datastax.com/en/dse-planning/doc/planning/planPlanningDiskCapacity.html

Cassandra Compacting wide rows large partitions

I have been searching some docs online to get good understanding of how to tackle large partitions in cassandra.
I followed a document on the below link:
https://www.safaribooksonline.com/library/view/cassandra-high-performance/9781849515122/ch13s10.html.
Regarding "LARGE ROWS WITH COMPACTION LIMITS", below is metioned:
"The default value for in_memory_compaction_limit_in_mb is 64. This value is set in conf/cassandra.yaml. For use cases that have fixed columns, the limit should never be exceeded. Setting this value can work as a sanity check to ensure that processes are not inadvertently writing to many columns to the same key.
Keys with many columns can also be problematic when using the row cache because it requires the entire row to be stored in memory."
In the /conf/cassandra.yaml, I did find a configuration named "in_memory_compaction_limit_in_mb".
The Definition in the cassandra.yaml goes as below:
In Cassandra 2.0:
in_memory_compaction_limit_in_mb
(Default: 64) Size limit for rows being compacted in memory. Larger rows spill to disk and use a slower two-pass compaction process. When this occurs, a message is logged specifying the row key. The recommended value is 5 to 10 percent of the available Java heap size.
In Cassandra 3.0: (No such entries found in cassandra.yaml)
compaction_large_partition_warning_threshold_mb
(Default: 100) Cassandra logs a warning when compacting partitions larger than the set value
I have searching lot on what exactly the setting in_memory_compaction_limit_in_mb does.
It mentions some compaction is done in memory and some compaction is done on disk.
As per my understanding goes, When Compaction process runs:
SSTABLE is being read from disk---->(compared,tombstones removed,stale data removed) all happens in memory--->new sstable written to disk-->old table being removed
This operations accounts to high Disc space requirements and Disk I/O(Bandwidth).
Do help me with,if my understanding of compaction is wrong. Is there anything in compaction that happens in memory.
In my environment the
in_memory_compaction_limit_in_mb is set to 800.
I need to understand the purpose and implications.
Thanks in advance
in_memory_compaction_limit_in_mb is no longer necessary since the size doesn't need to be known before writing. There is no longer a 2 pass compaction so can be ignored. You don't have to do the entire partition at once, just a row at a time.
Now the primary cost is in deserializing the large index at the beginning of the partition that occurs in memory. You can increase the column_index_size_in_kb to reduce the size of that index (at cost of more IO during reads, but likely insignificant compared to the deserialization). Also if you use a newer version (3.11+) the index is lazy loaded after exceeding a certain size which improves things quite a bit.

Want to spread current cluster data over more and smallere sstables

Our cassandra 2.1.15 application' KS (using STCS) are leveling in less than 100 sstables/node of which some data sstables are now getting into the +1TB size. This means heavy/longer compactions plus longer time before tombstones and their evicted data gets in the same compaction view (application do both create/read/delete of data), thus longer before real disk space gets reclaimed, this sucks :(
Our Application Vendor later revealed to us, that they normally recommend hashing the data over 10-20 CFs in the application KS rather than our currently created 3 CFs, guessing as an way to keep ratio of sstables vs sizes in a 'workable' range. Only the application can't have this changed now we have begun hashing data out in our 3 CFs.
Currently we got 14x linux node cluster, nodes of same HW and size (running w/equal amount of vnodes), originally constructed with two data_file_directories in two xfs FS on each their logical volumes - LVs backed each by a PV (6+1 raid5). Then as some nodes began to compact data skewed in these data dirs/LVs when growning sstable sizes, we merged both data dirs onto one LV and expanded this LV with the thus released PV. So we now got 7x nodes with two data dirs in one LV backed by two PVs and 7x nodes with two data dirs in two LVs on each their PV.
1) Now as sstable sizes keeps growning due to more data and using STCS (as recommend by App Vendor) we're thinking we might be able spread data over more and smallere sstables by simply adding more data dirs in our LVs as compensation for having less CFs rather than adding more HW nodes :) Wouldn't this work to spread data over more and smallere sstables or is the a catch in using multiple data dir compared with fewer?
1) Follow-up: must have had a brain fa.. that day, off course it won't :) The Compaction Strategy doesn't bother with over how many data dirs a CF' sstables are scattered only bothers with the sstables them selves according to the strategy. So only way to spread over more and smallere sstables is to hash data over more CFs. Too bad Vendor did the time-space trade off not to record in which CF a partition key is hashed a long with the key it self, then hashing might have been reseeded to a larger number of CFs. Now only way is to built a new cluster w/more CFs and migrate data there.
2) We could then possibly use either sstablesplit on the largest sstables or removing/rejoining with more than two data dirs node by node to get rit of the currently real big sstables. Would either approach work to get sstable sizes scaled down and which way is most recommendable?
2) Follow-up: well if one node is decommissioned is token range will be scatter to other nodes, specially when using multiple vnodes/node and thus one big sstables would be scatter over more nodes and left to the mercy of the compaction strategy at other nodes. But generally if 1 out of 14 nodes, each with 256 vnodes, would be scattered to the 13 other nodes for sure, right?
Thus only increasing other nodes' amount of data by roughly 1/13 of decommissioned node' content. But rejoining such a node again would properly only send roughly same amount of data back eventually getting compacted into similar sized sstables, meaning we've done a lot IO+streaming for nothing... Unless tombstones were among the original data but just to far apart to be lucky enough to enter same compaction views (small sstable vs large sstable), such an exercise may possible get data shuffled around giving better/other chance to get some tombstone+their data evicted through the scatter+rejoining faster than waiting to strategy to get TS+data in same compaction view, dunno... any thoughs on the value of possible doing this?
Huh that was a huge thought dump.
I'll try to get straight to the point. Using ANY type of raid (except stripe) is a deathtrap. If your nodes don't have sufficient space then you either add disks as JBODs to your nodes or scale out. Second thing is your application creating, deleting, updating and reading data and you are using STCS? And with all that you have 1TB+ per node? I don't even want to get into questioning the performance of that setup.
My suggestion would be to rethink the setup having data size, access patterns, read/write/delete/update ratios and data retention plans in mind. 14 nodes with 1TB+ of data each is not catastrophic (even thou the docu states that going past 600-800GB is bad, its not) but you need to change the approach. LCS works wonders for scenarios like yours and with proper planning you can have that cluster running a long time before having to scale out (or TTL your data) with decent performance.

Resources