I have been running my code to read/write to cassandra column families. I have observed that my table size is around 10 GB but the space on disk is consumed by db files for the same table is around 400 GB with different versions of files.
la-2749-big-Statistics.db la-2750-big-Index.db la-2750-big-Filter.db
la-2750-big-Summary.db la-2750-big-Data.db la-2750-big-Digest.adler32
la-2750-big-CRC.db la-2750-big-TOC.txt la-2750-big-Statistics.db
la-2751-big-Filter.db la-2751-big-Index.db la-2751-big-Summary.db
la-2751-big-Data.db la-2751-big-Digest.adler32 la-2751-big-CRC.db
la-2751-big-Statistics.db la-2751-big-TOC.txt
la-2752-big-Index.db la-2752-big-Filter.db la-2752-big-Summary.db
la-2752-big-Data.db la-2752-big-Digest.adler32 la-2752-big-CRC.db
la-2752-big-TOC.txt la-2752-big-Statistics.db
Would like to understand if the latest version of the file set has all the data required and can I remove the older versions? Does cassandra provide facility for rolling deletion of such files?
The number you refer to is the number of the SSTable (I think it is technically called generation). Specifically, the format of the filename is:
CFName-Generation-SSTableFormat-ComponentFile
In you case:
CFName = la
Generation = 275x
SSTableFormat = BIG
ComponentFile = Data.db, TOC.txt, etc...
You can't really tell if the last SSTable contains all the data you need. The space on disk consumed by old generations may be released only if data in not referenced anymore (snapshots comes to mind), and their tombstones age is greater than the gc_grace_seconds.
You should first check if you have any snapshots, and eventually use the nodetool to remove them. Then you should investigate how your tombstones are distributed among these SSTables, and in that case you may have probably a bigger problem to solve if tombstones cannot get compacted away (eg schema redesign, or data migration to a new cluster).
Related
Our cassandra 2.1.15 application' KS (using STCS) are leveling in less than 100 sstables/node of which some data sstables are now getting into the +1TB size. This means heavy/longer compactions plus longer time before tombstones and their evicted data gets in the same compaction view (application do both create/read/delete of data), thus longer before real disk space gets reclaimed, this sucks :(
Our Application Vendor later revealed to us, that they normally recommend hashing the data over 10-20 CFs in the application KS rather than our currently created 3 CFs, guessing as an way to keep ratio of sstables vs sizes in a 'workable' range. Only the application can't have this changed now we have begun hashing data out in our 3 CFs.
Currently we got 14x linux node cluster, nodes of same HW and size (running w/equal amount of vnodes), originally constructed with two data_file_directories in two xfs FS on each their logical volumes - LVs backed each by a PV (6+1 raid5). Then as some nodes began to compact data skewed in these data dirs/LVs when growning sstable sizes, we merged both data dirs onto one LV and expanded this LV with the thus released PV. So we now got 7x nodes with two data dirs in one LV backed by two PVs and 7x nodes with two data dirs in two LVs on each their PV.
1) Now as sstable sizes keeps growning due to more data and using STCS (as recommend by App Vendor) we're thinking we might be able spread data over more and smallere sstables by simply adding more data dirs in our LVs as compensation for having less CFs rather than adding more HW nodes :) Wouldn't this work to spread data over more and smallere sstables or is the a catch in using multiple data dir compared with fewer?
1) Follow-up: must have had a brain fa.. that day, off course it won't :) The Compaction Strategy doesn't bother with over how many data dirs a CF' sstables are scattered only bothers with the sstables them selves according to the strategy. So only way to spread over more and smallere sstables is to hash data over more CFs. Too bad Vendor did the time-space trade off not to record in which CF a partition key is hashed a long with the key it self, then hashing might have been reseeded to a larger number of CFs. Now only way is to built a new cluster w/more CFs and migrate data there.
2) We could then possibly use either sstablesplit on the largest sstables or removing/rejoining with more than two data dirs node by node to get rit of the currently real big sstables. Would either approach work to get sstable sizes scaled down and which way is most recommendable?
2) Follow-up: well if one node is decommissioned is token range will be scatter to other nodes, specially when using multiple vnodes/node and thus one big sstables would be scatter over more nodes and left to the mercy of the compaction strategy at other nodes. But generally if 1 out of 14 nodes, each with 256 vnodes, would be scattered to the 13 other nodes for sure, right?
Thus only increasing other nodes' amount of data by roughly 1/13 of decommissioned node' content. But rejoining such a node again would properly only send roughly same amount of data back eventually getting compacted into similar sized sstables, meaning we've done a lot IO+streaming for nothing... Unless tombstones were among the original data but just to far apart to be lucky enough to enter same compaction views (small sstable vs large sstable), such an exercise may possible get data shuffled around giving better/other chance to get some tombstone+their data evicted through the scatter+rejoining faster than waiting to strategy to get TS+data in same compaction view, dunno... any thoughs on the value of possible doing this?
Huh that was a huge thought dump.
I'll try to get straight to the point. Using ANY type of raid (except stripe) is a deathtrap. If your nodes don't have sufficient space then you either add disks as JBODs to your nodes or scale out. Second thing is your application creating, deleting, updating and reading data and you are using STCS? And with all that you have 1TB+ per node? I don't even want to get into questioning the performance of that setup.
My suggestion would be to rethink the setup having data size, access patterns, read/write/delete/update ratios and data retention plans in mind. 14 nodes with 1TB+ of data each is not catastrophic (even thou the docu states that going past 600-800GB is bad, its not) but you need to change the approach. LCS works wonders for scenarios like yours and with proper planning you can have that cluster running a long time before having to scale out (or TTL your data) with decent performance.
Got a DSC 2.1.15 14-node cluster, which is using STCS and it seems to be hovering around what seems a stable number of sstables even as we insert more and more data, so currently starting to see sstables data files in the excess of +1TB. See graphs:
Reading this we fear that having too large file sizes, might postpone compacting tombstones to finally release space as we'll have to wait for at least 4 similar sized sstables to get created.
Every node currently have two data directories each, we were hoping cassandra would spread data across those dirs using space relative equally, but as sstables are growing due to compaction, We fear ending with larger and larger sstables and maybe in one data dir primarily.
Howto possible control this better maybe, LCS or...?
Howto determine a sweet spot for number of sstables vs their sizes?
What affects the number of sstables vs their sizes vs in what data dir they get placed?
Currently few nodes are beginning to look skewed:
/dev/mapper/vg--blob1-lv--blob1 6.4T 3.3T 3.1T 52% /blob/1
/dev/mapper/vg--blob2-lv--blob2 6.6T 545G 6.1T 9% /blob/2
Could we stop a node, merge all keyspace's sstables (they seem uniquely named with an id/seq.# even though spread in two data dirs) into one data dir and expand the underlying volume and restart the node again and thus avoid running out of 'space' when only one data dir FS gets filled?
I have a Cassandra 2.1.4 cluster with 14 nodes. I am using it primarily for storing time series data collected via KairosDB.
The default TTL for data inserted into the column family named data_points (which is the biggest column family) is 12hrs. I have also set the gc_grace_seconds to 12 hrs.
In-spite of this my disk space keeps on increasing and it looks like tombstones are never dropped.
It looks like compactions are happening on a regular basis. The SSTable count does not seem outrageous either. It is constantly between ~10 to ~22. Compaction Strategy I am using is DTCS.
DESC keyspace -> http://pastebin.com/RW4rU76m
Am I doing anything wrong? Is there a way to mitigate this?
UPDATE: When I trigger a compaction manually, I see a drastic reduction in disk usage. It went from ~40GB to ~16GB. I also posted on the Cassandra user list and was suggested to move to a more recent version of Cassandra. Apparently in 2.1.4 this might be causing older data not to be dropped: https://issues.apache.org/jira/browse/CASSANDRA-8359
For about 1 month I'm seeing the following values of used space for the 3 nodes ( I have replication factor = 3) in my Cassandra cluster in nodetool cfstats output:
Pending Tasks: 0
Column Family: BinaryData
SSTable count: 8145
Space used (live): 787858513883
Space used (total): 1060488819870
For other nodes I see good values, something like:
Space used (live): 780599901299
Space used (total): 780599901299
You can note a 25% difference (~254Gb) between Live and Total space. It seems I have a lot garbage on these 3 nodes which cannot be compacted for some reason.
The column family I'm talking about has a LeveledCompaction strategy configured with SSTable size of 100Mb:
create column family BinaryData with key_validation_class=UTF8Type
and compaction_strategy=LeveledCompactionStrategy
and compaction_strategy_options={sstable_size_in_mb: 100};
Note, that total value staying for month on all of the three nodes. I relied Cassandra normalize data automatically.
What I tried to decrease space (without result):
nodetool cleanup
nodetool repair -pr
nodetool compact [KEYSPACE] BinaryData (nothing happens: major compaction is ignored for LeveledCompaction strategy)
Are there any other things I should try to cleanup a garbage and free space?
Ok, I have a solution. It looks like Cassandra issue.
First, I went deep into the Cassandra 1.1.9 sources and noted that Cassandra perform some re-analysing of SStables during node starting. It removes the SStables marked as compacted, performs recalculation of used space, and do some other staff.
So, what I did is restarted the 3 problem nodes. The Total and Live values have become equals immediately after restart was completed and then Compaction process has been started and used space is reducing now.
Leveled compaction creates sstables of a fixed, relatively small size,
in your case it is 100Mb that are grouped into “levels”. Within each
level, sstables are guaranteed to be non-overlapping. Each level is
ten times as large as the previous.
So basically from this statement provided in cassandra doc, we can conclude that may be in your case ten time large level background is not formed yet, resulting to no compaction.
Coming to second question, since you have kept the replication factor as 3, so data has 3 duplicate copies, for which you have this anomaly.
And finally 25% difference between Live and Total space, as you know its due over deletion operation.
For LeveledCompactionStrategy you want to set the sstable size to a max of around 15 MB. 100MB is going to cause you a lot of needless disk IO and it will cause it to take a long time for data to propagate to higher levels, making deleted data stick around for a long time.
With a lot of deletes, you are most likely hitting some of the issues with minor compactions not doing a great job cleaning up deleted data in Cassandra 1.1. There are a bunch of fixes for tombstone cleanup during minor compaction in Cassandra 1.2. Especially when combined with LCS. I would take a look at testing Cassandra 1.2 in your Dev/QA environment. 1.2 does still have some kinks being ironed out, so you will want to make sure to keep up to date with installing new versions, or even running off of the 1.2 branch in git, but for your data size and usage pattern, I think it will give you some definite improvements.
In major compaction merge all sstables from region server (Hbase) and all SSTables form tablet server (Cassandra) into big one.
If period comes is meany SSTables (total space above 1TB) merged into one?
Maby there is some range bounds for SSTable or HFile that splits it to several parts - for ensure that merge operations dont "rewrite all server"?
My question is related to "Compaction" section of this link http://wiki.apache.org/cassandra/MemtableSSTable
From what I found actually SSTable producted by major compaction is not splited in Cassandra. Other LSM-tree databases relies in this case on disturbed file system whitch splits SSTable (or HFile, CellSotre in Hypertable) into several files (for example 64MB) but major compaction either must compact all of this file into new one SSTable (i think is inefficient).
There are tickets in JIRA to improve and redesign compaction for Cassandra as mentioned:
https://issues.apache.org/jira/browse/CASSANDRA-1608
You may also want read my second simiral question:
How much data per node in Cassandra cluster?