removing a node from the cluster and tables in twcs - cassandra

I have a cluster (tested it both with 2.1.14 and 3.0.17) in which I have a table that is TWCS (time window compaction). All sstables are kept in the correct windows just fine up until I remove a node from the cluster (in the same dc) and in that moment it seems all sstables are treated as one pool for normal size tiered, causing sstables from different time periods to join. Seeing as my cluster is 400 nodes spread over 6 datacenters a node removal is something quite common.
I did not find any bug talking about this, is this expected behavior? having all the sstables handled together causes a major problem space wise since it means new and old data are in the same sstable causing the old data to remain on disk much longer
(2.1 twcs is achieved using a jar from jeffjirsa github)

Have you disabled read repair on TWCS tables? it could inject out-of-sequence timestamps. TWCS itself will do size tiered but only on current window, iff it falls behind in compaction.


Why is forcing major compaction on a table not ideal?

Consider a scenario where a table partitions with thousands of deleted rows. When reading from the table, Cassandra has to scan over thousands of deleted rows before it gets to the live rows.
A common workaround is to manually run a compaction on a node to forcibly get rid of tombstones.
What are the downsides of forcing major compaction on a table (with nodetool compact) and what is the best practice recommendation?
When forcing a major compaction on a table configured with the SizeTieredCompactionStrategy (STCS), all the SSTables on the node get compacted together into a single large SSTable. Due to its size, the resulting SSTable will likely never get compacted out since similar-sized SSTables are not available as compaction candidates. This creates additional issues for the nodes since tombstones do not get evicted and keep accumulating, affecting the cluster's performance.
We understand that cluster administrators use major compaction as a way of evicting tombstones which have accumulated as a result of high-delete workloads which in most cases is due to an incorrect data model.
The recommendation in this post does NOT constitute a solution to the underlying issue users face. It should not be considered a long-term fix to the data model problem.
In Apache Cassandra 2.2, CASSANDRA-7272 introduced a huge improvement which splits the output of nodetool compact into multiple files which are 50% then 25% then 12.5% of the original table size until the smallest chunk is 50MB for tables using STCS.
When using major compaction as a last resort for evicting tombstones, use the --split-output (or shorthand -s) to take advantage of this new feature:
$ nodetool compact --split-output -- <keyspace> <table>
NOTE - This feature is only available from Cassandra 2.2 and newer versions.
Also see How to split large SSTables on another server. Cheers!

Does nodetool compact move everything into one SSTable

The Cassandra compaction process reduces the number of SSTables (data files on disk) used to store data. Minor compactions occur automatically. You can tell Cassandra to perform a major compaction using the nodetool compact command.
Does running nodetool compact merely perform one round of compaction, reducing the number of SSTables, but perhaps still resulting in there being several SSTables? Or does it always compact all the SSTables (of a column family) into one SSTable?
It would depend on the compaction strategy you set for the table.
For DateTieredCompactionStrategy and LeveledCompactionStrategy, by definition I don't think even a major compaction would combine all the SSTables since that would go against the structure of SSTables they aim to create.
For the default SizeTieredCompactionStrategy, anecdotally it appears a major compaction will combine the SSTables into a single table. I ran cassandra-stress -write and watched the SSTables for a while. I could see the minor compactions combining SSTables of similar sizes, but not collapsing dissimilar sizes into one.
Then when I'd run a nodetool compact on the table, it would combine SSTables of dissimilar sizes into a single table. I'm not sure if that would be true in all cases.
Taking a quick look at the source, in it calls cfStore.getCompactionStrategy().getMaximalTask(gcBefore), which returns a list of tasks that it executes, so that kind of implies it will compact everything, but I didn't drill down any deeper than that.

What does Cassandra nodetool repair exactly do?

From I know that
The nodetool repair command repairs inconsistencies across all of the replicas for a given range of data.
but how does it fix the inconsistencies? It's written it uses Merkle trees - but that's for comparison not for fixing 'broken' data.
How the data can be 'broken'? Any common cases despite hard drive failure?
Question aside: it's compaction which evicts tombstones, right? So the requirement for running nodetool repair more frequently than gc_grace seconds is only to ensure that all data is spread to appropriate replicas? Shouldn't be that the usual scenario?
The data can become inconsistent whenever a write to a replica is not completed for whatever reason. This can happen if a node is down, if the node is up but the network connection is down, if a queue fills up and the write is dropped, disk failure, etc.
When inconsistent data is detected by comparing the merkle trees, the bad sections of data are repaired by streaming them from the nodes with the newer data. Streaming is a basic mechanism in Cassandra and is also used for bootstrapping empty nodes into the cluster.
The reason you need to run repair within gc grace seconds is so that tombstones will be sync'd to all nodes. If a node is missing a tombstone, then it won't drop that data during compaction. The nodes with the tombstone will drop the data during compaction, and then when they later run repair, the deleted data can be resurrected from the node that was missing the tombstone.

Leveled Compaction Strategy with low disk space

We have Cassandra 1.1.1 servers with Leveled Compaction Strategy.
The system works so that there are read and delete operations. Every half a year we delete approximately half of the data while new data comes in. Sometimes it happens that disk usage goes up to 75% while we know that real data take about 40-50% other space is occupied by tombstones. To avoid disk overflow we force compaction of our tables by dropping all SSTables to Level 0. For that we remove .json manifest file and restart Cassandra node. (gc_grace option does not help since compaction starts only after level is filled)
Starting from Cassandra 2.0 the manifest file was moved to sstable file itself:
We are considering migration to Cassandra 2.x while we afraid we won't have such a possibility as forcing leveled compaction any more.
My question is: how could we achieve that our table has a disk space limit e.g. 150GB? (When the limit is exceeded it triggers compaction automatically). The question is mostly about Cassandra 2.x. While any alternative solutions for Cassandra 1.1.1 are also welcome.
It seems like I've found the answers myself.
There is tool sstablelevelreset starting from 2.x version which does similar level reset as deletion of manifest file. The tool is located in tools directory of Cassandra distribution e.g. apache-cassandra-2.1.2/tools/bin/sstablelevelreset.
Starting from Cassandra 1.2 ( there is tombstone removal support for Leveled Compaction Strategy which supports tombstone_threshold option. It gives the possibility of setting maximal ratio of tombstones in a table.

Datastax Cassandra Remove and cleanup one column family

After some IT cleanup, we are noticing that we should probably do a full cleanup / restore for one column family. We believe that Cassandra has duplicate data that it is not cleaning up. Is it possible to clear out and just have Cassandra rebuild a single column family from scratch or a snapshot?
During an upgrade some of the nodes decided to rejoin the cluster, rather than just restarting. During that process nodetool netstats showed that nodes where transferring new data file into the original nodes. The cluster is stable, but the disk usage grew substantially. I am thinking that we will migrate to a new ring, but in the mean time I would like to see if I can reduce some disk usage. The ring is stable, and repairs are looking fine.
If we are able to cleanup one cf it would relieve disk space usage a ton.
nodetool cleanup is not reducing the size of the sstables.
If we have a new node join the cluster it is using approximately 50% of the disk space as the other nodes.
We could do the dance of nodetool decommision && nodetool join, but that is not going to be fun :)
We have validated that the data in the ring is consistent, and repairs show that the data is consistent across the ring.
Adding a new node and successfully running repair means the data for the partition range(s) that has(have) been assigned to that node has been streamed to the new node.
If, after this has happened, you run nodetool cleanup, any data from the other nodes that is no longer needed is cleaned up.
If you still see that some of your nodes have more data than others, this may be because you have some wider rows in some of your partitions, or because your nodes are unbalanced. There should not be any data duplication scenario (if you can prove this then it would be jira worthy).
You can run rebalance in OpsCenter or manually re-assign your tokens if you are looking to spread out the data more evenly across your nodes (or design your data model to avoid the aforementioned wide rows).
Use nodetool compact to clean up all the tombstones and compacts all the updated records into single record.
{nodetool compact}
