I have a 3-node cassandra cluster (version 3.11.11) with replication factor 3. only 2 of the nodes are receiving requests, and Node3 only sync with the other 2 nodes.
In theory, each node should have the same data size. But in practice, I end up with nodes with different data sizes as shown in the picture.
we have daily nodetool repair, operations like compaction are done automatically with default settings.
What can be the reason for the size difference?
It finally ends up how data gets compacted in the long run. Since compaction is local process and how sstables can be stacked up cannot be guaranteed. So I dont see any abbreviation here. Theory just say all nodes will have same data logically but physically it may vary. For example in node3 you may have old sstables that are not getting compacted due to size (if using STCS) and in other nodes they have compacted and reduced the size of those nodes.
Related
We have OpenNMS sending graph data to our Cassandra/Newts cluster which is comprised of 2 Cassandra nodes. I've set the replication factor to 2 for the keyspace "newts".
I started the nodes at the same time and left them up for some time, i then ran "nodetool cfstats newts" on each node and both nodes have the exact same write count.
If i however go in to the data directory "/db/newts" of each node and run "du -h" i can see the following:
Node1 storage used: 36K
Node2 storage used: 12M
How can they differ in size if i set the replication factor to 2? I know that they're connected to the same cluster via "nodetool status" which is showing both nodes as "UN" (Up/Normal).
In Cassandra data is not written directly to the hard drive, it lives in:
Commit log >> Memtable >> SSTables
Here you can find a good documentation on how data is written.
You can run:
nodetool flush
which will flush the memtables into sstables. After that you should be able to see the same sstable size on both of your nodes.
Installing Cassandra in a single node to run some tests, we noticed that we were using a RF of 3 and everything was working correctly.
This is of course because that node has 256 vnodes (by default) so the same data can be replicated in the same node in different vnodes.
This is worrying because if one node were to fail, you'd lose all your data even though you thought the data was replicated in different nodes.
How can I be sure that in a standard installation (with a ring with several nodes) the same data will not be replicated in the same "physical" node? Is there a setting to avoid Cassandra from using the same node for replicating data?
Replication strategy is schema dependent. You probably used the SimpleStrategy with RF=3 in your schema. That means that each piece of data will be placed on the node determined by the partition key, and successive replicas will be placed on the successive nodes. In your case, the successive node is the same physical node, hence you get 3 copies of your data there.
Increasing the number of nodes solves your problem. In general, your data will be placed in different physical nodes when your replication factor RF is less than/equal to your number of nodes N.
The other solution is to switch replication strategy and use the NetworkTopologyStrategy, usually used in multi datacenter clusters, and where you can specify how many replicas you want in each data center. This strategy
places replicas in the same data center by walking the ring clockwise
until reaching the first node in another rack. NetworkTopologyStrategy
attempts to place replicas on distinct racks because nodes in the same
rack (or similar physical grouping) often fail at the same time due to
power, cooling, or network issues.
Look at DataStax documentation for more information.
Without vnodes each physical node owns a single token range. With vnodes each physical node will own multiple, non-consecutive token ranges (aka a vnode), and furthermore vnodes are randomly assigned to physical nodes.
Which means that even when data gets replicated on the vnodes right next to the primary replica's node (i.e. when using SimpleStrategy) the replicas will - with high probability but not guaranteed - be on different physical nodes.
This random assignment can be seen in the output of nodetool ring.
More info can be found here.
Cassandra stores replicas on different nodes in the same keyspace. It would be nonsensical to have multiple replicas in the same keyspace. If the replication factor exceeds the number of nodes, than the number of nodes is your replication factor.
But, why is this not an error? Well, this allows for provisioning more nodes later.
As a general rule, the replication factor should not exceed the number of nodes in the cluster. However, you can increase the replication factor and then add the desired number of nodes later.
I have a keyspace with replication factor set to 3 but I have only a single node. Will then the disk space be used 3 times the data size? As the replicas are not yet assigned to any other nodes, will cassandra stop creating replicas unless new nodes join the cluster?
No, the disk space used would not be three times the size. The single node would own the entire token range and all writes would be written to that single node once.
What happens with the writes for the other two replicas would depend on if those nodes were previously present in the cluster and are currently down, or if they have never been added to the cluster. If they had never been added, then C* would just skip trying to write to them.
If they had been added but are currently down, and if you have hinted handoffs enabled and are still within the hinted handoff window, then C* will store hints for the down nodes on the single up node.
It depends on the replication strategy you have used . Assuming your queries are working you might have used SimpleStrategy , if you try to write to such a configuration your write should fail as it needs to write to 2 additional replica node before it gives a acknowledgement to the client ,which in case of SimpleStratagy are the next two clockwise nodes in the Ring.
After some IT cleanup, we are noticing that we should probably do a full cleanup / restore for one column family. We believe that Cassandra has duplicate data that it is not cleaning up. Is it possible to clear out and just have Cassandra rebuild a single column family from scratch or a snapshot?
During an upgrade some of the nodes decided to rejoin the cluster, rather than just restarting. During that process nodetool netstats showed that nodes where transferring new data file into the original nodes. The cluster is stable, but the disk usage grew substantially. I am thinking that we will migrate to a new ring, but in the mean time I would like to see if I can reduce some disk usage. The ring is stable, and repairs are looking fine.
If we are able to cleanup one cf it would relieve disk space usage a ton.
nodetool cleanup is not reducing the size of the sstables.
If we have a new node join the cluster it is using approximately 50% of the disk space as the other nodes.
We could do the dance of nodetool decommision && nodetool join, but that is not going to be fun :)
We have validated that the data in the ring is consistent, and repairs show that the data is consistent across the ring.
Adding a new node and successfully running repair means the data for the partition range(s) that has(have) been assigned to that node has been streamed to the new node.
If, after this has happened, you run nodetool cleanup, any data from the other nodes that is no longer needed is cleaned up.
If you still see that some of your nodes have more data than others, this may be because you have some wider rows in some of your partitions, or because your nodes are unbalanced. There should not be any data duplication scenario (if you can prove this then it would be jira worthy).
You can run rebalance in OpsCenter or manually re-assign your tokens if you are looking to spread out the data more evenly across your nodes (or design your data model to avoid the aforementioned wide rows).
Use nodetool compact to clean up all the tombstones and compacts all the updated records into single record.
{nodetool compact}
I have a 4 node brisk cluster with 2 Cassandra nodes in Cassandra DC and 2 brisk nodes in Brisk DC. I stress tested this set up using stress tool which is being shipped along with cassandra for 10 Million writes
On executing
$ ./nodetool -h x.x.x.x compactionstats
pending tasks: 17
compaction type keyspace column family bytes compacted bytes total progress
Major Keyspace1 Standard1 45172473 60278166 74.94%
AFAIK major compaction is manually triggered from node tool. But I'm able to see that it has been triggered automatically.
Is this a desired behavior? If so what are all the situations this may occur?
Regards,
Tamil
From the doc:
Compactions are triggered when at least N SStables have been flushed
to disk, where N is tunable and defaults to 4.
"Minor" compactions merge sstables of similar size; "major" compactions merge all sstables in a given ColumnFamily.
Again from the doc:
A major compaction is triggered either via nodeprobe, or automatically:
Nodeprobe sends TreeRequest messages to all neighbors of the target
node: when a node receives a TreeRequest, it will perform a readonly
compaction to immediately validate the column family.
Automatic compactions will also validate a column family and broadcast
TreeResponses, but since TreeRequest messages are not sent to
neighboring nodes, repairs will only occur if two nodes happen to
perform automatic compactions within TREE_STORE_TIMEOUT of one
another.
You may find more info here and here