Cassandra compaction: does replication factor have any influence? - cassandra

Let’s assume that the total disk usage of all keyspaces is 100GB before replication. The replication factor is 3. Making the total physical disk usage = 100GB x 3 = 300GB.
We use the default compaction strategy (size-tiered) and let’s assume the worse case where Cassandra needs as much free space as the the data to complete the compaction. Does Cassandra needs 100GB (before replication) or 300GB (100GB x3 with replication)?
In other words, when Cassandra needs free disk space for performing compaction, does the replication factor has any influence?

Compaction in Cassandra is local to a Node.
Now let's say you have a 3 node cluster, replication factor is also 3, and the original data size is 100GB. This means that each node has 100GB worth of data.
Hence on each node, I will need 100GB free space to compact the data present on that node.
TLDR; Free space required for Compaction is equal to the total data present on the node.

Because data is replicated between the nodes, every node will need to have up to 100Gb free space - so it's total of the 300Gb, but not on one node...

Related

How to purge massive data from cassandra when node size is limited

I have a Cassandra cluster (with cassandra v3.11.11) with 3 data centers with replication factor 3. Each node has 800GB nvme but one of the data table is taking up 600GB of data. This results in below from nodetool status:
DataCenter 1:
node 1: 620GB (used) / 800GB (total)
DataCenter 2:
node 1: 610GB (used) / 800GB (total)
DataCenter 3:
node 1: 680GB (used) / 800GB (total)
I cannot install another disk to the server and the server does not have internet connection at all for security reasons. The only thing I can do is to put some small files like scripts into it.
ALL cassandra tables are set up with SizeTieredCompactionStrategy so I end up with very big files ~one single 200GB sstable (full table size 600). Since the cluster is running out of space, I need to free the data which will introduce tombstones. But I have only 120GB of free space left for compaction or garbage collection. That means I can only retain at most 120GB data, and to be safe for the Cassandra process, maybe even less.
I am executed nodetool cleanup so I am not sure if there is more room to free.
Is there any way to free 200GB of data from Cassandra? or I can only retain less than 120 GB data?
(if we remove 200GB data, we have 400GB data left, compaction/gc will need 400+680GB data which is bigger than disk space..)
Thank you!
I personally would start with checking if the whole space is occupied by actual data, and not by snapshots - use nodetool listsnapshots to list them, and nodetool clearsnapshot to remove them. Because if you did snapshot for some reason, then after compaction they are occupying the space as original files were removed.
The next step would be to try to cleanup tombstones & deteled data from the small tables using nodetool garbagecollect, or nodetool compact with -s option to split table into files of different size. For big table I would try to use nodetool compact with --user-defined option on the individual files (assuming that there will be enough space for them). As soon as you free > 200Gb, you can sstablesplit (node should be down!) to split big SSTable into small files (~1-2Gb), so when node starts again, the data will be compacted

why full replication Cassandra cluster have node data size difference

I have a 3-node cassandra cluster (version 3.11.11) with replication factor 3. only 2 of the nodes are receiving requests, and Node3 only sync with the other 2 nodes.
In theory, each node should have the same data size. But in practice, I end up with nodes with different data sizes as shown in the picture.
we have daily nodetool repair, operations like compaction are done automatically with default settings.
What can be the reason for the size difference?
It finally ends up how data gets compacted in the long run. Since compaction is local process and how sstables can be stacked up cannot be guaranteed. So I dont see any abbreviation here. Theory just say all nodes will have same data logically but physically it may vary. For example in node3 you may have old sstables that are not getting compacted due to size (if using STCS) and in other nodes they have compacted and reduced the size of those nodes.

Cassandra read latency increases while writing

I have a cassandra cluster, its read latency increases during writes. The writes mostly happen via spark jobs during the night time. The writes happen in huge bursts, is there a way to reduce read latency during the writes. The writes happen using LOCAL_QUORUM and reads happen using LOCAL_ONE. Is there a way to reduce read latency when writes are happening?
Cassandra Cluster Configs
10 Node cassandra cluster (5 in DC1, 5 in DC2)
CPU: 8 Core
Memory: 32GB
Grafana Metrics
I can give some advice:
Use LCS compaction strategy.
Prefer round-robin load balancing policy for reads.
Choose partition_key wisely so that requests are not bombarded on a single partition.
Partition size also play a good role. Cassandra recommends to have smaller partition size. However, I have tested with Partitions of 10000 rows each with each row having size of 800 bytes. It worked better than with 3000 rows(or even 1 row). Very tiny partitions tend to increase CPU usage when data stored is large in terms of row count. However, very large partitions should be avoided even.
Replication Factor should be chosen strategically . Write consistency level should be decided considering the replication of all keyspaces.

How Cassandra In-memory feature works

I have few questions with regards to the In-Memory feature in Cassandra
1.) I have a 4 node datacenter and in Opscenter, under memory usage , it shows there is 100GB of in-memory available. Does it mean that each of the 4 nodes have 100GB memory available or is the 100Gb the total in memory capacity for my datacenter?
2.) If really 100GB is available for In-Memory for a datacenter, is it advisable to use the full capacity? Do I need to factor replication factor as well? Say I have a 15GB data which I want to store it in In-Memory, if the replication factor is 2, will it be like we have 30GB of data in In-memory for the datacenter?
3.) In dse.yaml file, there is a property which has the value like percentage of system memory "max_memory_to_lock_fraction" and by default it is 20%. As per the guidelines from Datastax Cassandra, we need to ensure that the in memory usage does not exceed 45% of total available system memory for each node. Is this "max_memory_to_lock_fraction" the parameter that needs to be set for 45%?
4.) Datastax documentation says compression needs to be removed for In-memory table. If compression is indeed set, will it affect the read/write performance?
5.) Output of dsetool inmemorystatus has a parameter called "Current Total memory not able to lock". Is the value present in this parameter denote the available memory. Like say if the value is 1024MB, does it mean that still 1GB In-memory is available for use.
I am using DSE 4.8.11 version. Please help me as I am trying to understand this feature so as to leverage it best.
Thanks in advance.
1) It depends on how you configure it it can be per cluster (all of the available memory) or you can view graphs of individual nodes
2) Yes, replication factor increases data by factor times in total. You will have to factor that in on the cluster level. Very nice tool to help you start: https://www.ecyrd.com/cassandracalculator/
3) Yes max_memory_to_lock_fraction is what you are looking for
4) It will increase processing time, since writes in cassandra are actually cpu bound this might not be best performance wise idea.
5) Yes this means there is still memory (of specified amount), but due to settings cassandra is unable to lock it.

What's to think of when increasing disk size on Cassandra nodes?

I run a 10-node Cassandra cluster in production. 99% writes; 1% reads, 0% deletes. The nodes have 32 GB RAM; C* runs with 8 GB heap. Each node has a SDD for commitlog and 2x4 TB spinning disks for data (sstables). The schema uses key caching only. C* version is 2.1.2.
It can be predicted that the cluster will run out of free disk space in not too long. So its storage capacity needs to be increased. The client prefers increasing disk size over adding more nodes. So a plan is to take the 2x4 TB spinning disks in each node and replace by 3x6 TB spinning disks.
Are there any obvious pitfalls/caveats to be aware of here? Like:
Can C* handle up to 18 TB data size per node with this amount of RAM?
Is it feasible to increase the disk size by mounting a new (larger) disk, copy all SS tables to it, and then mount it on the same mount point as the original (smaller) disk (to replace it)?
I would recommend adding nodes instead of increasing the data size of your current nodes. Adding nodes would take advantage of Cassandra's distribution feature by having small easily replaceable nodes.
Furthermore the recommended size of a single node in a cluster for a spinning disk is around 1 TB. Once you go higher than that, I can only image that performance will decrease significantly.
Not to mention if a node loses its data, it will take a long time to recover it as it has to stream a huge amount of data from the other nodes.
Can C* handle up to 18 TB data size per node with this amount of RAM?
This depends heavily on your workload.
Is it feasible to increase the disk size by mounting a new (larger) disk, copy all SS tables to it, and then mount it on the same mount point as the original (smaller) disk (to replace it)?
I don't see a reason why it would not work.
It's an anti-pattern in Cassandra. Distributed database is key feature of Cassandra

Resources