How does hard disk space allocation work in Cassandra? - cassandra

Let's say I have 4 identical servers with 300GB hard drive space and a replication factor of 2 (so basically 2 300GB nodes, each replicated on another physical machine with 300GB space), how does the space allocation work across these nodes?
For instance, imagine 300GB on Node 1 and 2 (node 2 being the replica of 1) is completely used by cassandra and another application which also uses disk space, but the second set (nodes 3 and 4) have some free disk space since they're only running Cassandra and nothing else. Would Cassandra store new entries on these nodes instead given the fact the first 2 nodes are out of disk space, or would it blow up?
Broadening the situation across multiple servers in a rack, would Cassandra intelligently manage disk space requirements and put the data on nodes with more free storage space? Similarly, would it be able to work with servers with varying storage spaces? (some 600GB, some 300GB, etc.).
Many thanks,

Cassandra does not allocate data by available space. It places data on nodes based on the hash of their Partition Key. Because of this there can be no intelligent live balancing of where data should go.
To do approximate balancing you can change the size of the token ranges a particular node is responsible for (no-vnodes) or adjust the number of vnodes. This all needs to be done manually.
Changes in the Cassandra.yaml
Example Vnodes:
Node 1: num_token: 128
Node 2: num_token: 128
Node 3: num_token: 256
Node 4: num_token: 256
Example Non-Vnodes (given a full range = 100):
Node1: initial_token: 15
Node2: initial_token: 30
Node3: initial_token: 65
Node4: initial_token: 100

Related

What happens if one DC in Cassandra runs out of physical memory?

I'm new to cassandra and I'm asking myself what will happen if I have multiple datacenters and at one point one datacenter won't have enough physical memory to store all the data.
Assume we have 2 DCs. The first DC can store 1 TB and the second DC can only hold 500 GB. Furthermore lets say we have a replication factor = 1 for both DCs. As I understand the both DCs will have the complete token ring, so every DC will have the full data. What happens now, if I push data to DC 1 and the total amount of storage needed exceeds 500 GB?
For keeping things simple, I will consider that you write the data using DC1, so this one will be the local DC in each scenario. DC2 that is down will be remote all the time. So what really matters here is what consistency level you are using for yours writes:
consistency level of type LOCAL (LOCAL_QUORUM, ONE, LOCAL_ONE) - you can write your data.
consistency level of type REMOTE (ALL, EACH_QUORUM, QUORUM, TWO, THREE) - you cannot write your data.
I suggest to read about consistency levels.
Also, a very quick test using ccm and cassandra-stress tool might be helpful to reproduce different scenarios.
Another comment is regarding your free space: when a node will hit the 250GB mark (half of 500GB) you will have compaction issues. The recommendation is to have half of the disk empty for compactions to run.
Let's say, however, that you will continue getting data to that node and you will hit the 500GB mark. Cassandra will stop on that node.

Need to "balance" a cassandra cluster

I have a test cluster I built, and upon looking at it when running a nodetool status I have the data distributed amongst the four nodes as such:
-- Address Load Tokens Owns
UN NODE3 1.61 GB 1 14.6%
UN NODE2 3.14 GB 1 19.4%
UN NODE1 7.68 GB 1 63.9%
UN NODE4 5.85 GB 1 2.0%
Now all nodes were added before I ingested data into the database, but I think I probably screwed up by not manually setting the token info prior to bringing data into the cluster.
My question is how best to readjust this to more evenly distribute the data?
If you are not using Vnodes (which you are not since you have 1 token per node), you can move tokens on each node to evenly distribute your ring. To do this, do the following:
Calculate the tokens to assign your nodes based on number of nodes in your cluster. You can use a tool like cassandra token calculator. With Murmur3Partitioner and 4 nodes you could use: (-9223372036854775808
-4611686018427387904 0 4611686018427387904)
One node at a time use nodetool move to move the node to the new token (i.e. nodetool move -- 0) and wait for it to complete. This may take a while. It may also be wise to choose which nodes to move to which token based on their current proximity to the token they are moving to.
Usually its a good idea to do a nodetool cleanup on that node afterwords to cleanup data no longer belonging to that node.
+1 on Andy's points. I'd like to add one more thing.
To ensure you get an accurate Owns %, you must specify a Cassandra Keyspace for nodetool status <ks>

Cassandra Partial Replication

This is my configuration for 4 Data Centers of Cassandra:
create KEYSPACE mySpace WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1' : 1, 'DC2' : 1, 'DC3' : 1, 'DC4' : 1};
In this configuration (Murmur3Partitioner + 256 tokens) , each DC is storing roughly 25% of the key space. And this 25% are replicated 3 times on each other DC. Meaning that every single row has 4 copies over all.
For instance if my data base is to big to keep 4 complete copies of it, how can I configure cassandra so that each DC is replicated only once or twice (instead of total number of DCs (x3)).
For example: 25% of the key space that is stored on DC1 I want to replicate once on DC2 only. I am not looking for selecting any particular DC for replication neither I care if 25% of DC1 will be split over multiple DC1,2,3 I just want to use NetworkTopologyStrategy but reduce storage costs.
Is it possible ?
Thank you
Best Regards
Your keyspace command shows that each of the DCs hold 1 copy of the data. This means that if you have 1 node in each DC, then each node will have 100% of your data. So, I am not sure how you concluded that each of your DCs store only 25% of keys as it is obvious they are storing 100%. Chances are when you run nodetool command you are not specifying the keyspace so the command shows you load which is based on the token range assigned to each node which would be misleading for NetworkTopology setup. Try running it with your keyspace name and see if you notice the difference.
I don't think there is a way to shift data around DCs using any of existing Snitches the way you want it. If you really wanted to have even distribution and you had equal number of nodes in each DC with initial tokens spaced evenly, you could have used SimpleSnitch to achieve what you want. You can change the Snitch to SimpleSnitch and run nodetool cleanup/repair on each node. Bare in mind that during this process you will have some outage because after the SnitchChange, previously written keys may not be available on some nodes until the repair job is done.
The way NetworkTopology works is that if you say you have DC1:1 and you have for example 2 nodes in DC1, it will evenly distribute keys across 2 nodes leading to 50% effective load on each node. With that in mind, I think what you really want to have done is to keep 3 copies of your data, 1 in each DC. So, you can really discard one DC and save money. I am saying this because I think these DCs you have are virtual in the notion of your NetworkTopology and not real physical DC because no one would want to have only 25% of data in one DC as it will not be an available setup. So, I recommend if your nodes are grouped into virtual DCs, you group them into 4 racks instead and maintain 1 DC:
DC1:
nd1-ra_1 rack-a
nd1-rb_1 rack-b
nd1-rc_1 rack-c
nd2-ra_2 rack-a
nd2-rb_2 rack-b
nd2-rc_2 rack-c
nd3-ra_3 rack-a
nd3-rb_3 rack-b
nd3-rc_3 rack-c
nd3-ra_4 rack-a
nd3-rb_4 rack-b
nd3-rc_4 rack-c
In this case, if you set your replication option to DC1:3, each of the racks a,b,and c will have 100% of your data (each node in each rack 25%).

Heap Size of a Node in Cassandra Cluster

I have a cassandra setup with 6 Node Ring single DC with RF:6 and Read:CL:1. Now at times if a particular node gets lot of requests which in turn would be passed on to all the nodes (cause of RF) it gets on to Compaction with CMS and then finally with ParNew where the whole ring gets in a hanged situation which in turn making it pretty unusable. I found that we could only solve it with increasing Heap size or tweaking in the cassandra code for only reading from the local node(as RF:6 would guarantee every node having the same data although repair etc have to be dealt separately).
How to calculate Heap Size for Cassandra node(I have two Keyspace's with total 14 CF's apart from System CF's).As per cassandra wiki this should be the heap size atleast : memtable_throughput_in_mb * 3 * number of hot CFs + 1G + internal caches where memtable_throughput_in_mb=128mb for my setup. The max row size for a particular CF should matter here. I am not using any row or key cache. Can someone suggest me the same.

How to load balance Cassandra cluster nodes?

I am using Cassandra-0.7.8 on cluster of 4 machines. I have uploaded some files using Map/Reduce.
It looks files got distributed only among 2 nodes. When I used RF=3 it had got distributed to equally 4 nodes on below configurations.
Here are some config info's:
ByteOrderedPartitioner
Replication Factor = 1 (since, I have storage problem. It will be increased later )
initial token - value has not been set.
create keyspace ipinfo with replication_factor = 1 and placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy';
[cassandra#cassandra01 apache-cassandra-0.7.8]$ bin/nodetool -h
172.27.10.131 ring Address Status State Load Owns Token
Token(bytes[fddfd9bae90f0836cd9bff20b27e3c04])
172.27.10.132 Up Normal 11.92 GB 25.00% Token(bytes[3ddfd9bae90f0836cd9bff20b27e3c04])
172.27.15.80 Up Normal 10.21 GB 25.00% Token(bytes[7ddfd9bae90f0836cd9bff20b27e3c04])
172.27.10.131 Up Normal 54.34 KB 25.00% Token(bytes[bddfd9bae90f0836cd9bff20b27e3c04])
172.27.15.78 Up Normal 58.79 KB 25.00% Token(bytes[fddfd9bae90f0836cd9bff20b27e3c04])
Can you suggest me how can I balance the load on my cluster.
Regards,
Thamizhannal
The keys in the data you loaded did not get high enough to reach the higher 2 nodes in the ring. You could change to the RandomPartitioner as suggested by frail. Another option would be to rebalance your ring as described in the Cassandra wiki. This is the route you will want to take if you want to continue having your keys ordered. Of course as more data is loaded, you'll want to rebalance again to keep the distribution of data relatively even. If you plan on doing just random reads and no range slices then switch to the RandomPartitioner and be done with it.
If you want better loadbalance you need to change your partitioner to RandomPartitioner. But it would cause problems if you are using range queries in your application. You would better check this article :
Cassandra: RandomPartitioner vs OrderPreservingPartitioner

Resources