Need to "balance" a cassandra cluster - cassandra

I have a test cluster I built, and upon looking at it when running a nodetool status I have the data distributed amongst the four nodes as such:
-- Address Load Tokens Owns
UN NODE3 1.61 GB 1 14.6%
UN NODE2 3.14 GB 1 19.4%
UN NODE1 7.68 GB 1 63.9%
UN NODE4 5.85 GB 1 2.0%
Now all nodes were added before I ingested data into the database, but I think I probably screwed up by not manually setting the token info prior to bringing data into the cluster.
My question is how best to readjust this to more evenly distribute the data?

If you are not using Vnodes (which you are not since you have 1 token per node), you can move tokens on each node to evenly distribute your ring. To do this, do the following:
Calculate the tokens to assign your nodes based on number of nodes in your cluster. You can use a tool like cassandra token calculator. With Murmur3Partitioner and 4 nodes you could use: (-9223372036854775808
-4611686018427387904 0 4611686018427387904)
One node at a time use nodetool move to move the node to the new token (i.e. nodetool move -- 0) and wait for it to complete. This may take a while. It may also be wise to choose which nodes to move to which token based on their current proximity to the token they are moving to.
Usually its a good idea to do a nodetool cleanup on that node afterwords to cleanup data no longer belonging to that node.

+1 on Andy's points. I'd like to add one more thing.
To ensure you get an accurate Owns %, you must specify a Cassandra Keyspace for nodetool status <ks>

Related

Add nodes to existing Cassandra cluster

We currently have a 2 node Cassandra cluster. We want to add 4 more nodes to the cluster, using the rack feature. The future topology will be:
node-01 (Rack1)
node-02 (Rack1)
node-03 (Rack2)
node-04 (Rack2)
node-05 (Rack3)
node-06 (Rack3)
We want to use different racks, but the same DC.
But for now we use SimpleStrategy and replication factor is 1 for all keyspaces. My plan to switch from a 2 to a 6 node cluster is shown below:
Change Endpoint snitch to GossipingPropetyFileSnitch.
Alter keyspace to NetworkTopologyStrategy...with replication_factor 'datacenter1': '3'.
According to the docs, when we add a new DC to an existing cluster, we must alter system keyspaces, too. But in our case, we change only the snitch and keyspace strategy, not the Datacenter. Or should I change the system keyspaces strategy and replication factor too, in the case of adding more nodes and changing the snitch?
First, I would change the endpoint_snitch to GossipingPropertyFileSnitch on one node and restart it. You need to make sure that approach works, first. Typically, you cannot (easily) change the logical datacenter or rack names on a running cluster. Which technically you're not doing that, but SimpleStrategy may do some things under the hood to abstract datacenter/rack awareness, so it's a good idea to test it.
If it works, make the change and restart the other node, as well. If it doesn't work, you may need to add 6 new nodes (instead of 4) and decommission the existing 2 nodes.
Or should I change the system keyspaces strategy and replication factor too?
Yes, you should set the same keyspace replication definition on the following keyspaces: system_auth, system_traces, and system_distributed.
Consider this situation: If one of your 2 nodes crashes, you won't be able to log in as the users assigned to that node via the system_auth table. So it is very important to ensure that system_auth is replicated appropriately.
I wrote a post on this some time ago (updated in 2018): Replication Factor to use for system_auth
Also, I recommend the same approach on system_traces and system_distributed, as future node adds/replacements/repairs may fail if valid token ranges for those keyspaces cannot be located. Basically, using the same approach on them prevents potential problems in the future.
Edit 20200527:
Do I need to launch the nodetool cleanup on old cluster nodes after the snitch and keyspace topology changes? According to docs "yes," but only on old nodes?
You will need to run it on every node, except for the very last one added. The last node is the only node guaranteed to only have data which match its token range assignments.
"Why?" you may ask. Consider the total percentage ownership as the cluster incrementally grows from 2 nodes to 6. If you bump the RF from 1 to 2 (run a repair), and then from 2 to 3 and add the first node, you have a cluster with 3 nodes and 3 replicas. Each node then has 100% data ownership.
That ownership percentage gradually decreases as each node is added, down to 50% when the 6th and final node is added. But even though all nodes will have ownership of 50% of the token ranges:
The first 3 nodes will still actually have 100% of the data set, accounting for an extra 50% of the data that they should.
The fourth node will still have an extra 25% (3/4 minus 1/2 or 50%).
The fifth node will still have an extra 10% (3/5 minus 1/2).
Therefore, the sixth and final node is the only one which will not contain any more data than it is responsible for.

Cassandra vnodes replicas

Setting up the context:
Cassandra currently implements vnodes. 256 by default which is tweakable in the cassandra.yaml file
Vnodes as I understand are token-ranges/hash-ranges. Eg. (x...y], where y is the token number of the vnode. Each physical node in Cassandra is assigned random 256 tokens, and each of those tokens are the boundary value of a hash/token range. The tokens assigned are within the range of 2^-63 to 2^63-1 (the range of hash numbers which murmur3 has partitioner may generate). So far so good.
Question:
1. Is it that a token range(vnode) is a fixed range. Once set, this token range will be copied to other Cassandra nodes to satisfy the replication factor like a token range(vnode) being a fundamental chunk of data(tokens) which goes around together. Only in case of bootstrap of a new node in the cluster, this token range(vnode) might break apart to be assigned to other node.
Riding on the last proposition, (say the last proposition is true).
Then a vnode must only contain tokens which belong a given keyspace.
Because each keyspace(container of column family/tables) has a defined replication strategy and replication factor. And it is highly likely that replication factor of keyspaces in a Cassandra cluster will vary.
Consider an example. "system_schema" keyspace has a RF of 1 whereas I created a keyspace "test_ks" with RF 3. If a row of system_schema keyspace has a token number 2(say) and a row of my test_ks has token number 5(say).
these 2 tokens can't be placed in the same token range(vnode). If a vnode is consistent chunk of token ranges, say token 2 and 5 belong to vnode with token number 10. so vnode 10 has to be placed on 3 different physical nodes to satisfy the RF =3 for test_ks, but we are unnecessary placing token 2 on 3 different nodes whose RF is supposed to be 1.
Is this proposition correct that, a vnode is only dedicated to a given keyspace?
which boils down to out of 256 tokens on a physical node... 20(say) vnodes currently belong to "system" keyspace, 80 vnodes(say) belong to test_ks.
Again riding on the above proposition, this means that each node should have the info of keyspace-wise vnodes currently available in the cluster.
That way when a new write comes in for a Keyspace the co-ordinator node would locate all vnodes in the cluster for that keyspace and assign the new row a token number which falls within the token range of those keyspaces. That being the case can I know how many vnodes currently belong to a keyspace in the entire cluster/ or on a given node.
Please do correct me if I'm wrong.
I have been following the below blogs and videos to get an understanding of this concept:
https://www.scribd.com/document/253239514/Virtual-Nodes-Strategies-for-Apache-Cassandra
https://www.youtube.com/watch?v=GddZ3pXiDys&t=11s
Thanks in advance
There is no fixed token-range, the tokens are just generated randomly. This is one of the reasons that vnodes were implemented - the idea being that if there are more tokens it is more likely that the resulting token-ranges will be more evenly distributed across nodes.
Token generation was recently improved in 3.0, allowing Cassandra to place new tokens a little more intelligently (see CASSANDRA-7032). You can also manually configure tokens (see initial_token), although it can become tricky to keep things balanced when it comes time to expand the cluster unless you plan on doubling the number of nodes.
The total number of tokens in a cluster is the number of nodes in the cluster multiplied by the number of vnodes per node.
In regards to placement of replicas, the first copy of a partition is placed in the node that owns that partition's token. The additional n copies are placed sequentially on the next n nodes in the ring that are in the same data centre. There is no relationship between tokens and keyspaces.
When a new write comes into a coordinator node, the coordinator node determines which node owns the partition by hashing the partition key. Note that for better performance this can actually be done by the driver instead if you use TokenAwarePolicy. The coordinator sends the write to the node that owns the partition, and if the data needs to be replicated the coordinator node also writes the replicas to the next two nodes sequentially in the token-space.
For example, suppose that we have 3 nodes which each have one token: node1: 10, node2: 20 & node3: 30. If we write a record whose partition key hashes to 22, to a keyspace with RF3, then the first copy goes to node2, the second goes to node3 and the third goes to node1. Note that each replica is equally valid - there is nothing special about the "first" replica other than that it happens to be stored on the "first" replica node.
Vnodes do not change this process, they just split up each node's token ranges by allowing each node to have more than one token. For example, if our cluster now has 2 vnodes for each node, it might instead look like this: node1: 10, 25, node2: 20, 3 & node3: 30, 21. Now our write that hashed to 22 goes to node3 (because it owns the range from 21-24), and the copies go to node1 and node2.

How does cassandra find the node that contains the data?

I've read quite a few articles and a lot of question/answers on SO about Cassandra but I still can't figure out how Cassandra decides which node(s) to go to when it's reading the data.
First, some assumptions about an imaginary cluster:
Replication Strategy = simple
Using Random Partitioner
Cluster of 10 nodes
Replication Factor of 5
Here's my understanding of how writes work based on various Datastax articles and other blog posts I've read:
Client sends the data to a random node
The "random" node is decided based on the MD5 hash of the primary key.
Data is written to the commit_log and memtable and then propagated 4 times (with RF = 5).
The 4 next nodes in the ring are then selected and data is persisted in them.
So far, so good.
Now the question is, when the client sends a read request (say with CL = 3) to the cluster, how does Cassandra know which nodes (5 out of 10 as the worst case scenario) it needs to contact to get this data? Surely it's not going to all 10 nodes as that would be inefficient.
Am I correct in assuming that Cassandra will again, do an MD5 hash of the primary key (of the request) and choose the node according to that and then walks the ring?
Also, how does the network topology case work? if I have multiple data centers, how does Cassandra know which nodes in each DC/Rack contain the data? From what I understand, only the first node is obvious (since the hash of the primary key has resulted in that node explicitly).
Sorry if the question is not very clear and please add a comment if you need more details about my question.
Many thanks,
Client sends the data to a random node
It might seem that way, but there is actually a non-random way that your driver picks a node to talk to. This node is called a "coordinator node" and is typically chosen based-on having the least (closest) "network distance." Client requests can really be sent to any node, and at first they will be sent to the nodes which your driver knows about. But once it connects and understands the topology of your cluster, it may change to a "closer" coordinator.
The nodes in your cluster exchange topology information with each other using the Gossip Protocol. The gossiper runs every second, and ensures that all nodes are kept current with data from whichever Snitch you have configured. The snitch keeps track of which data centers and racks each node belongs to.
In this way, the coordinator node also has data about which nodes are responsible for each token range. You can see this information by running a nodetool ring from the command line. Although if you are using vnodes, that will be trickier to ascertain, as data on all 256 (default) virtual nodes will quickly flash by on the screen.
So let's say that I have a table that I'm using to keep track of ship crew members by their first name, and let's assume that I want to look-up Malcolm Reynolds. Running this query:
SELECT token(firstname),firstname, id, lastname
FROM usersbyfirstname WHERE firstname='Mal';
...returns this row:
token(firstname) | firstname | id | lastname
----------------------+-----------+----+-----------
4016264465811926804 | Mal | 2 | Reynolds
By running a nodetool ring I can see which node is responsible for this token:
192.168.1.22 rack1 Up Normal 348.31 KB 3976595151390728557
192.168.1.22 rack1 Up Normal 348.31 KB 4142666302960897745
Or even easier, I can use nodetool getendpoints to see this data:
$ nodetool getendpoints stackoverflow usersbyfirstname Mal
Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar
192.168.1.22
For more information, check out some of the items linked above, or try running nodetool gossipinfo.
Cassandra uses consistent hashing to map each partition key to a token value. Each node owns ranges of token values as its primary range, so that every possible hash value will map to one node. Extra replicas are then kept in a systematic way (such as the next node in the ring) and stored in the nodes as their secondary range.
Every node in the cluster knows the topology of the entire cluster, such as which nodes are in which data center, where they are in the ring, and which token ranges each nodes owns. The nodes get and maintain this information using the gossip protocol.
When a read request comes in, the node contacted becomes the coordinator for the read. It will calculate which nodes have replicas for the requested partition, and then pick the required number of nodes to meet the consistency level. It will then send requests to those nodes and wait for their responses and merge the results based on the column timestamps before sending the result back to the client.
Cassandra will locate any data based on a partition key that is mapped to a token value by the partitioner. Tokens are part of a finite token ring value range where each part of the ring is owned by a node in the cluster. The node owning the range of a certain token is said to be the primary for that token. Replicas will be selected by the data replication strategy. Basically this works by going clockwise in the token ring, starting from the primary, and stopping depending on the number of required replicas.
What's important to realize is that each node in the cluster is able to identify the nodes responsible for a certain key based on the logic described above. Whenever a value is written to the cluster, the node accepting the request (the coordinator node) will know right away the nodes that need to execute the write.
In case of multiple data-centers, all keys will be mapped across all DCs to the exact same token determined by the partitioner. Cassandra will try to write to each DC and each DC's replicas.

Cassandra Partial Replication

This is my configuration for 4 Data Centers of Cassandra:
create KEYSPACE mySpace WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1' : 1, 'DC2' : 1, 'DC3' : 1, 'DC4' : 1};
In this configuration (Murmur3Partitioner + 256 tokens) , each DC is storing roughly 25% of the key space. And this 25% are replicated 3 times on each other DC. Meaning that every single row has 4 copies over all.
For instance if my data base is to big to keep 4 complete copies of it, how can I configure cassandra so that each DC is replicated only once or twice (instead of total number of DCs (x3)).
For example: 25% of the key space that is stored on DC1 I want to replicate once on DC2 only. I am not looking for selecting any particular DC for replication neither I care if 25% of DC1 will be split over multiple DC1,2,3 I just want to use NetworkTopologyStrategy but reduce storage costs.
Is it possible ?
Thank you
Best Regards
Your keyspace command shows that each of the DCs hold 1 copy of the data. This means that if you have 1 node in each DC, then each node will have 100% of your data. So, I am not sure how you concluded that each of your DCs store only 25% of keys as it is obvious they are storing 100%. Chances are when you run nodetool command you are not specifying the keyspace so the command shows you load which is based on the token range assigned to each node which would be misleading for NetworkTopology setup. Try running it with your keyspace name and see if you notice the difference.
I don't think there is a way to shift data around DCs using any of existing Snitches the way you want it. If you really wanted to have even distribution and you had equal number of nodes in each DC with initial tokens spaced evenly, you could have used SimpleSnitch to achieve what you want. You can change the Snitch to SimpleSnitch and run nodetool cleanup/repair on each node. Bare in mind that during this process you will have some outage because after the SnitchChange, previously written keys may not be available on some nodes until the repair job is done.
The way NetworkTopology works is that if you say you have DC1:1 and you have for example 2 nodes in DC1, it will evenly distribute keys across 2 nodes leading to 50% effective load on each node. With that in mind, I think what you really want to have done is to keep 3 copies of your data, 1 in each DC. So, you can really discard one DC and save money. I am saying this because I think these DCs you have are virtual in the notion of your NetworkTopology and not real physical DC because no one would want to have only 25% of data in one DC as it will not be an available setup. So, I recommend if your nodes are grouped into virtual DCs, you group them into 4 racks instead and maintain 1 DC:
DC1:
nd1-ra_1 rack-a
nd1-rb_1 rack-b
nd1-rc_1 rack-c
nd2-ra_2 rack-a
nd2-rb_2 rack-b
nd2-rc_2 rack-c
nd3-ra_3 rack-a
nd3-rb_3 rack-b
nd3-rc_3 rack-c
nd3-ra_4 rack-a
nd3-rb_4 rack-b
nd3-rc_4 rack-c
In this case, if you set your replication option to DC1:3, each of the racks a,b,and c will have 100% of your data (each node in each rack 25%).

How to load balance Cassandra cluster nodes?

I am using Cassandra-0.7.8 on cluster of 4 machines. I have uploaded some files using Map/Reduce.
It looks files got distributed only among 2 nodes. When I used RF=3 it had got distributed to equally 4 nodes on below configurations.
Here are some config info's:
ByteOrderedPartitioner
Replication Factor = 1 (since, I have storage problem. It will be increased later )
initial token - value has not been set.
create keyspace ipinfo with replication_factor = 1 and placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy';
[cassandra#cassandra01 apache-cassandra-0.7.8]$ bin/nodetool -h
172.27.10.131 ring Address Status State Load Owns Token
Token(bytes[fddfd9bae90f0836cd9bff20b27e3c04])
172.27.10.132 Up Normal 11.92 GB 25.00% Token(bytes[3ddfd9bae90f0836cd9bff20b27e3c04])
172.27.15.80 Up Normal 10.21 GB 25.00% Token(bytes[7ddfd9bae90f0836cd9bff20b27e3c04])
172.27.10.131 Up Normal 54.34 KB 25.00% Token(bytes[bddfd9bae90f0836cd9bff20b27e3c04])
172.27.15.78 Up Normal 58.79 KB 25.00% Token(bytes[fddfd9bae90f0836cd9bff20b27e3c04])
Can you suggest me how can I balance the load on my cluster.
Regards,
Thamizhannal
The keys in the data you loaded did not get high enough to reach the higher 2 nodes in the ring. You could change to the RandomPartitioner as suggested by frail. Another option would be to rebalance your ring as described in the Cassandra wiki. This is the route you will want to take if you want to continue having your keys ordered. Of course as more data is loaded, you'll want to rebalance again to keep the distribution of data relatively even. If you plan on doing just random reads and no range slices then switch to the RandomPartitioner and be done with it.
If you want better loadbalance you need to change your partitioner to RandomPartitioner. But it would cause problems if you are using range queries in your application. You would better check this article :
Cassandra: RandomPartitioner vs OrderPreservingPartitioner

Resources