how to rebalance cassandra cluster after adding new node - cassandra

I had a 3 node cassandra cluster with replication factor of 2. The nodes were running either dsc1.2.3 or dsc1.2.4. Each node had num_token value of 256 and initial_token was commented. This 3 node cluster was perfectly balanced i.e. each owned around 30% of the data.
One of the nodes crashed so I started a new node and nodetool removed the node that had crashed. The new node got added to the cluster but the two older nodes have most of the data now (47.0% and 52.3%) and the new node has just 0.7% of the data.
The output of nodetool status is
Datacenter: xx-xxxx
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.xxx.xxx.xxx 649.78 MB 256 47.0% ba3534b3-3d9f-4db7-844d-39a8f98618f1 1c
UN 10.xxx.xxx.xxx 643.11 MB 256 52.3% 562f7c3f-986a-4ba6-bfda-22a10e384960 1a
UN 10.xxx.xxx.xxx 6.84 MB 256 0.7% 5ba6aff7-79d2-4d62-b5b0-c5c67f1e1791 1c
How do i balance this cluster?

You didn't mention running a repair on the new node, if indeed you haven't yet done that it's likely the cause of your lack of data on the new node.
Until you run a nodetool repair the new node will only hold the new data that gets written to it or the data that read-repair pulls in. With vnodes you generally shouldn't need to re-balance, if I'm understanding vnodes correctly, but I haven't personally yet moved to using vnodes so I may be wrong about that.

It looks like your new node hasn't bootstrapped. Did you add auto_bootstrap=true to your cassandra.yaml?
If you don't want to bootstrap, you can run nodetool repair on the new node and then nodetool cleanup on the two others until the distribution is fair.

Related

Cassandra multi node balancing

I have added a new node into the cluster and was expecting the data on Cassandra to balance itself across nodes.
node status yields
$ nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.128.0.7 270.75 GiB 256 48.6% 1a3f6faa-4376-45a8-9c20-11480ae5664c rack1
UN 10.128.0.14 414.36 KiB 256 51.4% 66a89fbf-08ba-4b5d-9f10-55d52a199b41 rack1
Load of node 2 is just 400KB, we have time series data and query on that. how can I rebalance the load between these clusters?
configuration for both nodes are
cluster_name: 'cluster1'
- seeds: "node1_ip, node2_ip"
num_tokens: 256
endpoint_snitch: GossipingPropertyFileSnitch
auto_bootstrap: false
thank you for your time :)
I have added a new node into the cluster and was expecting the data on Cassandra to balance itself across nodes.
Explicitly setting `auto_bootstrap: false' tells it not to do that.
how can I rebalance the load?
Set your keyspace to a RF of 2.
Run nodetool -h 10.128.0.14 repair.
-Or-
Take the 10.128.0.14 out of the cluster.
Set auto_bootstrap: true (or just remove it).
And start the node up. It should join and stream data.
Pro-tip: With a data footprint of 270GB, you should have been running with more than one node to begin with. It would have been much easier to start with 3 nodes (which is probably the minimum you should be running on).

Cassandra node is taking hours to join

My cluster of size 2 had entered into somewhat inconsistent state. On one node (call it node A) nodetool status was correctly showing 2 nodes. While on another node (call it B) it was showing only one i.e. itself. After several attempts I could not fixed the issue. So I decommissioned node B. But nodetool status on node A was still showing the node B that to in UN state. I had to restart cassandra on node A so that it forget node B.
But this has lead to another problem. I am making new node (call it C) to join the cluster of node A. But that node is taking hours. It's already six hours and I am wondering whether it will successfully join finally.
Looking at debug logs of node C suggest that node B (the decommissioned one) is causing trouble. Logs at node C are constantly showing:
DEBUG [GossipTasks:1] 2017-04-29 12:38:40,004 Gossiper.java:337 - Convicting /10.120.8.53 with status removed - alive false
Nodetool status on node A is showing the node C in joining state as expected.
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UJ 10.120.8.113 1006.97 MiB 256 ? f357d8d0-2379-43d8-8ae5-62224191fb6c rack1
UN 10.120.8.23 5.29 GiB 256 ? 596260a0-785a-435c-a3f3-632f56c5c882 rack1
Load for node C increases in fraction after couple of hours.
I checked whether system.peers contains node B. But the table contains zero rows.
I am using cassandra 3.7.
What's going wrong. What can I do to avoid losing data on node A and still scale the cluster?
Run nodetool netstats on node C and see if there's is a progress going on.
Also review nodetool compactionstats, see amount of compactions pending, and see if it goes down with time.
If the bootstraping failed, try restarting the node.
As an alternative, you can remove node C and add it once again, with auto_bootstrap setting set to false. After the node is up, run nodetool rebuild, and nodetool repair after the process - should be a faster alternative for regular bootstrap.

Two nodes in cluster showing DN to each other, UN to everyone else

I have a nine node Cassandra cluster and everything seems to be working fine, except for two of my servers show each other as DN. All other nodes in the cluster show all nodes as UN. These two show all nodes UN except for each other, where they show each other as DN. There are no errors in the system.log on either server that indicates a problem. All nodes are listed as seed nodes across the cluster. I am able to telnet between the servers on port 7001, so I don't think it is a network issue. We are using Internode Communication Encryption so I wonder if it might be an issue with that?
Related Nodetool Status Snippet on 64.6.220.249:
DN 64.6.220.251 106.19 GB 256 ? e008bc26-5d12-48b5-a381-6a175b085496 Rack1
Related Nodetool Status Snippet on 64.6.220.251:
DN 64.6.220.249 105.31 GB 256 ? 59709c2a-6270-40be-a444-042bdf18873e Rack1
Related Nodetol Status Snippet from another node in the cluster (all nodes show this, except for the two above):
UN 64.6.220.251 106.19 GB 256 ? e008bc26-5d12-48b5-a381-6a175b085496 Rack1
UN 64.6.220.249 105.31 GB 256 ? 59709c2a-6270-40be-a444-042bdf18873e Rack1
GossipInfo ran from 64.6.220.249:
/64.6.220.251
generation:1473238188
heartbeat:12693992
SCHEMA:a7b7f6f4-24ba-3153-90cc-dc8ad2754251
RACK:Rack1
SEVERITY:0.0
RPC_ADDRESS:64.6.220.251
HOST_ID:e008bc26-5d12-48b5-a381-6a175b085496
INTERNAL_IP:64.6.220.251
X_11_PADDING:{"workload":"Cassandra","active":"true"}
LOAD:1.14019618013E11
NET_VERSION:8
DC:Cassandra-ALPHA
RELEASE_VERSION:2.1.5.469
STATUS:NORMAL,-1122920019547920198
GossipInfo ran from 64.6.220.251:
/64.6.220.249
generation:1473237564
heartbeat:12696040
RACK:Rack1
DC:Cassandra-ALPHA
RPC_ADDRESS:64.6.220.249
SCHEMA:a7b7f6f4-24ba-3153-90cc-dc8ad2754251
INTERNAL_IP:64.6.220.249
SEVERITY:0.0
X_11_PADDING:{"workload":"Cassandra","active":"true"}
RELEASE_VERSION:2.1.5.469
NET_VERSION:8
LOAD:1.13072884091E11
HOST_ID:59709c2a-6270-40be-a444-042bdf18873e
STATUS:NORMAL,-1027844444513030305
Nodetool describecluster from 64.6.220.249:
Cluster Information:
Name: Fusion Cluster
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
a7b7f6f4-24ba-3153-90cc-dc8ad2754251: [64.6.220.254, 170.75.212.226, 170.75.212.225, 64.6.220.252, 170.75.212.224, 64.6.220.253, 64.6.220.250, 64.6.220.249]
UNREACHABLE: [64.6.220.251]
Nodetool describecluster from 64.6.220.251:
Cluster Information:
Name: Fusion Cluster
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
a7b7f6f4-24ba-3153-90cc-dc8ad2754251: [64.6.220.254, 170.75.212.226, 170.75.212.225, 64.6.220.252, 170.75.212.224, 64.6.220.253, 64.6.220.250, 64.6.220.251]
UNREACHABLE: [64.6.220.249]
Can anyone point me in the right direction as to why these two nodes show each other as "DN", even though all other nodes see them as "UN?
I have seen this "mixed" gossip state before. When this happens, typically bouncing the cassandra process on the nodes being reported as "DN" fixes it.
Also, when you see this it's also a good idea to run a nodetool describecluster. You should check the results to ensure that you only have one schema version. If you have multiple schema versions (known as "schema disagreement") it's best to bounce those affected nodes as well.
I'm not entirely sure as to why this happens, but a contributing factor is having too many nodes designated as "seed nodes." When you have too many seed nodes, their gossip states can take longer to get in-sync, and that may lead to this condition.
I have seen this issue too with Cassandra version 2.2.13 following a few nodes being restarted soon after each other (10-15 minutes apart).
What's interesting is that looking at all the nodes in the cluster, using nodetool status, it's clear that the bulk of the nodes had a consensus on which nodes are UP or DN, and only a few nodes didn't have consensus.
My solution was to nodetool status on all nodes, identify the ones with inconsistent views, restart those nodes and tail the cassandra logs on the consistent nodes until you see the restarted (rogue) node is online with an entry like INFO 02:15:31 InetAddress /18.136.19.11 is now UP, and then move onto the next rogue node until all nodes are consistent.
In my case I had 2 nodes on status DN but they were actually UP and functioning.
I refreshed the gossip on both nodes and it solved the problem by running:
nodetool disablegossip && nodetool enablegossip
hope this helps.

Reshuffle data evenly across Cassandra ring

I have three-node ring of Apache Cassandra 2.1.12. I inserted some data when it was 2-node ring and then added one more 172.16.5.54 node in the ring. I am using the vnode in my ring. The problem is data is not distributed evenly where as ownership seems distributed evenly. So, how to redistribute the data aross the ring. I have tried with nodetool repair and nodetool cleanup but still no luck.
Moreover, what does this load and own column signify in the nodetool status output.
Also, If out of these three-node if i import the data from one of the node from the file. So, CPU utilization goes upto 100% and finally data on the rest of the two nodes get distributed evenly but not on import running node. Why is it so?
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.16.5.54 1.47 MB 256 67.4% 40d07f44-eef8-46bf-9813-4155ba753370 rack1
UN 172.16.4.196 165.65 MB 256 68.3% 6315bbad-e306-4332-803c-6f2d5b658586 rack1
UN 172.16.3.172 64.69 MB 256 64.4% 26e773ea-f478-49f6-92a5-1d07ae6c0f69 rack1
The columns in the output are explained for cassandra 2.1.x in this doc. The load is the amount of file system data in the cassandra data directories. It seems unbalanced across your 3 nodes, which might imply that your partition keys are clustering on a single node (172.16.4.196), sometimes called a hot spot.
The Owns column is "the percentage of the data owned by the node per datacenter times the replication factor." So I can deduce your RF=2 because each node Owns roughly 2/3 of the data.
You need to fix your partition keys of tables.
Cassandra distributes the data based on partition keys to nodes (using hash partitioning range).
So, for some reason you have alot of data for few partition key value, and almost non for rest partition key values.

Cassandra - Pillar applied migrations sync issue

Experiencing sync issues between different nodes in the same datacenter in Cassandra. The keyspace is set to a replication factor of 3 with NetworkTopology and has 3 nodes in the DC. Effectively making sure each node has a copy of the data. When node tool status is run, it shows all three nodes in the DC own 100% of the data.
Yet the applied_migrations column family in that keyspace is not in sync. This is strange because only a single column family is impacted within the keyspace. All the other column families are replicated fully among the three nodes. The test was done by doing a count of rows on each of the column families in the keyspaces.
keyspace_name | durable_writes | strategy_class | strategy_options
--------------+----------------+------------------------------------------------------+----------------------------
core_service | True | org.apache.cassandra.locator.NetworkTopologyStrategy | {"DC_DATA_1":"3"}
keyspace: core_service
Datacenter: DC_DATA_1
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN host_ip_address_1_DC_DATA_1 3.75 MB 256 100.0% 3851106b RAC1
UN host_ip_address_2_DC_DATA_1 3.72 MB 256 100.0% d1201142 RAC1
UN host_ip_address_3_DC_DATA_1 3.72 MB 256 100.0% 81625495 RAC1
Datacenter: DC_OPSCENTER_1
==========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN host_ip_address_4_DC_OPSCENTER_1 631.31 MB 256 0.0% 39e4f8af RAC1
Query: select count(*) from core_service.applied_migrations;
host_ip_address_1_DC_DATA_1 core_service applied_migrations
count
-------
1
(1 rows)
host_ip_address_2_DC_DATA_1 core_service applied_migrations
count
-------
2
(1 rows)
host_ip_address_3_DC_DATA_1 core_service applied_migrations
count
-------
2
(1 rows)
host_ip_address_4_DC_OPSCENTER_1 core_service applied_migrations
count
-------
2
(1 rows)
Similar error is received as described in the issue below. Because all the rows of data are not available, the migration script fails because it is trying to create an existing table:
https://github.com/comeara/pillar/issues/25
I require strong consistency
If you want to ensure that your reads are consistent you need to use the right consistency levels.
For RF3 the following are your options:
Write CL ALL and read with CL One or greater
Write CL Quorum and read CL Quorum. This is what's recommended by Magro who opened the issue you linked to. It's also the most common because you can loose one node and still read and write.
Write CL one but read CL ALL.
What does Cassandra do improve consistency
Cassandra's anti entropy mechanisms are:
Repair will ensure that your nodes are consistent. It gives you a consistency base line and for this reason it should be run as part of your maintenance operations. Run repair at least more often than your gc_grace_seconds in order to avoid zombie tombstones from coming back. DataStax OpsCenter has a Repair Service that automates this task.
Manually you can run:
nodetool repair
in one node or
nodetool repair -pr
in each of your nodes. The -pr option will ensure you only repair a node's primary ranges.
Read repair happens probabilistically (configurable at the table def). When you read a row, c* will notice if some of the replicas don't have the latest data and fix it.
Hints are collected by other nodes when a node is unavailable to take a write.
Manipulating c* Schemas
I noticed that the whole point of Pillar is "to automatically manage Cassandra schema as code". This is a dangerous notion--especially if Pillar is a distributed application (I don't know if it is). Because it may cause schema collisions that can leave a cluster in a wacky state.
Assuming that Pillar is not a distributed / multi-threaded system, you can ensure that you do not break schema by utilizing checkSchemaAgreement() before and after schema changes in the Java driver after schema modifications.
Long term
Cassandra schemas will be more robust and handle distributed updates. Watch (and vote for) CASSANDRA-9424

Resources