Reshuffle data evenly across Cassandra ring - cassandra

I have three-node ring of Apache Cassandra 2.1.12. I inserted some data when it was 2-node ring and then added one more 172.16.5.54 node in the ring. I am using the vnode in my ring. The problem is data is not distributed evenly where as ownership seems distributed evenly. So, how to redistribute the data aross the ring. I have tried with nodetool repair and nodetool cleanup but still no luck.
Moreover, what does this load and own column signify in the nodetool status output.
Also, If out of these three-node if i import the data from one of the node from the file. So, CPU utilization goes upto 100% and finally data on the rest of the two nodes get distributed evenly but not on import running node. Why is it so?
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.16.5.54 1.47 MB 256 67.4% 40d07f44-eef8-46bf-9813-4155ba753370 rack1
UN 172.16.4.196 165.65 MB 256 68.3% 6315bbad-e306-4332-803c-6f2d5b658586 rack1
UN 172.16.3.172 64.69 MB 256 64.4% 26e773ea-f478-49f6-92a5-1d07ae6c0f69 rack1

The columns in the output are explained for cassandra 2.1.x in this doc. The load is the amount of file system data in the cassandra data directories. It seems unbalanced across your 3 nodes, which might imply that your partition keys are clustering on a single node (172.16.4.196), sometimes called a hot spot.
The Owns column is "the percentage of the data owned by the node per datacenter times the replication factor." So I can deduce your RF=2 because each node Owns roughly 2/3 of the data.

You need to fix your partition keys of tables.
Cassandra distributes the data based on partition keys to nodes (using hash partitioning range).
So, for some reason you have alot of data for few partition key value, and almost non for rest partition key values.

Related

Data not distributed across cluster in Cassandra

We are using a 3 node cluster with REPLICATION = {'class':'SimpleStrategy' , 'replication_factor':1 }
But when we are inserting data , the same row is present in all three nodes (I see it when I run it on each node individually)
When I run nodetool status (I see the below) :
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.31.46.89 6.43 MiB 256 32.8% 2db6dc5c-9d05-4dc7-9bf5-ea9e3c406267 rack1
UN 172.31.47.150 13.17 MiB 256 32.1% eb10cc48-6117-427c-9151-48cb6761a5e6 rack1
DN 172.31.45.131 12.73 MiB 256 35.1% cc33fc04-a02f-41e2-a00b-3835a0d98cb5 rack1
Can anyone help me to understand why data is present in all nodes???
Cassandra is masterless and when you make a query to any node in the cluster it will request the appropriate replica to answer your query. The data will not be stored on all nodes with RF=1. If really want to verify it look at your data/keyspace/table directory and use the sstabledump on the Data file.
Data will not be stored on all nodes when RF=1. Instead when you connect with any node it act as a coordinator node and fetch data from node responsible for the data and provides the response.
The coordinator only stores data locally (on a write) if it ends up being one of the nodes responsible for the data's token range.

Cassandra multi node balancing

I have added a new node into the cluster and was expecting the data on Cassandra to balance itself across nodes.
node status yields
$ nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.128.0.7 270.75 GiB 256 48.6% 1a3f6faa-4376-45a8-9c20-11480ae5664c rack1
UN 10.128.0.14 414.36 KiB 256 51.4% 66a89fbf-08ba-4b5d-9f10-55d52a199b41 rack1
Load of node 2 is just 400KB, we have time series data and query on that. how can I rebalance the load between these clusters?
configuration for both nodes are
cluster_name: 'cluster1'
- seeds: "node1_ip, node2_ip"
num_tokens: 256
endpoint_snitch: GossipingPropertyFileSnitch
auto_bootstrap: false
thank you for your time :)
I have added a new node into the cluster and was expecting the data on Cassandra to balance itself across nodes.
Explicitly setting `auto_bootstrap: false' tells it not to do that.
how can I rebalance the load?
Set your keyspace to a RF of 2.
Run nodetool -h 10.128.0.14 repair.
-Or-
Take the 10.128.0.14 out of the cluster.
Set auto_bootstrap: true (or just remove it).
And start the node up. It should join and stream data.
Pro-tip: With a data footprint of 270GB, you should have been running with more than one node to begin with. It would have been much easier to start with 3 nodes (which is probably the minimum you should be running on).

Why is the load different on a 3 node cluster with RF 3?

I have a 3 node Cassandra cluster with a replication factor of 3.
This means that all data should be replication on to all 3 nodes.
The following is the output of nodetool status:
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.0.1 27.66 GB 256 100.0% 2e89198f-bc7d-4efd-bf62-9759fd1d4acc RAC1
UN 192.168.0.2 28.77 GB 256 100.0% db5fd62d-3381-42fa-84b5-7cb12f3f946b RAC1
UN 192.168.0.3 27.08 GB 256 100.0% 1ffb4798-44d4-458b-a4a8-a8898e0152a2 RAC1
This is a graph of disk usage over time on all 3 of the nodes:
My question is why do these sizes vary so much? Is it that compaction hasn't run at the same time?
I would say several factors could play a role here.
As you note, compaction will not run at the same time, so the number and contents of the SSTables will be somewhat different on each node.
The memtables will also not have been flushed to SSTables at the same time either, so right from the start, each node will have somewhat different SSTables.
If you're using compression for the SSTables, given that their contents are somewhat different, the amount of space saved by compressing the data will vary somewhat.
And even though you are using a replication factor of three, I would imagine the storage space for non-primary range data is slightly different than the storage space for primary range data, and it's likely that more primary range data is being mapped to one node or the other.
So basically unless each node saw the exact same sequence of messages at exactly the same time, then they wouldn't have exactly the same size of data.

Cassandra - Pillar applied migrations sync issue

Experiencing sync issues between different nodes in the same datacenter in Cassandra. The keyspace is set to a replication factor of 3 with NetworkTopology and has 3 nodes in the DC. Effectively making sure each node has a copy of the data. When node tool status is run, it shows all three nodes in the DC own 100% of the data.
Yet the applied_migrations column family in that keyspace is not in sync. This is strange because only a single column family is impacted within the keyspace. All the other column families are replicated fully among the three nodes. The test was done by doing a count of rows on each of the column families in the keyspaces.
keyspace_name | durable_writes | strategy_class | strategy_options
--------------+----------------+------------------------------------------------------+----------------------------
core_service | True | org.apache.cassandra.locator.NetworkTopologyStrategy | {"DC_DATA_1":"3"}
keyspace: core_service
Datacenter: DC_DATA_1
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN host_ip_address_1_DC_DATA_1 3.75 MB 256 100.0% 3851106b RAC1
UN host_ip_address_2_DC_DATA_1 3.72 MB 256 100.0% d1201142 RAC1
UN host_ip_address_3_DC_DATA_1 3.72 MB 256 100.0% 81625495 RAC1
Datacenter: DC_OPSCENTER_1
==========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN host_ip_address_4_DC_OPSCENTER_1 631.31 MB 256 0.0% 39e4f8af RAC1
Query: select count(*) from core_service.applied_migrations;
host_ip_address_1_DC_DATA_1 core_service applied_migrations
count
-------
1
(1 rows)
host_ip_address_2_DC_DATA_1 core_service applied_migrations
count
-------
2
(1 rows)
host_ip_address_3_DC_DATA_1 core_service applied_migrations
count
-------
2
(1 rows)
host_ip_address_4_DC_OPSCENTER_1 core_service applied_migrations
count
-------
2
(1 rows)
Similar error is received as described in the issue below. Because all the rows of data are not available, the migration script fails because it is trying to create an existing table:
https://github.com/comeara/pillar/issues/25
I require strong consistency
If you want to ensure that your reads are consistent you need to use the right consistency levels.
For RF3 the following are your options:
Write CL ALL and read with CL One or greater
Write CL Quorum and read CL Quorum. This is what's recommended by Magro who opened the issue you linked to. It's also the most common because you can loose one node and still read and write.
Write CL one but read CL ALL.
What does Cassandra do improve consistency
Cassandra's anti entropy mechanisms are:
Repair will ensure that your nodes are consistent. It gives you a consistency base line and for this reason it should be run as part of your maintenance operations. Run repair at least more often than your gc_grace_seconds in order to avoid zombie tombstones from coming back. DataStax OpsCenter has a Repair Service that automates this task.
Manually you can run:
nodetool repair
in one node or
nodetool repair -pr
in each of your nodes. The -pr option will ensure you only repair a node's primary ranges.
Read repair happens probabilistically (configurable at the table def). When you read a row, c* will notice if some of the replicas don't have the latest data and fix it.
Hints are collected by other nodes when a node is unavailable to take a write.
Manipulating c* Schemas
I noticed that the whole point of Pillar is "to automatically manage Cassandra schema as code". This is a dangerous notion--especially if Pillar is a distributed application (I don't know if it is). Because it may cause schema collisions that can leave a cluster in a wacky state.
Assuming that Pillar is not a distributed / multi-threaded system, you can ensure that you do not break schema by utilizing checkSchemaAgreement() before and after schema changes in the Java driver after schema modifications.
Long term
Cassandra schemas will be more robust and handle distributed updates. Watch (and vote for) CASSANDRA-9424

how to rebalance cassandra cluster after adding new node

I had a 3 node cassandra cluster with replication factor of 2. The nodes were running either dsc1.2.3 or dsc1.2.4. Each node had num_token value of 256 and initial_token was commented. This 3 node cluster was perfectly balanced i.e. each owned around 30% of the data.
One of the nodes crashed so I started a new node and nodetool removed the node that had crashed. The new node got added to the cluster but the two older nodes have most of the data now (47.0% and 52.3%) and the new node has just 0.7% of the data.
The output of nodetool status is
Datacenter: xx-xxxx
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.xxx.xxx.xxx 649.78 MB 256 47.0% ba3534b3-3d9f-4db7-844d-39a8f98618f1 1c
UN 10.xxx.xxx.xxx 643.11 MB 256 52.3% 562f7c3f-986a-4ba6-bfda-22a10e384960 1a
UN 10.xxx.xxx.xxx 6.84 MB 256 0.7% 5ba6aff7-79d2-4d62-b5b0-c5c67f1e1791 1c
How do i balance this cluster?
You didn't mention running a repair on the new node, if indeed you haven't yet done that it's likely the cause of your lack of data on the new node.
Until you run a nodetool repair the new node will only hold the new data that gets written to it or the data that read-repair pulls in. With vnodes you generally shouldn't need to re-balance, if I'm understanding vnodes correctly, but I haven't personally yet moved to using vnodes so I may be wrong about that.
It looks like your new node hasn't bootstrapped. Did you add auto_bootstrap=true to your cassandra.yaml?
If you don't want to bootstrap, you can run nodetool repair on the new node and then nodetool cleanup on the two others until the distribution is fair.

Resources