How to load balance Cassandra cluster nodes? - cassandra

I am using Cassandra-0.7.8 on cluster of 4 machines. I have uploaded some files using Map/Reduce.
It looks files got distributed only among 2 nodes. When I used RF=3 it had got distributed to equally 4 nodes on below configurations.
Here are some config info's:
ByteOrderedPartitioner
Replication Factor = 1 (since, I have storage problem. It will be increased later )
initial token - value has not been set.
create keyspace ipinfo with replication_factor = 1 and placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy';
[cassandra#cassandra01 apache-cassandra-0.7.8]$ bin/nodetool -h
172.27.10.131 ring Address Status State Load Owns Token
Token(bytes[fddfd9bae90f0836cd9bff20b27e3c04])
172.27.10.132 Up Normal 11.92 GB 25.00% Token(bytes[3ddfd9bae90f0836cd9bff20b27e3c04])
172.27.15.80 Up Normal 10.21 GB 25.00% Token(bytes[7ddfd9bae90f0836cd9bff20b27e3c04])
172.27.10.131 Up Normal 54.34 KB 25.00% Token(bytes[bddfd9bae90f0836cd9bff20b27e3c04])
172.27.15.78 Up Normal 58.79 KB 25.00% Token(bytes[fddfd9bae90f0836cd9bff20b27e3c04])
Can you suggest me how can I balance the load on my cluster.
Regards,
Thamizhannal

The keys in the data you loaded did not get high enough to reach the higher 2 nodes in the ring. You could change to the RandomPartitioner as suggested by frail. Another option would be to rebalance your ring as described in the Cassandra wiki. This is the route you will want to take if you want to continue having your keys ordered. Of course as more data is loaded, you'll want to rebalance again to keep the distribution of data relatively even. If you plan on doing just random reads and no range slices then switch to the RandomPartitioner and be done with it.

If you want better loadbalance you need to change your partitioner to RandomPartitioner. But it would cause problems if you are using range queries in your application. You would better check this article :
Cassandra: RandomPartitioner vs OrderPreservingPartitioner

Related

What happens if one DC in Cassandra runs out of physical memory?

I'm new to cassandra and I'm asking myself what will happen if I have multiple datacenters and at one point one datacenter won't have enough physical memory to store all the data.
Assume we have 2 DCs. The first DC can store 1 TB and the second DC can only hold 500 GB. Furthermore lets say we have a replication factor = 1 for both DCs. As I understand the both DCs will have the complete token ring, so every DC will have the full data. What happens now, if I push data to DC 1 and the total amount of storage needed exceeds 500 GB?
For keeping things simple, I will consider that you write the data using DC1, so this one will be the local DC in each scenario. DC2 that is down will be remote all the time. So what really matters here is what consistency level you are using for yours writes:
consistency level of type LOCAL (LOCAL_QUORUM, ONE, LOCAL_ONE) - you can write your data.
consistency level of type REMOTE (ALL, EACH_QUORUM, QUORUM, TWO, THREE) - you cannot write your data.
I suggest to read about consistency levels.
Also, a very quick test using ccm and cassandra-stress tool might be helpful to reproduce different scenarios.
Another comment is regarding your free space: when a node will hit the 250GB mark (half of 500GB) you will have compaction issues. The recommendation is to have half of the disk empty for compactions to run.
Let's say, however, that you will continue getting data to that node and you will hit the 500GB mark. Cassandra will stop on that node.

Datastax Cassandra repair service weird estimation and heavy load

I have a 5 node cluster with around 1TB of data. Vnodes enabled. Ops Center version 5.12 and DSE 4.6.7. I would like to do a full repair within 10 days and use the repair service in Ops Center so that i don't put unnecessary load on the cluster.
The problem that I'm facing is that repair service puts to much load and is working too fast. It progress is around 30% (according to Ops Center) in 24h. I even tried to change it to 40 days without any difference.
Questions,
Can i trust the percent-complete number in OpsCenter?
The suggested number is something like 0.000006 days. Could that guess be related to the problem?
Are there any settings/tweaks that could be useful to lower the load?
You can use OpsCenter as a guideline about where data is stored and what's going on in the cluster, but it's really more of a dashboard. The real 'tale of the tape' comes from 'nodetool' via command line on server nodes such as
#shell> nodetool status
Status=Up/Down |/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack UN 10.xxx.xxx.xx 43.95 GB 256 33.3%
b1e56789-8a5f-48b0-9b76-e0ed451754d4 RAC1
What type of compaction are you using?
You've asked a sort of 'magic bullet' question, as there could be several factors in play. These are examples but not limited to:
A. Size of data, and the whole rows in Cassandra (you can see these with nodetool cf_stats table_size entries). Rows that result in a binary size of larger than 16M will be seen as "ultra" wide rows, which might be an indicator your schema in your data model needs a 'compound' or 'composite' row key.
B. Type of setup you have with respects to replication and network strategy.
C. Data entry point, how Cassandra gets it's data. Are you using Python? PHP? What inputs the data? You can get funky behavior from a cluster with a bad PHP driver (for example)
D. Vnodes are good, but can be bad. What version of Cassandra are you running? You can find out via CQLSH with cqlsh -3 then type 'show version'
E. Type of compaction is a big killer. Are you using SizeTieredCompaction or LevelCompaction?
Start by running 'nodetool cfstats' from command line on the server any given node is running on. The particular areas of interest would be (at this point)
Compacted row minimum size:
Compacted row maximum size:
More than X amount of bytes in size here on systems with Y amount of RAM can be a significant problem. Be sure Cassandra has enough RAM and that the stack is tuned.
The default configuration for performance on Cassandra should normally be enough, so the next step would be to open a CQLSH interface to the node with 'cqlsh -3 hostname' and issue the command 'describe keyspaces'. Take the known key space name you are running and issue 'describe keyspace FOO' and look at your schema. Of particular interest are your primary keys. Are you using "composite rowkeys" or "composite primary key"? (as described here: http://www.datastax.com/dev/blog/whats-new-in-cql-3-0 )If not, you probably need to depending on read/write load expected.
Also check how your initial application layer is inserting data into Cassandra? Using PHP? Python? What drivers are being used? There are significant bugs in Cassandra versions < 1.2.10 using certain Thrift connectors such as the Java driver or the PHPcassa driver so you might need to upgrade Cassandra and make some driver changes.
In addition to these steps also consider how your nodes were created.
Note that migration from static nodes to virtual nodes (or vnodes) has to be mitigated. You can't simply switch configs on a node that's already been populated. You will want to check your initial_token: settings in /etc/cassandra/cassandra.yaml. The questions I ask myself here are "what initial tokens are set? (no initial tokens for vnodes) were the tokens changed after the data was populated?" For static nodes which I typically run, I calculate them using a tool like: [http://www.geroba.com/cassandra/cassandra-token-calculator/] as I've run into complications with vnodes (though they are much more reliable now than before).

Need to "balance" a cassandra cluster

I have a test cluster I built, and upon looking at it when running a nodetool status I have the data distributed amongst the four nodes as such:
-- Address Load Tokens Owns
UN NODE3 1.61 GB 1 14.6%
UN NODE2 3.14 GB 1 19.4%
UN NODE1 7.68 GB 1 63.9%
UN NODE4 5.85 GB 1 2.0%
Now all nodes were added before I ingested data into the database, but I think I probably screwed up by not manually setting the token info prior to bringing data into the cluster.
My question is how best to readjust this to more evenly distribute the data?
If you are not using Vnodes (which you are not since you have 1 token per node), you can move tokens on each node to evenly distribute your ring. To do this, do the following:
Calculate the tokens to assign your nodes based on number of nodes in your cluster. You can use a tool like cassandra token calculator. With Murmur3Partitioner and 4 nodes you could use: (-9223372036854775808
-4611686018427387904 0 4611686018427387904)
One node at a time use nodetool move to move the node to the new token (i.e. nodetool move -- 0) and wait for it to complete. This may take a while. It may also be wise to choose which nodes to move to which token based on their current proximity to the token they are moving to.
Usually its a good idea to do a nodetool cleanup on that node afterwords to cleanup data no longer belonging to that node.
+1 on Andy's points. I'd like to add one more thing.
To ensure you get an accurate Owns %, you must specify a Cassandra Keyspace for nodetool status <ks>

How does hard disk space allocation work in Cassandra?

Let's say I have 4 identical servers with 300GB hard drive space and a replication factor of 2 (so basically 2 300GB nodes, each replicated on another physical machine with 300GB space), how does the space allocation work across these nodes?
For instance, imagine 300GB on Node 1 and 2 (node 2 being the replica of 1) is completely used by cassandra and another application which also uses disk space, but the second set (nodes 3 and 4) have some free disk space since they're only running Cassandra and nothing else. Would Cassandra store new entries on these nodes instead given the fact the first 2 nodes are out of disk space, or would it blow up?
Broadening the situation across multiple servers in a rack, would Cassandra intelligently manage disk space requirements and put the data on nodes with more free storage space? Similarly, would it be able to work with servers with varying storage spaces? (some 600GB, some 300GB, etc.).
Many thanks,
Cassandra does not allocate data by available space. It places data on nodes based on the hash of their Partition Key. Because of this there can be no intelligent live balancing of where data should go.
To do approximate balancing you can change the size of the token ranges a particular node is responsible for (no-vnodes) or adjust the number of vnodes. This all needs to be done manually.
Changes in the Cassandra.yaml
Example Vnodes:
Node 1: num_token: 128
Node 2: num_token: 128
Node 3: num_token: 256
Node 4: num_token: 256
Example Non-Vnodes (given a full range = 100):
Node1: initial_token: 15
Node2: initial_token: 30
Node3: initial_token: 65
Node4: initial_token: 100

Cassandra Partial Replication

This is my configuration for 4 Data Centers of Cassandra:
create KEYSPACE mySpace WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1' : 1, 'DC2' : 1, 'DC3' : 1, 'DC4' : 1};
In this configuration (Murmur3Partitioner + 256 tokens) , each DC is storing roughly 25% of the key space. And this 25% are replicated 3 times on each other DC. Meaning that every single row has 4 copies over all.
For instance if my data base is to big to keep 4 complete copies of it, how can I configure cassandra so that each DC is replicated only once or twice (instead of total number of DCs (x3)).
For example: 25% of the key space that is stored on DC1 I want to replicate once on DC2 only. I am not looking for selecting any particular DC for replication neither I care if 25% of DC1 will be split over multiple DC1,2,3 I just want to use NetworkTopologyStrategy but reduce storage costs.
Is it possible ?
Thank you
Best Regards
Your keyspace command shows that each of the DCs hold 1 copy of the data. This means that if you have 1 node in each DC, then each node will have 100% of your data. So, I am not sure how you concluded that each of your DCs store only 25% of keys as it is obvious they are storing 100%. Chances are when you run nodetool command you are not specifying the keyspace so the command shows you load which is based on the token range assigned to each node which would be misleading for NetworkTopology setup. Try running it with your keyspace name and see if you notice the difference.
I don't think there is a way to shift data around DCs using any of existing Snitches the way you want it. If you really wanted to have even distribution and you had equal number of nodes in each DC with initial tokens spaced evenly, you could have used SimpleSnitch to achieve what you want. You can change the Snitch to SimpleSnitch and run nodetool cleanup/repair on each node. Bare in mind that during this process you will have some outage because after the SnitchChange, previously written keys may not be available on some nodes until the repair job is done.
The way NetworkTopology works is that if you say you have DC1:1 and you have for example 2 nodes in DC1, it will evenly distribute keys across 2 nodes leading to 50% effective load on each node. With that in mind, I think what you really want to have done is to keep 3 copies of your data, 1 in each DC. So, you can really discard one DC and save money. I am saying this because I think these DCs you have are virtual in the notion of your NetworkTopology and not real physical DC because no one would want to have only 25% of data in one DC as it will not be an available setup. So, I recommend if your nodes are grouped into virtual DCs, you group them into 4 racks instead and maintain 1 DC:
DC1:
nd1-ra_1 rack-a
nd1-rb_1 rack-b
nd1-rc_1 rack-c
nd2-ra_2 rack-a
nd2-rb_2 rack-b
nd2-rc_2 rack-c
nd3-ra_3 rack-a
nd3-rb_3 rack-b
nd3-rc_3 rack-c
nd3-ra_4 rack-a
nd3-rb_4 rack-b
nd3-rc_4 rack-c
In this case, if you set your replication option to DC1:3, each of the racks a,b,and c will have 100% of your data (each node in each rack 25%).

Resources