Cassandra Partial Replication - cassandra

This is my configuration for 4 Data Centers of Cassandra:
create KEYSPACE mySpace WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1' : 1, 'DC2' : 1, 'DC3' : 1, 'DC4' : 1};
In this configuration (Murmur3Partitioner + 256 tokens) , each DC is storing roughly 25% of the key space. And this 25% are replicated 3 times on each other DC. Meaning that every single row has 4 copies over all.
For instance if my data base is to big to keep 4 complete copies of it, how can I configure cassandra so that each DC is replicated only once or twice (instead of total number of DCs (x3)).
For example: 25% of the key space that is stored on DC1 I want to replicate once on DC2 only. I am not looking for selecting any particular DC for replication neither I care if 25% of DC1 will be split over multiple DC1,2,3 I just want to use NetworkTopologyStrategy but reduce storage costs.
Is it possible ?
Thank you
Best Regards

Your keyspace command shows that each of the DCs hold 1 copy of the data. This means that if you have 1 node in each DC, then each node will have 100% of your data. So, I am not sure how you concluded that each of your DCs store only 25% of keys as it is obvious they are storing 100%. Chances are when you run nodetool command you are not specifying the keyspace so the command shows you load which is based on the token range assigned to each node which would be misleading for NetworkTopology setup. Try running it with your keyspace name and see if you notice the difference.
I don't think there is a way to shift data around DCs using any of existing Snitches the way you want it. If you really wanted to have even distribution and you had equal number of nodes in each DC with initial tokens spaced evenly, you could have used SimpleSnitch to achieve what you want. You can change the Snitch to SimpleSnitch and run nodetool cleanup/repair on each node. Bare in mind that during this process you will have some outage because after the SnitchChange, previously written keys may not be available on some nodes until the repair job is done.
The way NetworkTopology works is that if you say you have DC1:1 and you have for example 2 nodes in DC1, it will evenly distribute keys across 2 nodes leading to 50% effective load on each node. With that in mind, I think what you really want to have done is to keep 3 copies of your data, 1 in each DC. So, you can really discard one DC and save money. I am saying this because I think these DCs you have are virtual in the notion of your NetworkTopology and not real physical DC because no one would want to have only 25% of data in one DC as it will not be an available setup. So, I recommend if your nodes are grouped into virtual DCs, you group them into 4 racks instead and maintain 1 DC:
DC1:
nd1-ra_1 rack-a
nd1-rb_1 rack-b
nd1-rc_1 rack-c
nd2-ra_2 rack-a
nd2-rb_2 rack-b
nd2-rc_2 rack-c
nd3-ra_3 rack-a
nd3-rb_3 rack-b
nd3-rc_3 rack-c
nd3-ra_4 rack-a
nd3-rb_4 rack-b
nd3-rc_4 rack-c
In this case, if you set your replication option to DC1:3, each of the racks a,b,and c will have 100% of your data (each node in each rack 25%).

Related

Add nodes to existing Cassandra cluster

We currently have a 2 node Cassandra cluster. We want to add 4 more nodes to the cluster, using the rack feature. The future topology will be:
node-01 (Rack1)
node-02 (Rack1)
node-03 (Rack2)
node-04 (Rack2)
node-05 (Rack3)
node-06 (Rack3)
We want to use different racks, but the same DC.
But for now we use SimpleStrategy and replication factor is 1 for all keyspaces. My plan to switch from a 2 to a 6 node cluster is shown below:
Change Endpoint snitch to GossipingPropetyFileSnitch.
Alter keyspace to NetworkTopologyStrategy...with replication_factor 'datacenter1': '3'.
According to the docs, when we add a new DC to an existing cluster, we must alter system keyspaces, too. But in our case, we change only the snitch and keyspace strategy, not the Datacenter. Or should I change the system keyspaces strategy and replication factor too, in the case of adding more nodes and changing the snitch?
First, I would change the endpoint_snitch to GossipingPropertyFileSnitch on one node and restart it. You need to make sure that approach works, first. Typically, you cannot (easily) change the logical datacenter or rack names on a running cluster. Which technically you're not doing that, but SimpleStrategy may do some things under the hood to abstract datacenter/rack awareness, so it's a good idea to test it.
If it works, make the change and restart the other node, as well. If it doesn't work, you may need to add 6 new nodes (instead of 4) and decommission the existing 2 nodes.
Or should I change the system keyspaces strategy and replication factor too?
Yes, you should set the same keyspace replication definition on the following keyspaces: system_auth, system_traces, and system_distributed.
Consider this situation: If one of your 2 nodes crashes, you won't be able to log in as the users assigned to that node via the system_auth table. So it is very important to ensure that system_auth is replicated appropriately.
I wrote a post on this some time ago (updated in 2018): Replication Factor to use for system_auth
Also, I recommend the same approach on system_traces and system_distributed, as future node adds/replacements/repairs may fail if valid token ranges for those keyspaces cannot be located. Basically, using the same approach on them prevents potential problems in the future.
Edit 20200527:
Do I need to launch the nodetool cleanup on old cluster nodes after the snitch and keyspace topology changes? According to docs "yes," but only on old nodes?
You will need to run it on every node, except for the very last one added. The last node is the only node guaranteed to only have data which match its token range assignments.
"Why?" you may ask. Consider the total percentage ownership as the cluster incrementally grows from 2 nodes to 6. If you bump the RF from 1 to 2 (run a repair), and then from 2 to 3 and add the first node, you have a cluster with 3 nodes and 3 replicas. Each node then has 100% data ownership.
That ownership percentage gradually decreases as each node is added, down to 50% when the 6th and final node is added. But even though all nodes will have ownership of 50% of the token ranges:
The first 3 nodes will still actually have 100% of the data set, accounting for an extra 50% of the data that they should.
The fourth node will still have an extra 25% (3/4 minus 1/2 or 50%).
The fifth node will still have an extra 10% (3/5 minus 1/2).
Therefore, the sixth and final node is the only one which will not contain any more data than it is responsible for.

Avoid Cassandra full table scan cross DC

I have Cassandra cluster nodes distributed across 2 data centers. 6 nodes in each data center, a total of 12 nodes
My keyspace definition:
CREATE KEYSPACE my_keyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3', 'dc2': '3'} AND durable_writes = true;
In each node, I have 64 tokens.
I am using Cassandra driver to connect to the cluster and using default load balancing policy DCAwareRoundRobinPolicy and using only dc1 nodes as contact points. So, I assume dc2 nodes will have HostDistance IGNORED and my application won't connect to them.
Note: For all my reads and writes I use the same configuration
My use case is to do a full table scan. But, I can not use Spark. So, instead, I an achieving this by getting all token range using metadata.getTokenRanges() and querying these token ranges in multiple threads.
Everything works fine. But, metadata.getTokenRanges() return 768 tokens(64*12). Which means it's giving me token range across all 12 nodes.
Since I have to run through all token ranges. Even with multiple threads, the process is very slow.
Is there any way I can get token ranges of only one data center. I even tried to get token ranges using metadata.getTokenRanges("my_keyspace", host from dc1).
I do get less number of tokens(517), but when I use this list, I get fewer data.
How can I get token ranges of only 1 DC?
Edit: I checked read/write latency in both the clusters. I do not see any operations being performed on dc2, whereas I can see a clear spike on my dc1 data center.
This is even more puzzling for me now. If dc2 is never queried how I am getting 64*12 +1 token ranges? and why not 64*6+1?
Your replication_factor is 3+3=6. So you may have 6 times the actual data. 3 copies in dc1 and 3 copies in dc2.
You have 64 vnodes per node so 64*12 =768 vnodes.
So, if you want to do a complete table scan then you might have to query all the token ranges i.e 768. What you are missing is that, because of the replication all that token ranges' data will reside within dc1. So you can get all the data from dc1 itself.
If you are using DCAwareRoundRobinPolicy and set .withLocalDc() with dc1 and consistency level LOCAL_* then you are reading only from dc1. dc1 will have all the data because the replication_factor of dc1 is 3.

Alter Keyspace on cassandra 3.11 production cluster to switch to NetworkTopologyStrategy

I have a cassandra 3.11 production cluster with 15 nodes. Each node has ~500GB total with replication factor 3. Unfortunately the cluster is setup with Replication 'SimpleStrategy'. I am switching it to 'NetworkTopologyStrategy'. I am looking to understand the caveats of doing so on a production cluster. What should I expect?
Switching from mSimpleStrategy to NetworkTopologyStrategy in a single data center configuration is very simple. The only caveat of which I would warn, is to make sure you spell the data center name correctly. Failure to do so will cause operations to fail.
One way to ensure that you use the right data center, is to query it from system.local.
cassdba#cqlsh> SELECT data_center FROM system.local;
data_center
-------------
west_dc
(1 rows)
Then adjust your keyspace to replicate to that DC:
ALTER KEYSPACE stackoverflow WITH replication = {'class': 'NetworkTopologyStrategy',
'west_dc': '3'};
Now for multiple data centers, you'll want to make sure that you specify your new data center names correctly, AND that you run a repair (on all nodes) when you're done. This is because SimpleStrategy treats all nodes as a single data center, regardless of their actual DC definition. So you could have 2 replicas in one DC, and only 1 in another.
I have changed RFs for keyspaces on-the-fly several times. Usually, there are no issues. But it's a good idea to run nodetool describecluster when you're done, just to make sure all nodes have schema agreement.
Pro-tip: For future googlers, there is NO BENEFIT to creating keyspaces using SimpleStrategy. All it does, is put you in a position where you have to fix it later. In fact, I would argue that SimpleStrategy should NEVER BE USED.
so when will the data movement commence? In my case since I have specific rack ids now, so I expect my replicas to switch nodes upon this alter keyspace action.
This alone will not cause any adjustments of token range responsibility. If you already have a RF of 3 and so does your new DC definition, you won't need to run a repair, so nothing will stream.
I have a 15 nodes cluster which is divided into 5 racks. So each rack has 3 nodes belonging to it. Since I previously have replication factor 3 and SimpleStrategy, more than 1 replica could have belonged to the same rack. Whereas NetworkStrategy guarantees that no two replicas will belong to the same rack. So shouldn't this cause data to move?
In that case, if you run a repair your secondary or ternary replicas may find a new home. But your primaries will stay the same.
So are you saying that nothing changes until I run a repair?
Correct.

Titan using Cassandra - Multiple Datacenter Oddities

Say I have 2 datacenters - DC1 and DC2. DC1 has 3 nodes with replication 3 (fully replicated) and DC2 has 1 node with replication 1 (fully replicated).
Say the lone node in DC2 is up, all nodes in DC1 are down, and my read/write consistency is at LOCAL_QUORUM everywhere.
I try to do a transaction on DC2 but it fails due to UnavailableException, which of course means not enough nodes are online. But why? Does the LOCAL part of LOCAL_QUORUM get ignored because I only have one node in that data center?
The lone node in DC2 has 100% of the data so why can't I do anything unless 2 nodes are also up in DC1, regardless of read/write consistency settings?
With your settings, 2 replicas need to be written to disk for a write to succeed. Here the failed write(row) partition might belong to the down nodes. Because the hash of that partition decides where it needs to go. Once you decommision those nodes, ring gets re-adjusted and work fine.
But as long as they are simply down, some writes will succeed and some writes will fail. You can check which write succeeds and which one fails by simply checking the hash and ring tokens
eg: Now imagine we got a request for that node with token range 41-50. And according to replication strategy, the next replica should go to 1-20 and 11-20, then LOCAL_QUORAM is not satisfied because they are down. So your write fails.
From https://groups.google.com/forum/#!topic/aureliusgraphs/fJYH1de5wBw
"titan uses an internal consistency for locking and id allocation, the level it uses is quorum.
as a result no matter what I do titan will always access both DC."

How to load balance Cassandra cluster nodes?

I am using Cassandra-0.7.8 on cluster of 4 machines. I have uploaded some files using Map/Reduce.
It looks files got distributed only among 2 nodes. When I used RF=3 it had got distributed to equally 4 nodes on below configurations.
Here are some config info's:
ByteOrderedPartitioner
Replication Factor = 1 (since, I have storage problem. It will be increased later )
initial token - value has not been set.
create keyspace ipinfo with replication_factor = 1 and placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy';
[cassandra#cassandra01 apache-cassandra-0.7.8]$ bin/nodetool -h
172.27.10.131 ring Address Status State Load Owns Token
Token(bytes[fddfd9bae90f0836cd9bff20b27e3c04])
172.27.10.132 Up Normal 11.92 GB 25.00% Token(bytes[3ddfd9bae90f0836cd9bff20b27e3c04])
172.27.15.80 Up Normal 10.21 GB 25.00% Token(bytes[7ddfd9bae90f0836cd9bff20b27e3c04])
172.27.10.131 Up Normal 54.34 KB 25.00% Token(bytes[bddfd9bae90f0836cd9bff20b27e3c04])
172.27.15.78 Up Normal 58.79 KB 25.00% Token(bytes[fddfd9bae90f0836cd9bff20b27e3c04])
Can you suggest me how can I balance the load on my cluster.
Regards,
Thamizhannal
The keys in the data you loaded did not get high enough to reach the higher 2 nodes in the ring. You could change to the RandomPartitioner as suggested by frail. Another option would be to rebalance your ring as described in the Cassandra wiki. This is the route you will want to take if you want to continue having your keys ordered. Of course as more data is loaded, you'll want to rebalance again to keep the distribution of data relatively even. If you plan on doing just random reads and no range slices then switch to the RandomPartitioner and be done with it.
If you want better loadbalance you need to change your partitioner to RandomPartitioner. But it would cause problems if you are using range queries in your application. You would better check this article :
Cassandra: RandomPartitioner vs OrderPreservingPartitioner

Resources