Avoid Cassandra full table scan cross DC - cassandra

I have Cassandra cluster nodes distributed across 2 data centers. 6 nodes in each data center, a total of 12 nodes
My keyspace definition:
CREATE KEYSPACE my_keyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3', 'dc2': '3'} AND durable_writes = true;
In each node, I have 64 tokens.
I am using Cassandra driver to connect to the cluster and using default load balancing policy DCAwareRoundRobinPolicy and using only dc1 nodes as contact points. So, I assume dc2 nodes will have HostDistance IGNORED and my application won't connect to them.
Note: For all my reads and writes I use the same configuration
My use case is to do a full table scan. But, I can not use Spark. So, instead, I an achieving this by getting all token range using metadata.getTokenRanges() and querying these token ranges in multiple threads.
Everything works fine. But, metadata.getTokenRanges() return 768 tokens(64*12). Which means it's giving me token range across all 12 nodes.
Since I have to run through all token ranges. Even with multiple threads, the process is very slow.
Is there any way I can get token ranges of only one data center. I even tried to get token ranges using metadata.getTokenRanges("my_keyspace", host from dc1).
I do get less number of tokens(517), but when I use this list, I get fewer data.
How can I get token ranges of only 1 DC?
Edit: I checked read/write latency in both the clusters. I do not see any operations being performed on dc2, whereas I can see a clear spike on my dc1 data center.
This is even more puzzling for me now. If dc2 is never queried how I am getting 64*12 +1 token ranges? and why not 64*6+1?

Your replication_factor is 3+3=6. So you may have 6 times the actual data. 3 copies in dc1 and 3 copies in dc2.
You have 64 vnodes per node so 64*12 =768 vnodes.
So, if you want to do a complete table scan then you might have to query all the token ranges i.e 768. What you are missing is that, because of the replication all that token ranges' data will reside within dc1. So you can get all the data from dc1 itself.
If you are using DCAwareRoundRobinPolicy and set .withLocalDc() with dc1 and consistency level LOCAL_* then you are reading only from dc1. dc1 will have all the data because the replication_factor of dc1 is 3.

Related

Alter Keyspace on cassandra 3.11 production cluster to switch to NetworkTopologyStrategy

I have a cassandra 3.11 production cluster with 15 nodes. Each node has ~500GB total with replication factor 3. Unfortunately the cluster is setup with Replication 'SimpleStrategy'. I am switching it to 'NetworkTopologyStrategy'. I am looking to understand the caveats of doing so on a production cluster. What should I expect?
Switching from mSimpleStrategy to NetworkTopologyStrategy in a single data center configuration is very simple. The only caveat of which I would warn, is to make sure you spell the data center name correctly. Failure to do so will cause operations to fail.
One way to ensure that you use the right data center, is to query it from system.local.
cassdba#cqlsh> SELECT data_center FROM system.local;
data_center
-------------
west_dc
(1 rows)
Then adjust your keyspace to replicate to that DC:
ALTER KEYSPACE stackoverflow WITH replication = {'class': 'NetworkTopologyStrategy',
'west_dc': '3'};
Now for multiple data centers, you'll want to make sure that you specify your new data center names correctly, AND that you run a repair (on all nodes) when you're done. This is because SimpleStrategy treats all nodes as a single data center, regardless of their actual DC definition. So you could have 2 replicas in one DC, and only 1 in another.
I have changed RFs for keyspaces on-the-fly several times. Usually, there are no issues. But it's a good idea to run nodetool describecluster when you're done, just to make sure all nodes have schema agreement.
Pro-tip: For future googlers, there is NO BENEFIT to creating keyspaces using SimpleStrategy. All it does, is put you in a position where you have to fix it later. In fact, I would argue that SimpleStrategy should NEVER BE USED.
so when will the data movement commence? In my case since I have specific rack ids now, so I expect my replicas to switch nodes upon this alter keyspace action.
This alone will not cause any adjustments of token range responsibility. If you already have a RF of 3 and so does your new DC definition, you won't need to run a repair, so nothing will stream.
I have a 15 nodes cluster which is divided into 5 racks. So each rack has 3 nodes belonging to it. Since I previously have replication factor 3 and SimpleStrategy, more than 1 replica could have belonged to the same rack. Whereas NetworkStrategy guarantees that no two replicas will belong to the same rack. So shouldn't this cause data to move?
In that case, if you run a repair your secondary or ternary replicas may find a new home. But your primaries will stay the same.
So are you saying that nothing changes until I run a repair?
Correct.

Titan using Cassandra - Multiple Datacenter Oddities

Say I have 2 datacenters - DC1 and DC2. DC1 has 3 nodes with replication 3 (fully replicated) and DC2 has 1 node with replication 1 (fully replicated).
Say the lone node in DC2 is up, all nodes in DC1 are down, and my read/write consistency is at LOCAL_QUORUM everywhere.
I try to do a transaction on DC2 but it fails due to UnavailableException, which of course means not enough nodes are online. But why? Does the LOCAL part of LOCAL_QUORUM get ignored because I only have one node in that data center?
The lone node in DC2 has 100% of the data so why can't I do anything unless 2 nodes are also up in DC1, regardless of read/write consistency settings?
With your settings, 2 replicas need to be written to disk for a write to succeed. Here the failed write(row) partition might belong to the down nodes. Because the hash of that partition decides where it needs to go. Once you decommision those nodes, ring gets re-adjusted and work fine.
But as long as they are simply down, some writes will succeed and some writes will fail. You can check which write succeeds and which one fails by simply checking the hash and ring tokens
eg: Now imagine we got a request for that node with token range 41-50. And according to replication strategy, the next replica should go to 1-20 and 11-20, then LOCAL_QUORAM is not satisfied because they are down. So your write fails.
From https://groups.google.com/forum/#!topic/aureliusgraphs/fJYH1de5wBw
"titan uses an internal consistency for locking and id allocation, the level it uses is quorum.
as a result no matter what I do titan will always access both DC."

Cassandra not working when one of the nodes is down

I have a development cassandra cluster of two cassandra nodes [Let's call them NodeA and NodeB]. I also have a script that is continuously sending data on NodeA. I have created the database with the following parameters:
CREATE KEYSPACE test_database WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;
Now, for some reason NodeB is stoping after some time. But the issue is, as soon as NodeB stops, the script that is sending data to NodeA starts giving data insertion error.
Can anyone point out a probable reason for the same.
Update: Both the nodes are seed nodes.
How Cassandra handle data repartition
Each key in cassandra can be converted to a token. When you install your cluster, the nodes calculate what range of token they will accept.
Let's take a simple example:
You have two nodes, and a token that goes from 0 to 9. A simple repartition would be: node A stores every token between 0-4 and node B stores every token between 5-9.
How Cassandra works for write
You choose a Coordinator (in your case node A), that receive the data. This node will then calculate a token. As seen in the first example, every node has a range of token assigned to it. So imagine the key is converted to token 4, then the data goes to node A (here the coordinator). If the token is 8, the data will be sent to node B.
What is cassandra data replication factor
The replication factor is how many time your data will be stored on your cluster. For a single database with no racks (your case), the data is first send to the node who owns the token associated with the key, and the replicas are sent to the next node in the topology.
In case of failure of one node, the replicas will help the node to restore its data.
In your case, there are no replicas, and if a node is down, Cassandra can't store the data and throws an error. If you have replication factor 2, Cassandra should be able to store a replica on node A and not fail.
Cassandra's Replication Factor:
Lets say we have 'n' as replication factor which means given input data will be stored/retrieved from 'n' nodes.
t
If you mention the replication factor as '1' which means only one node will have the data.
Partitioning:
Lets say we have 2 nodes, whenever you are inserting the data. Both these nodes will have some data, based on partitioning algorithm mentioned.
For example:
You are inserting 10 records, based on the hashing and partitioning algorithm, it chooses which node needs to be written for each record. Of-course the identification of node is done by the Coordinator :)
Durable Writes:
By default, cassandra always write in commit-log before flushing to disk. If you set to false, it will bypass commit-log and write directly to disk(SSTable).
The problem you have mentioned, for example lets say you are inserting 10 rows.
For simplicity, we can make the partitioning/hashing calculation as n/2.
So, Cassandra's Coordinator node splits up your data into two pieces(for simple calculation it will be 10/2) and tries to put 1st half in to 1st node and succeeds and tries to put the 2nd half into the second node(writing to commit-log), since it is unavailable it is throwing error.
So how do we fix this issue? lets say I want to batch insert multiple insert queries when 1 node in a cluster is down? It returns me
Connection to Cassandra cluster associated with connection cs1 not available due to Host not available. Host Address: cassandra1
If your table is not counter table , you can use consistency level of ANY which gives high availaiblity for write.
Refer this to learn more about it => https://www.datastax.com/blog/2011/05/understanding-hinted-handoff-cassandra-08

Cassandra Partial Replication

This is my configuration for 4 Data Centers of Cassandra:
create KEYSPACE mySpace WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1' : 1, 'DC2' : 1, 'DC3' : 1, 'DC4' : 1};
In this configuration (Murmur3Partitioner + 256 tokens) , each DC is storing roughly 25% of the key space. And this 25% are replicated 3 times on each other DC. Meaning that every single row has 4 copies over all.
For instance if my data base is to big to keep 4 complete copies of it, how can I configure cassandra so that each DC is replicated only once or twice (instead of total number of DCs (x3)).
For example: 25% of the key space that is stored on DC1 I want to replicate once on DC2 only. I am not looking for selecting any particular DC for replication neither I care if 25% of DC1 will be split over multiple DC1,2,3 I just want to use NetworkTopologyStrategy but reduce storage costs.
Is it possible ?
Thank you
Best Regards
Your keyspace command shows that each of the DCs hold 1 copy of the data. This means that if you have 1 node in each DC, then each node will have 100% of your data. So, I am not sure how you concluded that each of your DCs store only 25% of keys as it is obvious they are storing 100%. Chances are when you run nodetool command you are not specifying the keyspace so the command shows you load which is based on the token range assigned to each node which would be misleading for NetworkTopology setup. Try running it with your keyspace name and see if you notice the difference.
I don't think there is a way to shift data around DCs using any of existing Snitches the way you want it. If you really wanted to have even distribution and you had equal number of nodes in each DC with initial tokens spaced evenly, you could have used SimpleSnitch to achieve what you want. You can change the Snitch to SimpleSnitch and run nodetool cleanup/repair on each node. Bare in mind that during this process you will have some outage because after the SnitchChange, previously written keys may not be available on some nodes until the repair job is done.
The way NetworkTopology works is that if you say you have DC1:1 and you have for example 2 nodes in DC1, it will evenly distribute keys across 2 nodes leading to 50% effective load on each node. With that in mind, I think what you really want to have done is to keep 3 copies of your data, 1 in each DC. So, you can really discard one DC and save money. I am saying this because I think these DCs you have are virtual in the notion of your NetworkTopology and not real physical DC because no one would want to have only 25% of data in one DC as it will not be an available setup. So, I recommend if your nodes are grouped into virtual DCs, you group them into 4 racks instead and maintain 1 DC:
DC1:
nd1-ra_1 rack-a
nd1-rb_1 rack-b
nd1-rc_1 rack-c
nd2-ra_2 rack-a
nd2-rb_2 rack-b
nd2-rc_2 rack-c
nd3-ra_3 rack-a
nd3-rb_3 rack-b
nd3-rc_3 rack-c
nd3-ra_4 rack-a
nd3-rb_4 rack-b
nd3-rc_4 rack-c
In this case, if you set your replication option to DC1:3, each of the racks a,b,and c will have 100% of your data (each node in each rack 25%).

Write data across cassandra multiple data centers

I would like to understand the following,
Suppose we have two data centers DC1 and DC2, each with two nodes.
Now I have formed a token ring with the order DC1:1 - DC2:1 - DC1:2 - DC2:2.
Let us assume, I have not configured my replicas across DCs.
Now my question is, if I write a data into say DC2, will the key be mapped only to the nodes in DC2 or will it get mapped to any of the nodes in the token ring?
If your keyspace replication options are set to
{DC1:2}
(I assume this is what you mean by replicas not being configured across DCs.) Then data will only be stored on DC1 because implicitly the replication factor is zero for DC2. You can write data to any node (DC1 or DC2) and it will be forwarded. This is because, in Cassandra the destination of writes does not depend on which node the write was made to.
If, however, you use
{DC1:2, DC2:2}
then all data will be written to all nodes, again regardless of where the write is made.

Resources