Write data across cassandra multiple data centers - cassandra

I would like to understand the following,
Suppose we have two data centers DC1 and DC2, each with two nodes.
Now I have formed a token ring with the order DC1:1 - DC2:1 - DC1:2 - DC2:2.
Let us assume, I have not configured my replicas across DCs.
Now my question is, if I write a data into say DC2, will the key be mapped only to the nodes in DC2 or will it get mapped to any of the nodes in the token ring?

If your keyspace replication options are set to
{DC1:2}
(I assume this is what you mean by replicas not being configured across DCs.) Then data will only be stored on DC1 because implicitly the replication factor is zero for DC2. You can write data to any node (DC1 or DC2) and it will be forwarded. This is because, in Cassandra the destination of writes does not depend on which node the write was made to.
If, however, you use
{DC1:2, DC2:2}
then all data will be written to all nodes, again regardless of where the write is made.

Related

Cassandra: what node will data be written if the needed node is down?

Suppose I have a Cassandra cluster with 3 nodes (node 0, node 1 and node 2) and replication factor of 1.
Suppose that I want to insert a new data to the cluster and the partition key directs the new row to node 1. However, node 1 is temporarily unavailable. In this case, will the new data be inserted to node 0 or node 2 (although it should not be placed there according to the partition key)?
In Cassandra, Replication Factor (RF) determines how many copies of data will ultimately exist and is set/configured at the keyspace layer. Again, its purpose is to define how many nodes/copies should exist if things are operating "normally". They could receive the data several ways:
During the write itself - assuming things are functioning "normally" and everything is available
Using Hinted Handoff - if one/some of the nodes are unavailable for a configured amount of time (< 3 hours), cassandra will automatically send the data to the node(s) when they become available again
Using manual repair - "nodetool repair" or if you're using DSE, ops center can repair/reconcile data for a table, keyspace, or entire cluster (nodesync is also a tool that is new to DSE and similar to repair)
During a read repair - Read operations, depending on the configurable client consistency level (described next) can compare data from multiple nodes to ensure accuracy/consistency, and fix things if they're not.
The configurable client consistency level (CL) will determine how many nodes must acknowledge they have successfully received the data in order for the client to be satisfied to move on (for writes) - or how many nodes to compare with when data is read to ensure accuracy (for reads). The number of nodes available must be equal to or greater than the client CL number specified or the application will error (for example it won't be able to compare a QUORUM level of nodes if a QUORUM number of nodes are not available). This setting does not dictate how many nodes will receive the data. Again, that's the RF keyspace setting. That will always hold true. What we're specifying here is how many must acknowledge each write or compare for each read in order the client to be happy at that moment. Hopefully that makes sense.
Now...
In your scenario with a RF=1, the application will receive an error upon the write as the single node that should receive the data (based off of a hash algorithm) is down (RF=1 again means only a single copy of the data will exist, and that single copy is determined by a hash algorithm to be the unavailable node). Does that make sense?
If you had a RF=2 (2 copies of data), then one of the two other nodes would receive the data (again, the hash algorithm picks the "base" node, and then another algorithm will chose where the cop(ies) go), and when the unavailable node became available, it would eventually receive the data (either by hinted handoff or repair). If you chose a RF=3 (3 copies) then the other 2 nodes would get the data, and again, once the unavailable node became available, it would eventually receive the data (either by hinted handoff or repair).
FYI, if you ever want to know where a piece of data will/does exist in a Cassandra cluster, you can run "nodetool getendpoints". The output will be where all copies will/do reside.

Avoid Cassandra full table scan cross DC

I have Cassandra cluster nodes distributed across 2 data centers. 6 nodes in each data center, a total of 12 nodes
My keyspace definition:
CREATE KEYSPACE my_keyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3', 'dc2': '3'} AND durable_writes = true;
In each node, I have 64 tokens.
I am using Cassandra driver to connect to the cluster and using default load balancing policy DCAwareRoundRobinPolicy and using only dc1 nodes as contact points. So, I assume dc2 nodes will have HostDistance IGNORED and my application won't connect to them.
Note: For all my reads and writes I use the same configuration
My use case is to do a full table scan. But, I can not use Spark. So, instead, I an achieving this by getting all token range using metadata.getTokenRanges() and querying these token ranges in multiple threads.
Everything works fine. But, metadata.getTokenRanges() return 768 tokens(64*12). Which means it's giving me token range across all 12 nodes.
Since I have to run through all token ranges. Even with multiple threads, the process is very slow.
Is there any way I can get token ranges of only one data center. I even tried to get token ranges using metadata.getTokenRanges("my_keyspace", host from dc1).
I do get less number of tokens(517), but when I use this list, I get fewer data.
How can I get token ranges of only 1 DC?
Edit: I checked read/write latency in both the clusters. I do not see any operations being performed on dc2, whereas I can see a clear spike on my dc1 data center.
This is even more puzzling for me now. If dc2 is never queried how I am getting 64*12 +1 token ranges? and why not 64*6+1?
Your replication_factor is 3+3=6. So you may have 6 times the actual data. 3 copies in dc1 and 3 copies in dc2.
You have 64 vnodes per node so 64*12 =768 vnodes.
So, if you want to do a complete table scan then you might have to query all the token ranges i.e 768. What you are missing is that, because of the replication all that token ranges' data will reside within dc1. So you can get all the data from dc1 itself.
If you are using DCAwareRoundRobinPolicy and set .withLocalDc() with dc1 and consistency level LOCAL_* then you are reading only from dc1. dc1 will have all the data because the replication_factor of dc1 is 3.

Alter Keyspace on cassandra 3.11 production cluster to switch to NetworkTopologyStrategy

I have a cassandra 3.11 production cluster with 15 nodes. Each node has ~500GB total with replication factor 3. Unfortunately the cluster is setup with Replication 'SimpleStrategy'. I am switching it to 'NetworkTopologyStrategy'. I am looking to understand the caveats of doing so on a production cluster. What should I expect?
Switching from mSimpleStrategy to NetworkTopologyStrategy in a single data center configuration is very simple. The only caveat of which I would warn, is to make sure you spell the data center name correctly. Failure to do so will cause operations to fail.
One way to ensure that you use the right data center, is to query it from system.local.
cassdba#cqlsh> SELECT data_center FROM system.local;
data_center
-------------
west_dc
(1 rows)
Then adjust your keyspace to replicate to that DC:
ALTER KEYSPACE stackoverflow WITH replication = {'class': 'NetworkTopologyStrategy',
'west_dc': '3'};
Now for multiple data centers, you'll want to make sure that you specify your new data center names correctly, AND that you run a repair (on all nodes) when you're done. This is because SimpleStrategy treats all nodes as a single data center, regardless of their actual DC definition. So you could have 2 replicas in one DC, and only 1 in another.
I have changed RFs for keyspaces on-the-fly several times. Usually, there are no issues. But it's a good idea to run nodetool describecluster when you're done, just to make sure all nodes have schema agreement.
Pro-tip: For future googlers, there is NO BENEFIT to creating keyspaces using SimpleStrategy. All it does, is put you in a position where you have to fix it later. In fact, I would argue that SimpleStrategy should NEVER BE USED.
so when will the data movement commence? In my case since I have specific rack ids now, so I expect my replicas to switch nodes upon this alter keyspace action.
This alone will not cause any adjustments of token range responsibility. If you already have a RF of 3 and so does your new DC definition, you won't need to run a repair, so nothing will stream.
I have a 15 nodes cluster which is divided into 5 racks. So each rack has 3 nodes belonging to it. Since I previously have replication factor 3 and SimpleStrategy, more than 1 replica could have belonged to the same rack. Whereas NetworkStrategy guarantees that no two replicas will belong to the same rack. So shouldn't this cause data to move?
In that case, if you run a repair your secondary or ternary replicas may find a new home. But your primaries will stay the same.
So are you saying that nothing changes until I run a repair?
Correct.

Titan using Cassandra - Multiple Datacenter Oddities

Say I have 2 datacenters - DC1 and DC2. DC1 has 3 nodes with replication 3 (fully replicated) and DC2 has 1 node with replication 1 (fully replicated).
Say the lone node in DC2 is up, all nodes in DC1 are down, and my read/write consistency is at LOCAL_QUORUM everywhere.
I try to do a transaction on DC2 but it fails due to UnavailableException, which of course means not enough nodes are online. But why? Does the LOCAL part of LOCAL_QUORUM get ignored because I only have one node in that data center?
The lone node in DC2 has 100% of the data so why can't I do anything unless 2 nodes are also up in DC1, regardless of read/write consistency settings?
With your settings, 2 replicas need to be written to disk for a write to succeed. Here the failed write(row) partition might belong to the down nodes. Because the hash of that partition decides where it needs to go. Once you decommision those nodes, ring gets re-adjusted and work fine.
But as long as they are simply down, some writes will succeed and some writes will fail. You can check which write succeeds and which one fails by simply checking the hash and ring tokens
eg: Now imagine we got a request for that node with token range 41-50. And according to replication strategy, the next replica should go to 1-20 and 11-20, then LOCAL_QUORAM is not satisfied because they are down. So your write fails.
From https://groups.google.com/forum/#!topic/aureliusgraphs/fJYH1de5wBw
"titan uses an internal consistency for locking and id allocation, the level it uses is quorum.
as a result no matter what I do titan will always access both DC."

Cassandra rack concept and database structure

I am new to Cassandra and I would like to learn more about Cassandra's racks and structure.
Suppose I have around 70 column families in Cassandra and two AWS2 instances.
How many Data Centres will be used?
How many nodes will each rack have?
Is it possible to divide a column family in multiple keyspaces?
The intent of making Cassandra aware of logical racks and data centers is to provide additional levels of fault tolerance. The idea (as described in this document, under the "Network Topology Strategy") is that the application should still be able to function if one rack or data center goes dark. Essentially, Cassandra...
places replicas in the same data center by walking the ring clockwise
until reaching the first node in another rack. NetworkTopologyStrategy
attempts to place replicas on distinct racks because nodes in the same
rack (or similar physical grouping) often fail at the same time due to
power, cooling, or network issues.
In this way, you can also query your data by LOCAL_QUORUM, in which QUORUM ((replication_factor / 2) + 1) is only computed from the nodes present in the same data center as the coordinator node. This reduces the effects of inter-data center latency.
As for your questions:
How many data centers are used are entirely up to you. If you only have two AWS instances, putting them in different logical data centers is possible, but only makes sense if you are planning to use consistency level ONE. As-in, if one instance goes down, your application only needs to worry about finding one other replica. But even then, the snitch can only find data on one instance, or the other.
Again, you can define the number of nodes that you wish to have for each rack. But as I indicated with #1, if you only have two instances, there isn't much to be gained by splitting them into different data centers or racks.
I do not believe it is possible to divide a column family over multiple keyspaces. But I think I know what you're getting at. Each keyspace will be created on each instance. As you have 2 instances, you will be able to specify a replication factor of 1 or 2. If you had 3 instances, you could set a replication factor of 2, and then if you lost 1 instance you would still have access to all the data. As you only have 2 instances, you need to be able to handle one going dark, so you will want to make sure both instances have a copy of every row (replication factor of 2).
Really, the logical datacenter/rack structure becomes more-useful as the number of nodes in your cluster increases. With only two, there is little to be gained by splitting them with additional logical barriers. For more information, read through the two docs I linked above:
Apache Cassandra 2.0: Data Replication
Apache Cassandra 2.0: Snitches

Resources