How to handle QUORUM consistency in 4 node, 2 DC Cassandra cluster - cassandra

I have a four node, two Data Center cassandra 1.1.1 cluster.
My keyspace is RF 2 per Data center, giving me complete copy of data on each node.
The cluster is for a vendor product, which uses r/w consistency of QUORUM. With this config I can only handle the loss of one node.... How can I tweak it to handle the loss of a data center?

Unless your data centers are in the same physical location, your network overhead is going to be terrible with this configuration. The reason is because quorum consistency will not pay attention to DC when it's comparing replicas. So you will frequently have to cross data center lines before acking a read or write. Switching to local quorum would solve the latency issue but would effectively cause a data center to go down if one node goes down. However, as long as both nodes in the second DC are up (and your app can handle this properly), you will still be up and running.
Having said that, the general rule of thumb is that 3 nodes is the bare minimum per data center. If you add a node to each data center and switch to local quorum R/W, you can lose one node in each DC and still have that DC operational, or you can lose an entire DC with the other remaining operational.

Related

Cassandra Spark Connector : requirement failed, contact points contain multiple data centers

I have two Cassandra datacenters, with all servers in the same building, connected with 10 gbps network. The RF is 2 in each datacenter.
I need to ensure strong consistency inside my app, so I first planed to use QUORUM consistency (3 replicas of 4 must respond) on both reads and writes. With that configuration, I can also be fault tolerant if a node crash on a particular datacenter.
So I set multiples contact point from multiples datacenter to my spark connector, but the following error is immediately returned : requirement failed, contact points contain multiple data centers
So I look at the documentation. It say :
Connections are never made to data centers other than the data center of spark.cassandra.connection.host [...]. This technique guarantees proper workload isolation so that a huge analytics job won't disturb the realtime part of the system.
Okay. So after reading that, I plan to switch to LOCAL_QUORUM (2 replicas of 2 must respond) on write, and LOCAL_ONE on read, to still get strong consistency, and connect by default on datacenter1.
The problem, is still consistency, because Spark apps working on the second datacenter datacenter2 don't have strong consistency on write, because data are just asynchronously synchronized from datacenter1.
To avoid that, I can set write consistency to EACH_QUORUM (= ALL). But the problem in that case, is if a single node is unresponsive or down, the entire writes are unable to process.
So my only option, to have both some fault tolerance, AND strong consistency, is to switch my replication factor from 2 to 3 on each datacenter. Then use EACH_QUORUM on write, and LOCAL_QUORUM on read ? Is that correct ?
Thank you
This comment indicates there is some misunderstanding on your part:
... because data are just asynchronously synchronized from datacenter1.
so allow me to clarify.
The coordinator of a write request sends each mutation (INSERT, UPDATE, DELETE) to ALL replicas in ALL data centres in real time. It doesn't happen at some later point in time (i.e. 2 seconds later, 10s later or 1 minute later) -- it gets sent to all DCs at the same time without delay regardless of whether you have a 1Mbps or 10Gbps link between DCs.
We also recommend a minimum of 3 replicas in each DC in production as well as use LOCAL_QUORUM for both reads and writes. There are very limited edge cases where these recommendations do not apply.
The spark-cassandra-connector requires all contacts points to belong to the same DC so:
analytics workloads do not impact the performance of OLTP DCs (as you already pointed out), and
it can achieve data-locality for optimal performance where possible.

How to determine the sync status is up to date for particular node in a Cassandra cluster?

Suppose I have two node cassandra cluster and they are reside on physically different data-centers. Suppose the database inside that cluster has replication factor is 2 which means every data in that database should be sync with each other. suppose this database is a massive database which have millions of records of its tables. I named those nodes centers as node1 and node2. Suppose node2 is not reliable and there was a crash on that server and take few days to fix and get the server back to up and running state. After that according to my understating there should be a gap between node1 and node2 and it may take significant time to sync node2 with node1. So need a way to measure the gap between node2 and node1 for the mean time of sync happen? After some times how should I assure that node2 is equal to node1? Please correct me if im wrong with this question according to the cassandra architechure.
So let's start with your description. 2 node cluster, which sounds fine, but 2 nodes in 2 different data centers (DCs) - bad design, but doable. Each data center should have multiple nodes to ensure your data is highly available. Anyway, that aside, let's assume you have a 2 node cluster with 1 node in each DC. The replication factor (RF) is defined at the keyspace level (not at the cluster level - each DC will have a RF setting for a particular keyspace (or 0 if not specified for a particular DC)). That being said, you can't have RF=2 for a keyspace for either of your DCs if you only have a single node in each one (RF, which is how many copies of the data that exist, can't be more than the number of nodes in the DC). So let's put that aside for now as well.
You have the possibility for DCs to become out of sync as well as nodes within a DC to become out of sync. There are multiple protections against this problem.
Consistency Level (CL)
This is a lever that you (the client) have to be able to help control how far out of sync things get. There's a trade off between availability v.s. consistency (with performance implications as well). The CL setting is configured at connection time and/or each statement level. For writes, the CL determines how many nodes must IMMEDIATELY ACKNOWLEDGE the write before giving your application the "green light" to move on (a number of nodes that you're comfortable with - knowing the more nodes you immediately require the more consistent your nodes and/or DC(s) will be, but the longer it will take and the less flexibility you have in nodes becoming unavailable without client failure). If you specify less than RF it doesn't mean that RF won't be met, it just means that they don't need to immediately acknowledge the write to move on. For reads, this setting determines how many nodes' data are compared before the result is returned (if cassandra finds a particular row doesn't match from the nodes it's comparing, it will "fix" them during the read before you get your results - this is called read repair). There are a handful of CL options by the client (e.g. ONE, QUORUM, LOCAL_ONE, LOCAL_QUOURM, etc.). Again, there is a trade-off between availability and consistency with the selected choice.
If you want to be sure your data is consistent when your queries run (when you read the data), ensure the write CL + the read CL > RF. You can ensure that's done on a LOCAL level (e.g. the DC that the read/write is occurring on, say, LOCAL_QUORUM) or globally (all DCs with QUORUM). By doing this, you'll be sure that while your cluster may be inconsistent, your results during reads will not be (i.e. the results will be consistent/accurate - which is all that anyone really cares about). With this setting you also allow some flexibility in unavailable nodes (e.g. for a 3 node DC you could have a single node be unavailable without client failure for either reads or writes).
If nodes do become out of sync, you have a few options at this point:
Repair
Repair (run by "nodetool repair") - this is a facility that you can schedule or manually run to reconcile your tables, keyspaces and/or the entire node with other nodes (either in the DC the node resides or the entire cluster). This is a "node level" command and must be run on each node to "fix" things. If you have DSE, Ops Center can run repairs in the background fixing "chunks" of data - cycling the process repetitively.
NodeSync
Similar to repair, this is a DSE specific tool similar to repair that helps keep data in sync (the newer version of repair).
Unavailable nodes:
Hinted Handoff
Cassandra has the ability to "hold onto" changes if nodes become unavailable during writes. It will hang onto changes for a specified period of time. If the unavailable nodes become available before time runs out, the changes are sent over for application. If time runs out, hint collection stops and one of the other options, above, need to be performed to catch things up.
Finally, there is no way to know how inconsistent things are (e.g. 30% inconsistent). You simply try to utilize the tools mentioned above to control consistency without completely sacrificing availability.
Hopefully that makes sense and helps.
-Jim

What happens if one DC in Cassandra runs out of physical memory?

I'm new to cassandra and I'm asking myself what will happen if I have multiple datacenters and at one point one datacenter won't have enough physical memory to store all the data.
Assume we have 2 DCs. The first DC can store 1 TB and the second DC can only hold 500 GB. Furthermore lets say we have a replication factor = 1 for both DCs. As I understand the both DCs will have the complete token ring, so every DC will have the full data. What happens now, if I push data to DC 1 and the total amount of storage needed exceeds 500 GB?
For keeping things simple, I will consider that you write the data using DC1, so this one will be the local DC in each scenario. DC2 that is down will be remote all the time. So what really matters here is what consistency level you are using for yours writes:
consistency level of type LOCAL (LOCAL_QUORUM, ONE, LOCAL_ONE) - you can write your data.
consistency level of type REMOTE (ALL, EACH_QUORUM, QUORUM, TWO, THREE) - you cannot write your data.
I suggest to read about consistency levels.
Also, a very quick test using ccm and cassandra-stress tool might be helpful to reproduce different scenarios.
Another comment is regarding your free space: when a node will hit the 250GB mark (half of 500GB) you will have compaction issues. The recommendation is to have half of the disk empty for compactions to run.
Let's say, however, that you will continue getting data to that node and you will hit the 500GB mark. Cassandra will stop on that node.

Alter Keyspace on cassandra 3.11 production cluster to switch to NetworkTopologyStrategy

I have a cassandra 3.11 production cluster with 15 nodes. Each node has ~500GB total with replication factor 3. Unfortunately the cluster is setup with Replication 'SimpleStrategy'. I am switching it to 'NetworkTopologyStrategy'. I am looking to understand the caveats of doing so on a production cluster. What should I expect?
Switching from mSimpleStrategy to NetworkTopologyStrategy in a single data center configuration is very simple. The only caveat of which I would warn, is to make sure you spell the data center name correctly. Failure to do so will cause operations to fail.
One way to ensure that you use the right data center, is to query it from system.local.
cassdba#cqlsh> SELECT data_center FROM system.local;
data_center
-------------
west_dc
(1 rows)
Then adjust your keyspace to replicate to that DC:
ALTER KEYSPACE stackoverflow WITH replication = {'class': 'NetworkTopologyStrategy',
'west_dc': '3'};
Now for multiple data centers, you'll want to make sure that you specify your new data center names correctly, AND that you run a repair (on all nodes) when you're done. This is because SimpleStrategy treats all nodes as a single data center, regardless of their actual DC definition. So you could have 2 replicas in one DC, and only 1 in another.
I have changed RFs for keyspaces on-the-fly several times. Usually, there are no issues. But it's a good idea to run nodetool describecluster when you're done, just to make sure all nodes have schema agreement.
Pro-tip: For future googlers, there is NO BENEFIT to creating keyspaces using SimpleStrategy. All it does, is put you in a position where you have to fix it later. In fact, I would argue that SimpleStrategy should NEVER BE USED.
so when will the data movement commence? In my case since I have specific rack ids now, so I expect my replicas to switch nodes upon this alter keyspace action.
This alone will not cause any adjustments of token range responsibility. If you already have a RF of 3 and so does your new DC definition, you won't need to run a repair, so nothing will stream.
I have a 15 nodes cluster which is divided into 5 racks. So each rack has 3 nodes belonging to it. Since I previously have replication factor 3 and SimpleStrategy, more than 1 replica could have belonged to the same rack. Whereas NetworkStrategy guarantees that no two replicas will belong to the same rack. So shouldn't this cause data to move?
In that case, if you run a repair your secondary or ternary replicas may find a new home. But your primaries will stay the same.
So are you saying that nothing changes until I run a repair?
Correct.

Cassandra - Replication Factor vs Nodes Count Relation

One of my C* cluster design expects nodes to hold between 1 and 2 TBs of data each, and I expect a huge amount of data in a few months. Pretending I can get 1PB of data and that each node will hold 1TB of data, that means I should plan for a 1000x growth over time, and starting from a "misere" N=3 nodes with RF=3 for 1TB of data, I would keep adding nodes up to N=3000 over time.
The high number of nodes involved put some pressures on how to deal with disks/servers failures, keep the cluster healthy and how to perform backups.
Healthy Cluster
Assuming you don't want any data loss and perform reads/writes with LOCAL_QUOROM Consistency Level, using RF=3 when you have N<10 nodes is very reasonable, however when you go up with N the MTBF of your nodes goes down accordingly, so keeping RF=3 is going to call for troubles and you may want to "upgrade" to RF=5 or more.
Q1: What's a good RF that would fight against the decreased MTBF and keep the cluster healthy (and you sleeping peacefully) with say 100 nodes? and 500? and 1000?
BACKUP
Making backups of all the nodes seems to be a bit not viable due to the following reasons:
Doubles the costs of the solution instantly.
I would backup the redundant data due to the RF of the cluster.
I see no way to remove the redundancy introduced by the RF and backup only the data expect adding another DC to C* with RF=2 (I could go for RF=1 but if I lose one node all the backup cluster is down). That
would mean adding 2/RF of the cost of the cluster for backup
purposes which seems to me a good alternative.
Q2: Are there any other methods to perform this task without increasing too much the cost of the solution?

Resources