YugabyteDB deployment in 2 datacenters - yugabytedb

[Question posted by a user on YugabyteDB Community Slack]
I have two different Datacentres, I want the app to write simultaneously to the database on both data centers. Both the instances in the Primary and Secondary datacentres should be active and accept writes and should replicate synchronously or async. However, ACID property should be maintained so that data is consistently read at both sites. The database in Primary should have all data that Secondary has and vice versa. The latency between the datacenters is 40ms.

Option 1: Using a single, multi-region YugabyteDB cluster stretched across datacenters. YB uses synchronous replication within a single cluster.. this uses quorum (consensus) protocol.
For this deployment, because of the use of RAFT protocol, an odd number of data centers, typically 3, is recommended so that you can tolerate a datacenter failure and still be active. Disadvantage: This deployment will generally have higher latency for write operation. The further apart the DCs are, the higher the latency will be.
At the tablet level, leaders have to coordinate the writes. So no matter which node / DC you are coming from. The request will first have to be routed to the leader of the tablet. This could add 40ms if the leader for shard was in DC1 but your write request (i.e. app is running in DC(⅔)). You will not have this penalty if you are primarily writing from one DC and you pick that DC as the preferred zone to keep all your leaders(by default).
On top of that, the number of network round-trips depends on whether the operation is a fast-path (single-shard) transaction - e.g., a simple single-row INSERT (in which case about another 40ms). Or, if it involves a distributed transaction (e.g., a multi-row INSERT or an INSERT to a table with one more secondary index) -- this case will involve about two network round-trips.. so closer to 80-100ms.
Option 2: Using two YugabyteDB clusters.. one in DC1 and one in DC2. Each is RF=3, and asynchronously replicating to each other. Both clusters can take writes, and latency of writes will stay in intra-DC latencies (so it'll be much faster than option 1). However, you will not have immediate consistency at both sites with async replication, so no full ACID implementation. Furthermore, if you are taking writes on both sites and doing async replication bidirectionally, then care has to be exercised. If the two clusters are touching an unrelated set of keys/records, then less of an issue. If they are updating the same records, then the semantics is just "latest writer wins", and not going to be ACID. Basically, bidirectional async replication should be used carefully/has these caveats as you can imagine due to the very nature of async replication.

Related

Why cassandra is considered as partition tolerant by CAP theorem despite we can isolate the coordinator?

Here is the definition of partition tolerance by Gilbert and Lynch
When a network is partitioned, all messages sent from nodes in one
component of the partition to nodes in another component are lost.
Let's divide the cluster into two partitions: the first one contains only the coordinator, the second one contains all other nodes. This way coordinator will not be able to contact any replicas and will respond with error. Is it allowed for partition tolerant systems?
More specifically I think the question is which of the other two CAP attributes does Cassandra retain in the face of such a Partition.
The answer is dependent on the configured consistency level. For writes there is the ANY consistency level. At this consistency level, so long as hinted-handoffs are enabled, the coordinator will record the write and maintain Availability. Clients connected to other coordinators will not be able to see the udpated value until the partition is resolved, so reads will not be Consistent. If a stronger consistency level is chosen, then the client is explicitly configuring Consistency over Availability.
So can Cassandra (given that it does not necessarily replicate all data to all nodes) be considered AP when a read coordinator is alone in a partition? If it responds with an error that sounds like Consistency to me, if it responds with an empty result set because the data is not in its partition, then that would be Availability. Since the weakest read consistency level is ONE - requiring at least one replica to respond, Cassandra opts for the former: If the coordinator is not itself one of the replicas owning the requested data then the read will time out and not be Available. As with writes, any stronger read consistency level explicitly configures Cassandra to behave more Consistently at the expense of Availability.
So the "coordinator" node isn't a long-lasting or "leader"-like definition. It changes with practically every query. If there was a non-token-aware operation which needed a coordinator node, and that coordinator was suddenly partitioned-off from the rest, then that one query would fail.
The next query (or a retry) would pick a new node as a coordinator. The only issue, would be that some data rows will be short by one replica (data stored on the partitioned node). But as long as you're querying by ONE and have a RF >= 2, the cluster will continue on like nothing happened.
So "yes," Cassandra is definitely partition-tolerant.
Note: This is why it's important to use a token-aware load balancing policy. That way the driver picks one of the nodes containing the required data as the "coordinator." At consistency ONE, the operation is completed locally, and a network hop is taken out of the equation.

Cassandra Spark Connector : requirement failed, contact points contain multiple data centers

I have two Cassandra datacenters, with all servers in the same building, connected with 10 gbps network. The RF is 2 in each datacenter.
I need to ensure strong consistency inside my app, so I first planed to use QUORUM consistency (3 replicas of 4 must respond) on both reads and writes. With that configuration, I can also be fault tolerant if a node crash on a particular datacenter.
So I set multiples contact point from multiples datacenter to my spark connector, but the following error is immediately returned : requirement failed, contact points contain multiple data centers
So I look at the documentation. It say :
Connections are never made to data centers other than the data center of spark.cassandra.connection.host [...]. This technique guarantees proper workload isolation so that a huge analytics job won't disturb the realtime part of the system.
Okay. So after reading that, I plan to switch to LOCAL_QUORUM (2 replicas of 2 must respond) on write, and LOCAL_ONE on read, to still get strong consistency, and connect by default on datacenter1.
The problem, is still consistency, because Spark apps working on the second datacenter datacenter2 don't have strong consistency on write, because data are just asynchronously synchronized from datacenter1.
To avoid that, I can set write consistency to EACH_QUORUM (= ALL). But the problem in that case, is if a single node is unresponsive or down, the entire writes are unable to process.
So my only option, to have both some fault tolerance, AND strong consistency, is to switch my replication factor from 2 to 3 on each datacenter. Then use EACH_QUORUM on write, and LOCAL_QUORUM on read ? Is that correct ?
Thank you
This comment indicates there is some misunderstanding on your part:
... because data are just asynchronously synchronized from datacenter1.
so allow me to clarify.
The coordinator of a write request sends each mutation (INSERT, UPDATE, DELETE) to ALL replicas in ALL data centres in real time. It doesn't happen at some later point in time (i.e. 2 seconds later, 10s later or 1 minute later) -- it gets sent to all DCs at the same time without delay regardless of whether you have a 1Mbps or 10Gbps link between DCs.
We also recommend a minimum of 3 replicas in each DC in production as well as use LOCAL_QUORUM for both reads and writes. There are very limited edge cases where these recommendations do not apply.
The spark-cassandra-connector requires all contacts points to belong to the same DC so:
analytics workloads do not impact the performance of OLTP DCs (as you already pointed out), and
it can achieve data-locality for optimal performance where possible.

The contact between Replication factor and Resource Usage

I am a Cassandra user in china. Recently we want to use Cassandra in our production environment. But I don't know the impact of data replica factor and resource consumption.
My stress test show that 3 replication factor use three times more resources than 1 replication factor. But I'm not sure it's right.
So, I would like to ask if there is a formula for replication factor and resource consumption? Or has anyone ever tested it?
I'm very grateful if anyone can reply me;
First of all, RF=3 means you need at least three servers (obviously). But really, it depends on what you mean by "resources." If that's mainly referring to disk space, then "yes" setting a RF=3 will use 3x the disk space that a single copy (RF=1) would.
So why would you want that? Because supporting data loads in highly-available (HA) scenarios is what Cassandra does really well. This means that Cassandra needs to be able to continue to serve requests if a node should fail. Achieving that means setting RF>1.
As for the remaining resources, if you're referring to network, CPU & RAM as well, then the answer is "it depends." An application can choose to query at different consistency levels, such as ONE, QUORUM, or ALL (and others). For ONE, it does just what it says: an operation (read or write) waits for acknowledgement from a single node.
So if an app is querying at a consistency of ONE, the answer is "no," it won't use three times the resources if RF=3.
Cassandra is distributed database so it stores the data based on partition and hash algorithm. We can configure replica of our data based on requirement and application nature. Default Cassandra cluster with minimum 3 node recommended for production but you should use or configure the replication factor(replica/copy of data) totally on your wish.
If you use 3 node cluster with RF=3 then your data will be distributed on each node (approx 1/3 data on each node). We need to consider the resource here for all 3 nodes like disk, CPU, Memory, I/O etc equally for better performance. However, we can tune multiple things(like consistency, compaction, network, OS) inside the Cassandra to improve the performance and resource effective. 3 copy of data will use more memory and disk as compared to 1 copy of data. But if you consider availability and performance you should use at least 2 copy of data. you can refer below link for more details regarding RF calculation etc:-
https://www.ecyrd.com/cassandracalculator/

maintaining dynamic consistency level in datastax

I have a 5 node cluster and keyspace with replication factor of 3. The nature of operations are such that writes are much more important than read, but frequency of read operations are about 10 times higher than write. To achieve consistency while improving overall performance, I chose to set consistency level for writes as ALL, and ONE for read. But this causes operations to fail if even one node is down.
Is there a method by which I can simultaneously change consistency level for (Write,Read) from (ALL,ONE) to (QUORUM, QUORUM) if one node is detected down, or if there is a query-execution-exception; plus this be done in a manner that no operations pass through a temporary phase where it sees a temporary (QUORUM, ONE) setting.
We also plan to expand to twice the capacity, 3 datacenter with 4 nodes each. Is it possible to define custom consistency levels, like, (a level of ALL in any one datacenter and ONE in others). I'm thinking that a level of (EACH_ONE) for read, coupled with above level for write will insure consistency but will allow the cluster to remain available even if a node goes down.
The flexibility is there since you can set your consistency level at a per request basis. Depending on the client you are using, there are some nice capabilities. For example, the java driver has something called a DowngradingConsistencyRetryPolicy such that if a request fails, it will be retried with the next lowest consistency level until the request succeeds. This pushes the complexity of retrying into the client so you don't have to write a bunch of code for it, it's really nice!
The java driver also allows you to configure consistency level per request with Statement#setConsistencyLevel()
As far as custom consistency levels, this is not an option available to you (without changing the cassandra source code), however I think what is made available should be sufficient.
For reads, I don't find much value in ensuring consistency between Data Centers on read. I think LOCAL_QUORUM is more than sufficient, but if you really care, you can use something like EACH_QUORUM for to ensure all datacenters agree, but that will severely impact your response time and availability. For example, if one of your datacenters goes down completely, you won't be able to do reads at all (unless downgrading).
For writes, I'd strongly recommend not using ALL in a multi datacenter set up if you care about response time and availability. Depending on your requirements, LOCAL_QUORUM should likely be more than sufficient.
While one of the benefits of Cassandra is that consistency is tunable, you can have as much strong consistency as you like, but keep in mind that Cassandra is at its best as a Highly Available, Partition Tolerant system.
A really good presentation on consistency that I think really nails a lot of these points is Christos Kalazantis' talk 'Eventual Consistency != Hopeful Consistency' which suggests that a consistency level of ONE is sufficient for a lot of use cases.

When would Cassandra not provide C, A, and P with W/R set to QUORUM?

When both read and write are set to quorum, I can be guaranteed the client will always get the latest value when reading.
I realize this may be a novice question, but I'm not understanding how this setup doesn't provide consistency, availability, and partitioning.
With a quorum, you are unavailable (i.e. won't accept reads or writes) if there aren't enough replicas available. You can choose to relax and read / write on lower consistency levels granting you availability, but then you won't be consistent.
There's also the case where a quorum on reads and writes guarantees you the latest "written" data is retrieved. However, if a coordinator doesn't know about required partitions being down (i.e. gossip hasn't propagated after 2 of 3 nodes fail), it will issue a write to 3 replicas [assuming quorum consistency on a replication factor of 3.] The one live node will write, and the other 2 won't (they're down). The write times out (it doesn't fail). A write timeout where even one node has writte IS NOT a write failure. It's a write "in progress". Let's say the down nodes come up now. If a client next requests that data with quorum consistency, one of two things happen:
Request goes to one of the two downed nodes, and to the "was live" node. Client gets latest data, read repair triggers, all is good.
Request goes to the two nodes that were down. OLD data is returned (assuming repair hasn't happened). Coordinator gets digest from third, read repair kicks in. This is when the original write is considered "complete" and subsequent reads will get the fresh data. All is good, but one client will have received the old data as the write was "in progress" but not "complete". There is a very small rare scenario where this would happen. One thing to note is that write to cassandra are upserts on keys. So usually retries are ok to get around this problem, however in case nodes genuinely go down, the initial read may be a problem.
Typically you balance your consistency and availability requirements. That's where the term tunable consistency comes from.
Said that on the web it's full of links that disprove (or at least try to) the Brewer's CAP theorem ... from the theorem's point of view the C say that
all nodes see the same data at the same time
Which is quite different from the guarantee that a client will always retrieve fresh information. Strictly following the theorem, in your situation, the C it's not respected.
The DataStax documentation contains a section on Configuring Data Consistency. In looking through all of the available consistency configurations, For QUORUM it states:
Returns the record with the most recent timestamp after a quorum of replicas has responded regardless of data center. Ensures strong consistency if you can tolerate some level of failure.
Note that last part "tolerate some level of failure." Right there it's indicating that by using QUORUM consistency you are sacrificing availability (A).
The document referenced above also further defines the QUORUM level, stating that your replication factor comes into play as well:
If consistency is top priority, you can ensure that a read always
reflects the most recent write by using the following formula:
(nodes_written + nodes_read) > replication_factor
For example, if your application is using the QUORUM consistency level
for both write and read operations and you are using a replication
factor of 3, then this ensures that 2 nodes are always written and 2
nodes are always read. The combination of nodes written and read (4)
being greater than the replication factor (3) ensures strong read
consistency.
In the end, it all depends on your application requirements. If your application needs to be highly-available, ONE is probably your best choice. On the other hand, if you need strong-consistency, then QUORUM (or even ALL) would be the better option.

Resources