Data consistency across nodes in NOSQL - cassandra

This is a design level question,
I have a node setup like Node N1, N2 and N3 where my application and database (as of now consider as Cassandra) runs in all 3 nodes.
I need to provide the data consistency for the following scenario, Could someone provide answers?
Thread (T1) tries to edit the data in Node N1
Thread (T2) tries to edit the same data from Node N2
Only one write should succeed
In this case, what will happen in Cassandra?
Is there a way to provide the concurrency via application / Cassandra database? Or any Algorithms?
Apart from LWT in Cassandra.

Cassandra offers tunable consistency. In your case this only means, that if you offer CL=QUORUM for writes it will get synced to 2 out-of 3 nodes. Read will be consistent with CL=QUORUM as you will get results from 2 out-of 3 nodes, so there's an overlap.
For writes Cassandra offers last-write-wins mechanism. This means that independently from consistency level a reader will either see T1 or T2 thread's write, depending on when the read happens. Later on reader will only see the latest write.
If you want locking mechanism, you can use offline concurrency patterns in your application layer, like optimistic or pessimistic offline lock.
Some of the persistency management frameworks offer these pattern implementation out-of-the-box.

Related

YugabyteDB deployment in 2 datacenters

[Question posted by a user on YugabyteDB Community Slack]
I have two different Datacentres, I want the app to write simultaneously to the database on both data centers. Both the instances in the Primary and Secondary datacentres should be active and accept writes and should replicate synchronously or async. However, ACID property should be maintained so that data is consistently read at both sites. The database in Primary should have all data that Secondary has and vice versa. The latency between the datacenters is 40ms.
Option 1: Using a single, multi-region YugabyteDB cluster stretched across datacenters. YB uses synchronous replication within a single cluster.. this uses quorum (consensus) protocol.
For this deployment, because of the use of RAFT protocol, an odd number of data centers, typically 3, is recommended so that you can tolerate a datacenter failure and still be active. Disadvantage: This deployment will generally have higher latency for write operation. The further apart the DCs are, the higher the latency will be.
At the tablet level, leaders have to coordinate the writes. So no matter which node / DC you are coming from. The request will first have to be routed to the leader of the tablet. This could add 40ms if the leader for shard was in DC1 but your write request (i.e. app is running in DC(⅔)). You will not have this penalty if you are primarily writing from one DC and you pick that DC as the preferred zone to keep all your leaders(by default).
On top of that, the number of network round-trips depends on whether the operation is a fast-path (single-shard) transaction - e.g., a simple single-row INSERT (in which case about another 40ms). Or, if it involves a distributed transaction (e.g., a multi-row INSERT or an INSERT to a table with one more secondary index) -- this case will involve about two network round-trips.. so closer to 80-100ms.
Option 2: Using two YugabyteDB clusters.. one in DC1 and one in DC2. Each is RF=3, and asynchronously replicating to each other. Both clusters can take writes, and latency of writes will stay in intra-DC latencies (so it'll be much faster than option 1). However, you will not have immediate consistency at both sites with async replication, so no full ACID implementation. Furthermore, if you are taking writes on both sites and doing async replication bidirectionally, then care has to be exercised. If the two clusters are touching an unrelated set of keys/records, then less of an issue. If they are updating the same records, then the semantics is just "latest writer wins", and not going to be ACID. Basically, bidirectional async replication should be used carefully/has these caveats as you can imagine due to the very nature of async replication.

What is the main difference between the new PNCOUNTER and IAtomicLong in Hazelcast?

I have issues understanding the new features of hazelcast 5.0 Which is the main difference between those data structures? because pncounter is a counter that replicates data and when there is no more updates they combine together, I want to understand how does PNCounter of hazelcast controls the concurrency on hazelcast.
The replicant count max value it's about the max nodes you have running on hazelcast?
How does work internally? because I need to understand how this works, I'm working with a counter that counts the activity of several clients, we create like 1000 or even more pncounters for different activities because I don't know if one pn counter would work.
Does the client know which counter need to connect or does the counter follow a certain logic flow? I don't understand this feature, I really want to know the difference between pncounter and atomiclong.
for me it's like atomiclong which the feature that it can replicate.
It all boils down to the CAP Theorem.
In summary, out of Consistency, Availability & Partition-Resistance, you can pick 2 out of 3. And since Hazelcast is distributed by nature, your choice is between Consistency and Availability.
IAtomicLong is a member of CP Subsystem API
-- https://docs.hazelcast.com/imdg/4.2/data-structures/iatomiclong
A Conflict-free Replicated Data Type (CRDT) is a distributed data structure that achieves high availability by relaxing consistency constraints. There may be several replicas for the same data and these replicas can be modified concurrently without coordination. This means that you may achieve high throughput and low latency when updating a CRDT data structure. On the other hand, all of the updates are replicated asynchronously.
-- https://docs.hazelcast.com/imdg/4.2/data-structures/pn-counter
In summary, IAtomicLong sacrifices Availability for Consistency. The result will always be correct, but it might not always be available.
PNCounter makes the opposite trade-off. It's always available (depending on the number of nodes of the cluster, of course) but it's eventually consistent as it's asynchronously replicated.

maintaining dynamic consistency level in datastax

I have a 5 node cluster and keyspace with replication factor of 3. The nature of operations are such that writes are much more important than read, but frequency of read operations are about 10 times higher than write. To achieve consistency while improving overall performance, I chose to set consistency level for writes as ALL, and ONE for read. But this causes operations to fail if even one node is down.
Is there a method by which I can simultaneously change consistency level for (Write,Read) from (ALL,ONE) to (QUORUM, QUORUM) if one node is detected down, or if there is a query-execution-exception; plus this be done in a manner that no operations pass through a temporary phase where it sees a temporary (QUORUM, ONE) setting.
We also plan to expand to twice the capacity, 3 datacenter with 4 nodes each. Is it possible to define custom consistency levels, like, (a level of ALL in any one datacenter and ONE in others). I'm thinking that a level of (EACH_ONE) for read, coupled with above level for write will insure consistency but will allow the cluster to remain available even if a node goes down.
The flexibility is there since you can set your consistency level at a per request basis. Depending on the client you are using, there are some nice capabilities. For example, the java driver has something called a DowngradingConsistencyRetryPolicy such that if a request fails, it will be retried with the next lowest consistency level until the request succeeds. This pushes the complexity of retrying into the client so you don't have to write a bunch of code for it, it's really nice!
The java driver also allows you to configure consistency level per request with Statement#setConsistencyLevel()
As far as custom consistency levels, this is not an option available to you (without changing the cassandra source code), however I think what is made available should be sufficient.
For reads, I don't find much value in ensuring consistency between Data Centers on read. I think LOCAL_QUORUM is more than sufficient, but if you really care, you can use something like EACH_QUORUM for to ensure all datacenters agree, but that will severely impact your response time and availability. For example, if one of your datacenters goes down completely, you won't be able to do reads at all (unless downgrading).
For writes, I'd strongly recommend not using ALL in a multi datacenter set up if you care about response time and availability. Depending on your requirements, LOCAL_QUORUM should likely be more than sufficient.
While one of the benefits of Cassandra is that consistency is tunable, you can have as much strong consistency as you like, but keep in mind that Cassandra is at its best as a Highly Available, Partition Tolerant system.
A really good presentation on consistency that I think really nails a lot of these points is Christos Kalazantis' talk 'Eventual Consistency != Hopeful Consistency' which suggests that a consistency level of ONE is sufficient for a lot of use cases.

When would Cassandra not provide C, A, and P with W/R set to QUORUM?

When both read and write are set to quorum, I can be guaranteed the client will always get the latest value when reading.
I realize this may be a novice question, but I'm not understanding how this setup doesn't provide consistency, availability, and partitioning.
With a quorum, you are unavailable (i.e. won't accept reads or writes) if there aren't enough replicas available. You can choose to relax and read / write on lower consistency levels granting you availability, but then you won't be consistent.
There's also the case where a quorum on reads and writes guarantees you the latest "written" data is retrieved. However, if a coordinator doesn't know about required partitions being down (i.e. gossip hasn't propagated after 2 of 3 nodes fail), it will issue a write to 3 replicas [assuming quorum consistency on a replication factor of 3.] The one live node will write, and the other 2 won't (they're down). The write times out (it doesn't fail). A write timeout where even one node has writte IS NOT a write failure. It's a write "in progress". Let's say the down nodes come up now. If a client next requests that data with quorum consistency, one of two things happen:
Request goes to one of the two downed nodes, and to the "was live" node. Client gets latest data, read repair triggers, all is good.
Request goes to the two nodes that were down. OLD data is returned (assuming repair hasn't happened). Coordinator gets digest from third, read repair kicks in. This is when the original write is considered "complete" and subsequent reads will get the fresh data. All is good, but one client will have received the old data as the write was "in progress" but not "complete". There is a very small rare scenario where this would happen. One thing to note is that write to cassandra are upserts on keys. So usually retries are ok to get around this problem, however in case nodes genuinely go down, the initial read may be a problem.
Typically you balance your consistency and availability requirements. That's where the term tunable consistency comes from.
Said that on the web it's full of links that disprove (or at least try to) the Brewer's CAP theorem ... from the theorem's point of view the C say that
all nodes see the same data at the same time
Which is quite different from the guarantee that a client will always retrieve fresh information. Strictly following the theorem, in your situation, the C it's not respected.
The DataStax documentation contains a section on Configuring Data Consistency. In looking through all of the available consistency configurations, For QUORUM it states:
Returns the record with the most recent timestamp after a quorum of replicas has responded regardless of data center. Ensures strong consistency if you can tolerate some level of failure.
Note that last part "tolerate some level of failure." Right there it's indicating that by using QUORUM consistency you are sacrificing availability (A).
The document referenced above also further defines the QUORUM level, stating that your replication factor comes into play as well:
If consistency is top priority, you can ensure that a read always
reflects the most recent write by using the following formula:
(nodes_written + nodes_read) > replication_factor
For example, if your application is using the QUORUM consistency level
for both write and read operations and you are using a replication
factor of 3, then this ensures that 2 nodes are always written and 2
nodes are always read. The combination of nodes written and read (4)
being greater than the replication factor (3) ensures strong read
consistency.
In the end, it all depends on your application requirements. If your application needs to be highly-available, ONE is probably your best choice. On the other hand, if you need strong-consistency, then QUORUM (or even ALL) would be the better option.

Understand cassandra replication factor versus consistency level

I want to clarify very basic concept of replication factor and consistency level in Cassandra. Highly appreciate if someone can provide answer to below questions.
RF- Replication Factor
RC- Read Consistency
WC- Write Consistency
2 cassandra nodes (Ex: A, B) RF=1, RC=ONE, WC=ONE or ANY
can I write data to node A and read from node B ?
what will happen if A goes down ?
3 cassandra nodes (Ex: A, B, C) RF=2, RC=QUORUM, WC=QUORUM
can I write data to node A and read from node C ?
what will happen if node A goes down ?
3 cassandra nodes (Ex: A, B, C) RF=3, RC=QUORUM, WC=QUORUM
can I write data to node A and read from node C ?
what will happen if node A goes down ?
Short summary: Replication factor describes how many copies of your data exist. Consistency level describes the behavior seen by the client. Perhaps there's a better way to categorize these.
As an example, you can have a replication factor of 2. When you write, two copies will always be stored, assuming enough nodes are up. When a node is down, writes for that node are stashed away and written when it comes back up, unless it's down long enough that Cassandra decides it's gone for good.
Now say in that example you write with a consistency level of ONE. The client will receive a success acknowledgement after a write is done to one node, without waiting for the second write. If you did a write with a CL of ALL, the acknowledgement to the client will wait until both copies are written. There are very many other consistency level options, too many to cover all the variants here. Read the Datastax doc, though, it does a good job of explaining them.
In the same example, if you read with a consistency level of ONE, the response will be sent to the client after a single replica responds. Another replica may have newer data, in which case the response will not be up-to-date. In many contexts, that's quite sufficient. In others, the client will need the most up-to-date information, and you'll use a different consistency level on the read - perhaps a level ALL. In that way, the consistency of Cassandra and other post-relational databases is tunable in ways that relational databases typically are not.
Now getting back to your examples.
Example one: Yes, you can write to A and read from B, even if B doesn't have its own replica. B will ask A for it on your client's behalf. This is also true for your other cases where the nodes are all up. When they're all up, you can write to one and read from another.
For writes, with WC=ONE, if the node for the single replica is up and is the one you're connect to, the write will succeed. If it's for the other node, the write will fail. If you use ANY, the write will succeed, assuming you're talking to the node that's up. I think you also have to have hinted handoff enabled for that. The down node will get the data later, and you won't be able to read it until after that occurs, not even from the node that's up.
In the other two examples, replication factor will affect how many copies are eventually written, but doesn't affect client behavior beyond what I've described above. The QUORUM will affect client behavior in that you will have to have a sufficient number of nodes up and responding for writes and reads. If you get lucky and at least (nodes/2) + 1 nodes are up out of the nodes you need, then writes and reads will succeed. If you don't have enough nodes with replicas up, reads and writes will fail. Overall some QUORUM reads and writes can succeed if a node is down, assuming that that node is either not needed to store your replica, or if its outage still leaves enough replica nodes available.
Check out this simple calculator which allows you to simulate different scenarios:
http://www.ecyrd.com/cassandracalculator/
For example with 2 nodes, a replication factor of 1, read consistency = 1, and write consistency = 1:
Your reads are consistent
You can survive the loss of no nodes.
You are really reading from 1 node every time.
You are really writing to 1 node every time.
Each node holds 50% of your data.

Resources