Read/Write Strategy For Consistency Level - cassandra

Based on Read Operation in Cassandra at Consistency level of Quorum?
there are 3 ways to read data consistency:
a. WRITE ALL + READ OoNE
b. WRITE ONE + READ ALL
c. WRITE QUORUM + READ QUORUM
For a data, the write operation usually happens once, but read operations often happens.
But take care of the read consistency, is it possible to merge a and b ?
This is, WRITE ONE -> READ ONE -> if not found -> READ ALL.
Does the approach usually fulfill read/write operation happen once?
There is only read ALL at first time on a node which has no the data.
So Is my understanding correct?
Wilian, thanks for exactly elaborating. I think I need to describe my use case, as bellow. I implemented a timeline uses can post to. And users can follow the interesting post. So notification will be sent to the followers. For saving bandwidth, users write/read post at CL ONE. Eventually, users always can read the post after a while by read repair. Followers will receive the notification of comments added to post if they listen the post. Here is my question. It must make sure followers can read the comments if notification were delivers to followers. So I am indented to use CL ONE to check if the comment was synced to the node queried. If no result, try CL ALL to synced the comment. So other followers query from the node don't bother to sync other nodes since the CL ALL was done before,which can save bandwidth and lower server overhead. So as for your final scenario, I don't care if the value is old or latest because the data was synced according to notifications. I need to ensure users can get the comment if notification was delivered to followers.

From the answer to the linked question, Carlo Bertuccini wrote:
What guarantees consistency is the following disequation
(WRITE CL + READ CL) > REPLICATION FACTOR
The cases A, B, and C in this question appear to be referring to the three minimum ways of satisfying that disequation, as given in the same answer.
Case A
WRITE ALL will send the data to all replicas. If your replication factor (RF) is three(3), then WRITE ALL writes three copies before reporting a successful write to the client. But you can't possibly see that the write occurred until the next read of the same data key. Minimally, READ ONE will read from a single one of the aforementioned replicas, and satisfies the necessary condition: WRITE(3) + READ(1) > RF(3)
Case B
WRITE ONE will send the data to only a single replica. In this case, the only way to get a consistent read is to read from all of them. The coordinator node will get all of the answers, figure out which one is the most recent and then send a "hint" to the out-of-date replicas, informing them that there's a newer value. The hint occurs asynchronously but only after the READ ALL occurs does it satisfy the necessary condition: WRITE(1) + READ(3) > RF(3)
Case C
QUORUM operations must involve FLOOR(RF / 2) + 1 replicas. In our RF=3 example, that is FLOOR(3 / 2) + 1 == 1 + 1 == 2. Again, consistency depends on both the reads and the writes. In the simplest case, the read operation talks to exactly the same replicas that the write operation used, but that's never guaranteed. In the general case, the coordinator node doing the read will talk to at least one of the replicas used by the write, so it will see the newer value. In that case, much like the READ ALL case, the coordinator node will get all of the answers, figure out which one is the most recent and then send a "hint" to the out-of-date replicas. Of course, this also satisfies the necessary condition: WRITE(2) + READ(2) > RF(3)
So to the OP's question...
Is it possible to "merge" cases A and B?
To ensure consistency it is only possible to "merge" if you mean WRITE ALL + READ ALL because you can always increase the number of readers or writers in the above cases.
However, WRITE ONE + READ ONE is not a good idea if you need to read consistent data, so my answer is: no. Again, using that disequation and our example RF=3: WRITE(1) + READ(1) > RF(3) does not hold. If you were to use this configuration, receiving an answer that there is no value cannot be trusted -- it simply means that the one replica contacted to do the read did not have a value. But values might exist on one or more of the other replicas.
So from that logic, it might seem that doing a READ ALL on receiving a no value answer would solve the problem. And it would for that use case, but there's another to consider: what if you get some value back from the READ ALL... how do you know that the value returned is "the latest" one? That's what's meant when we want consistency. If you care about reading the most recent write, then you need to satisfy the disequation.
Regarding the use case of "timeline" notifications in the edited question
If my understanding of your described scenario is correct, these are the main points to your use case:
Most (but not all?) timeline entries will be write-once (not modified later)
Any such entry can be followed (there is a list of followers)
Any such entry can be commented upon (there is a list of comments)
Any comment on a timeline entry should trigger a notification to the list of followers for that timeline entry
Trying to minimize cost (in this case, measured as bandwidth) for the "normal" case
Willing to rely on the anti-entropy features built into Cassandra (e.g. read repair)
I need to ensure users can get the comment if notification was delivered to followers.
Since most of your entries are write-once, and you care more about the existence of an entry and not necessarily the latest content for the entry, you might be able to get away with WRITE ONE + READ ONE with a fallback to READ ALL if you get no record for something that had some other indication it should exist (e.g. from a notification). For the timeline entry content, it does not sound like your case depends on consistency of the user content of the timeline entries.
If you don't care about consistency, then this discussion is moot; read/write with whatever Consistency Level and let Cassandra's asynchronous replication and anti-entropy features do their work. That said, though your goal is minimizing network traffic/cost, if your workload is mostly reads then the added cost of doing writes at CL QUORUM or ALL may not actually be that much.
You also said:
Followers will receive the notification of comments added to post if they listen the post.
This statement implies that you care about about not only whether the set of followers exists but also its contents (which users are following). You have not detailed how you are storing/tracking the followers, but unless you ensure the consistency of this data it is possible that one or more followers are not notified of a new comment because you retrieved an out-of-date version of the follower list. Or, someone who "unfollowed" a post could still receive notifications for the same reason.
Cassandra is very flexible and allows each discrete read and write operation to use different consistency levels. Take advantage of this and ensure strong consistency where it is needed and relax it where you are sure that "reading the latest write" is not important to your application's logic and function.

Related

Strong Consistency in Cassandra

According to datastax article, strong consistency can be guaranteed
if, R + W > N
where
R is the consistency level of read operations
W is the consistency level of write operations
N is the number of replicas
What does strong consistency mean here? Does it mean that 'every time' a query's response is given from the database, the response will 'always' be the last updated value? If conditions of strong consistency is maintained in cassandra, then, are there no scenarios where the data returned might be inconsistent? In short, does strong consistency mean 100% consistency?
Edit 1
Adding some additional material regarding some scenarios where Cassandra might not be consistent even when R+W>RF
Write fails with Quorum CL
Cassandra's eventual consistency
Cassandra has tunable consistency with some tradeoffs you can choose.
R + W > N - this simply means there must be one overlapping node in your roundtrip that has the actual and newest data available to be consistent.
For example if you write at CL.ONE you will need to read at CL.ALL to be sure to get a consistent result: N+1 > N - but you might not want CL.ALL as you can not tolerate a single node failure in your cluster.
Often you can choose CL.QUORUM at read and write time to ensure consistency and tolerate node failures. For example at RF=3 a QUORUM needs (3/2)+1=2 nodes available, so R+W>N will be 4>3 - your requests are consistent AND you can tolerate a single node failure.
One thing to keep in mind - it is really important to have thight synchronized clocks on all your nodes (cassandra and application), you will want to have ntp up and running.
While this is an old question, I thought I would chip in to set the record straight.
R+W>RF does not imply strong consistency
A system with **R+W>RF* will only be eventually consistent. The claims for strong consistency guarentee break during node failures or in between writes. For example consider the following scenario:
Assume that there are 3 nodes A,B,C with RF=3, W=3, R=2 (hence, R+W = 5 > 3 = RF)
Further assume key k is associated to value v i.e. (k,v) is stored on the database. Suppose the following series of actions occur:
t=1: (k,v1) write request is sent to A,B,C from a user
t=2: (k,v1) reaches A and is written to store at A
t=3: Reader 1 sends a read request for key k, which is replied to by A and B
t=4: Reader 1 receives response (k,v1) - by latest write wins rule
t=5: Reader 1 sends another read request which gets served by nodes B and C
t=6: Reader 1 receives response (k,v), which is an older value INCONSISTENCY
t=7: (k,v1) reaches C and is written to store at C
t=8: (k,v1) reaches B and is written to store at B
This demonstrates that W+R>RF cannot guarantee strong consistency. To ensure strong consistency you might want to use another algorithm such as paxos or raft that can help in ensuring that the writes are atomic. You can read an interesting article on the same here (Do checkout the FAQ section)
Edit:
Cassandra does have some internal mechanism (called the blocking read repairs) - that trigger synchronous writes before response from the db is sent back to client. This kind of synchronous read repair occurs in case of inconsistencies amongst the nodes queried to achieve read consistency level and ensures something known as Monotonic Read Consistency [See below for definitions]. This causes the (k,v1) in above example to be written to node B before response is returned in case of first read request and so the second read request would also have an updated value. (Thanks to #Nadav Har'El for pointing this out)
However, this still does not guarantee strong consistency. Below are some definitions to clear it of:
Sequential/Strong Consistency: the result of any execution is the same as if the reads and writes occur in some order, and the operations of each individual processor appear in this sequence in the order specified by its program [as defined by Leslie Lamport]
Monotonic Read Consistency: once you read a value, all subsequent reads will return this value or a newer version
Sequential consistency would require the client program/reader to see the latest value that was written since the write statement is executed before the read statement in the sequence of program instructions.
For both reads and writes, the consistency levels of ANY , ONE , TWO , and THREE are considered weak, whereas QUORUM and ALL are considered strong.
Yes. If R + W consistency is greater than replicas then you will always get consistent data. 100% consistency. But you will have to trade availability to achieve higher consistency.
Cassandra has concept of tunable consistency (set consistency on query basis).
I will actually regard this Strong Consistency as Strong read consistency. And it is sessional, aka Monotonic Read Consistency.( refer to #NadavHar'El answer).
But it is not sequential consistency as Cassandra doesn't fully support lock, transaction or serialze the write operation. There is only lightweight transaction, which supports local serialization of write operation and serialization of read operation.
To make things easy to understand. Let's say we have three nodes - A, B, C and set read quorum to be 3 and write to be 1.
If there is only one client, it writes to any node - A.
B and C might be not synchronized.(Eventually they will -- Eventual consistency)
But When the client reads again, it requires client to get at least three nodes' response and by comparing the latest timestamp, we will use A's record. This is Monotonic Read Consistency
However,if there are two client trying to update the records at the same time or if they try to read the value first and then rewrite it(e.g increase column by 100) at the same time:
Client C1 and Client C2 both read the current column value as 10, and they both decide to increase it by 100:
While C1 just need to write 110 to one node, client C2 will do the same and the final result on any node can only be 110 max.
Then we lose 100 in these operations(Lost updates). It is the issue caused by race condition or concurrent issues. It has to be fixed by serializing the operation and using any form of lock just like how other SQL DB implements transaction.
I know Cassandra now has new counter column which might solve it but it is still limited in terms of the full transaction. And Cassandra is also not supposed to be transactional as it is NoSQL database which sacrifice consistency for availability

Dealing with eventual consistency in Cassandra

I have a 3 node cassandra cluster with RF=2. The read consistency level, call it CL, is set to 1.
I understand that whenever CL=1,a read repair would happen when a read is performed against Cassandra, if it returns inconsistent data. I like the idea of having CL=1 instead of setting it to 2, because then even if a node goes down, my system would run fine. Thinking by the way of the CAP theorem, I like my system to be AP instead of CP.
The read requests are seldom(more like 2-3 per second), but are very important to the business. They are performed against log-like data(which is immutable, and hence never updated). My temporary fix for this is to run the query more than once, say 3 times, instead of running it once. This way, I can be sure that that even if I don't get my data in the first read request, the system would trigger read repairs, and I would eventually get my data during the 2nd or 3rd read request. Ofcourse, these 3 queries happen one after the other, without any blocking.
Is there any way that I can direct Cassandra to perform read repairs in the background without having the need to actually perform a read request in order to trigger a repair?
Basically, I am looking for ways to tune my system in a way as to circumvent the 'eventual consistency' model, by which my reads would have a high probability of succeeding.
Help would be greatly appreciated.
reads would have a high probability of succeeding
Look at DowngradingConsistencyRetryPolicy this policy allows retry queries with lower CL than the initial one. With this policy your queries will have strong consistency when all nodes are available and you will not lose availability if some node is fail.

Does CQL3 "IF" make my update not idempotent?

It seems to me that using IF would make the statement possibly fail if re-tried. Therefore, the statement is not idempotent. For instance, given the CQL below, if it fails because of a timeout or system problem and I retry it, then it may not work because another person may have updated the version between retries.
UPDATE users
SET name = 'foo', version = 4
WHERE userid = 1
IF version = 3
Best practices for updates in Cassandra are to make updates idempotent, yet the IF operator is in direct opposition to this. Am I missing something?
If your application is idempotent, then generally you wouldn't need to use the expensive IF clause, since all your clients would be trying to set the same value.
For example, suppose your clients were aggregating some values and writing the result to a roll up table. Each client would calculate the same total and write the same value, so it wouldn't matter if multiple clients wrote to it, or what order they wrote to it, since it would be the same value.
If what you are actually looking for is mutual exclusion, such as keeping a bank balance, then the IF clause could be used. You might read a row to get the current balance, then subtract some money and update the balance only if the balance hadn't changed since you read it. If another client was trying to add a deposit at the same time, then it would fail and would have to try again.
But another way to do that without mutual exclusion is to write each withdrawal and deposit as a separate clustered transaction row, and then calculate the balance as an idempotent result of applying all the transaction rows.
You can use the IF clause for idempotent writes, but it seems pointless. The first client to do the write would succeed and Cassandra would return the value "applied=True". And the next client to try the same write would get back "applied=False, version=4", indicating that the row had already been updated to version 4 so nothing was changed.
This question is more about linerizability(ordering) than idempotency I think. This query uses Paxos to try to determine the state of the system before applying a change. If the state of the system is identical then the query can be retried many times without a change in the results. This provides a weak form of ordering (and is expensive) unlike most Cassandra writes. Generally you should only use CAS operations if you are attempting to record state of a system (rather than a history or log)
Do not use many of these queries if you can help it, the guidelines suggest having only a small percentage of your queries rely on this behavior.

When would Cassandra not provide C, A, and P with W/R set to QUORUM?

When both read and write are set to quorum, I can be guaranteed the client will always get the latest value when reading.
I realize this may be a novice question, but I'm not understanding how this setup doesn't provide consistency, availability, and partitioning.
With a quorum, you are unavailable (i.e. won't accept reads or writes) if there aren't enough replicas available. You can choose to relax and read / write on lower consistency levels granting you availability, but then you won't be consistent.
There's also the case where a quorum on reads and writes guarantees you the latest "written" data is retrieved. However, if a coordinator doesn't know about required partitions being down (i.e. gossip hasn't propagated after 2 of 3 nodes fail), it will issue a write to 3 replicas [assuming quorum consistency on a replication factor of 3.] The one live node will write, and the other 2 won't (they're down). The write times out (it doesn't fail). A write timeout where even one node has writte IS NOT a write failure. It's a write "in progress". Let's say the down nodes come up now. If a client next requests that data with quorum consistency, one of two things happen:
Request goes to one of the two downed nodes, and to the "was live" node. Client gets latest data, read repair triggers, all is good.
Request goes to the two nodes that were down. OLD data is returned (assuming repair hasn't happened). Coordinator gets digest from third, read repair kicks in. This is when the original write is considered "complete" and subsequent reads will get the fresh data. All is good, but one client will have received the old data as the write was "in progress" but not "complete". There is a very small rare scenario where this would happen. One thing to note is that write to cassandra are upserts on keys. So usually retries are ok to get around this problem, however in case nodes genuinely go down, the initial read may be a problem.
Typically you balance your consistency and availability requirements. That's where the term tunable consistency comes from.
Said that on the web it's full of links that disprove (or at least try to) the Brewer's CAP theorem ... from the theorem's point of view the C say that
all nodes see the same data at the same time
Which is quite different from the guarantee that a client will always retrieve fresh information. Strictly following the theorem, in your situation, the C it's not respected.
The DataStax documentation contains a section on Configuring Data Consistency. In looking through all of the available consistency configurations, For QUORUM it states:
Returns the record with the most recent timestamp after a quorum of replicas has responded regardless of data center. Ensures strong consistency if you can tolerate some level of failure.
Note that last part "tolerate some level of failure." Right there it's indicating that by using QUORUM consistency you are sacrificing availability (A).
The document referenced above also further defines the QUORUM level, stating that your replication factor comes into play as well:
If consistency is top priority, you can ensure that a read always
reflects the most recent write by using the following formula:
(nodes_written + nodes_read) > replication_factor
For example, if your application is using the QUORUM consistency level
for both write and read operations and you are using a replication
factor of 3, then this ensures that 2 nodes are always written and 2
nodes are always read. The combination of nodes written and read (4)
being greater than the replication factor (3) ensures strong read
consistency.
In the end, it all depends on your application requirements. If your application needs to be highly-available, ONE is probably your best choice. On the other hand, if you need strong-consistency, then QUORUM (or even ALL) would be the better option.

What ConsistencyLevel to use with Cassandra counter tables?

I have a table counting around 1000 page views per second. What read and write ConsistencyLevel should I use with it? I am using the Cassandra Thrift client.
Carlo has more or less the right idea. But you have to balance it with your use case.
I work in the game industry and we use cassandra for player data. It is quite heavily bound by the read-modify-write pattern which is not the strong suit of cassandra. But we also have some functionality that are Write heavy (thousands of writes for a few reads a day).
This is my opinion, based upon experience, of how you should use the consistency levels.
Write + Read at QUORUM means that before returning for both operations it will wait for a majority of nodes in the cluster to confirm the operation. It is the solution I use when Read and Writes are roughly at the same frequency. (Player data blob)
Write One + Read All is useful for something very write heavy. We use this for high scores for examples (write often read every 5 minutes for regenerating the high score table of the whole game)
You could use Write Any if you do not care about the data that much (non critical logs comes to mind).
The only use case I could come up for the Write All + Read One would be messaging or feeds with periodical checks for updates. Chats and messaging seem a good fit for that since Cassandra does not have a subscription/push functionality to it.
Write & Read ALL is a bad implementation. It IS a WASTE of resource as you will get the same consistency as if you were using one of the three set up I mentioned above.
A final note about Write ANY vs. Write ONE : ANY only confirms that anything in the cluster has received the mutation, but ONE confirms that it has been applied at least by one node. ANY is not safe as it could return without error even if all the nodes responsible for that mutation are down, or any other condition that could make the mutation fail after reception. It is also slightly quicker (I only use it as an async dump for logs that are not critical) that is its only advantage, but do not trust the response at 100%.
A good reference to study this subject about cassandra is http://www.datastax.com/docs/1.2/dml/data_consistency
If you want always be consistent at any read the rule is
(write consistency level + read consistency level) > replication factor.
So you could
Write All + Read All (worst solution)
Write One + Read All (second-worst solution)
Write All + Read One (probably faster solution)
Write Quorum + Read Quorum (imho, best solution)
I want remember that if a node of RF is down during the r/w operation the operation will fail so I'd avoid the CL ALL.
Regards, Carlo
Based on their document (https://docs.datastax.com/en/cql/3.0/cql/ddl/ddl_counters_c.html), consistency level ONE is recommended. I guess some sort of merging is used to resolve conflict for counter columns, instead of usual last write win. That's likely why setting a value is not allowed.

Resources