Why allow dirty read when it happens network partition in VoltDB? - voltdb

I am PhD student in Seoul National University. My name is Seokwon Choi. I impressed research paper(Analysis for network partition fault). I hope to present this paper with my lab member at lab seminar time.
However, I read your research paper and your presentation slide. I have one question.
Why the read operation read Y value in VoltDB? Actually replication is fail, so write is fail. Why it update Y value in local storage?
and read operation read value Y that updated locally?
I think read operation should read commit value(written successfully: in this case-> value X).
I try to find VoltDB Document. It can allow dirty read in VoltDB. Why allow dirty read when it happens network partition in VoltDB?
Is there any reason to work like this?
I attached picture of dirty read when network partition
Thank you
Best Regards
From Seokwon Choienter image description here

VoltDB does not allow dirty reads. In your picture, you show a 3-node cluster where 1 node gets partitioned from the other 2 and the single node is a partition master.
Event1: Network partition
Event2: Write to minority (and you show that the write fails, which is correct)
Event3: Read from minority (and you show a dirty read, which is incorrect).
Event 3 is not possible. The single node that gets partitioned from the other two will shut down its client interface and then crash, never allowing event 3 to happen.
We ran Jepsen tests several years ago and fixed a defect in V6.4 that in some circumstances would allow that dirty read from event#3. See https://aphyr.com/posts/331-jepsen-voltdb-6-3 and https://www.voltdb.com/blog/2016/07/12/voltdb-6-4-passes-official-jepsen-testing/ for the full details on the guarantees from VoltDB, the Jepsen testing we did, and the defects that were fixed in order to pass the test.
Disclosure: I work for VoltDB.

Related

Synch delay when writing to Cassandra with two data centers and LOCAL_QUORUM

The motivation of this question is to follow from this great post. I finally got some clarification of what they mean from eventual consistency in Cassandra. Andrew Weaver explained what happens with the commit log very clearly. Now I would like to find out what happens when you have two data centers and you use a consistency of LOCAL_QUORUM. This is the scenario: I just finished writing to data center one and commit log was flushed/sync'd to disk and two other replicas also were synchronized. When would the writing to the second data center would take placed?
I imagine that one of the instances would initiate the syncing process but there will be a delay. What is that delayed? Also if the data one center goes down before syncing it will not be available in data cente r two. But what happens if the same row is written in data in data center two. How does the row gets reconcile?
The same question if data center one goes down the middle of the syncing process? When would the data be consistent and if there is any gotchas?
Thanks
you have to think that in your case the replication to the second datacenter is done in an async way. The delay in general is in milliseconds.
If the datacenter one goes down, then you write the same row to the second datacenter, it like two different writes, so different timestamps and the last write will win (in case of different data of course). reconciliation will happen during repair, or read repair or also if you will read with a consistency level ALL.
This is a very good article by Ryan Svihla, explaining the subjet with different cases :
https://medium.com/#foundev/cassandra-how-many-nodes-are-talked-to-with-quorum-also-should-i-use-it-98074e75d7d5
I hope this helps.

Dealing with eventual consistency in Cassandra

I have a 3 node cassandra cluster with RF=2. The read consistency level, call it CL, is set to 1.
I understand that whenever CL=1,a read repair would happen when a read is performed against Cassandra, if it returns inconsistent data. I like the idea of having CL=1 instead of setting it to 2, because then even if a node goes down, my system would run fine. Thinking by the way of the CAP theorem, I like my system to be AP instead of CP.
The read requests are seldom(more like 2-3 per second), but are very important to the business. They are performed against log-like data(which is immutable, and hence never updated). My temporary fix for this is to run the query more than once, say 3 times, instead of running it once. This way, I can be sure that that even if I don't get my data in the first read request, the system would trigger read repairs, and I would eventually get my data during the 2nd or 3rd read request. Ofcourse, these 3 queries happen one after the other, without any blocking.
Is there any way that I can direct Cassandra to perform read repairs in the background without having the need to actually perform a read request in order to trigger a repair?
Basically, I am looking for ways to tune my system in a way as to circumvent the 'eventual consistency' model, by which my reads would have a high probability of succeeding.
Help would be greatly appreciated.
reads would have a high probability of succeeding
Look at DowngradingConsistencyRetryPolicy this policy allows retry queries with lower CL than the initial one. With this policy your queries will have strong consistency when all nodes are available and you will not lose availability if some node is fail.

Read/Write Strategy For Consistency Level

Based on Read Operation in Cassandra at Consistency level of Quorum?
there are 3 ways to read data consistency:
a. WRITE ALL + READ OoNE
b. WRITE ONE + READ ALL
c. WRITE QUORUM + READ QUORUM
For a data, the write operation usually happens once, but read operations often happens.
But take care of the read consistency, is it possible to merge a and b ?
This is, WRITE ONE -> READ ONE -> if not found -> READ ALL.
Does the approach usually fulfill read/write operation happen once?
There is only read ALL at first time on a node which has no the data.
So Is my understanding correct?
Wilian, thanks for exactly elaborating. I think I need to describe my use case, as bellow. I implemented a timeline uses can post to. And users can follow the interesting post. So notification will be sent to the followers. For saving bandwidth, users write/read post at CL ONE. Eventually, users always can read the post after a while by read repair. Followers will receive the notification of comments added to post if they listen the post. Here is my question. It must make sure followers can read the comments if notification were delivers to followers. So I am indented to use CL ONE to check if the comment was synced to the node queried. If no result, try CL ALL to synced the comment. So other followers query from the node don't bother to sync other nodes since the CL ALL was done before,which can save bandwidth and lower server overhead. So as for your final scenario, I don't care if the value is old or latest because the data was synced according to notifications. I need to ensure users can get the comment if notification was delivered to followers.
From the answer to the linked question, Carlo Bertuccini wrote:
What guarantees consistency is the following disequation
(WRITE CL + READ CL) > REPLICATION FACTOR
The cases A, B, and C in this question appear to be referring to the three minimum ways of satisfying that disequation, as given in the same answer.
Case A
WRITE ALL will send the data to all replicas. If your replication factor (RF) is three(3), then WRITE ALL writes three copies before reporting a successful write to the client. But you can't possibly see that the write occurred until the next read of the same data key. Minimally, READ ONE will read from a single one of the aforementioned replicas, and satisfies the necessary condition: WRITE(3) + READ(1) > RF(3)
Case B
WRITE ONE will send the data to only a single replica. In this case, the only way to get a consistent read is to read from all of them. The coordinator node will get all of the answers, figure out which one is the most recent and then send a "hint" to the out-of-date replicas, informing them that there's a newer value. The hint occurs asynchronously but only after the READ ALL occurs does it satisfy the necessary condition: WRITE(1) + READ(3) > RF(3)
Case C
QUORUM operations must involve FLOOR(RF / 2) + 1 replicas. In our RF=3 example, that is FLOOR(3 / 2) + 1 == 1 + 1 == 2. Again, consistency depends on both the reads and the writes. In the simplest case, the read operation talks to exactly the same replicas that the write operation used, but that's never guaranteed. In the general case, the coordinator node doing the read will talk to at least one of the replicas used by the write, so it will see the newer value. In that case, much like the READ ALL case, the coordinator node will get all of the answers, figure out which one is the most recent and then send a "hint" to the out-of-date replicas. Of course, this also satisfies the necessary condition: WRITE(2) + READ(2) > RF(3)
So to the OP's question...
Is it possible to "merge" cases A and B?
To ensure consistency it is only possible to "merge" if you mean WRITE ALL + READ ALL because you can always increase the number of readers or writers in the above cases.
However, WRITE ONE + READ ONE is not a good idea if you need to read consistent data, so my answer is: no. Again, using that disequation and our example RF=3: WRITE(1) + READ(1) > RF(3) does not hold. If you were to use this configuration, receiving an answer that there is no value cannot be trusted -- it simply means that the one replica contacted to do the read did not have a value. But values might exist on one or more of the other replicas.
So from that logic, it might seem that doing a READ ALL on receiving a no value answer would solve the problem. And it would for that use case, but there's another to consider: what if you get some value back from the READ ALL... how do you know that the value returned is "the latest" one? That's what's meant when we want consistency. If you care about reading the most recent write, then you need to satisfy the disequation.
Regarding the use case of "timeline" notifications in the edited question
If my understanding of your described scenario is correct, these are the main points to your use case:
Most (but not all?) timeline entries will be write-once (not modified later)
Any such entry can be followed (there is a list of followers)
Any such entry can be commented upon (there is a list of comments)
Any comment on a timeline entry should trigger a notification to the list of followers for that timeline entry
Trying to minimize cost (in this case, measured as bandwidth) for the "normal" case
Willing to rely on the anti-entropy features built into Cassandra (e.g. read repair)
I need to ensure users can get the comment if notification was delivered to followers.
Since most of your entries are write-once, and you care more about the existence of an entry and not necessarily the latest content for the entry, you might be able to get away with WRITE ONE + READ ONE with a fallback to READ ALL if you get no record for something that had some other indication it should exist (e.g. from a notification). For the timeline entry content, it does not sound like your case depends on consistency of the user content of the timeline entries.
If you don't care about consistency, then this discussion is moot; read/write with whatever Consistency Level and let Cassandra's asynchronous replication and anti-entropy features do their work. That said, though your goal is minimizing network traffic/cost, if your workload is mostly reads then the added cost of doing writes at CL QUORUM or ALL may not actually be that much.
You also said:
Followers will receive the notification of comments added to post if they listen the post.
This statement implies that you care about about not only whether the set of followers exists but also its contents (which users are following). You have not detailed how you are storing/tracking the followers, but unless you ensure the consistency of this data it is possible that one or more followers are not notified of a new comment because you retrieved an out-of-date version of the follower list. Or, someone who "unfollowed" a post could still receive notifications for the same reason.
Cassandra is very flexible and allows each discrete read and write operation to use different consistency levels. Take advantage of this and ensure strong consistency where it is needed and relax it where you are sure that "reading the latest write" is not important to your application's logic and function.

What ConsistencyLevel to use with Cassandra counter tables?

I have a table counting around 1000 page views per second. What read and write ConsistencyLevel should I use with it? I am using the Cassandra Thrift client.
Carlo has more or less the right idea. But you have to balance it with your use case.
I work in the game industry and we use cassandra for player data. It is quite heavily bound by the read-modify-write pattern which is not the strong suit of cassandra. But we also have some functionality that are Write heavy (thousands of writes for a few reads a day).
This is my opinion, based upon experience, of how you should use the consistency levels.
Write + Read at QUORUM means that before returning for both operations it will wait for a majority of nodes in the cluster to confirm the operation. It is the solution I use when Read and Writes are roughly at the same frequency. (Player data blob)
Write One + Read All is useful for something very write heavy. We use this for high scores for examples (write often read every 5 minutes for regenerating the high score table of the whole game)
You could use Write Any if you do not care about the data that much (non critical logs comes to mind).
The only use case I could come up for the Write All + Read One would be messaging or feeds with periodical checks for updates. Chats and messaging seem a good fit for that since Cassandra does not have a subscription/push functionality to it.
Write & Read ALL is a bad implementation. It IS a WASTE of resource as you will get the same consistency as if you were using one of the three set up I mentioned above.
A final note about Write ANY vs. Write ONE : ANY only confirms that anything in the cluster has received the mutation, but ONE confirms that it has been applied at least by one node. ANY is not safe as it could return without error even if all the nodes responsible for that mutation are down, or any other condition that could make the mutation fail after reception. It is also slightly quicker (I only use it as an async dump for logs that are not critical) that is its only advantage, but do not trust the response at 100%.
A good reference to study this subject about cassandra is http://www.datastax.com/docs/1.2/dml/data_consistency
If you want always be consistent at any read the rule is
(write consistency level + read consistency level) > replication factor.
So you could
Write All + Read All (worst solution)
Write One + Read All (second-worst solution)
Write All + Read One (probably faster solution)
Write Quorum + Read Quorum (imho, best solution)
I want remember that if a node of RF is down during the r/w operation the operation will fail so I'd avoid the CL ALL.
Regards, Carlo
Based on their document (https://docs.datastax.com/en/cql/3.0/cql/ddl/ddl_counters_c.html), consistency level ONE is recommended. I guess some sort of merging is used to resolve conflict for counter columns, instead of usual last write win. That's likely why setting a value is not allowed.

Read-your-own-writes consistency in Cassandra

Read-your-own-writes consistency is great improvement from the so called eventual consistency: if I change my profile picture I don't care if others see the change a minute later, but it looks weird if after a page reload I still see the old one.
Can this be achieved in Cassandra without having to do a full read-check on more than one node?
Using ConsistencyLevel.QUORUM is fine while reading an unspecified data and n>1 nodes are actually being read. However when client reads from the same node as he writes in (and actually using the same connection) it can be wasteful - some databases will in this case always ensure that the previously written (my) data are returned, and not some older one. Using ConsistencyLevel.ONE does not ensure this and assuming it leads to race conditions. Some test showed this: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/per-connection-quot-read-after-my-write-quot-consistency-td6018377.html
My hypothetical setup for this scenario is 2 nodes, replication factor 2, read level 1, write level 1. This leads to eventual consistency, but I want read-your-own-writes consistency on reads.
Using 3 nodes, RF=3, RL=quorum and WL=quorum in my opinion leads to wasteful read request if I being consistent only on "my" data is enough.
// seo: also known as: session consistency, read-after-my-write consistency
Good question.
We've had http://issues.apache.org/jira/browse/CASSANDRA-876 open for a while to add this, but nobody's bothered finishing it because
CL.ONE is just fine for a LOT of workloads without any extra gymnastics
Reads are so fast anyway that doing the extra one is not a big deal (and in fact Read Repair, which is on by default, means all the nodes get checked anyway, so the difference between CL.ONE and higher is really more about availability than performance)
That said, if you're motivated to help, ask on the ticket and I'll be happy to point you in the right direction.
I've been following Cassandra development for a little while and I haven't seen a feature like this mentioned.
That said, if you only have 2 nodes with a replication factor of 2, I would question whether Cassandra is the best solution. You are going to end up with the entire data set on each node, so a more traditional replicated SQL setup might be simpler and more widely tested. Cassandra is very promising but it is still only version 0.8.2 and problems are regularly reported on the mailing list.
The other way to solve the 'see my own updates' problem would be to cache the results somewhere closer to the client, whether in the web server, the application layer, or using something like memcached.

Resources