http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_ltwt_transaction_c.html
A SERIAL consistency level allows reading the current (and possibly
uncommitted) state of data without proposing a new addition or update.
If a SERIAL read finds an uncommitted transaction in progress, it will
commit it as part of the read.
What I did not understand is - how can a read operation commit an in progress transaction? Does it mean to say - it will read it as part of the commit?
Thanks for spotting the problem in the docs. The sentence should say, "If a SERIAL read finds an uncommitted transaction in progress, Cassandra will perform a read repair as part of the commit. A read repair updates replicas with the most recent version of frequently-read data.
Related
I am reading Cassandra: The Definitive Guide, 3rd edition. It has the following text:
The serial consistency level can apply on reads as well. If Cassandra detects that a query is reading data that is part of an uncommitted transaction, it commits the transaction as part of the read, according to the specified serial consistency level.
Why a read is committing an uncommitted transaction and doesn't it interfere with ability of the writer to rollback?
https://community.datastax.com/questions/5769/why-a-read-is-committing-an-uncommitted-transactio.html
Committed means that a mutation (INSERT, UPDATE or DELETE) is not added to commitlog.
Uncommitted is when a mutation is still in the process of being saved to the commitlog.
In oder for the LWT to provide guarantees such as IF EXISTS or IF NOT EXISTS, It has to add any data that is not written to commitlog by another in-flight operation to commitlog.
Here Uncommitted data doesnt mean that it was a failed write. Uncommitted data is a successful data written to some node in the cluster which is not updated in the current node.
here,
it commits the transaction as part of the read
means that Cassandra will initiate a read repair and update the data in the node before sending the data back to the client.
Rollback is not in the picture here because write was successful and this concerns only the replication of data across nodes
I am PhD student in Seoul National University. My name is Seokwon Choi. I impressed research paper(Analysis for network partition fault). I hope to present this paper with my lab member at lab seminar time.
However, I read your research paper and your presentation slide. I have one question.
Why the read operation read Y value in VoltDB? Actually replication is fail, so write is fail. Why it update Y value in local storage?
and read operation read value Y that updated locally?
I think read operation should read commit value(written successfully: in this case-> value X).
I try to find VoltDB Document. It can allow dirty read in VoltDB. Why allow dirty read when it happens network partition in VoltDB?
Is there any reason to work like this?
I attached picture of dirty read when network partition
Thank you
Best Regards
From Seokwon Choienter image description here
VoltDB does not allow dirty reads. In your picture, you show a 3-node cluster where 1 node gets partitioned from the other 2 and the single node is a partition master.
Event1: Network partition
Event2: Write to minority (and you show that the write fails, which is correct)
Event3: Read from minority (and you show a dirty read, which is incorrect).
Event 3 is not possible. The single node that gets partitioned from the other two will shut down its client interface and then crash, never allowing event 3 to happen.
We ran Jepsen tests several years ago and fixed a defect in V6.4 that in some circumstances would allow that dirty read from event#3. See https://aphyr.com/posts/331-jepsen-voltdb-6-3 and https://www.voltdb.com/blog/2016/07/12/voltdb-6-4-passes-official-jepsen-testing/ for the full details on the guarantees from VoltDB, the Jepsen testing we did, and the defects that were fixed in order to pass the test.
Disclosure: I work for VoltDB.
The questions are regarding the “CAS operations” paragraph into the article : http://www.datastax.com/dev/blog/cassandra-error-handling-done-right
a)
If the paxos phase fails, the driver will throw a WriteTimeoutException with a WriteType.CAS as retrieved with WriteTimeoutException#getWriteType(). In this situation you can’t know if the CAS operation has been applied..
How do you understand this?
I thought that If the paxos (prepare) phase fails then the coordinator will not initiate the commit phase at all?
I guess that it does not matter how the paxos phase fails (not enough replicas or replica timeouts or ..).
b)
The commit phase is then similar to regular Cassandra writes… you can simply ignore this error if you make sure to use setConsistencyLevel(ConsistencyLevel.SERIAL) on the subsequent read statements on the column that was touched by this transaction, as it will force Cassandra to commit any remaining uncommitted Paxos state before proceeding with the read
Wondering about the above with relation to writes with ConsistencyLevel.QUORUM:
If the commit phase failed because there is no quorum (unavailable nodes or timeouts) then we get back WriteTimeoutException with a WriteType of SIMPLE, right?
In this case it is not clear if the write is actually successful or not, right?
So I’m not sure what are all the possibilities from now on (recover/rollback/nothing)?
Is it saying that if I use ConsistencyLevel.QUORUM for the read operation I can see the old data version (as if the above write was not successful) for some time and after that again with QUORUM read I will see that the write is successful?
(actually I’m seen exactly this in a 3 node cluster with replication factor=3 after WriteTimeoutException (2 replica were required but only 1 acknowledged the write) – quorum read just after that returned the old data and then when i check with cqlsh I see the new data).
How this is possible?
guess:
Probably after the timeout the coordinator says that we have no quorum for the commit phase yet (and subsequent QUORUM reads get the older data version) and returns the WriteTimeoutException.type=SIMPLE to the client. And when the nodes that have timeout actually respond/commit we have a quorum in this future moment and after it all quorum reads will obtain the newer data version.
But not sure about the explanation of when you use read with SERIAL.
I have an application pulls CouchDB from the first doc to the latest one, batch by batch.
I tried compact my database from 1.7GB to 1.0GB, and /db/_changes seems the same.
Can anyone please clarify if CouchDB compaction affects /db/_changes ?
All compaction does is remove old references to documents in a given database. The changes feed deals exclusively with write operations, which are unaffected by compaction. (since those writes have already happened)
Now, it should be noted that the changes feed will give you the rev numbers as well. Upon compaction, all but the most recent rev are deleted, so those entries in the changes feed will have "dead" links. (so-to-speak)
See the docs for more information about compaction.
If I understand it correctly, upon a write request the write is sent to all N replicas, and the operation succeeds when the first W responses are received. Is this correct?
If it is, then combined with Hinted Handoff, it seems that all replicas will already get all writes as soon as possible, do we really have to do read repair in this case?
Thanks.
Short answer: you still need read repair.
Longer answer: there wasn't a good discussion of Hinted Handoff anywhere, so I wrote one.
For Cassandra 1.0+, read the updated article. The crucial part being:
At first glance, it may appear that Hinted Handoff lets you safely get away without needing repair. This is only true if you never have hardware failure.
It is possible for hinted handoff to fail for various reasons. Such as the node the hint was written to can fail. With read repair enabled if hinted handoff doesn't work for some reason read repair will fix it. And then you should also run "nodetool repair" on your nodes to catch any cases where read repair and hinted handoff both fail to fix all the data.
Check the wiki for more info.
http://wiki.apache.org/cassandra/AntiEntropy
http://wiki.apache.org/cassandra/HintedHandoff
The consistency level can be varied for each write (and read).
For example, let's say we have 10 nodes, with a replication factor of 3.
But if we write with a consistency level of ANY, none of the eventual 3 replicas may initally have the data when the write call returns. If we use consistency level ONE, then only one of the eventual 3 replicas has to have the data before the write returns, so a read straight after the write may see outdated data if the read has a low consistency level.
See http://wiki.apache.org/cassandra/API for the definitions of the consistency levels, particularly the following:
Read level ONE: Will return the record
returned by the first replica to
respond. A consistency check is always
done in a background thread to fix any
consistency issues when
ConsistencyLevel.ONE is used. This
means subsequent calls will have
correct data even if the initial read
gets an older value. (This is called
ReadRepair)
See also http://wiki.apache.org/cassandra/ReadRepair :
Read repair means that when a query is
made against a given key, we perform a
digest query against all the replicas
of the key and push the most recent
version to any out-of-date replicas.
If a low ConsistencyLevel was
specified, this is done in the
background after returning the data
from the closest replica to the
client; otherwise, it is done before
returning the data.