The doc says "This means that a write to a row within a single partition on a single node is only visible to the client performing the operation".
If there is another client2 perform operations on the same partition and on the same single node, is the write being performed by "THE CLIENT" as mentioned in the doc, also visible to this client2?
And, what does "visible to the client performing the operation" actually mean?
More concrete examples would be much appreciated!
This means that when you're making change, data is written atomically inside partition - the client2 won't see the part of the row, only the full row, or nothing.
Related
Scenario:
Client sends a write request to a coordinator node
Replication factor is 3 and Read/Write Consistency level is QUORUM.
Coordinator sends the request to nodes A, B and C. Data is committed to node A, but nodes B and C go down immediately after receiving the request from the coordinator.
Coordinator will send a timeout exception to the client since it has not received an ack from nodes B and C within the allotted time. The data on node A is now inconsistent with the data on nodes B and C.
Based on my understanding nodes B and C will be updated with the value on node A during read repair. So we had a timeout exception here, but the new value has been eventually written to all the nodes.
There could be other timeout exceptions where the new data has not been written to any of the nodes.
So it appears that the developer is expected to handle the timeout exception in the code which may not be straightforward in all cases(because the new value may be written in some cases and not in others and the developer has to check for that during a retry after the timeout).
I'm just learning Cassandra. So if my understanding is not correct, please correct me.
Some of you may say that this happens in a relational DB too, but it's a rare occurrence there since it's not a distributed system.
Here are some articles that I found, but it does not address my question specifically.
What happens if a coordinator node goes down during a write in Apache Cassandra?
https://www.datastax.com/blog/2012/08/when-timeout-not-failure-how-cassandra-delivers-high-availability-part-1
If the data is written you it is consistent, even if node B and C didnot sent the ACKT :
When the data is received by a node, it first goes to a commit log and if the node crashes, then it will replay the mutation as soon as it will starts up again.
As the second article said, it is more like a InProgressException than a TimedOutException.
On client side if you have a TimedOutException you are not 100% sure that the data was written, but it could be.
For your case, if the write as received by node B and C, even if they didnot sent ACK, the data is consistent. even if just one of the 2 nodes did, the data is consistent too due to QUORUM use.
Cluster side, there are several mechanisms that can hep Cassandra being more consistent : hinted handoff, read repair, and repair.
For better understanding, maybe worth taking a look at :
write path :
https://docs.datastax.com/en/cassandra-oss/2.1/cassandra/dml/dml_write_path_c.html
hinted handoff:
https://docs.datastax.com/en/cassandra-oss/2.1/cassandra/dml/dml_about_hh_c.html
read repair :
https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsRepairNodesReadRepair.html
Thanks for the response. It still does not help answer the question from an end user/developer perspective since I need to write the code to handle the exception.
For whatever it's worth, I found the below article on DataStax.
https://www.datastax.com/blog/2014/10/cassandra-error-handling-done-right
If you refer to the section on
'WriteTimeOutException' and 'Non-idempotent operations', u can see that
the end user is expected to do a retry after receiving the exception.
If it's an idempotent operation, then no additional code is required on the application side. Things are not so straight forward for non-idempotent operations. Cassandra assumes that most of the write operations are generally idempotent and I don't necessarily agree with this. The business rules depend on the application.
Example of non-idempotent operations:
update table set counter = counter + 1 where key = 'xyz'
or update table set commission = commission * 1.02 where key = 'abc'
The article gives some recommendations on how to handle non-idempotent operations using CAS/lightweight transactions in the 'Non-idempotent operations' section. This makes things complicated/ugly in the client code, especially when you have a lot of DML in the code.
Even though it's NOT the answer that I was looking for, it appears that there's no better way at least for now.
It seems to me that using IF would make the statement possibly fail if re-tried. Therefore, the statement is not idempotent. For instance, given the CQL below, if it fails because of a timeout or system problem and I retry it, then it may not work because another person may have updated the version between retries.
UPDATE users
SET name = 'foo', version = 4
WHERE userid = 1
IF version = 3
Best practices for updates in Cassandra are to make updates idempotent, yet the IF operator is in direct opposition to this. Am I missing something?
If your application is idempotent, then generally you wouldn't need to use the expensive IF clause, since all your clients would be trying to set the same value.
For example, suppose your clients were aggregating some values and writing the result to a roll up table. Each client would calculate the same total and write the same value, so it wouldn't matter if multiple clients wrote to it, or what order they wrote to it, since it would be the same value.
If what you are actually looking for is mutual exclusion, such as keeping a bank balance, then the IF clause could be used. You might read a row to get the current balance, then subtract some money and update the balance only if the balance hadn't changed since you read it. If another client was trying to add a deposit at the same time, then it would fail and would have to try again.
But another way to do that without mutual exclusion is to write each withdrawal and deposit as a separate clustered transaction row, and then calculate the balance as an idempotent result of applying all the transaction rows.
You can use the IF clause for idempotent writes, but it seems pointless. The first client to do the write would succeed and Cassandra would return the value "applied=True". And the next client to try the same write would get back "applied=False, version=4", indicating that the row had already been updated to version 4 so nothing was changed.
This question is more about linerizability(ordering) than idempotency I think. This query uses Paxos to try to determine the state of the system before applying a change. If the state of the system is identical then the query can be retried many times without a change in the results. This provides a weak form of ordering (and is expensive) unlike most Cassandra writes. Generally you should only use CAS operations if you are attempting to record state of a system (rather than a history or log)
Do not use many of these queries if you can help it, the guidelines suggest having only a small percentage of your queries rely on this behavior.
Based on Read Operation in Cassandra at Consistency level of Quorum?
there are 3 ways to read data consistency:
a. WRITE ALL + READ OoNE
b. WRITE ONE + READ ALL
c. WRITE QUORUM + READ QUORUM
For a data, the write operation usually happens once, but read operations often happens.
But take care of the read consistency, is it possible to merge a and b ?
This is, WRITE ONE -> READ ONE -> if not found -> READ ALL.
Does the approach usually fulfill read/write operation happen once?
There is only read ALL at first time on a node which has no the data.
So Is my understanding correct?
Wilian, thanks for exactly elaborating. I think I need to describe my use case, as bellow. I implemented a timeline uses can post to. And users can follow the interesting post. So notification will be sent to the followers. For saving bandwidth, users write/read post at CL ONE. Eventually, users always can read the post after a while by read repair. Followers will receive the notification of comments added to post if they listen the post. Here is my question. It must make sure followers can read the comments if notification were delivers to followers. So I am indented to use CL ONE to check if the comment was synced to the node queried. If no result, try CL ALL to synced the comment. So other followers query from the node don't bother to sync other nodes since the CL ALL was done before,which can save bandwidth and lower server overhead. So as for your final scenario, I don't care if the value is old or latest because the data was synced according to notifications. I need to ensure users can get the comment if notification was delivered to followers.
From the answer to the linked question, Carlo Bertuccini wrote:
What guarantees consistency is the following disequation
(WRITE CL + READ CL) > REPLICATION FACTOR
The cases A, B, and C in this question appear to be referring to the three minimum ways of satisfying that disequation, as given in the same answer.
Case A
WRITE ALL will send the data to all replicas. If your replication factor (RF) is three(3), then WRITE ALL writes three copies before reporting a successful write to the client. But you can't possibly see that the write occurred until the next read of the same data key. Minimally, READ ONE will read from a single one of the aforementioned replicas, and satisfies the necessary condition: WRITE(3) + READ(1) > RF(3)
Case B
WRITE ONE will send the data to only a single replica. In this case, the only way to get a consistent read is to read from all of them. The coordinator node will get all of the answers, figure out which one is the most recent and then send a "hint" to the out-of-date replicas, informing them that there's a newer value. The hint occurs asynchronously but only after the READ ALL occurs does it satisfy the necessary condition: WRITE(1) + READ(3) > RF(3)
Case C
QUORUM operations must involve FLOOR(RF / 2) + 1 replicas. In our RF=3 example, that is FLOOR(3 / 2) + 1 == 1 + 1 == 2. Again, consistency depends on both the reads and the writes. In the simplest case, the read operation talks to exactly the same replicas that the write operation used, but that's never guaranteed. In the general case, the coordinator node doing the read will talk to at least one of the replicas used by the write, so it will see the newer value. In that case, much like the READ ALL case, the coordinator node will get all of the answers, figure out which one is the most recent and then send a "hint" to the out-of-date replicas. Of course, this also satisfies the necessary condition: WRITE(2) + READ(2) > RF(3)
So to the OP's question...
Is it possible to "merge" cases A and B?
To ensure consistency it is only possible to "merge" if you mean WRITE ALL + READ ALL because you can always increase the number of readers or writers in the above cases.
However, WRITE ONE + READ ONE is not a good idea if you need to read consistent data, so my answer is: no. Again, using that disequation and our example RF=3: WRITE(1) + READ(1) > RF(3) does not hold. If you were to use this configuration, receiving an answer that there is no value cannot be trusted -- it simply means that the one replica contacted to do the read did not have a value. But values might exist on one or more of the other replicas.
So from that logic, it might seem that doing a READ ALL on receiving a no value answer would solve the problem. And it would for that use case, but there's another to consider: what if you get some value back from the READ ALL... how do you know that the value returned is "the latest" one? That's what's meant when we want consistency. If you care about reading the most recent write, then you need to satisfy the disequation.
Regarding the use case of "timeline" notifications in the edited question
If my understanding of your described scenario is correct, these are the main points to your use case:
Most (but not all?) timeline entries will be write-once (not modified later)
Any such entry can be followed (there is a list of followers)
Any such entry can be commented upon (there is a list of comments)
Any comment on a timeline entry should trigger a notification to the list of followers for that timeline entry
Trying to minimize cost (in this case, measured as bandwidth) for the "normal" case
Willing to rely on the anti-entropy features built into Cassandra (e.g. read repair)
I need to ensure users can get the comment if notification was delivered to followers.
Since most of your entries are write-once, and you care more about the existence of an entry and not necessarily the latest content for the entry, you might be able to get away with WRITE ONE + READ ONE with a fallback to READ ALL if you get no record for something that had some other indication it should exist (e.g. from a notification). For the timeline entry content, it does not sound like your case depends on consistency of the user content of the timeline entries.
If you don't care about consistency, then this discussion is moot; read/write with whatever Consistency Level and let Cassandra's asynchronous replication and anti-entropy features do their work. That said, though your goal is minimizing network traffic/cost, if your workload is mostly reads then the added cost of doing writes at CL QUORUM or ALL may not actually be that much.
You also said:
Followers will receive the notification of comments added to post if they listen the post.
This statement implies that you care about about not only whether the set of followers exists but also its contents (which users are following). You have not detailed how you are storing/tracking the followers, but unless you ensure the consistency of this data it is possible that one or more followers are not notified of a new comment because you retrieved an out-of-date version of the follower list. Or, someone who "unfollowed" a post could still receive notifications for the same reason.
Cassandra is very flexible and allows each discrete read and write operation to use different consistency levels. Take advantage of this and ensure strong consistency where it is needed and relax it where you are sure that "reading the latest write" is not important to your application's logic and function.
I know that in Cassandra, there's no strong consistency unless you explicitly request it (and even then, there're no transactions).
However, I'm interested in the "order" of consistency. Take the following example:
In a database node, there are 3 nodes (A, B and C). Two insert queries are sent trough the same CQL-connection (or thrift for that matter, I don't think that's relevant to this question anyway). Both operate on different tables (this might be relevant).
INSERT INTO table_a (id) VALUES (0)
INSERT INTO table_b (id) VALUES (1)
Directly after the questions have been successfuly executed on the node that they're sent to, it goes down. The node may or may not have succeeded in propogating these two queries to B and C.
Now, I'd think that there is an order of consistency. Either both are successfully propogated and executed on B and C, or only the first query is, or both are. I'd think that, under no circumstances only the second query is propogated and executed, and not the first (because of the order of tcp packets, and the fact that obviously, all nodes share the same consistency strategy).
Am I right?
You're right, at least on the node you connect to. What happens on the server is, for a consistency level ONE write:
Receive insert to table_a
Write into commitlog
Acknowledge write to client
Receive insert to table_b
Write into commitlog
Acknowledge write to client
The key is that there is a global commitlog. So you can't flush it for one table and not another. Also, because the writes are sequential, you know the write was made to the commitlog before returning.
The commitlog gets flushed periodically (by default), so could flush after 2 but before 5, in which case only the insert to table_a is kept in the event of a crash immediately after 4 or 5.
On other nodes, the ordering isn't guaranteed, because the write is done asynchronously and writes are multithreaded. But it's not possible to totally lose the first write and not the second if the original node doesn't fail permanently.
If you want stronger guarantees, you can use Cassandra's batching.
Cassandra can guarantee that neither or both of the writes succeed if you write them as a batch. For even old Cassandra versions, if updates within a batch have the same row key (partition key in CQL speak), even if they are in different column families (tables), they will get committed to the commitlog atomically.
New in 1.2 is a batchlog across multiple rows that offers the same guarantees - either all the batch gets applied or none.
Is it possible to make a conditional insert with the Windows Azure Table Storage Service?
Basically, what I'd like to do is to insert a new row/entity into a partition of the Table Storage Service if and only if nothing changed in that partition since I last looked.
In case you are wondering, I have Event Sourcing in mind, but I think that the question is more general than that.
Basically I'd like to read part of, or an entire, partition and make a decision based on the content of the data. In order to ensure that nothing changed in the partition since the data was loaded, an insert should behave like normal optimistic concurrency: the insert should only succeed if nothing changed in the partition - no rows were added, updated or deleted.
Normally in a REST service, I'd expect to use ETags to control concurrency, but as far as I can tell, there's no ETag for a partition.
The best solution I can come up with is to maintain a single row/entity for each partition in the table which contains a timestamp/ETag and then make all inserts part of a batch consisting of the insert as well as a conditional update of this 'timestamp entity'. However, this sounds a little cumbersome and brittle.
Is this possible with the Azure Table Storage Service?
The view from a thousand feet
Might I share a small tale with you...
Once upon a time someone wanted to persist events for an aggregate (from Domain Driven Design fame) in response to a given command. This person wanted to ensure that an aggregate would only be created once and that any form of optimistic concurrency could be detected.
To tackle the first problem - that an aggregate should only be created once - he did an insert into a transactional medium that threw when a duplicate aggregate (or more accurately the primary key thereof) was detected. The thing he inserted was the aggregate identifier as primary key and a unique identifier for a changeset. A collection of events produced by the aggregate while processing the command, is what is meant by changeset here. If someone or something else beat him to it, he would consider the aggregate already created and leave it at that. The changeset would be stored beforehand in a medium of his choice. The only promise this medium must make is to return what has been stored as-is when asked. Any failure to store the changeset would be considered a failure of the whole operation.
To tackle the second problem - detection of optimistic concurrency in the further life-cycle of the aggregate - he would, after having written yet another changeset, update the aggregate record in the transactional medium if and only if nobody had updated it behind his back (i.e. compared to what he last read just before executing the command). The transactional medium would notify him if such a thing happened. This would cause him to restart the whole operation, rereading the aggregate (or changesets thereof) to make the command succeed this time.
Of course, now he had solved the writing problems, along came the reading problems. How would one be able to read all the changesets of an aggregate that made up its history? Afterall, he only had the last committed changeset associated with the aggregate identifier in that transactional medium. And so he decided to embed some metadata as part of each changeset. Among the meta data - which is not so uncommon to have as part of a changeset - would be the identifier of the previous last committed changeset. This way he could "walk the line" of changesets of his aggregate, like a linked list so to speak.
As an additional perk, he would also store the command message identifier as part of the metadata of a changeset. This way, when reading changesets, he could know in advance if the command he was about to execute on the aggregate was already part of its history.
All's well that ends well ...
P.S.
1. The transactional medium and changeset storage medium can be the same,
2. The changeset identifier MUST not be the command identifier,
3. Feel free to punch holes in the tale :-),
4. Although not directly related to Azure Table Storage, I've implemented the above tale successfully using AWS DynamoDB and AWS S3.
How about storing each event at "PartitionKey/RowKey" created based on AggregateId/AggregateVersion?where AggregateVersion is a sequential number based on how many events the aggregate already has.
This is very deterministic, so when adding a new event to the aggregate, you will make sure that you were using the latest version of it, because otherwise you'll get an error saying that the row for that partition already exists. At this time you can drop the current operation and retry, or try to figure out if you could merge the operation anyways if the new updates to the aggregate do not conflict to the operation you just did.