I know that in Cassandra, there's no strong consistency unless you explicitly request it (and even then, there're no transactions).
However, I'm interested in the "order" of consistency. Take the following example:
In a database node, there are 3 nodes (A, B and C). Two insert queries are sent trough the same CQL-connection (or thrift for that matter, I don't think that's relevant to this question anyway). Both operate on different tables (this might be relevant).
INSERT INTO table_a (id) VALUES (0)
INSERT INTO table_b (id) VALUES (1)
Directly after the questions have been successfuly executed on the node that they're sent to, it goes down. The node may or may not have succeeded in propogating these two queries to B and C.
Now, I'd think that there is an order of consistency. Either both are successfully propogated and executed on B and C, or only the first query is, or both are. I'd think that, under no circumstances only the second query is propogated and executed, and not the first (because of the order of tcp packets, and the fact that obviously, all nodes share the same consistency strategy).
Am I right?
You're right, at least on the node you connect to. What happens on the server is, for a consistency level ONE write:
Receive insert to table_a
Write into commitlog
Acknowledge write to client
Receive insert to table_b
Write into commitlog
Acknowledge write to client
The key is that there is a global commitlog. So you can't flush it for one table and not another. Also, because the writes are sequential, you know the write was made to the commitlog before returning.
The commitlog gets flushed periodically (by default), so could flush after 2 but before 5, in which case only the insert to table_a is kept in the event of a crash immediately after 4 or 5.
On other nodes, the ordering isn't guaranteed, because the write is done asynchronously and writes are multithreaded. But it's not possible to totally lose the first write and not the second if the original node doesn't fail permanently.
If you want stronger guarantees, you can use Cassandra's batching.
Cassandra can guarantee that neither or both of the writes succeed if you write them as a batch. For even old Cassandra versions, if updates within a batch have the same row key (partition key in CQL speak), even if they are in different column families (tables), they will get committed to the commitlog atomically.
New in 1.2 is a batchlog across multiple rows that offers the same guarantees - either all the batch gets applied or none.
Related
In Cassandra, tombstones are used in deletion since the writes are written to immutable files. I read that tombstones also solve the tough problem of deleting in distributed systems. This is where I am confused. What problems exist in deleting from distributed databases? For eg: Take a 3 node cluster with nodes A, B and C. Say node C is down and a delete came. It is marked as tombstone in A and B and success is returned back to client. After sometime the compaction kicks in on A and B and clears out this tombstone. Now when a read comes for the previously deleted value, A and B return nothing while C returns the old value. But here I read that the value given by C takes precedence over the empty responses.
If the tombstoned record has already been deleted from the rest of the cluster before that node recovers, Cassandra treats the record on the recovered node as new data, and propagates it to the rest of the cluster.
Why does it do this? Since quorum nodes say the value is not present, why don't we return that back to the client? This could potentially simplify the problem of deletes in distributed systems as we needn't wait for gc grace seconds before clearing out the tombstones.
The quorum returns nothing could also mean that the rest of the nodes simply didn't receive value because the nodes were down, so in this case the single node having data is correct, and this value will be propagated to the nodes that doesn't have it. Cassandra simply don't know, if the data is missing because it was deleted via tombstone vs. data is missing because nodes weren't available at the time of write.
That's why it's important to run repairs regularly and make sure that this happens during gc_grace_seconds. And that you didn't put back machine after being offline greater than this period.
According to datastax article, strong consistency can be guaranteed
if, R + W > N
where
R is the consistency level of read operations
W is the consistency level of write operations
N is the number of replicas
What does strong consistency mean here? Does it mean that 'every time' a query's response is given from the database, the response will 'always' be the last updated value? If conditions of strong consistency is maintained in cassandra, then, are there no scenarios where the data returned might be inconsistent? In short, does strong consistency mean 100% consistency?
Edit 1
Adding some additional material regarding some scenarios where Cassandra might not be consistent even when R+W>RF
Write fails with Quorum CL
Cassandra's eventual consistency
Cassandra has tunable consistency with some tradeoffs you can choose.
R + W > N - this simply means there must be one overlapping node in your roundtrip that has the actual and newest data available to be consistent.
For example if you write at CL.ONE you will need to read at CL.ALL to be sure to get a consistent result: N+1 > N - but you might not want CL.ALL as you can not tolerate a single node failure in your cluster.
Often you can choose CL.QUORUM at read and write time to ensure consistency and tolerate node failures. For example at RF=3 a QUORUM needs (3/2)+1=2 nodes available, so R+W>N will be 4>3 - your requests are consistent AND you can tolerate a single node failure.
One thing to keep in mind - it is really important to have thight synchronized clocks on all your nodes (cassandra and application), you will want to have ntp up and running.
While this is an old question, I thought I would chip in to set the record straight.
R+W>RF does not imply strong consistency
A system with **R+W>RF* will only be eventually consistent. The claims for strong consistency guarentee break during node failures or in between writes. For example consider the following scenario:
Assume that there are 3 nodes A,B,C with RF=3, W=3, R=2 (hence, R+W = 5 > 3 = RF)
Further assume key k is associated to value v i.e. (k,v) is stored on the database. Suppose the following series of actions occur:
t=1: (k,v1) write request is sent to A,B,C from a user
t=2: (k,v1) reaches A and is written to store at A
t=3: Reader 1 sends a read request for key k, which is replied to by A and B
t=4: Reader 1 receives response (k,v1) - by latest write wins rule
t=5: Reader 1 sends another read request which gets served by nodes B and C
t=6: Reader 1 receives response (k,v), which is an older value INCONSISTENCY
t=7: (k,v1) reaches C and is written to store at C
t=8: (k,v1) reaches B and is written to store at B
This demonstrates that W+R>RF cannot guarantee strong consistency. To ensure strong consistency you might want to use another algorithm such as paxos or raft that can help in ensuring that the writes are atomic. You can read an interesting article on the same here (Do checkout the FAQ section)
Edit:
Cassandra does have some internal mechanism (called the blocking read repairs) - that trigger synchronous writes before response from the db is sent back to client. This kind of synchronous read repair occurs in case of inconsistencies amongst the nodes queried to achieve read consistency level and ensures something known as Monotonic Read Consistency [See below for definitions]. This causes the (k,v1) in above example to be written to node B before response is returned in case of first read request and so the second read request would also have an updated value. (Thanks to #Nadav Har'El for pointing this out)
However, this still does not guarantee strong consistency. Below are some definitions to clear it of:
Sequential/Strong Consistency: the result of any execution is the same as if the reads and writes occur in some order, and the operations of each individual processor appear in this sequence in the order specified by its program [as defined by Leslie Lamport]
Monotonic Read Consistency: once you read a value, all subsequent reads will return this value or a newer version
Sequential consistency would require the client program/reader to see the latest value that was written since the write statement is executed before the read statement in the sequence of program instructions.
For both reads and writes, the consistency levels of ANY , ONE , TWO , and THREE are considered weak, whereas QUORUM and ALL are considered strong.
Yes. If R + W consistency is greater than replicas then you will always get consistent data. 100% consistency. But you will have to trade availability to achieve higher consistency.
Cassandra has concept of tunable consistency (set consistency on query basis).
I will actually regard this Strong Consistency as Strong read consistency. And it is sessional, aka Monotonic Read Consistency.( refer to #NadavHar'El answer).
But it is not sequential consistency as Cassandra doesn't fully support lock, transaction or serialze the write operation. There is only lightweight transaction, which supports local serialization of write operation and serialization of read operation.
To make things easy to understand. Let's say we have three nodes - A, B, C and set read quorum to be 3 and write to be 1.
If there is only one client, it writes to any node - A.
B and C might be not synchronized.(Eventually they will -- Eventual consistency)
But When the client reads again, it requires client to get at least three nodes' response and by comparing the latest timestamp, we will use A's record. This is Monotonic Read Consistency
However,if there are two client trying to update the records at the same time or if they try to read the value first and then rewrite it(e.g increase column by 100) at the same time:
Client C1 and Client C2 both read the current column value as 10, and they both decide to increase it by 100:
While C1 just need to write 110 to one node, client C2 will do the same and the final result on any node can only be 110 max.
Then we lose 100 in these operations(Lost updates). It is the issue caused by race condition or concurrent issues. It has to be fixed by serializing the operation and using any form of lock just like how other SQL DB implements transaction.
I know Cassandra now has new counter column which might solve it but it is still limited in terms of the full transaction. And Cassandra is also not supposed to be transactional as it is NoSQL database which sacrifice consistency for availability
I have a table in Cassandra which stores versions of csv-files. It uses a primary key with a unique id for the version (the partition key) and a row number (the clustering key). When I insert a new version I first execute a delete statement on the partition key I am about to insert, to clean up any incomplete data. Then the data is inserted.
Now here is the issue. Even though the delete and subsequent insert are executed synchronously after one another in the application it seems that some level of concurrency still exist in Cassandra, because when I read afterwards, rows from my insert will be missing occasionally - something like 1 in 3 times. Here are some facts:
Cassandra 3.0
Consistency ALL (R+W)
Delete using the Java Driver
Insert using the Spark-Cassandra connector
Number of nodes: 2
Replication factor: 2
The delete statement I execute looks like this:
"DELETE FROM myTable WHERE version = 'id'"
If I omit it, the problem goes away. If I insert a delay between the delete and the insert the problem is reduced (less rows missing). Initially I used a less restrictive consistency level, and I was sure this was the issue, but it didn't affect the problem. My hypothesis is that for some reason the delete statement is being sent to the replica asynchronously despite the consistency level of ALL, but I can't see why this would be the case or how to avoid it.
All mutations are going to by default get a write time of the coordinator for that write. From the docs
TIMESTAMP: sets the timestamp for the operation. If not specified,
the coordinator will use the current time (in microseconds) at the
start of statement execution as the timestamp. This is usually a
suitable default.
http://cassandra.apache.org/doc/cql3/CQL.html
Since the coordinator for different mutations can be different, a clock skew between coordinators can end up with a mutations to one machine to be skewed relative to another.
Since write time controls C* history this means you can have a driver which synchronously inserts and deletes but depending on the coordinator the delete can happen "before" the insert.
Example
Imagine two nodes A and B, B is operating with a 5 second clock skew behind A.
At time 0: You insert data to the cluster and A is chosen as the coordinator. The mutation arrives at A and A assigns a timestamp (0)
There is now a record in the cluster
INSERT VALUE AT TIME 0
Both nodes contain this message and the request returns confirming the write was successful.
At time 2: You issue a delete for the data previously inserted and B is chosen as the coordinator. B assigns a timestamp of (-3) because it is clock skewed 5 seconds behind the time in A. This means that we end up with a statement like
DELETE VALUE AT TIME -3
We acknowledge that all nodes have received this record.
Now the global consistent timeline is
DELETE VALUE AT TIME -3
INSERT VALUE AT TIME 0
Since the insertion occurs after the delete the value still exists.
I have got similar problem, and I have fixed it by enabling Light-Weight-Transaction for both INSERT and DELETE requests (for all queries actually, including UPDATE). It will make sure all queries to this partition are serialized through one "thread", so DELETE wan't overwrite INSERT. For example (assuming instance_id is a primary key):
INSERT INTO myTable (instance_id, instance_version, data) VALUES ('myinstance', 0, 'some-data') IF NOT EXISTS;
UPDATE myTable SET instance_version=1, data='some-updated-data' WHERE instance_id='myinstance' IF instance_version=0;
UPDATE myTable SET instance_version=2, data='again-some-updated-data' WHERE instance_id='myinstance' IF instance_version=1;
DELETE FROM myTable WHERE instance_id='myinstance' IF instance_version=2
//or:
DELETE FROM myTable WHERE instance_id='myinstance' IF EXISTS
IF clauses enable light-wight-transactions for each row, so all of them are serialized. Warning: LWT is more expensive than normal calls, but sometimes they are needed, like in the case of this concurrency problem.
This question is about NoSQL (for instance take cassandra).
Is it true that when you use a NoSQL database without data replication that you have no consistency concerns? Also not in the case of access concurrency?
What happens in case of a partition where the same row has been written in both partitions, possible multiple times? When the partition is gone, which written value is used?
Let's say you use N=5 W=3 R=3. This means you have guaranteed consistency right? How good is it to use this quorum? Having 3 nodes returning the data isn't that a big overhead?
Can you specify on a per query basis in cassandra whether you want the query to have guaranteed consistency? For instance you do an insert query and you want to enforce that all replica's complete the insert before the value is returned by a read operation?
If you have: employees{PK:employeeID, departmentId, employeeName, birthday} and department{PK:departmentID, departmentName} and you want to get the birthday of all employees with a specific department name. Two problems:
you can't ask for all the employees with a given birthday (because you can only query on the primary key)
You can't join the employee and the department column families because joins are impossible.
So what you can do is create a column family:
departmentBirthdays{PK:(departmentName, birthday), [employees-whos-birthday-it-is]}
In that case whenever an employee is fired/hired it has to be removed/added in the departmentBirthdays column family. Is this process something you have to do manually? So you have to manually create queries to update all redundant/denormalized data?
I'll answer this from the perspective of cassandra, coz that's what you seem to be looking at (hardly any two nosql stores are the same!).
For a single node, all operations are in sequence. Concurrency issues can be orthogonal though...your web client may have made a request, and then another, but due to network load, cassandra got the second one first. That may or may not be an issue. There are approaches around such problems, like immutable data. You can also leverage "lightweight transactions".
Cassandra uses last write wins to resolve conflicts. Based on you replication factor and consistency level for your query, this can work well.
Quurom for reads AND writes will give you consistency. There is an edge case..if the coordinator doesn't know a quorum node is down, it sends the write requests, then the write would complete when quorum is re-established. The client in this case would get a timeout and not a failure. The subsequent query may get the stale data, but any query after that will get latest data. This is an extreme edge case, and typically N=5, R=3, W3= will give you full consistency. Reading from three nodes isn't actually that much of an overhead. For a query with R=3, the client would make that request to the node it's connected to (the coordinator). The coordinator will query replicas in parallel (not sequenctially). It willmerge up the results with LWW to get the result (and issue read repairs etc. if needed). As the queries happen in parallel, the overhead is greatly reduced.
Yes.
This is a matter of data modelling. You describe one approach (though partitioning on birthday rather than dept might be better and result in more even distribution of partitions). Do you need the employee and department tables...are they needed for other queries? If not, maybe you just need one. If you denormalize, you'll need to maintain the data manually. In Cassandra 3.0, global indexes will allow you to query on an index without being inefficient (which is the case when using a secondary index without specifying the partition key today). Yes another option is to partition employeed by birthday and do two queries, and do the join in memory in the client. Cassandra queries hitting a partition are very fast, so doing two won't really be that expensive.
The Cassandra 2.0 documentation contains the following paragraph on Atomicity:
For example, if using a write consistency level of QUORUM with a replication factor of 3, Cassandra will replicate the write to all nodes in the cluster and wait for acknowledgement from two nodes. If the write fails on one of the nodes but succeeds on the other, Cassandra reports a failure to replicate the write on that node. However, the replicated write that succeeds on the other node is not automatically rolled back.
So, write requests are sent to 3 nodes, and we're waiting for 2 ACKs. Let's assume we only receive 1 ACK (before timeout). So it's clear, that if we read with consistency ONE, that we may read the value, ok.
But which of the following statements is also true:
It may occur, that the write has been persisted on a second node, but the node's ACK got lost? (Note: This could result in a read of the value even at read consistency QUORUM!)
It may occur, that the write will be persisted later to a second node (e.g. due to hinted handoff)? (Note: This could result in a read of the value even at read consistency QUORUM!)
It's impossible, that the write is persisted on a second node, and the written value will eventually be removed from the node via ReadRepair?
It's impossible, that the write is persisted on a second node, but it is necessary to perform a manual "undo" action?
I believe you are mixing atomicity and consistency. Atomicity is not guaranteed across nodes whereas consistency is. Only writes to a single row in a single node are atomic in the truest sense of atomicity.
The only time Cassandra will fail a write is when too few replicas are alive when the coordinator receives the request i.e it cannot meet the consistency level. Otherwise your second statement is correct. It will hint that the failed node (replica) will need to have this row replicated.
This article describes the different failure conditions.
http://www.datastax.com/dev/blog/how-cassandra-deals-with-replica-failure