Why Cassandra cluster need synchronized clocks between nodes? - cassandra

In the introduction course of Cassandra DataStax they say that all of the clocks of a Cassandra cluster nodes, have to be synchronized, in order to prevent READ queries to 'old' data.
If one or more nodes are down they can not get updates, but as soon as they back up again - they would update and there is no problem...
So, why Cassandra cluster need synchronized clocks between nodes?

In general it is always a good idea to keep your server clocks in sync, but a primary reason why clock sync is needed between nodes is because Cassandra uses a concept called 'Last Write Wins' to resolve conflicts and determine which mutation represents the most correct up-to date state of data. This is explained in Why cassandra doesn't need vector clocks.
Whenever you 'mutate' (write or delete) column(s) in cassandra a timestamp is assigned by the coordinator handling your request. That timestamp is written with the column value in a cell.
When a read request occurs, cassandra builds your results finding the mutations for your query criteria and when it sees multiple cells representing the same column it will pick the one with the most recent timestamp (The read path is more involved than this but that is all you need to know in this context).
Things start to become problematic when your nodes' clocks become out of sync. As I mentioned, the coordinator node handling your request assigns the timestamp. If you do multiple mutations to the same column and different coordinators are assigned, you can create some situations where writes that happened in the past are returned instead of the most recent one.
Here is a basic scenario that describes that:
Assume we have a 2 node cluster with nodes A and B. Lets assume an initial state where A is at time t10 and B is at time t5.
User executes DELETE C FROM tbl WHERE key=5. Node A coordinates the request and it is assigned timestamp t10.
A second passes and a User executes UPDATE tbl SET C='data' where key=5. Node B coordinates the request and it is assigned timestamp t6.
User executes the query SELECT C from tbl where key=5. Because the DELETE from Step 1 has a more recent timestamp (t10 > t6), no results are returned.
Note that newer versions of the datastax drivers will start defaulting to use Client Timestamps to have your client application generate and assign timestamps to requests instead of relying on the C* nodes to assign them. datastax java-driver as of 3.0 now defaults to client timestamps (read more about there in 'Client-side generation'). This is very nice if all requests come from the same client, however if you have multiple applications writing to cassandra you now have to worry about keeping your client clocks in sync.

Related

cassandra sync with recovered node

I am trying to build cassandra backup and recovery process.
Let say I have 2 nodes A and B and table C with replica factor 2.
In table C we have row with ID=5 and Name="Alex".
Now , something bad happened to node B and we need to get it down for the few minutes to make a restore.
In the same time,while node B is down, someone change row with ID=5 form Name="Alex" to Name="Alehandro".
Node B up again , with restored data and respectively for this node row with ID=5 still contain Name="Alex".
What will happens when I try to find row with ID=5?
Will node A synchronize with node B?
Thanks.
Cassandra has several ways to synchronize data to nodes that were missed writes because they were down, or there was garbage collection pause, etc. This includes:
Hints - coordinator node for a some time (3 hours by default, configurable) will collect all write operations that other node has missed, and when it's back - these operations will be replayed against it
Repair - explicit synchronization of data, that is triggered via nodetool repair manually, or the tools like Reaper could be used to automate it
Read repair - if you're using consistency level that requires reading from the several nodes (TWO, LOCAL_QUORUM, QUORUM, etc.), then coordinator node will detect discrepancies, and will return data with the newest timestamp, if necessary, fixing the data on node that has old data
Answering your last question - when 2nd node is back, you can get old data if hints aren't replayed yet, and you're reading directly from that node, and you're reading with consistency level ONE or LOCAL_ONE.
P.S. I recommend to look through the DSE Architecture Guide - it covers how Cassandra works.

Cassandra: what node will data be written if the needed node is down?

Suppose I have a Cassandra cluster with 3 nodes (node 0, node 1 and node 2) and replication factor of 1.
Suppose that I want to insert a new data to the cluster and the partition key directs the new row to node 1. However, node 1 is temporarily unavailable. In this case, will the new data be inserted to node 0 or node 2 (although it should not be placed there according to the partition key)?
In Cassandra, Replication Factor (RF) determines how many copies of data will ultimately exist and is set/configured at the keyspace layer. Again, its purpose is to define how many nodes/copies should exist if things are operating "normally". They could receive the data several ways:
During the write itself - assuming things are functioning "normally" and everything is available
Using Hinted Handoff - if one/some of the nodes are unavailable for a configured amount of time (< 3 hours), cassandra will automatically send the data to the node(s) when they become available again
Using manual repair - "nodetool repair" or if you're using DSE, ops center can repair/reconcile data for a table, keyspace, or entire cluster (nodesync is also a tool that is new to DSE and similar to repair)
During a read repair - Read operations, depending on the configurable client consistency level (described next) can compare data from multiple nodes to ensure accuracy/consistency, and fix things if they're not.
The configurable client consistency level (CL) will determine how many nodes must acknowledge they have successfully received the data in order for the client to be satisfied to move on (for writes) - or how many nodes to compare with when data is read to ensure accuracy (for reads). The number of nodes available must be equal to or greater than the client CL number specified or the application will error (for example it won't be able to compare a QUORUM level of nodes if a QUORUM number of nodes are not available). This setting does not dictate how many nodes will receive the data. Again, that's the RF keyspace setting. That will always hold true. What we're specifying here is how many must acknowledge each write or compare for each read in order the client to be happy at that moment. Hopefully that makes sense.
Now...
In your scenario with a RF=1, the application will receive an error upon the write as the single node that should receive the data (based off of a hash algorithm) is down (RF=1 again means only a single copy of the data will exist, and that single copy is determined by a hash algorithm to be the unavailable node). Does that make sense?
If you had a RF=2 (2 copies of data), then one of the two other nodes would receive the data (again, the hash algorithm picks the "base" node, and then another algorithm will chose where the cop(ies) go), and when the unavailable node became available, it would eventually receive the data (either by hinted handoff or repair). If you chose a RF=3 (3 copies) then the other 2 nodes would get the data, and again, once the unavailable node became available, it would eventually receive the data (either by hinted handoff or repair).
FYI, if you ever want to know where a piece of data will/does exist in a Cassandra cluster, you can run "nodetool getendpoints". The output will be where all copies will/do reside.

Cassandra doesn't imply particular order in which the statements are executed

Cassandra doesn't imply particular order in which the statements are executed.
Executing statements like the code below doesn't execute in the order.
INSERT INTO channel
JSON ''{"cuid":"NQAA0WAL6drA"
,"owner":"123"
,"status":"open"
,"post_count":0
,"mem_count":1
,"link":"FWsA609l2Og1AADRYODkzNjE2MTIyOTE="
, "create_at":"1543328307953"}}'';
BEGIN BATCH
UPDATE channel
SET title = ? , description = ? WHERE cuid = ? ;
INSERT INTO channel_subscriber
JSON ''{"cuid":"NQAA0WAL6drA"
,"user_id":"123"
,"status":"subscribed"
,"priority":"owner"
,"mute":false
,"setting":{"create_at":"1543328307956"}}'';
APPLY BATCH ;
According to system_traces.sessions each of them are received by different nodes.
Sometimes the started_at time in both query are equal (in milliseconds), sometimes the started_at time of second query is less than the first one.
So, this ruins the order of statements and data.
We use erlang, marina driver, consistency_level is QUORUM and time of all cassandra nodes and application server are sync.
How can I force Cassandra to execute queries in order?
Because of the distributed nature, queries in Cassandra could be received by different nodes, and depending on the load on particular node, it could be that some queries that sent later, are executed earlier. In your case you can put first insert into batch itself. Or, as it's implemented in some drivers (for example, Java driver), use whitelist policy to send queries to only one node - but it will be bottleneck in this case. (and I really not sure that your driver has such functionality).

Can I detect conflicts when writing to Cassandra?

Is there some timestamp/counter that can be used to validate that in a read-modify-write cycle, the data in the row did not change between reading and modifying?
In other words, can I read some kind of ID while reading the row, and when I write it back tell Cassandra what that ID was, and the write then fails if the ID changed since then? (Which amounts to saying that some other write took place after I read the data)
Each column in cassandra is a Tuple (or a triplet) that contains a name, value and a timestamp. The timestamp of the column represents the last time it was modified. If you have 100's of nodes, whichever node has an update with a the most recent timestamp will win. This is how Eventual Consistency is achieved.
zznate has a good presentation: Introduction to Apache Cassandra for Java Developers where this topic is referenced (slide 37)
Accessing timestamp of a Cassandra column
In summary, you don't need "some kind of ID" when you have the ability to retrieve the timestamp for a given column representing the last time it was modified. However, at scale, with 100's of nodes, how can you be sure that the node you are connecting to, has the most up to date column? (refer back to the zznate presentation)
Point is, you can't, without enabling transactions:
Cassandra - transaction support
Cassandra Transaction with ZooKeeper - Does this work?
how to integrate cassandra with zookeeper to support transactions
And many more: cassandra & transactions

How hector/cassandra handles sequential operations?

Using a hector Mutator I update some row over N sequential operation. Is there a guaranty, that changes happens in the order they where added to Mutator?
The simplest example, if I delete some row and then immediately recreate it. Could it not happen, that the deletion happens after inserting?
How cassandra cluster manages it, if two sequential requests are sent to different nodes? It is always possible there is few milliseconds difference between nodes...
Cassandra resolves conflicts using timestamps supplied by the client. In your example the 'recreate' of the row will have a higher timestamp than the row delete so it doesn't matter if somehow they got to the server in the wrong order.
One consequence of client supplied timestamps is that you either need to sync the clocks on your client machines or design your data model so that different clients don't conflict with each other.

Resources