How to determine the sync status is up to date for particular node in a Cassandra cluster? - cassandra

Suppose I have two node cassandra cluster and they are reside on physically different data-centers. Suppose the database inside that cluster has replication factor is 2 which means every data in that database should be sync with each other. suppose this database is a massive database which have millions of records of its tables. I named those nodes centers as node1 and node2. Suppose node2 is not reliable and there was a crash on that server and take few days to fix and get the server back to up and running state. After that according to my understating there should be a gap between node1 and node2 and it may take significant time to sync node2 with node1. So need a way to measure the gap between node2 and node1 for the mean time of sync happen? After some times how should I assure that node2 is equal to node1? Please correct me if im wrong with this question according to the cassandra architechure.

So let's start with your description. 2 node cluster, which sounds fine, but 2 nodes in 2 different data centers (DCs) - bad design, but doable. Each data center should have multiple nodes to ensure your data is highly available. Anyway, that aside, let's assume you have a 2 node cluster with 1 node in each DC. The replication factor (RF) is defined at the keyspace level (not at the cluster level - each DC will have a RF setting for a particular keyspace (or 0 if not specified for a particular DC)). That being said, you can't have RF=2 for a keyspace for either of your DCs if you only have a single node in each one (RF, which is how many copies of the data that exist, can't be more than the number of nodes in the DC). So let's put that aside for now as well.
You have the possibility for DCs to become out of sync as well as nodes within a DC to become out of sync. There are multiple protections against this problem.
Consistency Level (CL)
This is a lever that you (the client) have to be able to help control how far out of sync things get. There's a trade off between availability v.s. consistency (with performance implications as well). The CL setting is configured at connection time and/or each statement level. For writes, the CL determines how many nodes must IMMEDIATELY ACKNOWLEDGE the write before giving your application the "green light" to move on (a number of nodes that you're comfortable with - knowing the more nodes you immediately require the more consistent your nodes and/or DC(s) will be, but the longer it will take and the less flexibility you have in nodes becoming unavailable without client failure). If you specify less than RF it doesn't mean that RF won't be met, it just means that they don't need to immediately acknowledge the write to move on. For reads, this setting determines how many nodes' data are compared before the result is returned (if cassandra finds a particular row doesn't match from the nodes it's comparing, it will "fix" them during the read before you get your results - this is called read repair). There are a handful of CL options by the client (e.g. ONE, QUORUM, LOCAL_ONE, LOCAL_QUOURM, etc.). Again, there is a trade-off between availability and consistency with the selected choice.
If you want to be sure your data is consistent when your queries run (when you read the data), ensure the write CL + the read CL > RF. You can ensure that's done on a LOCAL level (e.g. the DC that the read/write is occurring on, say, LOCAL_QUORUM) or globally (all DCs with QUORUM). By doing this, you'll be sure that while your cluster may be inconsistent, your results during reads will not be (i.e. the results will be consistent/accurate - which is all that anyone really cares about). With this setting you also allow some flexibility in unavailable nodes (e.g. for a 3 node DC you could have a single node be unavailable without client failure for either reads or writes).
If nodes do become out of sync, you have a few options at this point:
Repair
Repair (run by "nodetool repair") - this is a facility that you can schedule or manually run to reconcile your tables, keyspaces and/or the entire node with other nodes (either in the DC the node resides or the entire cluster). This is a "node level" command and must be run on each node to "fix" things. If you have DSE, Ops Center can run repairs in the background fixing "chunks" of data - cycling the process repetitively.
NodeSync
Similar to repair, this is a DSE specific tool similar to repair that helps keep data in sync (the newer version of repair).
Unavailable nodes:
Hinted Handoff
Cassandra has the ability to "hold onto" changes if nodes become unavailable during writes. It will hang onto changes for a specified period of time. If the unavailable nodes become available before time runs out, the changes are sent over for application. If time runs out, hint collection stops and one of the other options, above, need to be performed to catch things up.
Finally, there is no way to know how inconsistent things are (e.g. 30% inconsistent). You simply try to utilize the tools mentioned above to control consistency without completely sacrificing availability.
Hopefully that makes sense and helps.
-Jim

Related

Cassandra: what node will data be written if the needed node is down?

Suppose I have a Cassandra cluster with 3 nodes (node 0, node 1 and node 2) and replication factor of 1.
Suppose that I want to insert a new data to the cluster and the partition key directs the new row to node 1. However, node 1 is temporarily unavailable. In this case, will the new data be inserted to node 0 or node 2 (although it should not be placed there according to the partition key)?
In Cassandra, Replication Factor (RF) determines how many copies of data will ultimately exist and is set/configured at the keyspace layer. Again, its purpose is to define how many nodes/copies should exist if things are operating "normally". They could receive the data several ways:
During the write itself - assuming things are functioning "normally" and everything is available
Using Hinted Handoff - if one/some of the nodes are unavailable for a configured amount of time (< 3 hours), cassandra will automatically send the data to the node(s) when they become available again
Using manual repair - "nodetool repair" or if you're using DSE, ops center can repair/reconcile data for a table, keyspace, or entire cluster (nodesync is also a tool that is new to DSE and similar to repair)
During a read repair - Read operations, depending on the configurable client consistency level (described next) can compare data from multiple nodes to ensure accuracy/consistency, and fix things if they're not.
The configurable client consistency level (CL) will determine how many nodes must acknowledge they have successfully received the data in order for the client to be satisfied to move on (for writes) - or how many nodes to compare with when data is read to ensure accuracy (for reads). The number of nodes available must be equal to or greater than the client CL number specified or the application will error (for example it won't be able to compare a QUORUM level of nodes if a QUORUM number of nodes are not available). This setting does not dictate how many nodes will receive the data. Again, that's the RF keyspace setting. That will always hold true. What we're specifying here is how many must acknowledge each write or compare for each read in order the client to be happy at that moment. Hopefully that makes sense.
Now...
In your scenario with a RF=1, the application will receive an error upon the write as the single node that should receive the data (based off of a hash algorithm) is down (RF=1 again means only a single copy of the data will exist, and that single copy is determined by a hash algorithm to be the unavailable node). Does that make sense?
If you had a RF=2 (2 copies of data), then one of the two other nodes would receive the data (again, the hash algorithm picks the "base" node, and then another algorithm will chose where the cop(ies) go), and when the unavailable node became available, it would eventually receive the data (either by hinted handoff or repair). If you chose a RF=3 (3 copies) then the other 2 nodes would get the data, and again, once the unavailable node became available, it would eventually receive the data (either by hinted handoff or repair).
FYI, if you ever want to know where a piece of data will/does exist in a Cassandra cluster, you can run "nodetool getendpoints". The output will be where all copies will/do reside.

Why do tables get out of sync over time when Write Consistency ALL is used?

Iam running a cassandra 3.11.4 cluster with 1 data center, 2 racks and 11 nodes. My keyspaces and the tables are set to replication 2. I use the Prometheus-Grafana-Combo to monitor the cluster.
Observation: During (massive) inserts using Write-Consistency Level ALL (i.e. 2 nodes) the affected tables/nodes get slowly out of sync (worst case on one node: from 100% to 83% within 6 hours). My expectation is that this could only happen if I use ANY (or anything less than my replication factor).
I would really like to understand this behaviour.
What is also interesting: If I dare to use write consistency ANY I get exactly that- and even though all nodes are online Cassandra does not even seem attempt to write to all nodes. In any case (ANY or ALL) if have to perform incremental repairs.
First of all, your expectation is correct: Writes, regardless of what the consistency-level is (ALL or ONE or ANY or whatever), do make every attempt to write to all replicas. The different write-consistency levels only differ on when "success" is reported to the client: ALL waits until all writes were done, while ONE waits for just one (and does the other ones in the background). So unless one of your nodes goes down, or severely overloaded, none of the writes should be missing on any of the nodes, and there should be zero inconsistencies. The "hinted handoff" feature makes inconsistencies even less likely (if one node is temporarily down, other nodes save for it the writes it missed, and replay them later).
I think your only problem is that you're misinterpreting what the "percentrepaired" statistic means. The "percentrepaired" metric is used by incremental repair. In incremental repair, data on disk is split between "repaired" data (data that already went through a repair process) and "unrepaired" data - new data that still did not yes pass through repair. This does not mean that the new data is inconsistent or differs between nodes - it just that nobody checked that yet! To mark this new data "repaired" you'd need to run an (incremental) repair - it will realize the data does not differ between nodes, and mark it as "repaired".

Read fail with consistency one

I've got 3 nodes; 2 in datacenter 1 (node 1 and node 2) and 1 in datacenter 2 (node 3). Replication strategy: Network Topology, dc1:2, dc2: 1.
Initially I keep one of the nodes in dc1 off (node 2) and write 100 000 entries with consistency 2 (via c++ program). After writing, I shut down the node in datacenter 2 (node 3) and turn on node 2.
Now, if I try to read those 100 000 entries I had written (again via c++ program) with consistency set as ONE, I'm not able to read all those 100 000 entries i.e. I'm able to read only some of the entries. As I run the program again and again, my program fetches more and more entries.
I was expecting that since one of the 2 nodes which are up contains all the 100 000 entries, therefore, the read program should fetch all the entries in the first execution when the set consistency is ONE.
Is this related to read repair? I'm thinking that because the read repair is happening in the background, that is why, the node is not able to respond to all the queries? But nowhere could I find anything regarding this behavior.
Let's run through the scenario.
During the write of 100K rows (DC1) Node1 and (DC2) Node3 took all the writes. As it was happening Node1 also might have taken hints for Node2 (DC1) for default 3 hours and then stop doing that.
Once Node2 comes back up online, unless a repair was run - it takes a bit to catch up through replay of hints. If the node was down for more than 3 hours, repair becomes mandatory.
During the reads, it can technically reach to any node in the cluster based on the loadbalancy policy used by driver. Unless specified to do "DCAwareRoundRobinPolicy", the read request might even reach any of the DC (DC1 or DC2 in this case). Since the consistency requested is "ONE", practically any ALIVE node can respond - NODE1 & NODE2 (DC1) in this case. So NODE2 may not even have all data and it can still respond with NULL value and thats why you received empty data sometimes and correct data some other time.
With consistency "ONE" read repair doesn't even happen, as there no other node to compare it with. Here is the documentation on it . Even in case of consistency "local_quorum" or "quorum" there is a read_repair_chance set at the table level which is default to 0.1. Which means only 10% of reads will trigger read_repair. This is to save performance by not triggering every time. Think about it, if read repair can bring the table entirely consistent across nodes, then why does "nodetool repair" even exist?
To avoid this situation, whenever the node comes back up online its best practice to do a "nodetool repair" or run queries with consistency "local_quorum" to get consistent data back.
Also remember, consistency "ONE" is comparable to uncommitted read (dirty read) in the world of RDBMS (WITH UR). So expect to see unexpected data.
Per documentation, consistency level ONE when reads:
Returns a response from the closest replica, as determined by the snitch. By default, a read repair runs in the background to make the other replicas consistent. Provides the highest availability of all the levels if you can tolerate a comparatively high probability of stale data being read. The replicas contacted for reads may not always have the most recent write.
Did you check that your code contacted the node that always was online & accepted writes?
The DSE Architecture guide, and especially Database Internals section provides good overview how Cassandra works.

Cassandra difference between ANY and ONE consistency levels

Assumptions: RF = 3
In some video on the Internet about Consistency level speaker says that CL = ONE is better then CL = ANY because when we use CL = ANY coordinator will be happy to store only hint(and data)(we are assuming here that all the other nodes with corresponding partition key ranges are down) and we can potentially lose our data due to coordinator's failure. But wait a minute.... as I understand it, if we used CL = ONE and for example we had only one(of three) available node for this partition key, we would have only one node with inserted data. Risk of loss is the same.
But I think we should assume equal situations - all nodes for particular token is gone. Then it's better to discard write operation then write with such a big risk of coordinator's loss.
CL=ANY should probably never be used on a production server. Writes will be unavailable until the hint is written to a node owning that partition because you can't read data when its in a hints log.
Using CL=ONE and RF=3 with two nodes down, you would have data stored in both a) the commit log and memtable on a node and b) the hints log. These are likely different nodes, but they could be the same 1/3 of the time. So, yes, with CL=ONE and CL=ANY you risk complete loss of data with a single node failure.
Instead of ANY or ONE, use CL=QUORUM or CL=LOCAL_QUORUM.
The thing is the hints will just be stored for 3 hours by default and for longer times than that you have to run repairs. You can repair if you have at least one copy of this data on one node somewhere in the cluster (hints that are stored on coordinator don't count).
Consistency One guarantees that at least one node in the cluster has it in commit log no matter what. Any is in worst case stored in hints of coordinator (other nodes can't access it) and this is stored by default in a time frame of 3 hours. After 3 hours pass by with ANY you are loosing data if other two instances are down.
If you are worried about the risk, then use quorum and 2 nodes will have to guarantee to save the data. It's up to application developer / designer to decide. Quorum will usually have slightly bigger latencies on write than One. But You can always add more nodes etc. should the load dramatically increase.
Also have a look at this nice tool to see what impacts do various consistencies and replication factors have on applications:
https://www.ecyrd.com/cassandracalculator/
With RF 3, 3 nodes in the cluster will actually get the write. Consistency is just about how long you want to wait for response from them ... If you use One, you will wait until one node has it in commit log. But the coordinator will actually send the writes to all 3. If they don't respond coordinator will save the writes into hints.
Most of the time any in production is a bad idea.

would Cassandra write be successful in the following situation?

I have two datacentres with 3 machines each. the replication factor is DC1:3, DC2:3 and all the inserts are made with write consistency = all.
So all data exists on all nodes (this is done to get reads to be the fastest).
But are there other problems with this set up that I might be missing? (except for writes being slow which im fine with)
For example, if a single node is down would all my writes fail? (Or can cassandra note down the writes for the failed node somewhere and bring it up to speed once its up?)
If a single node were down, then all your writes would fail. The consistency level specifies how many replicas you require for the write to be successful. So if you say ALL, and every node is a replica, then all the nodes would need to be up for it to succeed.
Usually you would do your writes with a lower consistency, like ONE. Cassandra will still write the data to all the nodes if they are up. If some of them are down, then the data may still get written to them (once they are back up) via hinted handoffs, read repair chance, and scheduled repairs.

Resources