How Cassandra make sure consistency when adding a new node - cassandra

I am little confused about how cassandra making sure consistency when add a new node to the cluster. I know cassandra will do the range movements and stream the data to new added node. Question is that does cassandra also stream the secondary replica's data to new added node.
For example, we have 4 nodes in the cluster with RF=3 (A,B,C,D)
A(x=1, y=2), B(x=1, y=3), C(x=1), D(y=2). Partition key "x" will hold by A,B,C, while partition key "y" will hold by D,A,B. If I add a new node A' between A and B. I think it will stream partition "x" from A. But does it also stream partition "y" from B or D?
If it does stream partition "y", which node will cassandra choose to streaming from? From the official document. It will stream from primary replica which is D. If that's the case, when D has stale data (it is ok before adding new node, as both A and B and latest data, which meets the quorum), after streaming, it is possible to query out stale data from D and A'. Am I right?

Cassandra will stream information from the node that is giving up ownership of the token
That is, in your example: RF=3 (A,B,C,D) A(x=1, y=2), B(x=1, y=3), C(x=1), D(y=2). If E is added between A, B and A will give up owning X to E and B will give up owning y. Then A will send its value of X to E and B will send its value of Y to E - so the end result will be A(y=2), E(X=1,y=3), B(x=1), C(x=1), D(y=2).
Please note that after adding the node A has a stale copy of X and B has a stale copy of Y and they should run 'nodetool cleanup' to get rid of that.

You are probably right. Running nodetool repair is recommended before adding a new node so that there is no inconsistency in the cluster.

Related

removenode coordinator, and its hints data will be lost

There are four nodes in cluster. assume them are node A, B, C, D. enabled hinted handoff.
1) create a keyspace with RF=2, and create a table.
2) make node B, C down(nodetool stopdaemon),
3) login in node A with cqlsh,set CONSISTENCY ANY, insert into a row(assume the row will be stored in node B and C). The row was successfully inserted even though the node B,C was down, because the consistency level is ANY. the coordinator(node A) wrote hints.
4) make node A down(nodetool stopdaemon), then remove node A(nodetool removenode ${nodeA_hostId})
5) make node B, C come back(nodetool start)
6) login in any node of B, C, D. and execute select statement with partition key of inserted row. But there is no any data that inserted row on step 3.
These steps lead to data(on step 3 was inserted row) lost.
Is there any problem with the steps I performed above?
If yes, How to deal with this situation?
look forward to your reply, thanks.
CONSISTENCY.ANY will result in data loss in many scenarios. It can be as simple as a polar bear ripping a server off the wall as soon as the write is ACKd to the client (not even applied to a single commitlog yet). This is for writes that are equivelent to being ok with durable_writes=false where latency in client is more important than actually storing the data.
If you want to ensure no data loss, have a RF of at least 3 and use quorum, then any write you get an ack for you can be confident will survive a single node failure. A RF=2 can work with quorum but thats the equivalent of CL.ALL which means any node failure, gc, or hiccup will cause loss of availability.
Important to recognize that hints are not about guaranteed delivery, just possibly reducing the time of convergence when data becomes inconsistent. Repairs within the gc_grace_seconds are still necessary to prevent data loss. If your using weak consistency, durability and low replication you open yourself up for data loss.
Because removenode not stream the data from the node that will be removed. it tells the cluster I am going out of the cluster and balance the existing cluster.
Please refer https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsRemoveNode.html

How to bring up the new node

It is a follow-up question from High Availability in Cassandra
1) Let's say we have three nodes N1, N2 and N3, I have RF =3 and WC = 3 and RC = 1, then which means I cannot handle any node failure in case of write.
2) Let's say If the N3 (Imagine It holds the data) went down and as of now we will not be able to write the data with the consistency as '3'.
Question 1: Now If I bring a new Node N4 up and attach to the cluster, Still I will not be able to write to the cluster with consistency 3, So How can I make the node N4 act as the third node?
Question 2: I mean let's say we have 7 node cluster with RF = 3, then If any node holding the replica went down, Is there a way to make existing other nodes in the cluster to act as a node holding the partition?
Look at the docs:
https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsReplaceNode.html
You want to replace a dead node in your scenario. N3 should be removed from the ring and replaced by N4.
It should be easy to follow that instructions step by step. It is critial that if you installed the node via package mangement to stop it before configuring it new and to wipe out all existing data, caches and commitlogs from it (often found under /var/lib/cassandra/*).
Also it is possible to remove a dead node from the ring with nodetool removenode as described here http://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsRemoveNode.html and here https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsRemoveNode.html - this removes the node from your cluster (and you should ensure that it cant come back after that before wiping out its data).
Remember it only removes a dead node from the ring and assigns the token ranges to the remaining nodes, but no streaming will happen automatically. You will need to rum nodetool repair after removing a dead node.
If you want to remove a live node you can use nodetool decommission - but as above, ensure the node does not reenter the cluster by wiping out it's data.
Update:
Nodes in Cassandra are not "named" in that fashion N1, N2, etc. internally. The nodes have an uuid and they own so called token ranges which they are responsible for.
If a node is down - simply repair it if possible at all, bring it online again to join it your cluster - if that took less than the default 3 hours you are fine. Otherwise run nodetool repair.
But if the node is 'lost' completely and will never come back, run nodetool removenode on that dead node. This asks cassandra to assign the token ranges the dead node was responsible for to the remaining nodes. After that run nodetool repair so the nodes will stream the data which is missing. After that your cluster will now have one node less, so it will be six nodes.
Suppose you have a 7 node cluster. N1, N2, N3, ... ,N7. Suppose you have a data . That has RF = 3, Write consistency = 2, Read consistency = 2. Let's say node N1,N2, N3 are holding the data. If any of this nodes goes down,the cluster will be completely fine and data read/write operation will not be affected as long as consistency level for read and write operation is satisfied.
Suppose you have a data . That has RF = 3, Write consistency = 3, Read consistency = 3. Let's say node N1,N2, N3 are holding the data. If any of this nodes goes down,the operations will fail as the consistency level is not satisfied.
Now you can do two things if any of the N1,N2,N3 goes down:
1) You can replace the node. In this case newly replaced node will act like old dead node.
2) You can manually add a new node N8 and remove the old dead node N3. In this case Cassandra will distribute it's partiotioner among the ring and resize partiotion.

Will `nodetool repair` also repair against machines holding data they don't own in ring?

Let's say I have a cluster of three nodes with (for simplicity) a replication factor of 1. Let's call the nodes A, B and C.
According to the ring, the partition key X should be stored on A. However, due to a database recovery, the data for partition key X has ended up on node B (and A doesn't store X at all).
Question: If I issue nodetool repair, will it make sure that partition key X ends up on A?
I understand, that the real way of doing the database recovery would be to use something like sstableloader, however due to unforeseen circumstances doing the above might be an easier solution for me (if it works!).
You can't use repairs for clusters with replication factor 1. It just doesn't make sense for Cassandra to repair data across nodes if each node exclusively owns his own token range. Using sstableloader would be the cleaner solution in this case.

Cassandra not working when one of the nodes is down

I have a development cassandra cluster of two cassandra nodes [Let's call them NodeA and NodeB]. I also have a script that is continuously sending data on NodeA. I have created the database with the following parameters:
CREATE KEYSPACE test_database WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;
Now, for some reason NodeB is stoping after some time. But the issue is, as soon as NodeB stops, the script that is sending data to NodeA starts giving data insertion error.
Can anyone point out a probable reason for the same.
Update: Both the nodes are seed nodes.
How Cassandra handle data repartition
Each key in cassandra can be converted to a token. When you install your cluster, the nodes calculate what range of token they will accept.
Let's take a simple example:
You have two nodes, and a token that goes from 0 to 9. A simple repartition would be: node A stores every token between 0-4 and node B stores every token between 5-9.
How Cassandra works for write
You choose a Coordinator (in your case node A), that receive the data. This node will then calculate a token. As seen in the first example, every node has a range of token assigned to it. So imagine the key is converted to token 4, then the data goes to node A (here the coordinator). If the token is 8, the data will be sent to node B.
What is cassandra data replication factor
The replication factor is how many time your data will be stored on your cluster. For a single database with no racks (your case), the data is first send to the node who owns the token associated with the key, and the replicas are sent to the next node in the topology.
In case of failure of one node, the replicas will help the node to restore its data.
In your case, there are no replicas, and if a node is down, Cassandra can't store the data and throws an error. If you have replication factor 2, Cassandra should be able to store a replica on node A and not fail.
Cassandra's Replication Factor:
Lets say we have 'n' as replication factor which means given input data will be stored/retrieved from 'n' nodes.
t
If you mention the replication factor as '1' which means only one node will have the data.
Partitioning:
Lets say we have 2 nodes, whenever you are inserting the data. Both these nodes will have some data, based on partitioning algorithm mentioned.
For example:
You are inserting 10 records, based on the hashing and partitioning algorithm, it chooses which node needs to be written for each record. Of-course the identification of node is done by the Coordinator :)
Durable Writes:
By default, cassandra always write in commit-log before flushing to disk. If you set to false, it will bypass commit-log and write directly to disk(SSTable).
The problem you have mentioned, for example lets say you are inserting 10 rows.
For simplicity, we can make the partitioning/hashing calculation as n/2.
So, Cassandra's Coordinator node splits up your data into two pieces(for simple calculation it will be 10/2) and tries to put 1st half in to 1st node and succeeds and tries to put the 2nd half into the second node(writing to commit-log), since it is unavailable it is throwing error.
So how do we fix this issue? lets say I want to batch insert multiple insert queries when 1 node in a cluster is down? It returns me
Connection to Cassandra cluster associated with connection cs1 not available due to Host not available. Host Address: cassandra1
If your table is not counter table , you can use consistency level of ANY which gives high availaiblity for write.
Refer this to learn more about it => https://www.datastax.com/blog/2011/05/understanding-hinted-handoff-cassandra-08

Cassandra - client side load balancing

Consider following Cassandra setup:
ring of 6 nodes: A, B, D, E, F, G
replication factor: 3
partitioner: RandomPartitioner
placement strategy: SimpleStrategy
My Test-Column is stored on node B and replicated to nodes D and E.
Now I have multiple java processes reading my Test-Column trough Hector API (Thrift) with read CL.ONE
There are two possibilities:
Hector will forward all calls to node B, because B is the data
master
Hector will load balance read calls trough node B, D and E (master and replicates). In this case my test column would be loaded into cache on each Cassandra instance.
Which one is it 1) or 2) ?
Thanks and regards,
Maciej
I believe it is: 3) Cassandra forwards all calls to the closest node that is alive, where "closeness" is determined by the Snitch currently being used (set in cassandra.yaml).
SimpleSnitch chooses the closest node on the token ring.
AbstractNetworkTopologySnitch and derived snitches first try to choose nodes in the same rack, then nodes in the same datacenter.
If DynamicSnitch is enabled, it dynamically adjusts the node closeness returned by the underlying snitch, according to the nodes' recent performance.
See Cassandra ArchitectureInternals under "Read Path" for more information.
(Upvoted Theodore's answer because it is corect).
Some additional details:
We do nothing on the hector side to route traffic to a given node based on the key (yet). This was referred to as "client mediated selects" in section 6.2 of the Amazon Dynamo paper. The research seems to indicate that it really only is useful for very large clusters by cutting out a network hop.
The downside would be the duplication of hashing computation and partitioner lookup on the client.

Resources