How to bring up the new node - cassandra

It is a follow-up question from High Availability in Cassandra
1) Let's say we have three nodes N1, N2 and N3, I have RF =3 and WC = 3 and RC = 1, then which means I cannot handle any node failure in case of write.
2) Let's say If the N3 (Imagine It holds the data) went down and as of now we will not be able to write the data with the consistency as '3'.
Question 1: Now If I bring a new Node N4 up and attach to the cluster, Still I will not be able to write to the cluster with consistency 3, So How can I make the node N4 act as the third node?
Question 2: I mean let's say we have 7 node cluster with RF = 3, then If any node holding the replica went down, Is there a way to make existing other nodes in the cluster to act as a node holding the partition?

Look at the docs:
https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsReplaceNode.html
You want to replace a dead node in your scenario. N3 should be removed from the ring and replaced by N4.
It should be easy to follow that instructions step by step. It is critial that if you installed the node via package mangement to stop it before configuring it new and to wipe out all existing data, caches and commitlogs from it (often found under /var/lib/cassandra/*).
Also it is possible to remove a dead node from the ring with nodetool removenode as described here http://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsRemoveNode.html and here https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsRemoveNode.html - this removes the node from your cluster (and you should ensure that it cant come back after that before wiping out its data).
Remember it only removes a dead node from the ring and assigns the token ranges to the remaining nodes, but no streaming will happen automatically. You will need to rum nodetool repair after removing a dead node.
If you want to remove a live node you can use nodetool decommission - but as above, ensure the node does not reenter the cluster by wiping out it's data.
Update:
Nodes in Cassandra are not "named" in that fashion N1, N2, etc. internally. The nodes have an uuid and they own so called token ranges which they are responsible for.
If a node is down - simply repair it if possible at all, bring it online again to join it your cluster - if that took less than the default 3 hours you are fine. Otherwise run nodetool repair.
But if the node is 'lost' completely and will never come back, run nodetool removenode on that dead node. This asks cassandra to assign the token ranges the dead node was responsible for to the remaining nodes. After that run nodetool repair so the nodes will stream the data which is missing. After that your cluster will now have one node less, so it will be six nodes.

Suppose you have a 7 node cluster. N1, N2, N3, ... ,N7. Suppose you have a data . That has RF = 3, Write consistency = 2, Read consistency = 2. Let's say node N1,N2, N3 are holding the data. If any of this nodes goes down,the cluster will be completely fine and data read/write operation will not be affected as long as consistency level for read and write operation is satisfied.
Suppose you have a data . That has RF = 3, Write consistency = 3, Read consistency = 3. Let's say node N1,N2, N3 are holding the data. If any of this nodes goes down,the operations will fail as the consistency level is not satisfied.
Now you can do two things if any of the N1,N2,N3 goes down:
1) You can replace the node. In this case newly replaced node will act like old dead node.
2) You can manually add a new node N8 and remove the old dead node N3. In this case Cassandra will distribute it's partiotioner among the ring and resize partiotion.

Related

cassandra sync with recovered node

I am trying to build cassandra backup and recovery process.
Let say I have 2 nodes A and B and table C with replica factor 2.
In table C we have row with ID=5 and Name="Alex".
Now , something bad happened to node B and we need to get it down for the few minutes to make a restore.
In the same time,while node B is down, someone change row with ID=5 form Name="Alex" to Name="Alehandro".
Node B up again , with restored data and respectively for this node row with ID=5 still contain Name="Alex".
What will happens when I try to find row with ID=5?
Will node A synchronize with node B?
Thanks.
Cassandra has several ways to synchronize data to nodes that were missed writes because they were down, or there was garbage collection pause, etc. This includes:
Hints - coordinator node for a some time (3 hours by default, configurable) will collect all write operations that other node has missed, and when it's back - these operations will be replayed against it
Repair - explicit synchronization of data, that is triggered via nodetool repair manually, or the tools like Reaper could be used to automate it
Read repair - if you're using consistency level that requires reading from the several nodes (TWO, LOCAL_QUORUM, QUORUM, etc.), then coordinator node will detect discrepancies, and will return data with the newest timestamp, if necessary, fixing the data on node that has old data
Answering your last question - when 2nd node is back, you can get old data if hints aren't replayed yet, and you're reading directly from that node, and you're reading with consistency level ONE or LOCAL_ONE.
P.S. I recommend to look through the DSE Architecture Guide - it covers how Cassandra works.

Best way to add multiple nodes to existing cassandra cluster

We have a 12 node cluster with 2 datacenters(each DC has 6 nodes) with RF-3 in each DC.
We are planning to increase cluster capacity by adding 3 nodes in each DC(total 6 nodes).
What is the best way to add multiple nodes at once.(ya,may be with 2 min difference).
auto_bootstrap:false - Use auto_bootstrap:false(as this is quicker process to start nodes) on all new nodes , start all nodes & then run 'nodetool rebuild' to get data streamed to this new nodes from exisitng nodes.
If I go this way , where read requests go soon starting this new nodes , as at this point it has only token range assigned to them(new nodes) but NO data got streamed to this nodes , will it cause Read request failures/CL issues/any other issue?
OR
auto_bootstrap:true - Use auto_bootstrap:true and then start one node at a time , wait until streaming process finishes(this might take time I guess as we have huge data approx 600 GB+ on each node) before starting next node.
If I go this way , I have to wait until whole streaming process done on a node before proceed adding next new node.
Kindly suggest a best way to add multiple nodes all at once.
PS: We are using c*-2.0.3.
Thanks in advance.
As each depends on streaming data over the network, it largely depends how distributed your cluster is, and where your data currently is.
If you have a single-DC cluster and latency is minimal between all nodes, then bringing up a new node with auto_bootstrap: true should be fine for you. Also, if at least one copy of your data has been replicated to your local datacenter (the one you are joining the new node to) then this is also the preferred method.
On the other hand, for multiple DCs I have found more success with setting auto_bootstrap: false and using nodetool rebuild. The reason for this, is because nodetool rebuild allows you to specify a data center as the source of the data. This path gives you the control to contain streaming to a specific DC (and more importantly, not to other DCs). And similar to the above, if you are building a new data center and your data has not yet been fully-replicated to it, then you will need to use nodetool rebuild to stream data from a different DC.
how read requests would be handled ?
In both of these scenarios, the token ranges would be computed for your new node when it joins the cluster, regardless of whether or not the data is actually there. So if a read request were to be sent to the new node at CL ONE, it should be routed to a node containing a secondary replica (assuming RF>1). If you queried at CL QUORUM (with RF=3) it should find the other two. That is of course, assuming that the nodes which are picking-up the slack are not overcome by their streaming activities that they cannot also serve requests. This is a big reason why the "2 minute rule" exists.
The bottom line, is that your queries do have a higher chance of failing before the new node is fully-streamed. Your chances of query success increase with the size of your cluster (more nodes = more scalability, and each bears that much less responsibility for streaming). Basically, if you are going from 3 nodes to 4 nodes, you might get failures. If you are going from 30 nodes to 31, your app probably won't notice a thing.
Will the new node try to pull data from nodes in the other data centers too?
Only if your query isn't using a LOCAL consistency level.
I'm not sure this was ever answered:
If I go this way , where read requests go soon starting this new nodes , as at this point it has only token range assigned to them(new nodes) but NO data got streamed to this nodes , will it cause Read request failures/CL issues/any other issue?
And the answer is yes. The new node will join the cluster, receive the token assignments, but since auto_bootstrap: false, the node will not receive any streamed data. Thus, it will be a member of the cluster, but will not have any old data. New writes will be received and processed, but existing data prior to the node joining, will not be available on this node.
With that said, with the correct CL levels, your new node will still do background and foreground read repair, so that it shouldn't respond any differently to requests. However, I would not go this route. With 2 DC's, I would divert traffic to DCA, add all of the nodes with auto_bootstrap: false to DCB, and then rebuild the nodes from DCA. The rebuild will need to be from DCA because the tokens have changed in DCB, and with auto_bootstrap: false, the data may no longer exist. You could also run repair, and that should resolve any discrepancies as well. Lastly, after all of the nodes have been bootstrapped, run cleanup on all of the nodes in DCB.

Can cassandra guarantee the replication factor when a node is resyncing it's data?

Let's say I have a 3 node cluster.
I am writing to node #1.
If node #2 in that cluster goes down, and then comes back up and is resyncing the data from the other nodes, and I continue writing to node #1, will the data be synchronously replicated to node #2? That is, is the replication factor of that write honored synchronously or is it behind the queue post resync?
Thanks
Steve
Yes granted that you are reading and writing at a consistency level that can handle 1 node becoming unavailable.
Consider the following scenario:
You have a 3 node cluster with a keyspace 'ks' with a replication factor of 3.
You are writing at a Consistency Level of 'QUORUM'
You are reading at a Consistency level of 'QUORUM'.
Node 2 goes down for 10 minutes.
Reads and Writes can successfully continue while node is down since 'QUORUM' only requires 2 (3/2+1=2) nodes to be available. While Node 2 is down, both Node 1 and 3 maintain 'hints' for Node 2.
Node 2 comes online. Node 1 and 3 send hints they recorded while Node 2 was down to Node 2.
If a read happens and the coordinating cassandra node detects that nodes are missing data/not consistent, it may execute a 'read repair'
If Node 2 was down for a long time, Node 1 and Node 3 may not retain all hints destined for it. In this case, an operator should consider running repairs on a scheduled basis.
Also note that when doing reads, if Cassandra finds that there is a data mismatch during a digest request, it will always consider the data with the newest timestamp as the right one (see 'Why cassandra doesn't need vector clocks').
Node2 will immediately start taking the new writes and also any hints stored for this node by others. It is good idea to run a read repair on the node after it is back up, which will ensure the data is accurate with other nodes.
Note that each column has a timestamp stored against it which will help cassandra determine which data is recent when running node repair.

Does cassandra delete data that has been duplicated at new node with replication_factor 1

I set the replication_factor to be 1 and I have one node N1 cluster hosting all the data (100%, 1G). When I add a new node N2 to the cluster to take half of the data, what I see is that N1(50%,1G), N2(50%,0.5G).
It looks that node N1 still hosting all the data, even through half of data has been duplicated at N2. Why this would happen when there is only one copy in the cluster (replication_factor=1)?
Did you run nodetool cleanup on your N1 node? Read through the documentation on Nodetool's cleanup command:
Use this command to remove unwanted data after adding a new node to
the cluster. Cassandra does not automatically remove data from nodes
that lose part of their partition range to a newly added node. Run
nodetool cleanup on the source node and on neighboring nodes that
shared the same subrange after the new node is up and running. Failure
to run this command after adding a node causes Cassandra to include
the old data to rebalance the load on that node.

cassandra rack replication fixed to (dead) nodes [RF and CL confusion]

1 cluster, 3 nodes, in 2 physical locations, grouped in 2 racks
RF 2
PropertyFileSnitch
CL QUORUM
The problem is:
First node's (in RAC1) replication is pointed to third node from RAC2 and does not change if that node is down, reads and writes fails.
If I start back third node and shut down second node, reads and writes works.
Both second and third node replicate to first node, and if first node is down, reads and writes also fails.
The question is:
Is it possible to make it automatically detect dead nodes and point replication to active detected nodes?
If first node is down, second and third node to replicate data between each other
If second or third node is down, first node should detect what is active and replicate to it
Update1:
Made some tests:
Shut down first node - reads from second and third node fails (Unable to complete request: one or more nodes were unavailable.)
Shut down second node - reads from first and third node works
Shut down third node - reads from first and second node fails
Very strange ...
Update2:
I think I found the answer. How it is now: 3 nodes, RF 2, writes and reads has CL 2. if one replica is down, reads and writes fails (I tested selecting different keys, some succeeded when one node was down, and fails when another is down)
Now I am thinking to do this: move all nodes to one rack, change RF to 3, for reads and writes I will use CL 2 (two replications will be required for write to succeed, and third will be made in background). So now there will be 3 replicas, if one fails, CL 2 will succeed anyway.
Am I right?
Will writes succeed if there are 2 nodes active, replication factor is 3, and consistency level for current write operation is 2?
Update3:
Yes, I think I'm on the right way. Same question here
From the screenshot it can assumed that it is Opscenter.
In Opscenter there is special feature called alerts. It will help you in detecting dead node.
Now coming to question of node down & read write fails, basically these thing depends in read/write consistency level. Go through consistency levels, you will be able to find out the solution on your own.
UPDATE:
May be you might find this blog interesting.
The only time Cassandra will fail is when too few replicas are alive when the coordinator receives the request. This might be the reason behind your strange situation
You want to have all three nodes in RAC1 with a replication factor of 3 and use QUORUM for your reads/writes. This will ensure that data is always persisted to two nodes, reads will be consistent, and there can be one failed node without downtime or data loss. If you don't care about reads always being consistent, i.e. stale data is allowed sometimes, you can make reads more performant by using ONE for reads.

Resources