How to Manage Node Failure with Cassandra Replication Factor 1?

How to Manage Node Failure with Cassandra Replication Factor 1? - cassandra

I have a three node Cassandra (DSE) cluster where I don't care about data loss so I've set my RF to 1. I was wondering how Cassandra would respond to read/write requests if a node goes down (I have CL=ALL in my requests right now).
Ideally, I'd like these requests to succeed if the data exists - just on the remaining available nodes till I replace the dead node. This keyspace is essentially a really huge cache; I can replace any of the data in the event of a loss.

(Disclaimer: I'm a ScyllaDB employee)
Assuming your partition key was unique enough, when using RF=1 each of your 3 nodes contains 1/3 of your data. BTW, in this case CL=ONE/ALL is basically the same as there's only 1 replica for your data and no High Availability (HA).
Requests for "existing" data from the 2 up nodes will succeed. Still, when one of the 3 nodes is down a 1/3 of your client requests (for the existing data) will not succeed, as basically 1/3 of you data is not available, until the down node comes up (note that nodetool repair is irrelevant when using RF=1), so I guess restore from snapshot (if you have one available) is the only option.
While the node is down, once you execute nodetool decommission, the token ranges will re-distribute between the 2 up nodes, but that will apply only for new writes and reads.
You can read more about the ring architecture here:
http://docs.scylladb.com/architecture/ringarchitecture/

Related

Cassandra: what node will data be written if the needed node is down?

Suppose I have a Cassandra cluster with 3 nodes (node 0, node 1 and node 2) and replication factor of 1.
Suppose that I want to insert a new data to the cluster and the partition key directs the new row to node 1. However, node 1 is temporarily unavailable. In this case, will the new data be inserted to node 0 or node 2 (although it should not be placed there according to the partition key)?

In Cassandra, Replication Factor (RF) determines how many copies of data will ultimately exist and is set/configured at the keyspace layer. Again, its purpose is to define how many nodes/copies should exist if things are operating "normally". They could receive the data several ways:
During the write itself - assuming things are functioning "normally" and everything is available
Using Hinted Handoff - if one/some of the nodes are unavailable for a configured amount of time (< 3 hours), cassandra will automatically send the data to the node(s) when they become available again
Using manual repair - "nodetool repair" or if you're using DSE, ops center can repair/reconcile data for a table, keyspace, or entire cluster (nodesync is also a tool that is new to DSE and similar to repair)
During a read repair - Read operations, depending on the configurable client consistency level (described next) can compare data from multiple nodes to ensure accuracy/consistency, and fix things if they're not.
The configurable client consistency level (CL) will determine how many nodes must acknowledge they have successfully received the data in order for the client to be satisfied to move on (for writes) - or how many nodes to compare with when data is read to ensure accuracy (for reads). The number of nodes available must be equal to or greater than the client CL number specified or the application will error (for example it won't be able to compare a QUORUM level of nodes if a QUORUM number of nodes are not available). This setting does not dictate how many nodes will receive the data. Again, that's the RF keyspace setting. That will always hold true. What we're specifying here is how many must acknowledge each write or compare for each read in order the client to be happy at that moment. Hopefully that makes sense.
Now...
In your scenario with a RF=1, the application will receive an error upon the write as the single node that should receive the data (based off of a hash algorithm) is down (RF=1 again means only a single copy of the data will exist, and that single copy is determined by a hash algorithm to be the unavailable node). Does that make sense?
If you had a RF=2 (2 copies of data), then one of the two other nodes would receive the data (again, the hash algorithm picks the "base" node, and then another algorithm will chose where the cop(ies) go), and when the unavailable node became available, it would eventually receive the data (either by hinted handoff or repair). If you chose a RF=3 (3 copies) then the other 2 nodes would get the data, and again, once the unavailable node became available, it would eventually receive the data (either by hinted handoff or repair).
FYI, if you ever want to know where a piece of data will/does exist in a Cassandra cluster, you can run "nodetool getendpoints". The output will be where all copies will/do reside.

Possible to take half of Cassandra nodes down without affecting the application?

If there is a 4 node Cassandra cluster, is it possible to configure Cassandra in a way to have half of the nodes down (two in this case) without affecting the applications?
Also how long can nodes be down without Cassandra cancelling the write queue?

This depends on the client CL and DC replication factor.
Let's assume the RF is 4 (all), if the client has a CL=ONE or LOCAL_ONE, the application would not notice any issues. Any other client CL would have problems (e.g. cl=local_quorum of 4 is 3, allowing only a single node to be down).
Let's assume the RF=1 or 2. If CL=ONE or LOCAL_ONE, the application would be unaffected by queries that only manipulate data on the available nodes. However, any access to rows that only exist on the unavailable nodes would be impacted. In other words, CL=ONE or LOCAL_ONE only works if you're manipulating data that has at least one node available to return the response (You only need ONE to respond in this scenario). If the rows you're querying are on both of the unavailable nodes, you'll get an error stating something like: Expected response of 1, received 0.
Many applications configure CL to be some sort of quorum (local or not) - so in that case, the application would certainly fail unless you had RF=5 (so at least 5 nodes). Quorum of 5 is 3, allowing for 2 nodes to fail.
Hopefully that makes sense.

Yes, assuming you are talking about all four nodes in one data centre, if you set your replication factor to 3 or greater and your read and write consistency level to ONE.
For writes the nodes that are up will store hints for the nodes that are down, so when they come back up they can write the data. How long the nodes store these hints can be set in cassandra.yaml.

Cassandra cluster works with 2 nodes?

I have 2 nodes with replication factor = 1, it means will have 1 set of data copy in each node.
Based on above description when i use murmur3partitioner,
Will data shared among nodes ? like 50% of data in node1 and 50% of data in node 2?
when i read request to node 1 , will it internally connect with node 2 for consistency ?
And my intention is to make a replica and both nodes should server the request independently without inter communication.

First of all, please try to ask only one question at per post.
I have 2 nodes with replication factor = 1, it means will have 1 set of data copy in each node.
Incorrect. A RF=1 indicates that your entire cluster will have 1 copy of the data.
Will data shared among nodes ? like 50% of data in node1 and 50% of data in node 2?
That is what it will try to do. Do note that it probably won't be exact. It'll probably be something like 49/51-ish.
when i read request to node 1 , will it internally connect with node 2 for consistency ?
With a RF=1, no it will not. Based on the hashed token value of your partition key, it will be directed only to the node which contains the data.
As an example, with a RF=2 with 2 nodes, it would depend on the consistency level set for your operation. Reading at ONE will always read only one replica. Reading at QUORUM will always read from 2 replicas with 2 nodes (after all, QUORUM of 2 equals 2). Reading at ALL will require a response from all replicas, and initiate a read repair if they do not agree.
Important to note, but you cannot force your driver to connect to a specific Cassandra node. You may provide one endpoint, but it will find the other via gossip, and use it as it needs to.

Cassandra not working when one of the nodes is down

I have a development cassandra cluster of two cassandra nodes [Let's call them NodeA and NodeB]. I also have a script that is continuously sending data on NodeA. I have created the database with the following parameters:
CREATE KEYSPACE test_database WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;
Now, for some reason NodeB is stoping after some time. But the issue is, as soon as NodeB stops, the script that is sending data to NodeA starts giving data insertion error.
Can anyone point out a probable reason for the same.
Update: Both the nodes are seed nodes.

How Cassandra handle data repartition
Each key in cassandra can be converted to a token. When you install your cluster, the nodes calculate what range of token they will accept.
Let's take a simple example:
You have two nodes, and a token that goes from 0 to 9. A simple repartition would be: node A stores every token between 0-4 and node B stores every token between 5-9.
How Cassandra works for write
You choose a Coordinator (in your case node A), that receive the data. This node will then calculate a token. As seen in the first example, every node has a range of token assigned to it. So imagine the key is converted to token 4, then the data goes to node A (here the coordinator). If the token is 8, the data will be sent to node B.
What is cassandra data replication factor
The replication factor is how many time your data will be stored on your cluster. For a single database with no racks (your case), the data is first send to the node who owns the token associated with the key, and the replicas are sent to the next node in the topology.
In case of failure of one node, the replicas will help the node to restore its data.
In your case, there are no replicas, and if a node is down, Cassandra can't store the data and throws an error. If you have replication factor 2, Cassandra should be able to store a replica on node A and not fail.

Cassandra's Replication Factor:
Lets say we have 'n' as replication factor which means given input data will be stored/retrieved from 'n' nodes.
t
If you mention the replication factor as '1' which means only one node will have the data.
Partitioning:
Lets say we have 2 nodes, whenever you are inserting the data. Both these nodes will have some data, based on partitioning algorithm mentioned.
For example:
You are inserting 10 records, based on the hashing and partitioning algorithm, it chooses which node needs to be written for each record. Of-course the identification of node is done by the Coordinator :)
Durable Writes:
By default, cassandra always write in commit-log before flushing to disk. If you set to false, it will bypass commit-log and write directly to disk(SSTable).
The problem you have mentioned, for example lets say you are inserting 10 rows.
For simplicity, we can make the partitioning/hashing calculation as n/2.
So, Cassandra's Coordinator node splits up your data into two pieces(for simple calculation it will be 10/2) and tries to put 1st half in to 1st node and succeeds and tries to put the 2nd half into the second node(writing to commit-log), since it is unavailable it is throwing error.

So how do we fix this issue? lets say I want to batch insert multiple insert queries when 1 node in a cluster is down? It returns me
Connection to Cassandra cluster associated with connection cs1 not available due to Host not available. Host Address: cassandra1

If your table is not counter table , you can use consistency level of ANY which gives high availaiblity for write.
Refer this to learn more about it => https://www.datastax.com/blog/2011/05/understanding-hinted-handoff-cassandra-08

Failover and Replication in 2-node Cassandra cluster

I run KairosDB on a 2-node Cassandra cluster, RF = 2, Write CL = 1, Read CL = 1. If 2 nodes are alive, client sends half of data to node 1 (e.g. metric from METRIC_1 to METRIC_5000) and the other half of data to node 2 (e.g. metric from METRIC_5001 to METRIC_10000). Ideally, each node always has a copy of all data. But if one node is dead, client sends all data to the alive node.
Client started sending data to the cluster. After 30 minutes, I turned node 2 off for 10 minutes. During this 10-minute period, client sent all data to node 1 properly. After that, I restarted node 2 and client continued sending data to 2 nodes properly. One hour later I stopped the client.
I wanted to check if the data which was sent to node 1 when node 2 was dead had been automatically replicated to node 2 or not. To do this, I turned node 1 off and queried the data within time when node 2 was dead from node 2 but it returned nothing. This made me think that the data had not been replicated from node 1 to node 2. I posted a question Doesn't Cassandra perform “late” replication when a node down and up again?. It seems that the data was replicated automatically but it was so slow.
What I expect is data in both 2 servers are the same (for redundancy purpose). That means the data sent to the system when node 2 is dead must be replicated from node 1 to node 2 automatically after node 2 becomes available (because RF = 2).
I have several questions here:
1) Is the replication truly slow? Or did I configure something wrong?
2) If client sends half of data to each node as in this question I think it's possible to lose data (e.g. node 1 receives data from client, while node 1 is replicating data to node 2 it suddenly goes down). Am I right?
3) If I am right in 2), I am going to do like this: client sends all data to both 2 nodes. This can solve 2) and also takes advantages of replication if one node is dead and is available later. But I am wondering that, this would cause duplication of data because both 2 nodes receive the same data. Is there any problem here?
Thank you!

Can you check the value of hinted_handoff_enabled in cassandra.yaml config file?
For your question: Yes you may lose data in some cases, until the replication is fully achieved, Cassandra is not exactly doing late replication - there are three mechanisms.
Hinted handoffs http://docs.datastax.com/en/cassandra/2.2/cassandra/operations/opsRepairNodesHintedHandoff.html
Repairs - http://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsRepair.html
Read Repairs - those may not help much on your use case - http://wiki.apache.org/cassandra/ReadRepair
AFAIK, if you are running a version greater than 0.8, the hinted handoffs should duplicate the data after node restarts without the need for a repair, unless data is too old (this should not be the case for 10 minutes). I don't know why those handoffs where not sent to your replica node when it was restarted, it deserves some investigation.
Otherwise, when you restart the node, you can force Cassandra to make sure that data is consistent by running a repair (e.g. by running nodetool repair).
By your description I have the feeling you are getting confused between the coordinator node and the node that is getting the data (even if the two nodes hold the data, the distinction is important).
BTW, what is the client behaviour with metrics sharding between node 1 and node 2 you are describing? Neither KairosDB nor Cassandra work like that, is it your own client that is sending metrics to different KairosDB instances?
The Cassandra partition is not made on metric name but on row key (partition key exactly, but it's the same with kairosDB). So every 3-weeks data for each unique series will be associated a token based on hash code, this token will be use for sharding/replication on the cluster.
KairosDB is able to communicate with several nodes and would round robin between those as coordinator nodes.
I hope this helps.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string