There are four nodes in cluster. assume them are node A, B, C, D. enabled hinted handoff.
1) create a keyspace with RF=2, and create a table.
2) make node B, C down(nodetool stopdaemon),
3) login in node A with cqlsh,set CONSISTENCY ANY, insert into a row(assume the row will be stored in node B and C). The row was successfully inserted even though the node B,C was down, because the consistency level is ANY. the coordinator(node A) wrote hints.
4) make node A down(nodetool stopdaemon), then remove node A(nodetool removenode ${nodeA_hostId})
5) make node B, C come back(nodetool start)
6) login in any node of B, C, D. and execute select statement with partition key of inserted row. But there is no any data that inserted row on step 3.
These steps lead to data(on step 3 was inserted row) lost.
Is there any problem with the steps I performed above?
If yes, How to deal with this situation?
look forward to your reply, thanks.
CONSISTENCY.ANY will result in data loss in many scenarios. It can be as simple as a polar bear ripping a server off the wall as soon as the write is ACKd to the client (not even applied to a single commitlog yet). This is for writes that are equivelent to being ok with durable_writes=false where latency in client is more important than actually storing the data.
If you want to ensure no data loss, have a RF of at least 3 and use quorum, then any write you get an ack for you can be confident will survive a single node failure. A RF=2 can work with quorum but thats the equivalent of CL.ALL which means any node failure, gc, or hiccup will cause loss of availability.
Important to recognize that hints are not about guaranteed delivery, just possibly reducing the time of convergence when data becomes inconsistent. Repairs within the gc_grace_seconds are still necessary to prevent data loss. If your using weak consistency, durability and low replication you open yourself up for data loss.
Because removenode not stream the data from the node that will be removed. it tells the cluster I am going out of the cluster and balance the existing cluster.
Please refer https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsRemoveNode.html
Related
In Cassandra, tombstones are used in deletion since the writes are written to immutable files. I read that tombstones also solve the tough problem of deleting in distributed systems. This is where I am confused. What problems exist in deleting from distributed databases? For eg: Take a 3 node cluster with nodes A, B and C. Say node C is down and a delete came. It is marked as tombstone in A and B and success is returned back to client. After sometime the compaction kicks in on A and B and clears out this tombstone. Now when a read comes for the previously deleted value, A and B return nothing while C returns the old value. But here I read that the value given by C takes precedence over the empty responses.
If the tombstoned record has already been deleted from the rest of the cluster before that node recovers, Cassandra treats the record on the recovered node as new data, and propagates it to the rest of the cluster.
Why does it do this? Since quorum nodes say the value is not present, why don't we return that back to the client? This could potentially simplify the problem of deletes in distributed systems as we needn't wait for gc grace seconds before clearing out the tombstones.
The quorum returns nothing could also mean that the rest of the nodes simply didn't receive value because the nodes were down, so in this case the single node having data is correct, and this value will be propagated to the nodes that doesn't have it. Cassandra simply don't know, if the data is missing because it was deleted via tombstone vs. data is missing because nodes weren't available at the time of write.
That's why it's important to run repairs regularly and make sure that this happens during gc_grace_seconds. And that you didn't put back machine after being offline greater than this period.
I am trying to build cassandra backup and recovery process.
Let say I have 2 nodes A and B and table C with replica factor 2.
In table C we have row with ID=5 and Name="Alex".
Now , something bad happened to node B and we need to get it down for the few minutes to make a restore.
In the same time,while node B is down, someone change row with ID=5 form Name="Alex" to Name="Alehandro".
Node B up again , with restored data and respectively for this node row with ID=5 still contain Name="Alex".
What will happens when I try to find row with ID=5?
Will node A synchronize with node B?
Thanks.
Cassandra has several ways to synchronize data to nodes that were missed writes because they were down, or there was garbage collection pause, etc. This includes:
Hints - coordinator node for a some time (3 hours by default, configurable) will collect all write operations that other node has missed, and when it's back - these operations will be replayed against it
Repair - explicit synchronization of data, that is triggered via nodetool repair manually, or the tools like Reaper could be used to automate it
Read repair - if you're using consistency level that requires reading from the several nodes (TWO, LOCAL_QUORUM, QUORUM, etc.), then coordinator node will detect discrepancies, and will return data with the newest timestamp, if necessary, fixing the data on node that has old data
Answering your last question - when 2nd node is back, you can get old data if hints aren't replayed yet, and you're reading directly from that node, and you're reading with consistency level ONE or LOCAL_ONE.
P.S. I recommend to look through the DSE Architecture Guide - it covers how Cassandra works.
Hi I was trying out different configuration using the site
https://www.ecyrd.com/cassandracalculator/
But I could not understand the following results show for configuration
Cluster size 3
Replication Factor 2
Write Level 1
Read Level 1
You can survive the loss of no nodes without data loss.
For reference I have seen the question Cassandra loss of a node
But it still does not help to understand why Write level 1 will with replication 2 would make my cassandra cluster not survive the loss of no node without data loss?
A write request goes to all replica nodes and the even if 1 responds back , it is a success, so assuming 1 node is down, all write request will go to the other replica node and return success. It will be eventually consistent.
Can someone help me understand with an example.
I guess what the calculator is working with is the worst case scenario.
You can survive the loss of one node if your data is available redundantly on two out of three nodes. The thing with write level ONE is, that there is no guarantee that the data is actually present on two nodes right after your write was acknowledged.
Let's assume the coordinator of your write is one of the nodes holding a copy of the record you are writing. With write level ONE you are telling the cluster to acknowledge your write as soon as the write was committed to one of the two nodes that should hold the data. The coordinator might do that before even attempting to contact the other node (to boost latency percieved by the client). If in that moment, right after acknowledging the write but before attempting to contact the second node the coordinator node goes down and cannot be brought back, then you lost that write and the data with it.
When you read or write data, Cassandra computes the hash token for the data and distributes to respective nodes. When you have 3 node cluster with replication factor as 2 means your data is stored in 2 nodes. So at a point when 2 nodes are down which is responsible for a token A and this token is not part of node 3, eventually even you have one node you will still have TokenRangeOfflineException.
The point is we need replicas(Token) and not the nodes. Also see the similar question answered here.
This is the case because the write level is 1. And if the your application is writing on 1 node only (and waiting data to get eventually consistent/sync, which is going to take non-zero time), then data can get lost if that one server itself is lost before sync could happen
Assumptions: RF = 3
In some video on the Internet about Consistency level speaker says that CL = ONE is better then CL = ANY because when we use CL = ANY coordinator will be happy to store only hint(and data)(we are assuming here that all the other nodes with corresponding partition key ranges are down) and we can potentially lose our data due to coordinator's failure. But wait a minute.... as I understand it, if we used CL = ONE and for example we had only one(of three) available node for this partition key, we would have only one node with inserted data. Risk of loss is the same.
But I think we should assume equal situations - all nodes for particular token is gone. Then it's better to discard write operation then write with such a big risk of coordinator's loss.
CL=ANY should probably never be used on a production server. Writes will be unavailable until the hint is written to a node owning that partition because you can't read data when its in a hints log.
Using CL=ONE and RF=3 with two nodes down, you would have data stored in both a) the commit log and memtable on a node and b) the hints log. These are likely different nodes, but they could be the same 1/3 of the time. So, yes, with CL=ONE and CL=ANY you risk complete loss of data with a single node failure.
Instead of ANY or ONE, use CL=QUORUM or CL=LOCAL_QUORUM.
The thing is the hints will just be stored for 3 hours by default and for longer times than that you have to run repairs. You can repair if you have at least one copy of this data on one node somewhere in the cluster (hints that are stored on coordinator don't count).
Consistency One guarantees that at least one node in the cluster has it in commit log no matter what. Any is in worst case stored in hints of coordinator (other nodes can't access it) and this is stored by default in a time frame of 3 hours. After 3 hours pass by with ANY you are loosing data if other two instances are down.
If you are worried about the risk, then use quorum and 2 nodes will have to guarantee to save the data. It's up to application developer / designer to decide. Quorum will usually have slightly bigger latencies on write than One. But You can always add more nodes etc. should the load dramatically increase.
Also have a look at this nice tool to see what impacts do various consistencies and replication factors have on applications:
https://www.ecyrd.com/cassandracalculator/
With RF 3, 3 nodes in the cluster will actually get the write. Consistency is just about how long you want to wait for response from them ... If you use One, you will wait until one node has it in commit log. But the coordinator will actually send the writes to all 3. If they don't respond coordinator will save the writes into hints.
Most of the time any in production is a bad idea.
I want to clarify very basic concept of replication factor and consistency level in Cassandra. Highly appreciate if someone can provide answer to below questions.
RF- Replication Factor
RC- Read Consistency
WC- Write Consistency
2 cassandra nodes (Ex: A, B) RF=1, RC=ONE, WC=ONE or ANY
can I write data to node A and read from node B ?
what will happen if A goes down ?
3 cassandra nodes (Ex: A, B, C) RF=2, RC=QUORUM, WC=QUORUM
can I write data to node A and read from node C ?
what will happen if node A goes down ?
3 cassandra nodes (Ex: A, B, C) RF=3, RC=QUORUM, WC=QUORUM
can I write data to node A and read from node C ?
what will happen if node A goes down ?
Short summary: Replication factor describes how many copies of your data exist. Consistency level describes the behavior seen by the client. Perhaps there's a better way to categorize these.
As an example, you can have a replication factor of 2. When you write, two copies will always be stored, assuming enough nodes are up. When a node is down, writes for that node are stashed away and written when it comes back up, unless it's down long enough that Cassandra decides it's gone for good.
Now say in that example you write with a consistency level of ONE. The client will receive a success acknowledgement after a write is done to one node, without waiting for the second write. If you did a write with a CL of ALL, the acknowledgement to the client will wait until both copies are written. There are very many other consistency level options, too many to cover all the variants here. Read the Datastax doc, though, it does a good job of explaining them.
In the same example, if you read with a consistency level of ONE, the response will be sent to the client after a single replica responds. Another replica may have newer data, in which case the response will not be up-to-date. In many contexts, that's quite sufficient. In others, the client will need the most up-to-date information, and you'll use a different consistency level on the read - perhaps a level ALL. In that way, the consistency of Cassandra and other post-relational databases is tunable in ways that relational databases typically are not.
Now getting back to your examples.
Example one: Yes, you can write to A and read from B, even if B doesn't have its own replica. B will ask A for it on your client's behalf. This is also true for your other cases where the nodes are all up. When they're all up, you can write to one and read from another.
For writes, with WC=ONE, if the node for the single replica is up and is the one you're connect to, the write will succeed. If it's for the other node, the write will fail. If you use ANY, the write will succeed, assuming you're talking to the node that's up. I think you also have to have hinted handoff enabled for that. The down node will get the data later, and you won't be able to read it until after that occurs, not even from the node that's up.
In the other two examples, replication factor will affect how many copies are eventually written, but doesn't affect client behavior beyond what I've described above. The QUORUM will affect client behavior in that you will have to have a sufficient number of nodes up and responding for writes and reads. If you get lucky and at least (nodes/2) + 1 nodes are up out of the nodes you need, then writes and reads will succeed. If you don't have enough nodes with replicas up, reads and writes will fail. Overall some QUORUM reads and writes can succeed if a node is down, assuming that that node is either not needed to store your replica, or if its outage still leaves enough replica nodes available.
Check out this simple calculator which allows you to simulate different scenarios:
http://www.ecyrd.com/cassandracalculator/
For example with 2 nodes, a replication factor of 1, read consistency = 1, and write consistency = 1:
Your reads are consistent
You can survive the loss of no nodes.
You are really reading from 1 node every time.
You are really writing to 1 node every time.
Each node holds 50% of your data.