I've got 3 nodes; 2 in datacenter 1 (node 1 and node 2) and 1 in datacenter 2 (node 3). Replication strategy: Network Topology, dc1:2, dc2: 1.
Initially I keep one of the nodes in dc1 off (node 2) and write 100 000 entries with consistency 2 (via c++ program). After writing, I shut down the node in datacenter 2 (node 3) and turn on node 2.
Now, if I try to read those 100 000 entries I had written (again via c++ program) with consistency set as ONE, I'm not able to read all those 100 000 entries i.e. I'm able to read only some of the entries. As I run the program again and again, my program fetches more and more entries.
I was expecting that since one of the 2 nodes which are up contains all the 100 000 entries, therefore, the read program should fetch all the entries in the first execution when the set consistency is ONE.
Is this related to read repair? I'm thinking that because the read repair is happening in the background, that is why, the node is not able to respond to all the queries? But nowhere could I find anything regarding this behavior.
Let's run through the scenario.
During the write of 100K rows (DC1) Node1 and (DC2) Node3 took all the writes. As it was happening Node1 also might have taken hints for Node2 (DC1) for default 3 hours and then stop doing that.
Once Node2 comes back up online, unless a repair was run - it takes a bit to catch up through replay of hints. If the node was down for more than 3 hours, repair becomes mandatory.
During the reads, it can technically reach to any node in the cluster based on the loadbalancy policy used by driver. Unless specified to do "DCAwareRoundRobinPolicy", the read request might even reach any of the DC (DC1 or DC2 in this case). Since the consistency requested is "ONE", practically any ALIVE node can respond - NODE1 & NODE2 (DC1) in this case. So NODE2 may not even have all data and it can still respond with NULL value and thats why you received empty data sometimes and correct data some other time.
With consistency "ONE" read repair doesn't even happen, as there no other node to compare it with. Here is the documentation on it . Even in case of consistency "local_quorum" or "quorum" there is a read_repair_chance set at the table level which is default to 0.1. Which means only 10% of reads will trigger read_repair. This is to save performance by not triggering every time. Think about it, if read repair can bring the table entirely consistent across nodes, then why does "nodetool repair" even exist?
To avoid this situation, whenever the node comes back up online its best practice to do a "nodetool repair" or run queries with consistency "local_quorum" to get consistent data back.
Also remember, consistency "ONE" is comparable to uncommitted read (dirty read) in the world of RDBMS (WITH UR). So expect to see unexpected data.
Per documentation, consistency level ONE when reads:
Returns a response from the closest replica, as determined by the snitch. By default, a read repair runs in the background to make the other replicas consistent. Provides the highest availability of all the levels if you can tolerate a comparatively high probability of stale data being read. The replicas contacted for reads may not always have the most recent write.
Did you check that your code contacted the node that always was online & accepted writes?
The DSE Architecture guide, and especially Database Internals section provides good overview how Cassandra works.
Related
Suppose I have two node cassandra cluster and they are reside on physically different data-centers. Suppose the database inside that cluster has replication factor is 2 which means every data in that database should be sync with each other. suppose this database is a massive database which have millions of records of its tables. I named those nodes centers as node1 and node2. Suppose node2 is not reliable and there was a crash on that server and take few days to fix and get the server back to up and running state. After that according to my understating there should be a gap between node1 and node2 and it may take significant time to sync node2 with node1. So need a way to measure the gap between node2 and node1 for the mean time of sync happen? After some times how should I assure that node2 is equal to node1? Please correct me if im wrong with this question according to the cassandra architechure.
So let's start with your description. 2 node cluster, which sounds fine, but 2 nodes in 2 different data centers (DCs) - bad design, but doable. Each data center should have multiple nodes to ensure your data is highly available. Anyway, that aside, let's assume you have a 2 node cluster with 1 node in each DC. The replication factor (RF) is defined at the keyspace level (not at the cluster level - each DC will have a RF setting for a particular keyspace (or 0 if not specified for a particular DC)). That being said, you can't have RF=2 for a keyspace for either of your DCs if you only have a single node in each one (RF, which is how many copies of the data that exist, can't be more than the number of nodes in the DC). So let's put that aside for now as well.
You have the possibility for DCs to become out of sync as well as nodes within a DC to become out of sync. There are multiple protections against this problem.
Consistency Level (CL)
This is a lever that you (the client) have to be able to help control how far out of sync things get. There's a trade off between availability v.s. consistency (with performance implications as well). The CL setting is configured at connection time and/or each statement level. For writes, the CL determines how many nodes must IMMEDIATELY ACKNOWLEDGE the write before giving your application the "green light" to move on (a number of nodes that you're comfortable with - knowing the more nodes you immediately require the more consistent your nodes and/or DC(s) will be, but the longer it will take and the less flexibility you have in nodes becoming unavailable without client failure). If you specify less than RF it doesn't mean that RF won't be met, it just means that they don't need to immediately acknowledge the write to move on. For reads, this setting determines how many nodes' data are compared before the result is returned (if cassandra finds a particular row doesn't match from the nodes it's comparing, it will "fix" them during the read before you get your results - this is called read repair). There are a handful of CL options by the client (e.g. ONE, QUORUM, LOCAL_ONE, LOCAL_QUOURM, etc.). Again, there is a trade-off between availability and consistency with the selected choice.
If you want to be sure your data is consistent when your queries run (when you read the data), ensure the write CL + the read CL > RF. You can ensure that's done on a LOCAL level (e.g. the DC that the read/write is occurring on, say, LOCAL_QUORUM) or globally (all DCs with QUORUM). By doing this, you'll be sure that while your cluster may be inconsistent, your results during reads will not be (i.e. the results will be consistent/accurate - which is all that anyone really cares about). With this setting you also allow some flexibility in unavailable nodes (e.g. for a 3 node DC you could have a single node be unavailable without client failure for either reads or writes).
If nodes do become out of sync, you have a few options at this point:
Repair
Repair (run by "nodetool repair") - this is a facility that you can schedule or manually run to reconcile your tables, keyspaces and/or the entire node with other nodes (either in the DC the node resides or the entire cluster). This is a "node level" command and must be run on each node to "fix" things. If you have DSE, Ops Center can run repairs in the background fixing "chunks" of data - cycling the process repetitively.
NodeSync
Similar to repair, this is a DSE specific tool similar to repair that helps keep data in sync (the newer version of repair).
Unavailable nodes:
Hinted Handoff
Cassandra has the ability to "hold onto" changes if nodes become unavailable during writes. It will hang onto changes for a specified period of time. If the unavailable nodes become available before time runs out, the changes are sent over for application. If time runs out, hint collection stops and one of the other options, above, need to be performed to catch things up.
Finally, there is no way to know how inconsistent things are (e.g. 30% inconsistent). You simply try to utilize the tools mentioned above to control consistency without completely sacrificing availability.
Hopefully that makes sense and helps.
-Jim
Iam running a cassandra 3.11.4 cluster with 1 data center, 2 racks and 11 nodes. My keyspaces and the tables are set to replication 2. I use the Prometheus-Grafana-Combo to monitor the cluster.
Observation: During (massive) inserts using Write-Consistency Level ALL (i.e. 2 nodes) the affected tables/nodes get slowly out of sync (worst case on one node: from 100% to 83% within 6 hours). My expectation is that this could only happen if I use ANY (or anything less than my replication factor).
I would really like to understand this behaviour.
What is also interesting: If I dare to use write consistency ANY I get exactly that- and even though all nodes are online Cassandra does not even seem attempt to write to all nodes. In any case (ANY or ALL) if have to perform incremental repairs.
First of all, your expectation is correct: Writes, regardless of what the consistency-level is (ALL or ONE or ANY or whatever), do make every attempt to write to all replicas. The different write-consistency levels only differ on when "success" is reported to the client: ALL waits until all writes were done, while ONE waits for just one (and does the other ones in the background). So unless one of your nodes goes down, or severely overloaded, none of the writes should be missing on any of the nodes, and there should be zero inconsistencies. The "hinted handoff" feature makes inconsistencies even less likely (if one node is temporarily down, other nodes save for it the writes it missed, and replay them later).
I think your only problem is that you're misinterpreting what the "percentrepaired" statistic means. The "percentrepaired" metric is used by incremental repair. In incremental repair, data on disk is split between "repaired" data (data that already went through a repair process) and "unrepaired" data - new data that still did not yes pass through repair. This does not mean that the new data is inconsistent or differs between nodes - it just that nobody checked that yet! To mark this new data "repaired" you'd need to run an (incremental) repair - it will realize the data does not differ between nodes, and mark it as "repaired".
Assumptions: RF = 3
In some video on the Internet about Consistency level speaker says that CL = ONE is better then CL = ANY because when we use CL = ANY coordinator will be happy to store only hint(and data)(we are assuming here that all the other nodes with corresponding partition key ranges are down) and we can potentially lose our data due to coordinator's failure. But wait a minute.... as I understand it, if we used CL = ONE and for example we had only one(of three) available node for this partition key, we would have only one node with inserted data. Risk of loss is the same.
But I think we should assume equal situations - all nodes for particular token is gone. Then it's better to discard write operation then write with such a big risk of coordinator's loss.
CL=ANY should probably never be used on a production server. Writes will be unavailable until the hint is written to a node owning that partition because you can't read data when its in a hints log.
Using CL=ONE and RF=3 with two nodes down, you would have data stored in both a) the commit log and memtable on a node and b) the hints log. These are likely different nodes, but they could be the same 1/3 of the time. So, yes, with CL=ONE and CL=ANY you risk complete loss of data with a single node failure.
Instead of ANY or ONE, use CL=QUORUM or CL=LOCAL_QUORUM.
The thing is the hints will just be stored for 3 hours by default and for longer times than that you have to run repairs. You can repair if you have at least one copy of this data on one node somewhere in the cluster (hints that are stored on coordinator don't count).
Consistency One guarantees that at least one node in the cluster has it in commit log no matter what. Any is in worst case stored in hints of coordinator (other nodes can't access it) and this is stored by default in a time frame of 3 hours. After 3 hours pass by with ANY you are loosing data if other two instances are down.
If you are worried about the risk, then use quorum and 2 nodes will have to guarantee to save the data. It's up to application developer / designer to decide. Quorum will usually have slightly bigger latencies on write than One. But You can always add more nodes etc. should the load dramatically increase.
Also have a look at this nice tool to see what impacts do various consistencies and replication factors have on applications:
https://www.ecyrd.com/cassandracalculator/
With RF 3, 3 nodes in the cluster will actually get the write. Consistency is just about how long you want to wait for response from them ... If you use One, you will wait until one node has it in commit log. But the coordinator will actually send the writes to all 3. If they don't respond coordinator will save the writes into hints.
Most of the time any in production is a bad idea.
I have 2 questions related to DataStax queries:
I have a installed DataStax Enterprise 4.6 on 3 nodes of exactly the same configuration with regards to CPU,RAM,Storage etc. I then created a keyspace with RF=3, created a CF within the keyspace and inserted about 10 million rows in it. Now when I login to Node1 and execute a count query, it returns about 1.5 million in about 1mt 15 secs. But when I login to Node2 and execute the exact same query, it take about 1mt 35 secs. Similarly, when I login to Node3 and execute, it takes about 1mt 20 secs. Why is there a difference in the query execution times on the 3 nodes?
I shut down DSE (service dse stop) on Node2 & Node3 and ran the query on Node1. Since all required data is available on Node1, it ran successfully and took 1mt 15sec. I then brought DSE up on Node2 and ran the query again. With tracing on, I see that data is being fetched from Node2 as well but the time taken to execute the query is more than 1mt 15sec. Should it not be less, since 2 nodes are being used? Similarly, when Node3 is also brought up and the query is executed, it takes more time compared to when 2 nodes are up. My understanding is that Cassandra/DataStax is linearly scalable.
Any help/pointers is much appreciated ..
Sounds like normal behavior to me. There is always some overhead when multiple nodes are coordinating and interacting with each other, and things are not necessarily going to behave in a perfectly symmetric way.
Even if all the data is local, there's still some interaction with the other nodes going on, and some of that will be non deterministic in time. You have network latencies that vary, different queueing orders of things, variable seeks times on disks, etc.
When you take two of the nodes down, the remaining node knows that they are down and so it doesn't bother trying to do any reads or interactions with them. That's why that scenario is the fastest. As you bring the other nodes back online, the extra coordination with them will slow things down a little. That's the price you pay for redundancy.
The performance scales by not keeping a copy of the data on every node. You are using RF=3 and only have three nodes. If you added a fourth node, then not all the data would be on every node. Now you have added capacity since not every write goes to all nodes and different writes will hit a different set of machines.
Your question is simple to answer. It is a matter of Consistency: You can tune your select queries with a Consistency of One, then C* does not need to check if your data (RF=3) across all the nodes matches up.
In most use cases a Consistency of One for reads should be sufficient.
As for the time differences: The machines are involved in many different things beside serving queries. So normal behaviour to have different response times per node. There is a similar question/answer here : How do I set the consistency level of an individual CQL query in CQL3?
Basically go and play with consistency and see how response times change.
I want to clarify very basic concept of replication factor and consistency level in Cassandra. Highly appreciate if someone can provide answer to below questions.
RF- Replication Factor
RC- Read Consistency
WC- Write Consistency
2 cassandra nodes (Ex: A, B) RF=1, RC=ONE, WC=ONE or ANY
can I write data to node A and read from node B ?
what will happen if A goes down ?
3 cassandra nodes (Ex: A, B, C) RF=2, RC=QUORUM, WC=QUORUM
can I write data to node A and read from node C ?
what will happen if node A goes down ?
3 cassandra nodes (Ex: A, B, C) RF=3, RC=QUORUM, WC=QUORUM
can I write data to node A and read from node C ?
what will happen if node A goes down ?
Short summary: Replication factor describes how many copies of your data exist. Consistency level describes the behavior seen by the client. Perhaps there's a better way to categorize these.
As an example, you can have a replication factor of 2. When you write, two copies will always be stored, assuming enough nodes are up. When a node is down, writes for that node are stashed away and written when it comes back up, unless it's down long enough that Cassandra decides it's gone for good.
Now say in that example you write with a consistency level of ONE. The client will receive a success acknowledgement after a write is done to one node, without waiting for the second write. If you did a write with a CL of ALL, the acknowledgement to the client will wait until both copies are written. There are very many other consistency level options, too many to cover all the variants here. Read the Datastax doc, though, it does a good job of explaining them.
In the same example, if you read with a consistency level of ONE, the response will be sent to the client after a single replica responds. Another replica may have newer data, in which case the response will not be up-to-date. In many contexts, that's quite sufficient. In others, the client will need the most up-to-date information, and you'll use a different consistency level on the read - perhaps a level ALL. In that way, the consistency of Cassandra and other post-relational databases is tunable in ways that relational databases typically are not.
Now getting back to your examples.
Example one: Yes, you can write to A and read from B, even if B doesn't have its own replica. B will ask A for it on your client's behalf. This is also true for your other cases where the nodes are all up. When they're all up, you can write to one and read from another.
For writes, with WC=ONE, if the node for the single replica is up and is the one you're connect to, the write will succeed. If it's for the other node, the write will fail. If you use ANY, the write will succeed, assuming you're talking to the node that's up. I think you also have to have hinted handoff enabled for that. The down node will get the data later, and you won't be able to read it until after that occurs, not even from the node that's up.
In the other two examples, replication factor will affect how many copies are eventually written, but doesn't affect client behavior beyond what I've described above. The QUORUM will affect client behavior in that you will have to have a sufficient number of nodes up and responding for writes and reads. If you get lucky and at least (nodes/2) + 1 nodes are up out of the nodes you need, then writes and reads will succeed. If you don't have enough nodes with replicas up, reads and writes will fail. Overall some QUORUM reads and writes can succeed if a node is down, assuming that that node is either not needed to store your replica, or if its outage still leaves enough replica nodes available.
Check out this simple calculator which allows you to simulate different scenarios:
http://www.ecyrd.com/cassandracalculator/
For example with 2 nodes, a replication factor of 1, read consistency = 1, and write consistency = 1:
Your reads are consistent
You can survive the loss of no nodes.
You are really reading from 1 node every time.
You are really writing to 1 node every time.
Each node holds 50% of your data.