I have 2 questions related to DataStax queries:
I have a installed DataStax Enterprise 4.6 on 3 nodes of exactly the same configuration with regards to CPU,RAM,Storage etc. I then created a keyspace with RF=3, created a CF within the keyspace and inserted about 10 million rows in it. Now when I login to Node1 and execute a count query, it returns about 1.5 million in about 1mt 15 secs. But when I login to Node2 and execute the exact same query, it take about 1mt 35 secs. Similarly, when I login to Node3 and execute, it takes about 1mt 20 secs. Why is there a difference in the query execution times on the 3 nodes?
I shut down DSE (service dse stop) on Node2 & Node3 and ran the query on Node1. Since all required data is available on Node1, it ran successfully and took 1mt 15sec. I then brought DSE up on Node2 and ran the query again. With tracing on, I see that data is being fetched from Node2 as well but the time taken to execute the query is more than 1mt 15sec. Should it not be less, since 2 nodes are being used? Similarly, when Node3 is also brought up and the query is executed, it takes more time compared to when 2 nodes are up. My understanding is that Cassandra/DataStax is linearly scalable.
Any help/pointers is much appreciated ..
Sounds like normal behavior to me. There is always some overhead when multiple nodes are coordinating and interacting with each other, and things are not necessarily going to behave in a perfectly symmetric way.
Even if all the data is local, there's still some interaction with the other nodes going on, and some of that will be non deterministic in time. You have network latencies that vary, different queueing orders of things, variable seeks times on disks, etc.
When you take two of the nodes down, the remaining node knows that they are down and so it doesn't bother trying to do any reads or interactions with them. That's why that scenario is the fastest. As you bring the other nodes back online, the extra coordination with them will slow things down a little. That's the price you pay for redundancy.
The performance scales by not keeping a copy of the data on every node. You are using RF=3 and only have three nodes. If you added a fourth node, then not all the data would be on every node. Now you have added capacity since not every write goes to all nodes and different writes will hit a different set of machines.
Your question is simple to answer. It is a matter of Consistency: You can tune your select queries with a Consistency of One, then C* does not need to check if your data (RF=3) across all the nodes matches up.
In most use cases a Consistency of One for reads should be sufficient.
As for the time differences: The machines are involved in many different things beside serving queries. So normal behaviour to have different response times per node. There is a similar question/answer here : How do I set the consistency level of an individual CQL query in CQL3?
Basically go and play with consistency and see how response times change.
Related
As stated in the title, we are having a problem with our Cassandra cluster. There are 9 nodes with a replication factor of 3 using NetworkTopologyStrategy. All in the same DC and Rack. Cassandra version is 3.11.4 (planning to move on 3.11.10). Instances have 4 CPU and 32 GB RAM. (planning to move on 8 CPU)
Whenever we try to run repair on our cluster (using Cassandra Reaper on one of our nodes), we lose one node somewhere in the process. We quickly stop the repair, restart Cassandra service on the node and wait for it to join the ring. Therefore we are never able to run repair these days.
I observed the problem and realized that this problem is caused by high CPU usage on some of our nodes (exactly 3). You may see the 1 week interval graph in below. Ups and downs are caused by the usage of the app. In the mornings, it's very low.
I compared the running processes on each node and there is nothing extra on the high CPU nodes. I compared the configurations. They are identical. Couldn't find any difference.
I also realized that these nodes are the ones that take most of the traffic. See the 1 week interval graph in below. Both sent & received bytes.
I made some research. I found this thread and at the end it is recommended to set dynamic_snitch: false in Cassandra configuration. I looked at our snitch strategy which is GossipingPropertyFileSnitch. In practice, this strategy should work properly but I guess it doesn't.
The job of a snitch is to provide information about your network topology so that Cassandra can efficiently route requests.
My only observation that could be cause of this issue is there is a file called cassandra-topology.properties which is specifically told to be removed if using GossipingPropertyFileSnitch
The rack and datacenter for the local node are defined in cassandra-rackdc.properties and propagated to other nodes via gossip. If cassandra-topology.properties exists, it is used as a fallback, allowing migration from the PropertyFileSnitch.
I did not remove this file as I couldn't find any hard proof that this is causing the issue. If you have any knowledge on this or see any other reason to my problem, I would appreciate your help.
These two sentences tell me some important things about your cluster:
high CPU usage on some of our nodes (exactly 3).
I also realized that these nodes are the ones that take most of the traffic.
The obvious point, is that your replication factor (RF) is 3 (most common). The not-so-obvious, is that your data model is likely keyed on date or some other natural key which results in the same (3?) nodes serving all of the traffic for long periods of time. Running repair during those high-traffic periods will likely lead to issues.
Some things to try:
Have a look at the data model, and see if there's a better way to partition the data to distribute traffic over the rest of the cluster. This is often done with a modeling technique known as "bucketing" (adding another component...usually time based...to the partition key).
Are the partitions large? (Check with nodetool tablehistograms) And by "large," like > 10MB? It could also be that the large partitions are causing the repair operations to fail. If so, hopefully lowering resource consumption (below) will help.
Does your cluster sustain high amounts of write throughput? If so, it may also be dealing with compactions (nodetool compactionstats). You could try lowering compaction throughput (nodetool setcompactionthroughput) to free up some resources. Repair operations can also invoke compactions.
Likewise, you can also lower streaming throughput (nodetool setstreamthroughput) during repairs. Repairs will take longer to stream data, but if that's what is really tipping-over the node(s), it might be necessary.
In case you're not already, set up another instance and use Cassandra Reaper for repairs. It is so much better than triggering from cron. Plus, the UI allows for some finely-tuned config which might be necessary here. It also lets you pause and resume repairs, to pick-up where it leaves off.
Consider a growing number of data, let's choose from two extreme choices:
Evenly distribute all data across all nodes in the cluster
We pack them to as few nodes as possible
I prefer option 1 because as the volume of data grows, we can scatter it with all nodes, so that when each node is queried, it has the lowest load.
However, some resources state that we shouldn't query all the nodes because that will slow down the query. Why would that slow the query? Isn't that just a normal scatter and gather? They even claim this hurts linear scalability as adding more nodes will further drag down the query.
(Maybe I am missing on how Cassandra performs the query, some background reference is appreciated).
On the contrary, some resources state that we should go with option 2 because it queries the least number of nodes.
Of course there is no black and white choices here; everything must have a tradeoff.
I want to know, what's the real difference between option 1 and option 2. Plus, regarding the network querying, why option 1 would be slow.
I prefer option 1 because as the volume of data grows, we can scatter it with all nodes, so that when each node is queried, it has the lowest load.
You definitely want to go with option #1. This is also preferable, in that new or replacement nodes will stream much faster than a cluster made of fewer, dense nodes.
However, some resources state that we shouldn't query all the nodes because that will slow down the query.
And those resources are absolutely correct. First of all, if you read through the resources which Alex posted above you'll discover how to build your tables so that your queries can be served by a single node. Running queries which only hit a single node is the best way around that problem.
Why would that slow the query?
Because in a distributed database environment, query time becomes network time. There are many people out there who like to run multi-key or unbound queries against Cassandra. When that happens, and the query is unable to find a single node with the data, Cassandra picks one node to designate as a "coordinator."
That node builds the result set with data from the other nodes. Which means in a 30 node cluster, that one node is now pulling data from the other 29. Assuming that these requests don't time-out, the likelihood that the coordinator will crash due to trying to manage too much data is very high.
The bottom line, is that this is one of those tradeoffs between a CA relational database and an AP partitioned row store. Build your tables to support your queries, store data together which is queried together, and Cassandra will perform just fine.
Suppose I have two node cassandra cluster and they are reside on physically different data-centers. Suppose the database inside that cluster has replication factor is 2 which means every data in that database should be sync with each other. suppose this database is a massive database which have millions of records of its tables. I named those nodes centers as node1 and node2. Suppose node2 is not reliable and there was a crash on that server and take few days to fix and get the server back to up and running state. After that according to my understating there should be a gap between node1 and node2 and it may take significant time to sync node2 with node1. So need a way to measure the gap between node2 and node1 for the mean time of sync happen? After some times how should I assure that node2 is equal to node1? Please correct me if im wrong with this question according to the cassandra architechure.
So let's start with your description. 2 node cluster, which sounds fine, but 2 nodes in 2 different data centers (DCs) - bad design, but doable. Each data center should have multiple nodes to ensure your data is highly available. Anyway, that aside, let's assume you have a 2 node cluster with 1 node in each DC. The replication factor (RF) is defined at the keyspace level (not at the cluster level - each DC will have a RF setting for a particular keyspace (or 0 if not specified for a particular DC)). That being said, you can't have RF=2 for a keyspace for either of your DCs if you only have a single node in each one (RF, which is how many copies of the data that exist, can't be more than the number of nodes in the DC). So let's put that aside for now as well.
You have the possibility for DCs to become out of sync as well as nodes within a DC to become out of sync. There are multiple protections against this problem.
Consistency Level (CL)
This is a lever that you (the client) have to be able to help control how far out of sync things get. There's a trade off between availability v.s. consistency (with performance implications as well). The CL setting is configured at connection time and/or each statement level. For writes, the CL determines how many nodes must IMMEDIATELY ACKNOWLEDGE the write before giving your application the "green light" to move on (a number of nodes that you're comfortable with - knowing the more nodes you immediately require the more consistent your nodes and/or DC(s) will be, but the longer it will take and the less flexibility you have in nodes becoming unavailable without client failure). If you specify less than RF it doesn't mean that RF won't be met, it just means that they don't need to immediately acknowledge the write to move on. For reads, this setting determines how many nodes' data are compared before the result is returned (if cassandra finds a particular row doesn't match from the nodes it's comparing, it will "fix" them during the read before you get your results - this is called read repair). There are a handful of CL options by the client (e.g. ONE, QUORUM, LOCAL_ONE, LOCAL_QUOURM, etc.). Again, there is a trade-off between availability and consistency with the selected choice.
If you want to be sure your data is consistent when your queries run (when you read the data), ensure the write CL + the read CL > RF. You can ensure that's done on a LOCAL level (e.g. the DC that the read/write is occurring on, say, LOCAL_QUORUM) or globally (all DCs with QUORUM). By doing this, you'll be sure that while your cluster may be inconsistent, your results during reads will not be (i.e. the results will be consistent/accurate - which is all that anyone really cares about). With this setting you also allow some flexibility in unavailable nodes (e.g. for a 3 node DC you could have a single node be unavailable without client failure for either reads or writes).
If nodes do become out of sync, you have a few options at this point:
Repair
Repair (run by "nodetool repair") - this is a facility that you can schedule or manually run to reconcile your tables, keyspaces and/or the entire node with other nodes (either in the DC the node resides or the entire cluster). This is a "node level" command and must be run on each node to "fix" things. If you have DSE, Ops Center can run repairs in the background fixing "chunks" of data - cycling the process repetitively.
NodeSync
Similar to repair, this is a DSE specific tool similar to repair that helps keep data in sync (the newer version of repair).
Unavailable nodes:
Hinted Handoff
Cassandra has the ability to "hold onto" changes if nodes become unavailable during writes. It will hang onto changes for a specified period of time. If the unavailable nodes become available before time runs out, the changes are sent over for application. If time runs out, hint collection stops and one of the other options, above, need to be performed to catch things up.
Finally, there is no way to know how inconsistent things are (e.g. 30% inconsistent). You simply try to utilize the tools mentioned above to control consistency without completely sacrificing availability.
Hopefully that makes sense and helps.
-Jim
Iam running a cassandra 3.11.4 cluster with 1 data center, 2 racks and 11 nodes. My keyspaces and the tables are set to replication 2. I use the Prometheus-Grafana-Combo to monitor the cluster.
Observation: During (massive) inserts using Write-Consistency Level ALL (i.e. 2 nodes) the affected tables/nodes get slowly out of sync (worst case on one node: from 100% to 83% within 6 hours). My expectation is that this could only happen if I use ANY (or anything less than my replication factor).
I would really like to understand this behaviour.
What is also interesting: If I dare to use write consistency ANY I get exactly that- and even though all nodes are online Cassandra does not even seem attempt to write to all nodes. In any case (ANY or ALL) if have to perform incremental repairs.
First of all, your expectation is correct: Writes, regardless of what the consistency-level is (ALL or ONE or ANY or whatever), do make every attempt to write to all replicas. The different write-consistency levels only differ on when "success" is reported to the client: ALL waits until all writes were done, while ONE waits for just one (and does the other ones in the background). So unless one of your nodes goes down, or severely overloaded, none of the writes should be missing on any of the nodes, and there should be zero inconsistencies. The "hinted handoff" feature makes inconsistencies even less likely (if one node is temporarily down, other nodes save for it the writes it missed, and replay them later).
I think your only problem is that you're misinterpreting what the "percentrepaired" statistic means. The "percentrepaired" metric is used by incremental repair. In incremental repair, data on disk is split between "repaired" data (data that already went through a repair process) and "unrepaired" data - new data that still did not yes pass through repair. This does not mean that the new data is inconsistent or differs between nodes - it just that nobody checked that yet! To mark this new data "repaired" you'd need to run an (incremental) repair - it will realize the data does not differ between nodes, and mark it as "repaired".
I've got 3 nodes; 2 in datacenter 1 (node 1 and node 2) and 1 in datacenter 2 (node 3). Replication strategy: Network Topology, dc1:2, dc2: 1.
Initially I keep one of the nodes in dc1 off (node 2) and write 100 000 entries with consistency 2 (via c++ program). After writing, I shut down the node in datacenter 2 (node 3) and turn on node 2.
Now, if I try to read those 100 000 entries I had written (again via c++ program) with consistency set as ONE, I'm not able to read all those 100 000 entries i.e. I'm able to read only some of the entries. As I run the program again and again, my program fetches more and more entries.
I was expecting that since one of the 2 nodes which are up contains all the 100 000 entries, therefore, the read program should fetch all the entries in the first execution when the set consistency is ONE.
Is this related to read repair? I'm thinking that because the read repair is happening in the background, that is why, the node is not able to respond to all the queries? But nowhere could I find anything regarding this behavior.
Let's run through the scenario.
During the write of 100K rows (DC1) Node1 and (DC2) Node3 took all the writes. As it was happening Node1 also might have taken hints for Node2 (DC1) for default 3 hours and then stop doing that.
Once Node2 comes back up online, unless a repair was run - it takes a bit to catch up through replay of hints. If the node was down for more than 3 hours, repair becomes mandatory.
During the reads, it can technically reach to any node in the cluster based on the loadbalancy policy used by driver. Unless specified to do "DCAwareRoundRobinPolicy", the read request might even reach any of the DC (DC1 or DC2 in this case). Since the consistency requested is "ONE", practically any ALIVE node can respond - NODE1 & NODE2 (DC1) in this case. So NODE2 may not even have all data and it can still respond with NULL value and thats why you received empty data sometimes and correct data some other time.
With consistency "ONE" read repair doesn't even happen, as there no other node to compare it with. Here is the documentation on it . Even in case of consistency "local_quorum" or "quorum" there is a read_repair_chance set at the table level which is default to 0.1. Which means only 10% of reads will trigger read_repair. This is to save performance by not triggering every time. Think about it, if read repair can bring the table entirely consistent across nodes, then why does "nodetool repair" even exist?
To avoid this situation, whenever the node comes back up online its best practice to do a "nodetool repair" or run queries with consistency "local_quorum" to get consistent data back.
Also remember, consistency "ONE" is comparable to uncommitted read (dirty read) in the world of RDBMS (WITH UR). So expect to see unexpected data.
Per documentation, consistency level ONE when reads:
Returns a response from the closest replica, as determined by the snitch. By default, a read repair runs in the background to make the other replicas consistent. Provides the highest availability of all the levels if you can tolerate a comparatively high probability of stale data being read. The replicas contacted for reads may not always have the most recent write.
Did you check that your code contacted the node that always was online & accepted writes?
The DSE Architecture guide, and especially Database Internals section provides good overview how Cassandra works.