I am experimenting with Cassandra 3.0.2 on a 6-node cluster and found "unintuitive" read-scaling/-workload patterns.
Query:
select count(*) from dvds
where dvd has 280k records.
With default vnode settings (num_tokens: 256), I've found that increasing node count from 1 to 2 improves read performance by about 35%, but each additional node beyond 2 nodes decreases performance by about 30%.
With vnode-s disabled (num_tokens: 1 and initial_token-s set manually), a 6-node cluster performs about 35% better than with num_tokens: 256, but the following pattern is clearly observable: The coordinator node's CPU consumption is either about 50% (of the total capacity of a CPU core) or about 110-120%, whereas the other nodes consume either about 0% or 60-70% percent capacity of a single core. The unintuitive part is this: when one node is busy, the other nodes are idle. (When the coordinator CPU consumption is at 110-120%, all the other nodes are pretty idle. When the coordinator's CPU is 50%, one of the other nodes is busy.)
The strongest hypothesis I could come up with was that the cluster is unable to handle the network traffic, but the coordinator's network traffic (where, I assume, a network scalability issue would hardest hit) didn't seem to exceed 1Mb/s at any point in time. (The network interfaces' throughputs on the nodes are 10/100 Mbps.) Also, with a network scalability issue, I would expect the "num_tokens: 1" setup to show initially high CPU load on all nodes (with the exception of the coordinator) -- or at least some evenly distributed simultaneous load.
Please, can anybody shed some light on this?
count(*) has its place, but is very expensive. The coordinator essentially has to pull everything down from all the nodes, merge, and count them. The only thing it provides over "read everything" and counting them locally is reducing some network load between coordinator and your application.
If you need this metric regularly I would recommend using a counter or lwt to keep the count a single read operation (create data model around queries not abstractions of the data). If need it once, or infrequently, hadoop/spark is a great option. Also you can get a decent estimate from the EstimatedPartitionSize metric (per node though) depending on your data model.
Related
I am using a Cassandra cluster with 4 nodes. 2 nodes have much more resources than the other two in terms of CPU cores and RAM.
Right now I am using the DCAwareRoundRobin load balancing policy. I believe in this policy, all nodes receive the same number of requests. Due to this, 2 smaller nodes are NOT performing well. This is resulting in high IO and CPU usage on smaller nodes.
I want to distribute the traffic from the Java application to the Cassandra cluster in the ratio of available resources.
For example: small node 1 - 20%, small node 2 - 20%, large node 1 - 30%, large node 2 - 30% of queries.
Need you suggestion about any method or approach I can use to distribute the traffic in the manner.
I understand that I can use LatencyAwarePolicy [1]. I am worried that when a node is taken out of query plan due to a threshold breach, the remaining nodes might see the ripple effect.
[1] https://docs.datastax.com/en/drivers/java/2.2/com/datastax/driver/core/policies/LatencyAwarePolicy.html
In short, the approach you have is an anti-pattern and should be avoided at all costs.
TL;DR: Make the nodes the same.
During the request processing, The query will reach not a single node but all the nodes responsible for the partition you are accessing, based on the Replication Factor, doesn't matter whatever Consistency Level you use. If you have RF=3, 3 of your 4 nodes will be reached for each write or read request. You do not have and should not have control over request distribution, "the medicine will be worse than the sickness".
As stated in the title, we are having a problem with our Cassandra cluster. There are 9 nodes with a replication factor of 3 using NetworkTopologyStrategy. All in the same DC and Rack. Cassandra version is 3.11.4 (planning to move on 3.11.10). Instances have 4 CPU and 32 GB RAM. (planning to move on 8 CPU)
Whenever we try to run repair on our cluster (using Cassandra Reaper on one of our nodes), we lose one node somewhere in the process. We quickly stop the repair, restart Cassandra service on the node and wait for it to join the ring. Therefore we are never able to run repair these days.
I observed the problem and realized that this problem is caused by high CPU usage on some of our nodes (exactly 3). You may see the 1 week interval graph in below. Ups and downs are caused by the usage of the app. In the mornings, it's very low.
I compared the running processes on each node and there is nothing extra on the high CPU nodes. I compared the configurations. They are identical. Couldn't find any difference.
I also realized that these nodes are the ones that take most of the traffic. See the 1 week interval graph in below. Both sent & received bytes.
I made some research. I found this thread and at the end it is recommended to set dynamic_snitch: false in Cassandra configuration. I looked at our snitch strategy which is GossipingPropertyFileSnitch. In practice, this strategy should work properly but I guess it doesn't.
The job of a snitch is to provide information about your network topology so that Cassandra can efficiently route requests.
My only observation that could be cause of this issue is there is a file called cassandra-topology.properties which is specifically told to be removed if using GossipingPropertyFileSnitch
The rack and datacenter for the local node are defined in cassandra-rackdc.properties and propagated to other nodes via gossip. If cassandra-topology.properties exists, it is used as a fallback, allowing migration from the PropertyFileSnitch.
I did not remove this file as I couldn't find any hard proof that this is causing the issue. If you have any knowledge on this or see any other reason to my problem, I would appreciate your help.
These two sentences tell me some important things about your cluster:
high CPU usage on some of our nodes (exactly 3).
I also realized that these nodes are the ones that take most of the traffic.
The obvious point, is that your replication factor (RF) is 3 (most common). The not-so-obvious, is that your data model is likely keyed on date or some other natural key which results in the same (3?) nodes serving all of the traffic for long periods of time. Running repair during those high-traffic periods will likely lead to issues.
Some things to try:
Have a look at the data model, and see if there's a better way to partition the data to distribute traffic over the rest of the cluster. This is often done with a modeling technique known as "bucketing" (adding another component...usually time based...to the partition key).
Are the partitions large? (Check with nodetool tablehistograms) And by "large," like > 10MB? It could also be that the large partitions are causing the repair operations to fail. If so, hopefully lowering resource consumption (below) will help.
Does your cluster sustain high amounts of write throughput? If so, it may also be dealing with compactions (nodetool compactionstats). You could try lowering compaction throughput (nodetool setcompactionthroughput) to free up some resources. Repair operations can also invoke compactions.
Likewise, you can also lower streaming throughput (nodetool setstreamthroughput) during repairs. Repairs will take longer to stream data, but if that's what is really tipping-over the node(s), it might be necessary.
In case you're not already, set up another instance and use Cassandra Reaper for repairs. It is so much better than triggering from cron. Plus, the UI allows for some finely-tuned config which might be necessary here. It also lets you pause and resume repairs, to pick-up where it leaves off.
We have a 30+ node Cassandra cluster (3.11.2) in 4 data centers. One of the centers consists of 8 nodes in Azure running on Standard DS12 v2 (4cpu, 28gb) nodes with a 500GB premium SSD drive. All in the same data center (central US).
We are seeing a dramatic CPU imbalance in the node activity when pushed to the max. We have a keyspace with about 200 million records, and we're running a process to check and refresh the records if necessary from another data stream.
What's happening, is we have 4 nodes that are running at 70-90% CPU compared to 15-25% of the other 4. The measurement of the CPU is being done in the nodes themselves, because Azure's own metrics is broken and never represents what is actually happening.
Digging into a pair of nodes (one low CPU and one high) the difference is the iowait% of the two. The data in the keyspace is balanced (within reason - they are all within 5% of another in record count and size). It looks like the number of reads is balanced, and even the read latency as reported by Cassandra is similar.
When I do an iostat compare of the nodes, the high CPU node is reporting a much higher (by 50 to 100%) rKB/s numbers... which is likely leading to the difference in iowait% time.
These nodes are 100% configured the same, running the same version of everything (OS, libraries, everything) that I can think to look. I cannot figure out why some nodes are deciding to do more disk reads that the others, resulting in the cluster as a whole slowing down.
Anybody have any suggestions on where I can look for differences?
The only thing that is a pattern, is the nodes that are slower are the 4 nodes that were added later in our expansion. We started with 4 nodes for a while and added 4 more when we needed space. All the appropriate repairs and other tasks required with node additions were done - the fact that the records and physical size of the data files on disk being equal should attest to that.
When we shut down our refresh process, the all the nodes settle down to a even 5% or less CPU across the board. No compaction or any other maintenance is happening that would indicate something different.
plz help... :)
Our final solution for this - to fix ONLY the unbalanced problem was to cleanup, full repair and compact. At that point the nodes are relatively equally used. We suspect expanding the cluster (adding nodes) may have left elements of data on the older nodes that were not compacted out based on regular compaction events.
We are still working to try to solve the load issue; but now at least all the nodes are feeling the same CPU crunch.
One of my C* cluster design expects nodes to hold between 1 and 2 TBs of data each, and I expect a huge amount of data in a few months. Pretending I can get 1PB of data and that each node will hold 1TB of data, that means I should plan for a 1000x growth over time, and starting from a "misere" N=3 nodes with RF=3 for 1TB of data, I would keep adding nodes up to N=3000 over time.
The high number of nodes involved put some pressures on how to deal with disks/servers failures, keep the cluster healthy and how to perform backups.
Healthy Cluster
Assuming you don't want any data loss and perform reads/writes with LOCAL_QUOROM Consistency Level, using RF=3 when you have N<10 nodes is very reasonable, however when you go up with N the MTBF of your nodes goes down accordingly, so keeping RF=3 is going to call for troubles and you may want to "upgrade" to RF=5 or more.
Q1: What's a good RF that would fight against the decreased MTBF and keep the cluster healthy (and you sleeping peacefully) with say 100 nodes? and 500? and 1000?
BACKUP
Making backups of all the nodes seems to be a bit not viable due to the following reasons:
Doubles the costs of the solution instantly.
I would backup the redundant data due to the RF of the cluster.
I see no way to remove the redundancy introduced by the RF and backup only the data expect adding another DC to C* with RF=2 (I could go for RF=1 but if I lose one node all the backup cluster is down). That
would mean adding 2/RF of the cost of the cluster for backup
purposes which seems to me a good alternative.
Q2: Are there any other methods to perform this task without increasing too much the cost of the solution?
I am considering the design of a Cassandra cluster.
The use case would be storing large rows of tiny samples for time series data (using KairosDB), data will be almost immutable (very rare delete, no updates). That part is working very well.
However, after several years the data will be quite large (it wil reach a maximum size of several hundreds of terabytes - over one petabyte considering the replication factor).
I am aware of advice not to use more than 5TB of data per Cassandra node because of high I/O loads during compactions and repairs (which is apparently already quite high for spinning disks).
Since we don't want to build an entire datacenter with hundreds of nodes for this use case, I am investigating if this would be workable to have high density servers on spinning disks (e.g. at least 10TB or 20TB per node using spinning disks in RAID10 or JBOD, servers would have good CPU and RAM so the system will be I/O bound).
The amount of read/write in Cassandra per second will be manageable by a small cluster without any stress. I can also mention that this is not a high performance transactional system but a datastore for storage, retrievals and some analysis, and data will be almost immutable - so even if a compaction or a repair/reconstruction that take several days of several servers at the same time it's probably not going to be an issue at all.
I am wondering if some people have an experience feedback for high server density using spinning disks and what configuration you are using (Cassandra version, data size per node, disk size per node, disk config: JBOD/RAID, type of hardware).
Thanks in advance for your feedback.
Best regards.
The risk of super dense nodes isn't necessarily maxing IO during repair and compaction - it's the inability to reliably resolve a total node failure. In your reply to Jim Meyer, you note that RAID5 is discouraged because the probability of failure during rebuild is too high - that same potential failure is the primary argument against super dense nodes.
In the days pre-vnodes, if you had a 20T node that died, and you had to restore it, you'd have to stream 20T from the neighboring (2-4) nodes, which would max out all of those nodes, increase their likelihood of failure, and it would take (hours/days) to restore the down node. In that time, you're running with reduced redundancy, which is a likely risk if you value your data.
One of the reasons vnodes were appreciated by many people is that it distributes load across more neighbors - now, streaming operations to bootstrap your replacement node come from dozens of machines, spreading the load. However, you still have the fundamental problem: you have to get 20T of data onto the node without bootstrap failing. Streaming has long been more fragile than desired, and the odds of streaming 20T without failure on cloud networks are not fantastic (though again, it's getting better and better).
Can you run 20T nodes? Sure. But what's the point? Why not run 5 4T nodes - you get more redundancy, you can scale down the CPU/memory accordingly, and you don't have to worry about re-bootstrapping 20T all at once.
Our "dense" nodes are 4T GP2 EBS volumes with Cassandra 2.1.x (x >= 7 to avoid the OOMs in 2.1.5/6). We use a single volume, because while you suggest "cassandra now supports JBOD quite well", our experience is that relying on Cassandra's balancing algorithms is unlikely to give you quite what you think it will - IO will thundering herd between devices (overwhelm one, then overwhelm the next, and so on), they'll fill asymmetrically. That, to me, is a great argument against lots of small volumes - I'd rather just see consistent usage on a single volume.
I haven't used KairosDB, but if it gives you some control over how Cassandra is used, you could look into a few things:
See if you can use incremental repairs instead of full repairs. Since your data is an immutable time series, you won't often need to repair old SSTables, so incremental repairs would just repair recent data.
Archive old data in a different keyspace, and only repair that keyspace infrequently such as when there is a topology change. For routine repairs, only repair the "hot" keyspace you use for recent data.
Experiment with using a different compaction strategy, perhaps DateTiered. This might reduce the amount of time spent on compaction since it would spend less time compacting old data.
There are other repair options that might help, for example I've found the the -local option speeds up repairs significantly if you are running multiple data centers. Or perhaps you could run limited repairs more frequently rather than performance killing full repairs on everything.
I have some Cassandra clusters that use RAID5. This has worked fine so far, but if two disks in the array fail then the node becomes unusable since writes to the array are disabled. Then someone must manually intervene to fix the failed disks or remove the node from the cluster. If you have a lot of nodes, then disk failures will be a fairly common occurrence.
If no one gives you an answer about running 20 TB nodes, I'd suggest running some experiments on your own dataset. Set up a single 20 TB node and fill it with your data. As you fill it, monitor the write throughput and see if there are intolerable drops in throughput when compactions happen, and at how many TB it becomes intolerable. Then have an empty 20 TB node join the cluster and run a full repair on the new node and see how long it takes to migrate its half of the dataset to it. This would give you an idea of how long it would take to replace a failed node in your cluster.
Hope that helps.
I would recommend to think about the data model of your application and how to partition your data. For time series data it would probably make sense to use a composite key [1] which consists of a partition key + one or more columns. Partitions are distributed across multiple servers according to the hash of the partition key (depending on the Cassandra Partitioner that you use, see cassandra.yaml).
For example, you could partition your server by device that generates the data (Pattern 1 in [2]) or by a period of time (e.g., per day) as shown in Pattern 2 in [2].
You should also be aware that the max number of values per partition is limited to 2 billion [3]. So, partitioning is highly recommended. Don't store your entire time series on a single Cassandra node in a single partition.
[1] http://www.planetcassandra.org/blog/composite-keys-in-apache-cassandra/
[2] https://academy.datastax.com/demos/getting-started-time-series-data-modeling
[3] http://wiki.apache.org/cassandra/CassandraLimitations