Understanding Cassandra message latency metric

Understanding Cassandra message latency metric - cassandra

I'm trying to understand how to use the org.apache.cassandra.metrics:type=Messaging metric. I setup 3 datacenters with 1 node each. When I measure the metric, for each node I get 2 cross-datacenter metrics and 1 cross-node latency metric as follows (for node in DC-2)
org.apache.cassandra.metrics:type=Messaging,name=dc3-Latency
5.3387457013878636E7
org.apache.cassandra.metrics:type=Messaging,name=CrossNodeLatency
1.1471964354991291E8
org.apache.cassandra.metrics:type=Messaging,name=dc1-Latency
1.6108579786605054E8
However, I have no processes using the cluster currently. Is Cassandra doing a dummy write to measure this metric? Also, what does the cross-node latency metric mean here, each DC contains only one node.

The metric records incoming latency from all things using the message service. The message service is used for read/writes but its also used for streaming and gossip. Gossip fires every 1 second between all the nodes so this is probably dominating it in your situation. Also some tables may be written to (system_distributed, system_traces, and some dse tables if using dse) with even a pretty idle system in some situations.
Whenever a message is sent from one node to another, it attaches a timestamp to it along with some versioning information. The first thing the receiving system will do (ignoring the obvious os/socket/etc) more or less is compare that timestamp to "now". This is what drives the metric. It will then look at the datacenter the source is from to determine which metrics to increment by how much.

Related

High CPU usage and traffic on some Cassandra nodes

As stated in the title, we are having a problem with our Cassandra cluster. There are 9 nodes with a replication factor of 3 using NetworkTopologyStrategy. All in the same DC and Rack. Cassandra version is 3.11.4 (planning to move on 3.11.10). Instances have 4 CPU and 32 GB RAM. (planning to move on 8 CPU)
Whenever we try to run repair on our cluster (using Cassandra Reaper on one of our nodes), we lose one node somewhere in the process. We quickly stop the repair, restart Cassandra service on the node and wait for it to join the ring. Therefore we are never able to run repair these days.
I observed the problem and realized that this problem is caused by high CPU usage on some of our nodes (exactly 3). You may see the 1 week interval graph in below. Ups and downs are caused by the usage of the app. In the mornings, it's very low.
I compared the running processes on each node and there is nothing extra on the high CPU nodes. I compared the configurations. They are identical. Couldn't find any difference.
I also realized that these nodes are the ones that take most of the traffic. See the 1 week interval graph in below. Both sent & received bytes.
I made some research. I found this thread and at the end it is recommended to set dynamic_snitch: false in Cassandra configuration. I looked at our snitch strategy which is GossipingPropertyFileSnitch. In practice, this strategy should work properly but I guess it doesn't.
The job of a snitch is to provide information about your network topology so that Cassandra can efficiently route requests.
My only observation that could be cause of this issue is there is a file called cassandra-topology.properties which is specifically told to be removed if using GossipingPropertyFileSnitch
The rack and datacenter for the local node are defined in cassandra-rackdc.properties and propagated to other nodes via gossip. If cassandra-topology.properties exists, it is used as a fallback, allowing migration from the PropertyFileSnitch.
I did not remove this file as I couldn't find any hard proof that this is causing the issue. If you have any knowledge on this or see any other reason to my problem, I would appreciate your help.

These two sentences tell me some important things about your cluster:
high CPU usage on some of our nodes (exactly 3).
I also realized that these nodes are the ones that take most of the traffic.
The obvious point, is that your replication factor (RF) is 3 (most common). The not-so-obvious, is that your data model is likely keyed on date or some other natural key which results in the same (3?) nodes serving all of the traffic for long periods of time. Running repair during those high-traffic periods will likely lead to issues.
Some things to try:
Have a look at the data model, and see if there's a better way to partition the data to distribute traffic over the rest of the cluster. This is often done with a modeling technique known as "bucketing" (adding another component...usually time based...to the partition key).
Are the partitions large? (Check with nodetool tablehistograms) And by "large," like > 10MB? It could also be that the large partitions are causing the repair operations to fail. If so, hopefully lowering resource consumption (below) will help.
Does your cluster sustain high amounts of write throughput? If so, it may also be dealing with compactions (nodetool compactionstats). You could try lowering compaction throughput (nodetool setcompactionthroughput) to free up some resources. Repair operations can also invoke compactions.
Likewise, you can also lower streaming throughput (nodetool setstreamthroughput) during repairs. Repairs will take longer to stream data, but if that's what is really tipping-over the node(s), it might be necessary.
In case you're not already, set up another instance and use Cassandra Reaper for repairs. It is so much better than triggering from cron. Plus, the UI allows for some finely-tuned config which might be necessary here. It also lets you pause and resume repairs, to pick-up where it leaves off.

Cassandra Read or Write latency JMX metric

We are attempting to use Prometheus to manage Cassandra cluster (v2.1.x) in my organization. I have a question related to Read or Write latency metrics.
For a node level read or write latency, the metric that I am using is -
org.apache.cassandra.metrics<type=(ClientRequest), scope=(Read|Write), name=(Latency)><>(OneMinuteRate)
For a table level, I am using - org.apache.cassandra.metrics<type=(Keyspace), keyspace=(\S*), name=(ReadLatency|WriteLatency)><>(OneMinuteRate)
I tested the metrics by running cassandra-stress utility on a sample table. At the table level, I am seeing that the writes are roughly 5000 per sec. This is as expected.
But the node level writes are around 100 per sec. My requirement is to graph node level writes per second and table level writes per sec.

Yes, ClientRequest Latency metric is on node level. The reason you only have 100 writes per second there means that this node is getting very few client requests.
The table level is for all nodes in the cluster, meaning all your nodes together are doing 5000 writes to this table.
You find out if you have any hotspots (meaning these nodes handle more traffic than others). Perhaps you are using a limited amount of partition keys which would target a limited amount of nodes. This is most likely your issue (poor data model). You can also check what Load Balancing Policy your driver is using and change that.

cassandra write throughput and scalability

This may sound like a dumb question but still I wanted someone/expert to answer/confirm this.
Lets say I have a 3 node cassandra cluster. Lets say I have one database and just one table. For this single table lets say I get a throughput of 1K writes/second with 3 node cassandra. If tomorrow my write load on this table increases/scales to 10K or 20K, will I be able to handle this write load by increasing the size of cluster by say 10x or 20x?
My understanding of cassandra says it is possible (as cassandra is both read and write scalable) but would want an expert to confirm.

Yes, Cassandra has Linear Scalability.
The scalability is linear as shown in the chart below. Each client system generates about 17,500 write requests per second, and there are no bottlenecks as we scale up the traffic. Each client ran 200 threads to generate traffic across the cluster.
Source : https://medium.com/netflix-techblog/benchmarking-cassandra-scalability-on-aws-over-a-million-writes-per-second-39f45f066c9e

Yes - but only if your data is properly modeled - your data especially needs to be distributed evenly among your partition keys (since they map to specific replica nodes) to avoid hot spots. Given that, yes cassandra will scale horizontally well.
A "table" in cassandra is distributed among all nodes in your cluster. Each node is responsible for a range of tokens which are hashes of the partition key portion of your primary key.
Now, if you double your node count for example - the existing token ranges are split in half and distributed while bootstrapping the new nodes. So each node will only handle half of your inital requests. If you double your requests afterwards, each node will have roughly the same load as before.
For read intensive requests - choosing a higher replication factor helps when you can live with stale data for a while (e.g. read and write at a low consistency level).
There are good tutorials from DataStax available here https://academy.datastax.com/

Datastax states that:
What are the beneﬁts of Apache Cassandra?
Massively scalable ring architecture: Based on the best of Amazon Dynamo and Google BigTable, Cassandra’s peer-to-peer architecture overcomes the limitations of master-slave designs and allows for both high availability and massive scalability.
Linear scale performance: Nodes added to a Cassandra cluster (all done online) increase the throughput of your database in a predictable, linear fashion for both read and write operations.
So the answer is YES, it is possible. It may take some time to adding a new node and redistribute tokens. But it will scale as you change the number of nodes.
If you need more info to understand how it will scale , check this links below:
Benchmarking Cassandra Scalability on AWS
Adding nodes to Cassandra
Adding, replacing, moving and removing nodes

Yes, it is so, but with the single remark. You should consider replication factor (RF) and consistency level (CL) as they affect the scaling behaviour also.
For example, if you initially have the 10 nodes with RF=3, and you increase the nodes count up to 20 with the same RF=3, you'll get the linear increase in write throughput.
But if you want to increase the read throughput, you need to increase RF. And with the increased RF you had to decrease write consistency level to improve write throughput.
To summarize, you could not increase both read and write throughput in a linear way with the same RF and CL params.

Will Elasticsearch survive this much load or simply die?

We have Elasticsearch Server with 1 cluster 3 Nodes, we are expecting that queries fired per second will be 800-1000, so we want to know if we get load like 1000 queries per second then will the elasticsearch server respond with delays or it will simply stop working ?
Queries are all query_string, fuzzy (prefix & wildcard queries are not used).

There's a few factors to consider assuming that your network has the necessary throughput:
What's the CPU speed and number of cores for each node?
Should have 2GHZ quad cores at the very least. Also the nodes should be dedicated to ELK, so they aren't busy with other tasks.
How much ram do your nodes have?
Probably want to be north of 10GB at least
Are your logs filtered and indexed?
Having your logs filtered will greatly reduce the work load generated by the queries. Additionally, filtered logs can make it so that you don't have to query as much with wild cards (which are very expensive).
Hope that helps point in a better direction :)

One immediate suggestion: if you are expecting sustained query rates of 800 - 1K/sec you do not want the nodes storing the data (which will be handling indexing of new records, merging and shard rebalancing) to also be having to deal with query scatter/gather operations. Consider a client + data node topology where you keep your 3 nodes and add n client nodes (data and master set to false in their configs.) The actual value for n will vary based on your actual performance; this will be something you'll want to determine via experimentation.
Other factors equal or unknown, abundant memory is a good resource to have. Review the Elastic team's guidance on hardware and be sure to link through to the discussion on heap.

Cassandra cluster - data density (data size per node) - looking for feedback and advises

I am considering the design of a Cassandra cluster.
The use case would be storing large rows of tiny samples for time series data (using KairosDB), data will be almost immutable (very rare delete, no updates). That part is working very well.
However, after several years the data will be quite large (it wil reach a maximum size of several hundreds of terabytes - over one petabyte considering the replication factor).
I am aware of advice not to use more than 5TB of data per Cassandra node because of high I/O loads during compactions and repairs (which is apparently already quite high for spinning disks).
Since we don't want to build an entire datacenter with hundreds of nodes for this use case, I am investigating if this would be workable to have high density servers on spinning disks (e.g. at least 10TB or 20TB per node using spinning disks in RAID10 or JBOD, servers would have good CPU and RAM so the system will be I/O bound).
The amount of read/write in Cassandra per second will be manageable by a small cluster without any stress. I can also mention that this is not a high performance transactional system but a datastore for storage, retrievals and some analysis, and data will be almost immutable - so even if a compaction or a repair/reconstruction that take several days of several servers at the same time it's probably not going to be an issue at all.
I am wondering if some people have an experience feedback for high server density using spinning disks and what configuration you are using (Cassandra version, data size per node, disk size per node, disk config: JBOD/RAID, type of hardware).
Thanks in advance for your feedback.
Best regards.

The risk of super dense nodes isn't necessarily maxing IO during repair and compaction - it's the inability to reliably resolve a total node failure. In your reply to Jim Meyer, you note that RAID5 is discouraged because the probability of failure during rebuild is too high - that same potential failure is the primary argument against super dense nodes.
In the days pre-vnodes, if you had a 20T node that died, and you had to restore it, you'd have to stream 20T from the neighboring (2-4) nodes, which would max out all of those nodes, increase their likelihood of failure, and it would take (hours/days) to restore the down node. In that time, you're running with reduced redundancy, which is a likely risk if you value your data.
One of the reasons vnodes were appreciated by many people is that it distributes load across more neighbors - now, streaming operations to bootstrap your replacement node come from dozens of machines, spreading the load. However, you still have the fundamental problem: you have to get 20T of data onto the node without bootstrap failing. Streaming has long been more fragile than desired, and the odds of streaming 20T without failure on cloud networks are not fantastic (though again, it's getting better and better).
Can you run 20T nodes? Sure. But what's the point? Why not run 5 4T nodes - you get more redundancy, you can scale down the CPU/memory accordingly, and you don't have to worry about re-bootstrapping 20T all at once.
Our "dense" nodes are 4T GP2 EBS volumes with Cassandra 2.1.x (x >= 7 to avoid the OOMs in 2.1.5/6). We use a single volume, because while you suggest "cassandra now supports JBOD quite well", our experience is that relying on Cassandra's balancing algorithms is unlikely to give you quite what you think it will - IO will thundering herd between devices (overwhelm one, then overwhelm the next, and so on), they'll fill asymmetrically. That, to me, is a great argument against lots of small volumes - I'd rather just see consistent usage on a single volume.

I haven't used KairosDB, but if it gives you some control over how Cassandra is used, you could look into a few things:
See if you can use incremental repairs instead of full repairs. Since your data is an immutable time series, you won't often need to repair old SSTables, so incremental repairs would just repair recent data.
Archive old data in a different keyspace, and only repair that keyspace infrequently such as when there is a topology change. For routine repairs, only repair the "hot" keyspace you use for recent data.
Experiment with using a different compaction strategy, perhaps DateTiered. This might reduce the amount of time spent on compaction since it would spend less time compacting old data.
There are other repair options that might help, for example I've found the the -local option speeds up repairs significantly if you are running multiple data centers. Or perhaps you could run limited repairs more frequently rather than performance killing full repairs on everything.
I have some Cassandra clusters that use RAID5. This has worked fine so far, but if two disks in the array fail then the node becomes unusable since writes to the array are disabled. Then someone must manually intervene to fix the failed disks or remove the node from the cluster. If you have a lot of nodes, then disk failures will be a fairly common occurrence.
If no one gives you an answer about running 20 TB nodes, I'd suggest running some experiments on your own dataset. Set up a single 20 TB node and fill it with your data. As you fill it, monitor the write throughput and see if there are intolerable drops in throughput when compactions happen, and at how many TB it becomes intolerable. Then have an empty 20 TB node join the cluster and run a full repair on the new node and see how long it takes to migrate its half of the dataset to it. This would give you an idea of how long it would take to replace a failed node in your cluster.
Hope that helps.

I would recommend to think about the data model of your application and how to partition your data. For time series data it would probably make sense to use a composite key [1] which consists of a partition key + one or more columns. Partitions are distributed across multiple servers according to the hash of the partition key (depending on the Cassandra Partitioner that you use, see cassandra.yaml).
For example, you could partition your server by device that generates the data (Pattern 1 in [2]) or by a period of time (e.g., per day) as shown in Pattern 2 in [2].
You should also be aware that the max number of values per partition is limited to 2 billion [3]. So, partitioning is highly recommended. Don't store your entire time series on a single Cassandra node in a single partition.
[1] http://www.planetcassandra.org/blog/composite-keys-in-apache-cassandra/
[2] https://academy.datastax.com/demos/getting-started-time-series-data-modeling
[3] http://wiki.apache.org/cassandra/CassandraLimitations

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string