cassandra inter cluster latency metrics - cassandra

Is there any mbean for cassandra to get the Cross-data center latency metrics.
I have 6 nodes spread across 2 DC 3 node each. I want to monitor the replication between DCs

Sure, you can monitor both overall and DC-specific latencies using the org.apache.cassandra.metrics MBean.
Overall internode latency
JMX in MBean org.apache.cassandra.metrics:
type=Messaging,name=CrossNodeLatency
Internode latency for datacenter with name DC-Name
JMX in MBean org.apache.cassandra.metrics:
type=Messaging,name=<DC-Name>-Latency
You can find these and other useful metrics on this page in the DataStax documentation: https://docs.datastax.com/en/dseplanning/docs/metricsandalerts.html

Related

What is the expected ingestion pace for a Cassandra cluster?

I am running a project that requires to load millions of records to cassandra.
I am using kafka connect and doing partitioning and raising 24 workers I only get around 4000 rows per second.
I did a test with pentaho pdi inserting straight to cassandra with jdbc driver and I get a litle bit less rows per second: 3860 (avg)
The cassandra cluster has 24 nodes. What is the expected insertion pace by default? how can i fine tune the ingestion of big loads of data?
There is no magical "default" rate that a Cassandra cluster can ingest data. One cluster can take 100K ops/sec, another can do 10M ops/sec. In theory, it can be limitless.
A cluster's throughput is determined by a lot of moving parts which include (but NOT limited to):
hardware configuration
number of cores, type of CPU
amount of memory, type of RAM
disk bandwidth, disk configuration
network capacity/bandwidth
data model
client/driver configuration
access patterns
cluster topology
cluster size
The only way you can determine the throughput of your cluster is by doing your own test on as close to production loads as you can simulate. Cheers!

Frequent Compaction of OpsCenter.rollup_state on all the nodes consuming CPU cycles

I am using Datastax Cassandra 4.8.16. With cluster of 8 DC and 5 nodes on each DC on VM's. For last couple of weeks we observed below performance issue
1) Increase drop count on VM's.
2) LOCAL_QUORUM for some write operation not achieved.
3) Frequent Compaction of OpsCenter.rollup_state and system.hints are visible in Opscenter.
Appreciate any help finding the root cause for this.
Presence of dropped mutations means that cluster is heavily overloaded. It could be increase of the main load, so it + load from OpsCenter, overloaded system - you need to look into statistics about number of requests, latencies, etc. per nodes and per tables, to see where increase happened. Please also check the I/O statistics on machines (for example, with iostat) - sizes of the queues, read/write latencies, etc.
Also it's recommended to use a dedicated OpsCenter cluster to store metrics - it could be smaller size, and doesn't require an additional license for DSE. How it said in the OpsCenter's documentation:
Important: In production environments, DataStax strongly recommends storing data in a separate DataStax Enterprise cluster.
Regarding VMs - usually it's not really recommended setup, but heavily depends on what kind of underlying hardware - number of CPUs, RAM, disk system.

How to monitor Cassandra replication lag?

I have a Cassandra cluster setup with datacenter, each datacenter has a 3nodes. While doing load testing on it the data is replication from datacenter1 to datacenter2.
But is there any way by which I can monitor the replication lag/latency while data is getting replicated from dc1 to dc2?
Have a look at this Jira ticket. It describes a new metrics for cross dc latency for Cassandra 3.8 and above.

Measuring throughput in cassandra cluster

This may sound as an elementary question - but I am confused with all the literature.
I have a 3-node cassandra cluster on 3.11.x,with 1 seed node.
We are testing brute force write throughput in this setup with a single threaded client seated outside the cluster.
With nodetool and cqsl at my disposal - how do I go about realistically assessing the following:
How much volume was processed by each node.
How much of the total time was consumed
a)by the actual flush + compaction at each node
b)time taken by the cluster to resolve the node/partition(hashing)
c)network latency in chaperoning the data to the node
The best way would be monitoring JMX port of Cassandra by a tool like Opennms or Zabbix and compare Metrics such as Mutation, compatcion, etc (Hundreds of Metrics there):
https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsMonitoring.html

Will CONSISTENCY TWO be affected by low remote DC latency if all local replicas are up?

Scenario: we have up in aws a DSE 5.0 cluster cluster with 2 DCs, and a keyspace with 3 replicas in Australia and 3 replicas in US West coast. App talks to DSE via the dse java driver.
For our users in Sydney, If we use LOCAL_QUORUM, response times as measured in the client are under 90ms. This is good, but if 2 replicas are too slow (happened during a nasty repair caused by the analytics cluster) we are down.
If we use QUORUM, we can lose 2 nodes locally without going down, but our response times are over 450ms at all times because every read needs at least one answer from the remote DC.
My question is: will using CL TWO (which is enough for our case) suffer the same latency cost of QUORUM if all our 3 local replicas are healthy and behaving?
Our end goal is having low latency while still being automatically fail over and eat the latency cost if local fails.
If it makes any difference, we are using DCAwareRoundRobin in the driver.
DCAwareRoundRobin policy provides round-robin queries over the node of
the local data center. It also includes in the query plans returned a
configurable number of hosts in the remote data centers, but those are
always tried after the local nodes. In other words, this policy
guarantees that no host in a remote data center will be queried unless
no host in the local data center can be reached.
CONSISTENCY TWO returns the most recent data from two of the closest replicas.
CONSISTENCY In Cassandra
To obtain minimal latency in Scylla/Cassandra over a multi-dc implementation, you'd need to use the local aspect of the driver.
The challenge with the CL=Two is that it provides the closest response from the nearest replicas based on your snitch configuration.
To my understanding, it means that the coordinator node request is sent to all replicas without the locality aspect. It means you'd be charged for the egress traffic from both sides of the pond. once for the request and once for the actual data traffic coming from all replicas.

Resources