Cassandra nodes extremely unbalanced with Vnodes token strategy

Cassandra nodes extremely unbalanced with Vnodes token strategy - cassandra

Using the Vnodes strategy with 256 tokens per node, my cluster shows info like below while executing nodetool status. Seems the load of my cluster is extremely unbalanced. I dont know what cause this. Is partition key of tables related ? Any comments would be welcome, Thanks!
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.1.190 9.78 GiB 256 ? f3e56d8d-caf2-450a-b4f1-e5ac5163a17a rack1
UN 192.168.1.191 77.53 MiB 256 ? e464cda9-ca8b-400b-82eb-59256d23c94a rack1
UN 192.168.1.192 89.31 MiB 256 ? 6feaa65f-2523-4b65-9717-f4f3cbf28ef0 rack1

Even with a significant imbalance of the primary token range - something about the load is not right - if you are using an RF of 3, all 3 nodes would have a replica of all the data, and any primary range imbalance would not be visible.
To get the imbalance you have posted points to the use of using RF1 - and potentially a poor data model / partition key which is hotspotting the data to a single node.

Yes, most probably there is a skew in the distribution of partition keys, most probably some partitions have much more rows than other. Check this document for recommendations, especially the sections "Number of cells per partition" and "Big partitions". You can use the number of tools to check the hypothesis:
nodetool tablehistograms (may need to be executed for every table separately) on each host will show you the number of cells and partition size in bytes at 50%, 75%, ..., and 100% percentiles. You may see very big differences between 95% & 100% percentiles.
nodetool tablestats will show the max & average size of the partition per table per host
DSBulk has an option to show the largest partitions based on the number of rows per partition - it needs to be executed for every table in cluster, but only once, not from each host in contrast to the nodetool:
dsbulk count -k keyspace -t table --log.verbosity 0 --stats.mode partitions

Most tables are uniform but this one
kairosdb/data_points histograms —— NODE1 load 9.73GB
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 2.00 17.08 152.32 1597 86
75% 2.00 35.43 182.79 9887 446
95% 2.00 88.15 263.21 73457 3973
98% 2.00 105.78 263.21 315852 20501
99% 2.00 105.78 263.21 545791 29521
Min 2.00 6.87 126.94 104 0
Max 2.00 105.78 263.21 785939 35425
kairosdb/data_points histograms —— NODE2 load 36.95MB
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 1.00 20.50 454.83 1109 42
75% 2.00 42.51 943.13 9887 446
95% 2.00 73.46 14530.76 73457 3973
98% 2.00 219.34 14530.76 263210 17084
99% 2.00 219.34 14530.76 545791 29521
Min 1.00 8.24 88.15 104 0
Max 2.00 219.34 14530.76 785939 35425
kairosdb/data_points histograms —— NODE3 load 61.56MB
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 1.00 14.24 943.13 1331 50
75% 1.00 29.52 1131.75 9887 446
95% 1.00 61.21 1131.75 73457 3973
98% 1.00 152.32 1131.75 315852 17084
99% 1.00 654.95 1131.75 545791 29521
Min 1.00 4.77 785.94 73 0
Max 1.00 654.95 1131.75 785939 35425

Related

Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 1 replica responded)

One of our applications is occasionally getting the error:
Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 1 replica responded)
In the course of an hour we might get 20 or 30 of these over 10,000 queries or more. And a retry of the query generally works.
It does appear to be a timeout of some sort. The error appears in the application logs, but I don't see any corresponding error or warning, or anything really, in the cassandra system.log nor debug.log.
All of the searches I'm doing online lead to queries where people see this consistently, but for me it's not consistent. The cluster itself it healthy, and other queries return just fine. The table being queried isn't large (a few tens of MB on each server). Looking at tablehistorgrams, I'm not seeing anything overly large for reads or writes on any server for the table in question. CPU, memory, etc, are all fine.
A typical histgram for that table is currently
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 2.00 29.52 152.32 1916 72
75% 3.00 35.43 379.02 24601 770
95% 3.00 51.01 379.02 454826 14237
98% 3.00 61.21 379.02 654949 20501
99% 3.00 73.46 379.02 785939 24601
Min 0.00 14.24 105.78 180 6
Max 3.00 88.15 379.02 1629722 51012
Although I don't have one from immediately after this error appeared.
Running Apache Cassandra 3.11.3. 16 node cluster (8 nodes in each DC). Replication is DC1:3, DC2:3 (for all tables in all user keyspaces). Driver is configured to use DCAwareRoundRobin, and all reads and writes are LOCAL_QUORUM. Application (like all of our applications) is write heave. STDC configured, if that helps.
We see much less timeouts on writes, but they are not zero:
com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during SIMPLE write query at consistency LOCAL_QUORUM (2 replica were required but only 1 acknowledged the write)
If it matters, this is occurring with Akka persistence tables for this particular application.
I'm looking for possible suggestions as to cause, please, as I haven't been able to find anything (and I don't have much hair to pull out...).
Thanks.
Caused by: com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 1 replica responded)
at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:91)
at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:66)
at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:297)
at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:268)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
... 34 common frames omitted
We have several applications using this cluster. This isn't the only application with errors, but I figure by fixing this app it will fix the others.

How to find large partition in cassandra except system.log?

How can we find large partitions on our cassandra cluster before came into system.log? we are facing some performance issue due to this. Can anyone help me. We have cassandra version 2.0.11 and 2.1.16.

You can look into output of the nodetool tablestats (or nodetool cfstats in the older versions of Cassandra) - for every table it has line Compacted partition maximum bytes together with other information, like in this example when max partition size is about 268Mb:
Table: table_name
SSTable count: 2
Space used (live): 147638509
Space used (total): 147638509
.....
Compacted partition minimum bytes: 43
Compacted partition maximum bytes: 268650950
Compacted partition mean bytes: 430941
Average live cells per slice (last five minutes): 8256.0
Maximum live cells per slice (last five minutes): 10239
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
.....
But nodetool tablestats gives you an information for current node only, so you'll need to execute it on every node of the cluster.
Update: You can find largest partitions using different tools:
https://github.com/tolbertam/sstable-tools has describe command that shows largest/widest partitions. This command will be also available in Cassandra 4.0.
for DataStax products the DSBulk tool supports counting of partitions.

Try nodetool tablehistograms -- <keyspace> <table> command provides statistics about a table, including read/write latency, partition size, column count, and number of SSTables.
Below is the example output:
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 0.00 73.46 0.00 223875792 61214
75% 0.00 88.15 0.00 668489532 182785
95% 0.00 152.32 0.00 1996099046 654949
98% 0.00 785.94 0.00 3449259151 1358102
99% 0.00 943.13 0.00 3449259151 1358102
Min 0.00 24.60 0.00 5723 4
Max 0.00 5839.59 0.00 5960319812 1955666
This provides proper stats of the table like 95% percentile of raw_data table has partition size of 107MB and max of 3.44GB.
Hope this helps to figure out performance issue.

Cassandra - Maximum memory usage reached (128.000MiB) cannot allocate chunk of 1.000 MiB, what does it mean?

There are two other related posts
NoSpamLogger.java Maximum memory usage reached Cassandra
in cassandra Maximum memory usage reached (536870912 bytes), cannot allocate chunk of 1048576 bytes
But they aren't exactly asking the same thing. I am asking for a thorough understanding of what does this message mean? It doesn't seem to impact my latency at the moment.
I did a nodetool cfstats
SSTable count: 5
Space used (live): 1182782029
Space used (total): 1182782029
Space used by snapshots (total): 0
Off heap memory used (total): 802011
SSTable Compression Ratio: 0.17875764458149868
Number of keys (estimate): 34
Memtable cell count: 33607
Memtable data size: 5590408
Memtable off heap memory used: 0
Memtable switch count: 902
Local read count: 4689
Local read latency: NaN ms
Local write count: 51592342
Local write latency: 0.035 ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 120
Bloom filter off heap memory used: 80
Index summary off heap memory used: 291
Compression metadata off heap memory used: 801640
Compacted partition minimum bytes: 447
Compacted partition maximum bytes: 2874382626
Compacted partition mean bytes: 164195240
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0
The latency looks fine to me.
I also did a histogram
Percentile SSTables WriteLatency ReadLatency PartitionSize CellCount
50% 0.00 35.43 0.00 1629722 35425
75% 0.00 42.51 0.00 129557750 2346799
95% 0.00 61.21 0.00 668489532 14530764
98% 0.00 73.46 0.00 2874382626 52066354
99% 0.00 88.15 0.00 2874382626 52066354
Min 0.00 11.87 0.00 447 11
Max 0.00 785.94 0.00 2874382626 52066354
The stats look fine to me! So what is Cassandra complaining about?

The comment in this jira has an explanation: https://issues.apache.org/jira/browse/CASSANDRA-12221
Quote:
Wei Deng added a comment - 18/Jul/16 05:01
See CASSANDRA-5661. It's a cap to limit the amount of off-heap memory used by RandomAccessReader, and if there is a need, you can change the limit by file_cache_size_in_mb in cassandra.yaml.

The log message is relatively harmless. It indicates that the node's off-heap cache is full because the node is busy servicing reads.
The 134217728 bytes in the log message means that you have set file_cache_size_in_mb to 128 MB. You should consider setting it to the default 512 MB.
It is fine to see the occasional occurrences of the message in the logs which is why it is logged at INFO level but if it gets logged repeatedly, it is an indicator that the node is getting overloaded and you should consider increasing the capacity of your cluster by adding more nodes.
For more info, see my post on DBA Stack Exchange -- What does "Maximum memory usage reached" mean in the Cassandra logs?. Cheers!

Cassandra: read/s write/s

I'm trying to figure out the throughput of my Cassandra cluster, and can't figure out how to use nodetool to accomplish that. Below is a sample output:
Starting NodeTool
Keyspace: realtimetrader
Read Count: 0
Read Latency: NaN ms.
Write Count: 402
Write Latency: 0.09648756218905473 ms.
Pending Flushes: 0
Table: currencies
SSTable count: 1
Space used (live): 5254
Space used (total): 5254
Space used by snapshots (total): 0
Off heap memory used (total): 40
SSTable Compression Ratio: 0.0
Number of keys (estimate): 14
Memtable cell count: 1608
Memtable data size: 567
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 0
Local read latency: NaN ms
Local write count: 402
Local write latency: 0.106 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0,00000
Bloom filter space used: 24
Bloom filter off heap memory used: 16
Index summary off heap memory used: 16
Compression metadata off heap memory used: 8
Compacted partition minimum bytes: 125
Compacted partition maximum bytes: 149
Compacted partition mean bytes: 149
Average live cells per slice (last five minutes): 0.0
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0
I run the command:
nodetool cfstats
to get this, and then subtract the latter, "Local read count:" from the earlier one.
But I'm not sure what the "Local" here means?
Does it mean its local to that node and in a ring of 5 nodes, I should multiple the value by 5? Or is it that the simple subtraction will give me the correct result?
Also, which JMX bean should I be looking at to get these #'s?

Have a look at this nodetool cfstats.
I think what you are looking for is 'Read Latency' and 'Write Latency'.
These fields indicate how faster your read/writes are in your cluster.

Why does nodetool status keyspace still show hundreds of MBs of data after TRUNCATE?

I have used the TRUNCATE command from the CQLSH at node .20 for my table.
20 Minutes have passed since I issued the command and the output of nodetool status *myKeyspace* still shows a lot of data on 4 out of 6 nodes.
I am using Cassandra 3.0.8
192.168.178.20:/usr/share/cassandra$ nodetool status *myKeyspace*
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.178.24 324,57 MB 256 32,7% 4d852aea-65c7-42e1-b2bd-f38a320ec827 rack1
UN 192.168.178.28 650,86 KB 256 35,7% 82b67dc5-9f4f-47e9-81d7-a93f28a3e9da rack1
UN 192.168.178.30 155,68 MB 256 31,9% 28cf5138-7b61-42ca-8b0c-e4be1b5418ba rack1
UN 192.168.178.32 321,62 MB 256 33,3% 64e106ed-770f-4654-936d-db5b80aa37dc rack1
UN 192.168.178.36 640,91 KB 256 33,0% 76152b07-caa6-4214-8239-e8a51bbc4b62 rack1
UN 192.168.178.20 103,07 MB 256 33,3% 539a6333-c4ef-487a-b1e4-aac40949af4c rack1
The following command was run on .24 node. It looks like there there are still snapshots/backups being saved somewhere? But the amount of data, 658 MB for Node .24, does not match the reported 324 MB from nodetool status. What's going on there?
192.168.178.24:/usr/share/cassandra$ nodetool cfstats *myKeyspace*
Keyspace: *myKeyspace*
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Flushes: 0
Table: data
SSTable count: 0
Space used (live): 0
Space used (total): 0
Space used by snapshots (total): 658570012
Off heap memory used (total): 0
SSTable Compression Ratio: 0.0
Number of keys (estimate): 0
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 0
Local read latency: NaN ms
Local write count: 0
Local write latency: NaN ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0,00000
Bloom filter space used: 0
Bloom filter off heap memory used: 0
Index summary off heap memory used: 0
Compression metadata off heap memory used: 0
Compacted partition minimum bytes: 0
Compacted partition maximum bytes: 0
Compacted partition mean bytes: 0
Average live cells per slice (last five minutes): 3.790273556231003
Maximum live cells per slice (last five minutes): 103
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Note that there are no other tables than the one I cleaned in the keyspace. There might be some index data from cassandra-lucene-index though if they do not get cleared when using TRUNCATE.

nodetool status's keyspace option is really only for knowing the replication factor and datacenters to include when computing the ownership. The load is actually for all the sstables, not just the one keyspace. Just like how IP address, host id, and number of tokens is not affected by setting keyspace option. status is more of a global check.
Space used by snapshots is expected to still have old data. When you do a truncate it snapshots the data (can disable by setting auto_snapshot in cassandra.yaml to false). To clear all the snapshots you can use nodetool clearsnapshot <keyspace>

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string