How can we find large partitions on our cassandra cluster before came into system.log? we are facing some performance issue due to this. Can anyone help me. We have cassandra version 2.0.11 and 2.1.16.
You can look into output of the nodetool tablestats (or nodetool cfstats in the older versions of Cassandra) - for every table it has line Compacted partition maximum bytes together with other information, like in this example when max partition size is about 268Mb:
Table: table_name
SSTable count: 2
Space used (live): 147638509
Space used (total): 147638509
.....
Compacted partition minimum bytes: 43
Compacted partition maximum bytes: 268650950
Compacted partition mean bytes: 430941
Average live cells per slice (last five minutes): 8256.0
Maximum live cells per slice (last five minutes): 10239
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
.....
But nodetool tablestats gives you an information for current node only, so you'll need to execute it on every node of the cluster.
Update: You can find largest partitions using different tools:
https://github.com/tolbertam/sstable-tools has describe command that shows largest/widest partitions. This command will be also available in Cassandra 4.0.
for DataStax products the DSBulk tool supports counting of partitions.
Try nodetool tablehistograms -- <keyspace> <table> command provides statistics about a table, including read/write latency, partition size, column count, and number of SSTables.
Below is the example output:
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 0.00 73.46 0.00 223875792 61214
75% 0.00 88.15 0.00 668489532 182785
95% 0.00 152.32 0.00 1996099046 654949
98% 0.00 785.94 0.00 3449259151 1358102
99% 0.00 943.13 0.00 3449259151 1358102
Min 0.00 24.60 0.00 5723 4
Max 0.00 5839.59 0.00 5960319812 1955666
This provides proper stats of the table like 95% percentile of raw_data table has partition size of 107MB and max of 3.44GB.
Hope this helps to figure out performance issue.
Related
Using the Vnodes strategy with 256 tokens per node, my cluster shows info like below while executing nodetool status. Seems the load of my cluster is extremely unbalanced. I dont know what cause this. Is partition key of tables related ? Any comments would be welcome, Thanks!
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.1.190 9.78 GiB 256 ? f3e56d8d-caf2-450a-b4f1-e5ac5163a17a rack1
UN 192.168.1.191 77.53 MiB 256 ? e464cda9-ca8b-400b-82eb-59256d23c94a rack1
UN 192.168.1.192 89.31 MiB 256 ? 6feaa65f-2523-4b65-9717-f4f3cbf28ef0 rack1
Even with a significant imbalance of the primary token range - something about the load is not right - if you are using an RF of 3, all 3 nodes would have a replica of all the data, and any primary range imbalance would not be visible.
To get the imbalance you have posted points to the use of using RF1 - and potentially a poor data model / partition key which is hotspotting the data to a single node.
Yes, most probably there is a skew in the distribution of partition keys, most probably some partitions have much more rows than other. Check this document for recommendations, especially the sections "Number of cells per partition" and "Big partitions". You can use the number of tools to check the hypothesis:
nodetool tablehistograms (may need to be executed for every table separately) on each host will show you the number of cells and partition size in bytes at 50%, 75%, ..., and 100% percentiles. You may see very big differences between 95% & 100% percentiles.
nodetool tablestats will show the max & average size of the partition per table per host
DSBulk has an option to show the largest partitions based on the number of rows per partition - it needs to be executed for every table in cluster, but only once, not from each host in contrast to the nodetool:
dsbulk count -k keyspace -t table --log.verbosity 0 --stats.mode partitions
Most tables are uniform but this one
kairosdb/data_points histograms —— NODE1 load 9.73GB
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 2.00 17.08 152.32 1597 86
75% 2.00 35.43 182.79 9887 446
95% 2.00 88.15 263.21 73457 3973
98% 2.00 105.78 263.21 315852 20501
99% 2.00 105.78 263.21 545791 29521
Min 2.00 6.87 126.94 104 0
Max 2.00 105.78 263.21 785939 35425
kairosdb/data_points histograms —— NODE2 load 36.95MB
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 1.00 20.50 454.83 1109 42
75% 2.00 42.51 943.13 9887 446
95% 2.00 73.46 14530.76 73457 3973
98% 2.00 219.34 14530.76 263210 17084
99% 2.00 219.34 14530.76 545791 29521
Min 1.00 8.24 88.15 104 0
Max 2.00 219.34 14530.76 785939 35425
kairosdb/data_points histograms —— NODE3 load 61.56MB
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 1.00 14.24 943.13 1331 50
75% 1.00 29.52 1131.75 9887 446
95% 1.00 61.21 1131.75 73457 3973
98% 1.00 152.32 1131.75 315852 17084
99% 1.00 654.95 1131.75 545791 29521
Min 1.00 4.77 785.94 73 0
Max 1.00 654.95 1131.75 785939 35425
I have noticed that some tables show less than 100% "Percent repaired" in the nodetool tablestatus output. I have manually executed repairs on all nodes (3 node cluster, RF=3) but the value doesnt seem to change.
Example output:
Table: users
SSTable count: 3
Space used (live): 66636
Space used (total): 66636
Space used by snapshots (total): 0
Off heap memory used (total): 688
SSTable Compression Ratio: 0.5731829674519404
Number of partitions (estimate): 162
Memtable cell count: 11
Memtable data size: 483
Memtable off heap memory used: 0
Memtable switch count: 27
Local read count: 120833
Local read latency: NaN ms
Local write count: 12094
Local write latency: NaN ms
Pending flushes: 0
Percent repaired: 91.54
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 568
Bloom filter off heap memory used: 544
Index summary off heap memory used: 112
Compression metadata off heap memory used: 32
Compacted partition minimum bytes: 30
Compacted partition maximum bytes: 1916
Compacted partition mean bytes: 420
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0
Repair was done with nodetool repair -pr
What is going on?
Percent repaired seems to be a misleading metric as it refers to the percentage of SSTables repaired, but there are some conditions to be computed here:
- the tables should not be from systems keyspaces
- the tables should have a replication factor greater than 1
- the repair should be incremental or full (non-subrange)
When you use nodetool repair -pr, that will invoke a full repair that won't be able to update this value.
For more information regarding incremental repairs, I would recommend this article from the Last Pickle. Since they adopted the maintenance of the reaper tool, they have become an authority regarding repairs.
Executing nodetool repair -pr will repair the primary range owned by the node that command is executed on.
What does this mean? The node this command is executed on has data that it "owns", i.e., its primary range, but the node also contains data/replicas "owned" by other nodes. You are not repairing the replicas "owned" owned by other nodes.
Now, if you execute that command on every single node in the cluster (not data center), it will cover all the token ranges.
EDIT / NOTE:
My answer did not properly address the question. Although what I wrote is accurate, the answer to the question is stated in the answer above mine; basically, the percentage repaired is a value that is for incremental repair usage and is not affected by a full repair. (Incremental repair marks the repaired ranges as it works so it does not spend time re-repairing later.)
There are two other related posts
NoSpamLogger.java Maximum memory usage reached Cassandra
in cassandra Maximum memory usage reached (536870912 bytes), cannot allocate chunk of 1048576 bytes
But they aren't exactly asking the same thing. I am asking for a thorough understanding of what does this message mean? It doesn't seem to impact my latency at the moment.
I did a nodetool cfstats
SSTable count: 5
Space used (live): 1182782029
Space used (total): 1182782029
Space used by snapshots (total): 0
Off heap memory used (total): 802011
SSTable Compression Ratio: 0.17875764458149868
Number of keys (estimate): 34
Memtable cell count: 33607
Memtable data size: 5590408
Memtable off heap memory used: 0
Memtable switch count: 902
Local read count: 4689
Local read latency: NaN ms
Local write count: 51592342
Local write latency: 0.035 ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 120
Bloom filter off heap memory used: 80
Index summary off heap memory used: 291
Compression metadata off heap memory used: 801640
Compacted partition minimum bytes: 447
Compacted partition maximum bytes: 2874382626
Compacted partition mean bytes: 164195240
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0
The latency looks fine to me.
I also did a histogram
Percentile SSTables WriteLatency ReadLatency PartitionSize CellCount
50% 0.00 35.43 0.00 1629722 35425
75% 0.00 42.51 0.00 129557750 2346799
95% 0.00 61.21 0.00 668489532 14530764
98% 0.00 73.46 0.00 2874382626 52066354
99% 0.00 88.15 0.00 2874382626 52066354
Min 0.00 11.87 0.00 447 11
Max 0.00 785.94 0.00 2874382626 52066354
The stats look fine to me! So what is Cassandra complaining about?
The comment in this jira has an explanation: https://issues.apache.org/jira/browse/CASSANDRA-12221
Quote:
Wei Deng added a comment - 18/Jul/16 05:01
See CASSANDRA-5661. It's a cap to limit the amount of off-heap memory used by RandomAccessReader, and if there is a need, you can change the limit by file_cache_size_in_mb in cassandra.yaml.
The log message is relatively harmless. It indicates that the node's off-heap cache is full because the node is busy servicing reads.
The 134217728 bytes in the log message means that you have set file_cache_size_in_mb to 128 MB. You should consider setting it to the default 512 MB.
It is fine to see the occasional occurrences of the message in the logs which is why it is logged at INFO level but if it gets logged repeatedly, it is an indicator that the node is getting overloaded and you should consider increasing the capacity of your cluster by adding more nodes.
For more info, see my post on DBA Stack Exchange -- What does "Maximum memory usage reached" mean in the Cassandra logs?. Cheers!
I have a 19-node Cassandra cluster for our internal service. If I log into a node using nodetool and run commands like tablestats, etc, does that gather stats just for that particular node or for the entire cluster?
nodetool utility for cassandra gather for entire cluster, not a single node.
For example, if you run command like-
command:
nodetool tablestats musicdb.artist
result:
Keyspace: musicdb
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Flushes: 0
Table: artist
SSTable count: 1
Space used (live): 62073
Space used (total): 62073
Space used by snapshots (total): 0
Off heap memory used (total): 1400
SSTable Compression Ratio: 0.27975344141453456
Number of keys (estimate): 1000
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 0
Local read latency: NaN ms
Local write count: 0
Local write latency: NaN ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 1264
Bloom filter off heap memory used: 1256
Index summary off heap memory used: 128
Compression metadata off heap memory used: 16
Compacted partition minimum bytes: 104
Compacted partition maximum bytes: 149
Compacted partition mean bytes: 149
Average live cells per slice (last five minutes): 0.0
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0
Status of the table artist belongs to keyspace musicdb above is from the entire cluster.
Most nodetool commands operate on a single node in the cluster if -h
is not used to identify one or more other nodes. If the node from
which you issue the command is the intended target, you do not need
the -h option to identify the target; otherwise, for remote
invocation, identify the target node, or nodes, using -h.
Nodetool Utility
I'm trying to figure out the throughput of my Cassandra cluster, and can't figure out how to use nodetool to accomplish that. Below is a sample output:
Starting NodeTool
Keyspace: realtimetrader
Read Count: 0
Read Latency: NaN ms.
Write Count: 402
Write Latency: 0.09648756218905473 ms.
Pending Flushes: 0
Table: currencies
SSTable count: 1
Space used (live): 5254
Space used (total): 5254
Space used by snapshots (total): 0
Off heap memory used (total): 40
SSTable Compression Ratio: 0.0
Number of keys (estimate): 14
Memtable cell count: 1608
Memtable data size: 567
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 0
Local read latency: NaN ms
Local write count: 402
Local write latency: 0.106 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0,00000
Bloom filter space used: 24
Bloom filter off heap memory used: 16
Index summary off heap memory used: 16
Compression metadata off heap memory used: 8
Compacted partition minimum bytes: 125
Compacted partition maximum bytes: 149
Compacted partition mean bytes: 149
Average live cells per slice (last five minutes): 0.0
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0
I run the command:
nodetool cfstats
to get this, and then subtract the latter, "Local read count:" from the earlier one.
But I'm not sure what the "Local" here means?
Does it mean its local to that node and in a ring of 5 nodes, I should multiple the value by 5? Or is it that the simple subtraction will give me the correct result?
Also, which JMX bean should I be looking at to get these #'s?
Have a look at this nodetool cfstats.
I think what you are looking for is 'Read Latency' and 'Write Latency'.
These fields indicate how faster your read/writes are in your cluster.