Need help understanding cfhistograms output - cassandra

We are using Scylla version 4.4.1-0.20210406.00da6b5e9
I am not able to understand that
How nodetool cfhistogram is showing more no of sstables touched than the number of sstables that are actually present according to cfstats?
Why nodetool cfhistogram is showing such a high number sstables? I even ran nodetool compact prior to this.
Table structure:
CREATE TABLE gauntlet_keyspace.user_match_mapping (
user_id bigint,
match_id bigint,
team_id bigint,
created_at timestamp,
team_detail text,
updated_at timestamp,
user_login text,
user_team_name text,
PRIMARY KEY (user_id, match_id, team_id)
) WITH CLUSTERING ORDER BY (match_id ASC, team_id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
AND comment = ''
AND compaction = {'class': 'SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression':
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
cfstats
Keyspace : gauntlet_keyspace
Read Count: 9861938
Read Latency: 0.00119595834003418 ms
Write Count: 0
Write Latency: NaN ms
Pending Flushes: 0
Table: user_match_mapping
SSTable count: 14
SSTables in each level: [14/4]
Space used (live): 51979706055
Space used (total): 51979706055
Space used by snapshots (total): 0
Off heap memory used (total): 92586912
SSTable Compression Ratio: 0.203467
Number of partitions (estimate): 4328815
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 9799345
Local read latency: 1.178 ms
Local write count: 0
Local write latency: NaN ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 6195424
Bloom filter off heap memory used: 6195368
Index summary off heap memory used: 86391544
Compression metadata off heap memory used: 0
Compacted partition minimum bytes: 771
Compacted partition maximum bytes: 785939
Compacted partition mean bytes: 63765
Average live cells per slice (last five minutes): 0.0
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0
cfhistograms
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 0.00 0.00 1065.50 14237 72
75% 0.00 0.00 1286.50 105778 642
95% 1469901.75 0.00 1706.15 219342 1109
98% 9799345.00 0.00 1934.88 315852 1916
99% 9799345.00 0.00 2067.69 379022 1916
Min 0.00 0.00 554.00 771 5
Max 9799345.00 0.00 2202.00 785939 3973
Only this read query was done
select * from gauntlet_keyspace.user_match_mapping where user_id=? and match_id=? and team_id=?;

Related

Spark not giving equal tasks to all executors

I am reading from kafka topic which has 5 partitions. Since 5 cores are not sufficient to handle the load, I am doing repartitioning the input to 30. I have given 30 cores to my spark process with 6 cores on each executor. With this setup i was assuming that each executor will get 6 tasks. But more oftan than not we are seeing that one executor is getting 4 tasks and others are getting 7 tasks. It is skewing the processing time of our job.
Can someone help me understand why all the executor will not get equal number of tasks? Here is the executor metrics after job has run for 12 hours.
Address
Status
RDD Blocks
Storage Memory
Disk Used
Cores
Active Tasks
Failed Tasks
Complete Tasks
Total Tasks
Task Time (GC Time)
Input
Shuffle Read
Shuffle Write
ip1:36759
Active
7
1.6 MB / 144.7 MB
0.0 B
6
6
0
442506
442512
35.9 h (26 min)
42.1 GB
25.9 GB
24.7 GB
ip2:36689
Active
0
0.0 B / 128 MB
0.0 B
0
0
0
0
0
0 ms (0 ms)
0.0 B
0.0 B
0.0 B
ip5:44481
Active
7
1.6 MB / 144.7 MB
0.0 B
6
6
0
399948
399954
29.0 h (20 min)
37.3 GB
22.8 GB
24.7 GB
ip1:33187
Active
7
1.5 MB / 144.7 MB
0.0 B
6
5
0
445720
445725
35.9 h (26 min)
42.4 GB
26 GB
24.7 GB
ip3:34935
Active
7
1.6 MB / 144.7 MB
0.0 B
6
6
0
427950
427956
33.8 h (23 min)
40.5 GB
24.8 GB
24.7 GB
ip4:38851
Active
7
1.7 MB / 144.7 MB
0.0 B
6
6
0
410276
410282
31.6 h (24 min)
39 GB
23.9 GB
24.7 GB
If you see there is a skew in number of tasks completed by ip5:44481. I dont see abnormal GC activity as well.
What metrics should i be looking at to understand this skew?
UPDATE
Upon further debugging I can see that all the partitions are having unequal data. And all the tasks are given approx same number of records.
Executor ID
Address
Task Time
Total Tasks
Failed Tasks
Killed Tasks
Succeeded Tasks
Shuffle Read Size / Records
Blacklisted
0stdoutstderr
ip3:37049
0.8 s
6
0
0
6
600.9 KB / 272
FALSE
1stdoutstderr
ip1:37875
0.6 s
6
0
0
6
612.2 KB / 273
FALSE
2stdoutstderr
ip3:41739
0.7 s
5
0
0
5
529.0 KB / 226
FALSE
3stdoutstderr
ip2:38269
0.5 s
6
0
0
6
623.4 KB / 272
FALSE
4stdoutstderr
ip1:40083
0.6 s
7
0
0
7
726.7 KB / 318
FALSE
This is the stats of stage just after repartitioning. We can see that number of tasks are proportional to number of records. As a next step I am trying to see how the partition function is working.
Update 2:
The only explanation i have come across is spark uses round robin partitioning. And It is indepandently executed on each partition. For example if there are 5 records on node1 and 7 records on node2. Node1's round robin will distribute approximately 3 records to node1, and approximately 2 records to node2. Node2's round robin will distribute approximately 4 records to node1, and approximately 3 records to node2. So, there is the possibility of having 7 records on node1 and 5 records on node2, depending on the ordering of the nodes that is interpreted within the framework code for each individual node. source
NOTE:
if you notice the best performing guys are on same ip. Is it because after shuffling transferring data on same host is faster? compared to other ip?
Based on above data we can see that repartition is working fine, i.e. assigning equal number of records to 30 partitions, but the question is why is some executors getting more partition than others.

High disk I/O (read) on Cassandra nodes

We have 3 nodes Cassandra cluster.
We have an application that uses a keyspace that creates a hightload on disks, on read. The problem has a cumulative effect. The more days we interact with the keyspace, the more disk reading grows. :
hightload read
Reading goes up to > 700 MB/s. Then the storage (SAN) begins to degrade, and then the Сassandra cluster also degrades.
UPD 25.10.2021: "I wrote it a little wrong, through the SAN space is allocated to a virtual machine, like a normal drive"
The only thing that helps is clearing the keyspace.
Output command "tpstats" and "cfstats"
[cassandra-01 ~]$ nodetool tpstats
Pool Name Active Pending Completed Blocked All time blocked
ReadStage 1 1 1837888055 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 0 0 6789640 0 0
MutationStage 0 0 870873552 0 0
MemtableReclaimMemory 0 0 7402 0 0
PendingRangeCalculator 0 0 9 0 0
GossipStage 0 0 18939072 0 0
SecondaryIndexManagement 0 0 0 0 0
HintsDispatcher 0 0 3 0 0
RequestResponseStage 0 0 1307861786 0 0
Native-Transport-Requests 0 0 2981687196 0 0
ReadRepairStage 0 0 346448 0 0
CounterMutationStage 0 0 0 0 0
MigrationStage 0 0 168 0 0
MemtablePostFlush 0 0 8193 0 0
PerDiskMemtableFlushWriter_0 0 0 7402 0 0
ValidationExecutor 0 0 21 0 0
Sampler 0 0 10988 0 0
MemtableFlushWriter 0 0 7402 0 0
InternalResponseStage 0 0 3404 0 0
ViewMutationStage 0 0 0 0 0
AntiEntropyStage 0 0 71 0 0
CacheCleanupExecutor 0 0 0 0 0
Message type Dropped
READ 7
RANGE_SLICE 0
_TRACE 0
HINT 0
MUTATION 5
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 0
PAGED_RANGE 0
READ_REPAIR 0
[cassandra-01 ~]$ nodetool cfstats box_messages -H
Total number of tables: 73
----------------
Keyspace : box_messages
Read Count: 48847567
Read Latency: 0.055540737801741485 ms
Write Count: 69461300
Write Latency: 0.010656743870327794 ms
Pending Flushes: 0
Table: messages
SSTable count: 6
Space used (live): 3.84 GiB
Space used (total): 3.84 GiB
Space used by snapshots (total): 0 bytes
Off heap memory used (total): 10.3 MiB
SSTable Compression Ratio: 0.23265712113582082
Number of partitions (estimate): 4156030
Memtable cell count: 929912
Memtable data size: 245.04 MiB
Memtable off heap memory used: 0 bytes
Memtable switch count: 92
Local read count: 20511450
Local read latency: 0.106 ms
Local write count: 52111294
Local write latency: 0.013 ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 57318
Bloom filter false ratio: 0.00841
Bloom filter space used: 6.56 MiB
Bloom filter off heap memory used: 6.56 MiB
Index summary off heap memory used: 1.78 MiB
Compression metadata off heap memory used: 1.95 MiB
Compacted partition minimum bytes: 73
Compacted partition maximum bytes: 17084
Compacted partition mean bytes: 3287
Average live cells per slice (last five minutes): 2.0796939751354797
Maximum live cells per slice (last five minutes): 10
Average tombstones per slice (last five minutes): 1.1939751354797576
Maximum tombstones per slice (last five minutes): 2
Dropped Mutations: 5 bytes
(I'm unable to comment and hence posting it as an answer)
As folks mentioned SAN is not going to be the best suite here and one could read through the list of anti-patterns documented here which could also apply to OSS C*.

High OS load and poor performance on specific cassandra nodes in a cluster

I am using cassandra v2.1.13 in 10 node cluster with replication factor = 2 and LeveledCompactionStrategy. The nodes are c4.4x large types with 10g heap.
8 nodes in the cluster are performing fine but there is an issue with 2 particular nodes in the cluster. These nodes are constantly giving very poor read latency, and have a high dropped reads count. There is a continuous high OS load on these 2 nodes.
Below is the nodetool commands results from one of these nodes:
status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.0.23.37 257.31 GB 256 19.0% 3e9ee62e-70a2-4b2e-ba10-290a62cd055b 1
UN 10.0.53.69 300.24 GB 256 20.5% 48988162-69d6-4698-9afa-799ef4be7bbc 2
UN 10.0.23.133 342.37 GB 256 21.1% 30431a62-0cf6-4c82-8af1-e9ba0025eba6 1
UN 10.0.53.7 348.52 GB 256 21.4% 5fcdeb25-e1e5-47f6-af7f-7bea825ab3c0 2
UN 10.0.53.88 292.59 GB 256 19.5% c77904bc-10a8-49e0-b6fa-8fe8126e064c 2
UN 10.0.53.250 272.76 GB 256 20.6% ecf417f2-2e96-4b9e-bb15-06eaf948cefa 2
UN 10.0.23.75 271.24 GB 256 20.8% d8b0ab1b-65ab-46cd-b7e4-3fb3861ffb23 1
UN 10.0.23.253 302.9 GB 256 21.0% 4bb6408a-9aa0-42da-96f7-dbe0dad757bc 1
UN 10.0.23.238 326.35 GB 256 18.2% 55e33a97-e5ca-4c48-a530-a0ff6fa8edde 1
UN 10.0.53.222 247.4 GB 256 18.0% c3a6e4c2-7ab6-4f3a-a444-8d6dff2beb43 2
cfstats
Keyspace: key_space_name
Read Count: 63815118
Read Latency: 88.71912845022085 ms.
Write Count: 40802728
Write Latency: 1.1299861338192878 ms.
Pending Flushes: 0
Table: table1
SSTable count: 1269
SSTables in each level: [1, 10, 103/100, 1023/1000, 131, 0, 0, 0, 0]
Space used (live): 274263401275
Space used (total): 274263401275
Space used by snapshots (total): 0
Off heap memory used (total): 1776960519
SSTable Compression Ratio: 0.3146938954387242
Number of keys (estimate): 1472406972
Memtable cell count: 3840240
Memtable data size: 96356032
Memtable off heap memory used: 169478569
Memtable switch count: 149
Local read count: 47459095
Local read latency: 0.328 ms
Local write count: 14501016
Local write latency: 0.695 ms
Pending flushes: 0
Bloom filter false positives: 186
Bloom filter false ratio: 0.00000
Bloom filter space used: 1032396536
Bloom filter off heap memory used: 1032386384
Index summary off heap memory used: 495040742
Compression metadata off heap memory used: 80054824
Compacted partition minimum bytes: 216
Compacted partition maximum bytes: 3973
Compacted partition mean bytes: 465
Average live cells per slice (last five minutes): 0.1710125211397823
Maximum live cells per slice (last five minutes): 1.0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0.0
Table: table2
SSTable count: 93
SSTables in each level: [1, 10, 82, 0, 0, 0, 0, 0, 0]
Space used (live): 18134115541
Space used (total): 18134115541
Space used by snapshots (total): 0
Off heap memory used (total): 639297085
SSTable Compression Ratio: 0.2889927549599339
Number of keys (estimate): 102804187
Memtable cell count: 409595
Memtable data size: 492365311
Memtable off heap memory used: 529339207
Memtable switch count: 433
Local read count: 16357463
Local read latency: 345.194 ms
Local write count: 26302779
Local write latency: 1.370 ms
Pending flushes: 0
Bloom filter false positives: 4
Bloom filter false ratio: 0.00000
Bloom filter space used: 73133360
Bloom filter off heap memory used: 73132616
Index summary off heap memory used: 30985070
Compression metadata off heap memory used: 5840192
Compacted partition minimum bytes: 125
Compacted partition maximum bytes: 24601
Compacted partition mean bytes: 474
Average live cells per slice (last five minutes): 0.9915609172937249
Maximum live cells per slice (last five minutes): 1.0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0.0
tpstats
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 2 0 43617272 0 0
ReadStage 45 189 64039921 0 0
RequestResponseStage 0 0 37790267 0 0
ReadRepairStage 0 0 3590974 0 0
CounterMutationStage 0 0 0 0 0
MiscStage 0 0 0 0 0
HintedHandoff 0 0 18 0 0
GossipStage 0 0 457469 0 0
CacheCleanupExecutor 0 0 0 0 0
InternalResponseStage 0 0 0 0 0
CommitLogArchiver 0 0 0 0 0
CompactionExecutor 1 1 60965 0 0
ValidationExecutor 0 0 646 0 0
MigrationStage 0 0 0 0 0
AntiEntropyStage 0 0 1938 0 0
PendingRangeCalculator 0 0 17 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 1997 0 0
MemtablePostFlush 0 0 4884 0 0
MemtableReclaimMemory 0 0 1997 0 0
Native-Transport-Requests 11 0 61321377 0 0
Message type Dropped
READ 56788
RANGE_SLICE 0
_TRACE 0
MUTATION 182
COUNTER_MUTATION 0
BINARY 0
REQUEST_RESPONSE 0
PAGED_RANGE 0
READ_REPAIR 1
netstats
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 2910144
Mismatch (Blocking): 0
Mismatch (Background): 229431
Pool Name Active Pending Completed
Commands n/a 0 37851777
Responses n/a 0 101128958
compactionstats
pending tasks: 0
I verified that there is no continuous compaction running.
Upon taking the jstack on both the nodes, there were threads that were continuously calling the following stack in infinite loop - my guess is that the high os load is caused by these threads being stuck in loop and leading to slow reads.
SharedPool-Worker-7 - priority:5 - threadId:0x00002bb0356a0150 - nativeId:0x158ef - state:RUNNABLE
stackTrace:
java.lang.Thread.State: RUNNABLE
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.cassandra.utils.memory.HeapAllocator.allocate(HeapAllocator.java:34)
at org.apache.cassandra.utils.memory.AbstractAllocator.clone(AbstractAllocator.java:34)
at org.apache.cassandra.db.NativeCell.localCopy(NativeCell.java:58)
at org.apache.cassandra.db.CollationController$2.apply(CollationController.java:223)
at org.apache.cassandra.db.CollationController$2.apply(CollationController.java:220)
at com.google.common.collect.Iterators$8.transform(Iterators.java:794)
at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
at org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:175)
at org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:156)
at org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:146)
at org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:125)
at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:264)
at org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:108)
at org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:82)
at org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:69)
at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:314)
at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:2001)
at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1844)
at org.apache.cassandra.db.Keyspace.getRow(Keyspace.java:353)
at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:85)
at org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:47)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164)
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
at java.lang.Thread.run(Thread.java:748)
We tried increasing the instance size of the affected nodes, suspecting that they might be blocked on I/O ops but that did not help.
Can someone pls help me figuring out what is wrong with these 2 nodes in of the cluster.

How to read the cassandra nodetool histograms percentile and other columns?

How to read the cassandra nodetool histograms percentile and other coulmns?
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 1.00 14.24 4055.27 149 2
75% 35.00 17.08 17436.92 149 2
95% 35.00 24.60 74975.55 642 2
98% 86.00 35.43 129557.75 770 2
99% 103.00 51.01 186563.16 770 2
Min 0.00 2.76 51.01 104 2
Max 124.00 36904729.27 12359319.16 924 2
They show the distribution of the metrics. For example, in your data the write latency for 95% of the requests were 24.60 microseconds or less. 95% of the partitions are 642 bytes or less with 2 cells. The SStables column is how many sstables are touched on a read, so 95% or read requests are looking at 35 sstables (this is fairly high).

Issues with Scaling horizontally with Cassandra NoSQL

I am trying to configure and benchmark my AWS EC2 instances for Cassandra distributions with Datstax Community Edition. I'm working with 1 cluster so far, and I'm having issues with the horizontal scaling.
I'm running cassandra-stress tool to stress the nodes and I'm not seeing the horizontal scaling. My command is run under an EC2 instance that is on the same network as the nodes but not on the node (ie i'm not using one of the node to launch the command)
I have inputted the following:
cassandra-stress write n=1000000 cl=one -mode native cql3 -schema keyspace="keyspace1" -pop seq=1..1000000 -node ip1,ip2
I started with 2 nodes, and then 3, and then 6. But the numbers don't tell me what Cassandra is suppose to do: more nodes to a cluster should speed up read/write.
Results: 2 Nodes: 1M 3 Nodes: 1M 3 Nodes: 2M 6 Nodes: 1M 6 Nodes: 2M 6 Nodes: 6M 6 Nodes: 10M
op rate 6858 6049 6804 7711 7257 7531 8081
partition rate 6858 6049 6804 7711 7257 7531 8081
row rate 6858 6049 6804 7711 7257 7531 8081
latency mean 29.1 33 29.3 25.9 27.5 26.5 24.7
latency median 24.9 32.1 24 22.6 23.1 21.8 21.5
latency 95th percentile 57.9 73.3 62 50 56.2 52.1 40.2
latency 99th percentile 76 92.2 77.4 65.3 69.1 61.8 46.4
latency 99.9th percentile 87 103.4 83.5 76.2 75.7 64.9 48.1
latency max 561.1 587.1 1075 503.1 521.7 1662.3 590.3
total gc count 0 0 0 0 0 0 0
total gc mb 0 0 0 0 0 0 0
total gc time (s) 0 0 0 0 0 0 0
avg gc time(ms) NAN NaN NaN NaN NaN NaN NaN
stdev gc time(ms) 0 0 0 0 0 0 0
Total operation time 0:02:25 0:02:45 0:04:53 0:02:09 00:04.35 0:13:16 0:20:37
Each with the default keyspace1 that was provided.
I've tested at 3 Nodes: 1M, 2M iteration. 6 Nodes I've tried 1M,2M, 6M, and 10M. As I increased Iteration I'm marginally increasing the OP Rate.
Am I doing something wrong or do I have Cassandra backward. Right now RF = 1 as I don't want to insert latency for replications. I Just want to see in the longterm the horizontal scaling which I'm not seeing it.
Help?

Resources