How to read the cassandra nodetool histograms percentile and other coulmns?
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 1.00 14.24 4055.27 149 2
75% 35.00 17.08 17436.92 149 2
95% 35.00 24.60 74975.55 642 2
98% 86.00 35.43 129557.75 770 2
99% 103.00 51.01 186563.16 770 2
Min 0.00 2.76 51.01 104 2
Max 124.00 36904729.27 12359319.16 924 2
They show the distribution of the metrics. For example, in your data the write latency for 95% of the requests were 24.60 microseconds or less. 95% of the partitions are 642 bytes or less with 2 cells. The SStables column is how many sstables are touched on a read, so 95% or read requests are looking at 35 sstables (this is fairly high).
Related
I am reading from kafka topic which has 5 partitions. Since 5 cores are not sufficient to handle the load, I am doing repartitioning the input to 30. I have given 30 cores to my spark process with 6 cores on each executor. With this setup i was assuming that each executor will get 6 tasks. But more oftan than not we are seeing that one executor is getting 4 tasks and others are getting 7 tasks. It is skewing the processing time of our job.
Can someone help me understand why all the executor will not get equal number of tasks? Here is the executor metrics after job has run for 12 hours.
Address
Status
RDD Blocks
Storage Memory
Disk Used
Cores
Active Tasks
Failed Tasks
Complete Tasks
Total Tasks
Task Time (GC Time)
Input
Shuffle Read
Shuffle Write
ip1:36759
Active
7
1.6 MB / 144.7 MB
0.0 B
6
6
0
442506
442512
35.9 h (26 min)
42.1 GB
25.9 GB
24.7 GB
ip2:36689
Active
0
0.0 B / 128 MB
0.0 B
0
0
0
0
0
0 ms (0 ms)
0.0 B
0.0 B
0.0 B
ip5:44481
Active
7
1.6 MB / 144.7 MB
0.0 B
6
6
0
399948
399954
29.0 h (20 min)
37.3 GB
22.8 GB
24.7 GB
ip1:33187
Active
7
1.5 MB / 144.7 MB
0.0 B
6
5
0
445720
445725
35.9 h (26 min)
42.4 GB
26 GB
24.7 GB
ip3:34935
Active
7
1.6 MB / 144.7 MB
0.0 B
6
6
0
427950
427956
33.8 h (23 min)
40.5 GB
24.8 GB
24.7 GB
ip4:38851
Active
7
1.7 MB / 144.7 MB
0.0 B
6
6
0
410276
410282
31.6 h (24 min)
39 GB
23.9 GB
24.7 GB
If you see there is a skew in number of tasks completed by ip5:44481. I dont see abnormal GC activity as well.
What metrics should i be looking at to understand this skew?
UPDATE
Upon further debugging I can see that all the partitions are having unequal data. And all the tasks are given approx same number of records.
Executor ID
Address
Task Time
Total Tasks
Failed Tasks
Killed Tasks
Succeeded Tasks
Shuffle Read Size / Records
Blacklisted
0stdoutstderr
ip3:37049
0.8 s
6
0
0
6
600.9 KB / 272
FALSE
1stdoutstderr
ip1:37875
0.6 s
6
0
0
6
612.2 KB / 273
FALSE
2stdoutstderr
ip3:41739
0.7 s
5
0
0
5
529.0 KB / 226
FALSE
3stdoutstderr
ip2:38269
0.5 s
6
0
0
6
623.4 KB / 272
FALSE
4stdoutstderr
ip1:40083
0.6 s
7
0
0
7
726.7 KB / 318
FALSE
This is the stats of stage just after repartitioning. We can see that number of tasks are proportional to number of records. As a next step I am trying to see how the partition function is working.
Update 2:
The only explanation i have come across is spark uses round robin partitioning. And It is indepandently executed on each partition. For example if there are 5 records on node1 and 7 records on node2. Node1's round robin will distribute approximately 3 records to node1, and approximately 2 records to node2. Node2's round robin will distribute approximately 4 records to node1, and approximately 3 records to node2. So, there is the possibility of having 7 records on node1 and 5 records on node2, depending on the ordering of the nodes that is interpreted within the framework code for each individual node. source
NOTE:
if you notice the best performing guys are on same ip. Is it because after shuffling transferring data on same host is faster? compared to other ip?
Based on above data we can see that repartition is working fine, i.e. assigning equal number of records to 30 partitions, but the question is why is some executors getting more partition than others.
We are using Scylla version 4.4.1-0.20210406.00da6b5e9
I am not able to understand that
How nodetool cfhistogram is showing more no of sstables touched than the number of sstables that are actually present according to cfstats?
Why nodetool cfhistogram is showing such a high number sstables? I even ran nodetool compact prior to this.
Table structure:
CREATE TABLE gauntlet_keyspace.user_match_mapping (
user_id bigint,
match_id bigint,
team_id bigint,
created_at timestamp,
team_detail text,
updated_at timestamp,
user_login text,
user_team_name text,
PRIMARY KEY (user_id, match_id, team_id)
) WITH CLUSTERING ORDER BY (match_id ASC, team_id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
AND comment = ''
AND compaction = {'class': 'SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression':
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
cfstats
Keyspace : gauntlet_keyspace
Read Count: 9861938
Read Latency: 0.00119595834003418 ms
Write Count: 0
Write Latency: NaN ms
Pending Flushes: 0
Table: user_match_mapping
SSTable count: 14
SSTables in each level: [14/4]
Space used (live): 51979706055
Space used (total): 51979706055
Space used by snapshots (total): 0
Off heap memory used (total): 92586912
SSTable Compression Ratio: 0.203467
Number of partitions (estimate): 4328815
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 9799345
Local read latency: 1.178 ms
Local write count: 0
Local write latency: NaN ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 6195424
Bloom filter off heap memory used: 6195368
Index summary off heap memory used: 86391544
Compression metadata off heap memory used: 0
Compacted partition minimum bytes: 771
Compacted partition maximum bytes: 785939
Compacted partition mean bytes: 63765
Average live cells per slice (last five minutes): 0.0
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0
cfhistograms
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 0.00 0.00 1065.50 14237 72
75% 0.00 0.00 1286.50 105778 642
95% 1469901.75 0.00 1706.15 219342 1109
98% 9799345.00 0.00 1934.88 315852 1916
99% 9799345.00 0.00 2067.69 379022 1916
Min 0.00 0.00 554.00 771 5
Max 9799345.00 0.00 2202.00 785939 3973
Only this read query was done
select * from gauntlet_keyspace.user_match_mapping where user_id=? and match_id=? and team_id=?;
I am trying to configure and benchmark my AWS EC2 instances for Cassandra distributions with Datstax Community Edition. I'm working with 1 cluster so far, and I'm having issues with the horizontal scaling.
I'm running cassandra-stress tool to stress the nodes and I'm not seeing the horizontal scaling. My command is run under an EC2 instance that is on the same network as the nodes but not on the node (ie i'm not using one of the node to launch the command)
I have inputted the following:
cassandra-stress write n=1000000 cl=one -mode native cql3 -schema keyspace="keyspace1" -pop seq=1..1000000 -node ip1,ip2
I started with 2 nodes, and then 3, and then 6. But the numbers don't tell me what Cassandra is suppose to do: more nodes to a cluster should speed up read/write.
Results: 2 Nodes: 1M 3 Nodes: 1M 3 Nodes: 2M 6 Nodes: 1M 6 Nodes: 2M 6 Nodes: 6M 6 Nodes: 10M
op rate 6858 6049 6804 7711 7257 7531 8081
partition rate 6858 6049 6804 7711 7257 7531 8081
row rate 6858 6049 6804 7711 7257 7531 8081
latency mean 29.1 33 29.3 25.9 27.5 26.5 24.7
latency median 24.9 32.1 24 22.6 23.1 21.8 21.5
latency 95th percentile 57.9 73.3 62 50 56.2 52.1 40.2
latency 99th percentile 76 92.2 77.4 65.3 69.1 61.8 46.4
latency 99.9th percentile 87 103.4 83.5 76.2 75.7 64.9 48.1
latency max 561.1 587.1 1075 503.1 521.7 1662.3 590.3
total gc count 0 0 0 0 0 0 0
total gc mb 0 0 0 0 0 0 0
total gc time (s) 0 0 0 0 0 0 0
avg gc time(ms) NAN NaN NaN NaN NaN NaN NaN
stdev gc time(ms) 0 0 0 0 0 0 0
Total operation time 0:02:25 0:02:45 0:04:53 0:02:09 00:04.35 0:13:16 0:20:37
Each with the default keyspace1 that was provided.
I've tested at 3 Nodes: 1M, 2M iteration. 6 Nodes I've tried 1M,2M, 6M, and 10M. As I increased Iteration I'm marginally increasing the OP Rate.
Am I doing something wrong or do I have Cassandra backward. Right now RF = 1 as I don't want to insert latency for replications. I Just want to see in the longterm the horizontal scaling which I'm not seeing it.
Help?
We are in the process of researching a move to Cassandra (2.0.10) and we are testing the write and read performance.
While reading we are seeing what seems to be low read throughput, 14MB/s on avg.
Our current testing environment is only one node, Xeon E5-1620 # 3.7GHZ with 32GB of RAM, windows 7.
Cassandra heap was set to 8GB with default concurrent read and writes, key cache size is set to 400mb, the data sits on a local RAID10 array which is doing sustained avg of 300MB/s sequential read performance using 64KB and higher block sizes.
We are storing hourly sensor data with the current model:
CREATE TABLE IF NOT EXISTS sensor_data_by_day (
sensor_id int,
date text,
event_time timestamp,
load float,
PRIMARY KEY ((sensor_id,date),event_time))
Reading is done on the sensor, date and a range of event time.
Current data set is 2 years worth of data for 100K sensors, about 30GB on disk.
Data is inserted by numerous threads (So the inserts are not sorted by event time, if that matters)
Reading back a day worth of data takes about 2m with a throughput of 14MB/s.
Reading is done using the java-cassandara-connector with a prepared statement:
Select event_time, load from sensor_data_by_day where sensor_id = ? and date in ('2014-02-02') and event_time >= ? and event_time < ?
We create one connection and submitting tasks (100K queries as the number of sensors) to an executor service with pool of 100 threads.
Reading when the data is in the cache takes about 7s.
It's probably not a client problem, we tested when the data was located on an SSD and the total time went down from 2m to 10s (~170MB/s), which is understandably better given it's an SSD.
The read performance looks like a block read size issue, which can cause the low reads if Cassandra was reading in 4KB blocks. I read the default was 256 but didn't find the setting anywhere to confirm it or perhaps a random I/O issue?
Is this the kinds of read performance you should expect from Cassandra when using mechanical disks? Perhaps a modeling problem?
Output of cfhistograms:
SSTables per Read
1 sstables: 844726
2 sstables: 90
Write Latency (microseconds)
No Data
Read Latency (microseconds)
5 us: 418
6 us: 15252
7 us: 12884
8 us: 15447
10 us: 34211
12 us: 48972
14 us: 48421
17 us: 56641
20 us: 12484
24 us: 8325
29 us: 6602
35 us: 4953
42 us: 5427
50 us: 3610
60 us: 1784
72 us: 2414
86 us: 11208
103 us: 38395
124 us: 82050
149 us: 64840
179 us: 40161
215 us: 30891
258 us: 17691
310 us: 8787
372 us: 4171
446 us: 2305
535 us: 1588
642 us: 1187
770 us: 913
924 us: 811
1109 us: 716
1331 us: 602
1597 us: 513
1916 us: 513
2299 us: 516
2759 us: 595
3311 us: 776
3973 us: 1086
4768 us: 1502
5722 us: 2212
6866 us: 3264
8239 us: 4852
9887 us: 7586
11864 us: 11429
14237 us: 17236
17084 us: 22285
20501 us: 26163
24601 us: 26799
29521 us: 24311
35425 us: 22101
42510 us: 19420
51012 us: 16497
61214 us: 13830
73457 us: 11356
88148 us: 8749
105778 us: 6243
126934 us: 4406
152321 us: 2751
182785 us: 1754
219342 us: 977
263210 us: 497
315852 us: 233
379022 us: 109
454826 us: 60
545791 us: 21
654949 us: 10
785939 us: 2
943127 us: 0
1131752 us: 1
Partition Size (bytes)
179 bytes: 151874
215 bytes: 0
258 bytes: 0
310 bytes: 0
372 bytes: 5071
446 bytes: 0
535 bytes: 4170
642 bytes: 3724
770 bytes: 3454
924 bytes: 3416
1109 bytes: 3489
1331 bytes: 9179
1597 bytes: 11616
1916 bytes: 12435
2299 bytes: 19038
2759 bytes: 20653
3311 bytes: 10245454
3973 bytes: 25121333
Cell Count per Partition
4 cells: 151874
5 cells: 0
6 cells: 0
7 cells: 0
8 cells: 5071
10 cells: 0
12 cells: 4170
14 cells: 0
17 cells: 3724
20 cells: 3454
24 cells: 3416
29 cells: 3489
35 cells: 3870
42 cells: 9982
50 cells: 13521
60 cells: 20108
72 cells: 16678
86 cells: 51646
103 cells: 35323903
What kind of compaction do you use? If you are having bad read latency from disks it mostly because of the number of SS Tables.
My Suggestions:
If you are looking for better read latency, i would suggest use Leveled compaction. Configure the SS Table size to avoid too many compactions.
With leveled compaction you should get only have max number of reads as the levels. So the performance will be much better.
This comes at the cost of increased number of compaction(if the sstable size is lower) and higher disk IO.
What is your current bloom filter size? Increasing it will decrease the probability of false negatives again improving reads
You seem to have a pretty good key cache set up, if you guys have specific rows that might be read frequently you can turn on row cache. This is generally not recommended as the advantage is minimal for most of the applications.
If the data is always going to be time series, may be use date tiered compaction?
Im trying to track down why my nodejs app all a sudden uses 100% cpu. The app has around 50 concurrent connections and is running on a ec2 micro instance.
Below is the output of: strace -c node server.js
^C% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
87.32 0.924373 8 111657 epoll_wait
6.85 0.072558 3 22762 pread
2.55 0.026965 0 146179 write
0.92 0.009733 0 108434 1 futex
0.44 0.004661 0 82010 7 read
0.44 0.004608 0 223317 clock_gettime
0.31 0.003244 0 172467 gettimeofday
0.31 0.003241 35 93 brk
0.20 0.002075 0 75233 3 epoll_ctl
0.19 0.002052 0 23850 11925 accept4
0.19 0.001997 0 12302 close
0.19 0.001973 7 295 mmap
0.06 0.000617 4 143 munmap
And here is the output of: node-tick-processor
[Top down (heavy) profile]:
Note: callees occupying less than 0.1% are not shown.
inclusive self name
ticks total ticks total
669160 97.4% 669160 97.4% /lib/x86_64-linux-gnu/libc-2.15.so
4834 0.7% 28 0.0% LazyCompile: *Readable.push _stream_readable.js:116
4750 0.7% 10 0.0% LazyCompile: *emitReadable _stream_readable.js:392
4737 0.7% 19 0.0% LazyCompile: *emitReadable_ _stream_readable.js:407
1751 0.3% 7 0.0% LazyCompile: ~EventEmitter.emit events.js:53
1081 0.2% 2 0.0% LazyCompile: ~<anonymous> _stream_readable.js:741
1045 0.2% 1 0.0% LazyCompile: ~EventEmitter.emit events.js:53
960 0.1% 1 0.0% LazyCompile: *<anonymous> /home/ubuntu/node/node_modules/redis/index.js:101
948 0.1% 11 0.0% LazyCompile: RedisClient.on_data /home/ubuntu/node/node_modules/redis/index.js:541
This is my first time debugging a node app. Are there any conclusions that can be drawn from the above debug output? Where could the error be?
Edit
My node version: v0.10.25
Edit 2
After updating node to: v0.10.33
Here is the output
^C% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
91.81 1.894522 8 225505 45 epoll_wait
3.58 0.073830 1 51193 pread
1.59 0.032874 0 235054 2 write
0.98 0.020144 0 1101789 clock_gettime
0.71 0.014658 0 192494 1 futex
0.57 0.011764 0 166704 21 read
Seems like Node JS v0.10.25 bug with event loop, look here.
Note, from this github pull request:
If the same file description is open in two different processes, then
closing the file descriptor is not sufficient to deregister it from
the epoll instance (as described in epoll(7)), resulting in spurious
events that cause the event loop to spin repeatedly. So always
explicitly deregister it.
So as solution you can try update your OS or update Node JS.