What does LOAD in nodetool status measure? - cassandra

I am observing higher load on a Cassandra node (compared to other nodes in the ring) and I am looking for help interpreting this data. I have anonymized my IPs but the snipped below shows a comparison of "good" node 199 (load 14G) and "bad" node 159(load 25G):
nodetool status|grep -E '199|159'
UN XXXXX.159 25.2 GB 256 ? ffda4798-tokentoken XXXXXX
UN XXXXX.199 13.37 GB 256 ? c3a49dca-tokentoken YYYY
Note load is almost 2x on .159. Yet neither memory nor disk usage explain/support this:
.199 (low load box) data -- memory at about 34%, disk 50-60G:
top|grep apache_cassan
28950 root 20 0 24.353g 0.010t 1.440g S 225.3 34.2 25826:35 apache_cassandr
28950 root 20 0 24.357g 0.010t 1.448g S 212.4 34.2 25826:41 apache_cassandr
28950 root 20 0 24.357g 0.010t 1.452g S 219.7 34.3 25826:48 apache_cassandr
28950 root 20 0 24.357g 0.011t 1.460g S 250.5 34.3 25826:55 apache_cassandr
Filesystem Size Used Avail Use% Mounted on
/dev/sde1 559G 47G 513G 9% /cassandra/data_dir_a
/dev/sdf1 559G 63G 497G 12% /cassandra/data_dir_b
/dev/sdg1 559G 54G 506G 10% /cassandra/data_dir_c
/dev/sdh1 559G 57G 503G 11% /cassandra/data_dir_d
.159 (high load box) data -- memory at about 28%, disk 20-40G:
top|grep apache_cassan
25354 root 20 0 36.297g 0.017t 8.608g S 414.7 27.8 170:42.81 apache_cassandr
25354 root 20 0 36.302g 0.017t 8.608g S 272.2 27.8 170:51.00 apache_cassandr
25354 root 20 0 36.302g 0.017t 8.612g S 129.7 27.8 170:54.90 apache_cassandr
25354 root 20 0 36.354g 0.017t 8.625g S 94.1 27.8 170:57.73 apache_cassandr
Filesystem Size Used Avail Use% Mounted on
/dev/sde1 838G 17G 822G 2% /cassandra/data_dir_a
/dev/sdf1 838G 11G 828G 2% /cassandra/data_dir_b
/dev/sdg1 838G 35G 804G 5% /cassandra/data_dir_c
/dev/sdh1 838G 26G 813G 4% /cassandra/data_dir_d
TL;DR version -- what does nodetool status 'load' column actually measure/report

The nodetool status command provides the following information:
Status - U (up) or D (down)
Indicates whether the node is functioning or not.
Load - updates every 90 seconds
The amount of file system data under the cassandra data directory after excluding all content in the snapshots subdirectories. Because all SSTable data files are included, any data that is not cleaned up, such as TTL-expired cell or tombstoned data) is counted.
For more information go to nodetool status output description

Related

What is AWS Cloudwatch Agent disk_used_percent measuring? It does not match the usage I see with lsblk or df

I have a t4g.large EC2 instance, running Ubuntu 22.04, with a single 30GB storage volume. I have installed and configured the Cloudwatch Agent to monitor disk usage.
Right now, the metrics on Cloudwatch show that the disk is 56% full.
If I run lsblk -f, I see this (I deleted the uuid column for conciseness):
NAME FSTYPE FSVER LABEL FSAVAIL FSUSE% MOUNTPOINTS
loop0 squashfs 4.0 0 100% /snap/core20/1699
loop1 squashfs 4.0 0 100% /snap/amazon-ssm-agent/5657
loop2 squashfs 4.0
loop3 squashfs 4.0 0 100% /snap/lxd/23545
loop4 squashfs 4.0 0 100% /snap/core18/2658
loop5 squashfs 4.0 0 100% /snap/core18/2636
loop6 squashfs 4.0 0 100% /snap/snapd/17885
loop7 squashfs 4.0 0 100% /snap/amazon-ssm-agent/6313
loop8 squashfs 4.0 0 100% /snap/core20/1740
nvme0n1
├─nvme0n1p1 ext4 1.0 cloudimg-rootfs 2.9G 90% /
└─nvme0n1p15 vfat FAT32 UEFI 92.4M 5% /boot/efiNAME
If I run df -h, I see this:
Filesystem Size Used Avail Use% Mounted on
/dev/root 29G 27G 2.9G 91% /
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 1.6G 1.1M 1.6G 1% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/nvme0n1p15 98M 5.1M 93M 6% /boot/efi
tmpfs 782M 8.0K 782M 1% /run/user/1000
I don't understand where 56% could be coming from. Even if the Cloudwatch agent is doing a sum over all of the mount points, it would come out to ~75%, not 56%.
This is my config for the agent:
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "root"
},
"metrics": {
"aggregation_dimensions": [
[
"InstanceId"
]
],
"append_dimensions": {
"AutoScalingGroupName": "${aws:AutoScalingGroupName}",
"ImageId": "${aws:ImageId}",
"InstanceId": "${aws:InstanceId}",
"InstanceType": "${aws:InstanceType}"
},
"metrics_collected": {
"collectd": {
"metrics_aggregation_interval": 60
},
"disk": {
"measurement": [
"used_percent"
],
"metrics_collection_interval": 60,
"resources": [
"*"
]
},
"mem": {
"measurement": [
"mem_used_percent"
],
"metrics_collection_interval": 60
},
"statsd": {
"metrics_aggregation_interval": 60,
"metrics_collection_interval": 30,
"service_address": ":8125"
}
}
}
}
I tried changing "*" to "/" or "/dev/root" in the resources, and restarted the agent, but it has not made any difference in the reported value.
Edit: I've now deleted a bunch of files and lsblk reports 33% disk usage at the "/" mount point, while cloudwatch says 52%.
I figured it out. The culprit is this part of the config:
"aggregation_dimensions": [
[
"InstanceId"
]
],
This means that the agent sends an "aggregate" value to cloudwatch, which is what I was using by accident. To get this aggregate, I navigated through the Metrics in the Cloudwatch GUI like "CWAgent" - "InstanceId" - "disk_used_percent". This reports a set of data points for each point in time - all the results for all the different paths that the agent is reporting on. From there you can select "average", "max", "min", etc. to use this data. I had selected "average".
What I should have done was navigate through "CWAgent" - "ImageId, InstanceId, InstanceType, device, fstype, path" - "disk_used_percent" for path /. Then I would be looking at only the value for that path, there would only be one sample per time step, and it would match what I see in the terminal.
Note: If you really want to dive deep, you can check out the collectd config at /etc/collectd/collectd.conf, which has a config for "". This should point you to the path where collectd is storing the df information that the cloudwatch agent is reading.

How to increase persistently default storage of AZURE instance type via terraform?

My use case to increase the default disk space of an AZURE VM (Linux) persitently and not ephemeral.
Those are the facts:
created VM via terraform with instance type "Standard_D16s_v3", which
has 32 GB disk space per default available
I intend to increase/add 300 GB disk space
But I do not get it managed to attach a permanent disk to a Linux VM.
I tried with:
Creating a managed data-disk:
###managed storage disk creation for jumphost#####
resource "azurerm_managed_disk" "jump_disk" {
name = "jump_data1"
resource_group_name = "${var.resource_group_name}"
location = "${azurerm_resource_group.resource_group.location}"
storage_account_type = "Standard_LRS"
create_option = "Empty"
disk_size_gb = "300"
}
And attach later to LInux VM:
...
storage_os_disk {
name = "${var.resource_group_name}-osdisk"
caching = "ReadWrite"
create_option = "FromImage"
managed_disk_type = "Standard_LRS"
}
storage_data_disk {
name = "jump_data1"
managed_disk_id = "${azurerm_managed_disk.jump_disk.id}"
disk_size_gb = "300"
create_option = "Attach"
lun = 0
}
...
But getting on OS Level only /mnt/resource folder with information that data are ephemeral and will be lost after stopping/starting instance...and we shutdown&restart often to save costs:
[root#d021970-md300 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 32G 3.0G 29G 10% /
devtmpfs 63G 0 63G 0% /dev
tmpfs 63G 0 63G 0% /dev/shm
tmpfs 63G 9.1M 63G 1% /run
tmpfs 63G 0 63G 0% /sys/fs/cgroup
/dev/sda1 497M 101M 396M 21% /boot
/dev/sdb1 252G 2.1G 237G 1% /mnt/resource
tmpfs 13G 0 13G 0% /run/user/1000
[root#d021970-md300 ~]# cd /mnt/resource
[root#d021970-md300 resource]# ls
DATALOSS_WARNING_README.txt lost+found swapfile
[root#d021970-md300 resource]#
Does anybody know how to increase disk space not ephemeral, thus that after stopping and starting the VM all data is kept?
Thanks in advance.
Thomas
You need to provide the storage_service_name or media_link parameters when creating the managed disk which will indicate that a new disk is to be created.
Not sure if LUN 0 should be used either as this would typically be the OS disk. Try LUN 1.

Howto debug why hints doesn't get processed after all nodes are up again

Did some extended maintenance on a node d1r1n3 out of a 14x node dsc 2.1.15 cluster today, but finished well within the cluster's max hint window.
After bringing the node back up most other nodes' hints disappeared again within minutes except for two nodes (d1r1n4 and d1r1n7), where only part of the hints went away.
After few hours of still showing 1 active hintedhandoff task I restarted node d1r1n7 and then quickly d1r1n4 emptied its hint table.
Howto see for which node stored hints on d1r1n7 are destined?
And possible howto get hints processed?
Update:
Found later corresponding to end-of-maxhint-window after taking node d1r1n3 offline for maintenance that d1r1n7' hints had vanished. Leaving us with a confused feeling of whether this was okay or not. Had the hinted been processed okay or some how just expired after end of maxhint window?
If the latter would we need to run a repair on node d1r1n3 after it's mainenance (this takes quite some time and IO... :/) What if we now applied read [LOCAL]QUORUM instead of as currently read ONE w/one DC and RF=3, could this then trigger read path repairs on needed-basis and maybe spare us is this case for a full repair?
Answer: turned out hinted_handoff_throttle_in_kb was # default 1024 on these two nodes while rest of cluster were # 65536 :)
hints are stored in cassandra 2.1.15 in system.hints table
cqlsh> describe table system.hints;
CREATE TABLE system.hints (
target_id uuid,
hint_id timeuuid,
message_version int,
mutation blob,
PRIMARY KEY (target_id, hint_id, message_version)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (hint_id ASC, message_version ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = 'hints awaiting delivery'
AND compaction = {'enabled': 'false', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 3600000
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
the target_id correlated with the node id
for example
in my sample 2 node cluster with RF=2
nodetool status
Datacenter: datacenter1
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 71.47 KB 256 100.0% d00c4b10-2997-4411-9fc9-f6d9f6077916 rack1
DN 127.0.0.2 75.4 KB 256 100.0% 1ca6779d-fb41-4a26-8fa8-89c6b51d0bfa rack1
I executed the following while node2 was down
cqlsh> insert into ks.cf (key,val) values (1,1);
cqlsh> select * from system.hints;
target_id | hint_id | message_version | mutation
--------------------------------------+--------------------------------------+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1ca6779d-fb41-4a26-8fa8-89c6b51d0bfa | e80a6230-ec8c-11e6-a1fd-d743d945c76e | 8 | 0x0004000000010000000101cfb4fba0ec8c11e6a1fdd743d945c76e7fffffff80000000000000000000000000000002000300000000000547df7ba68692000000000006000376616c0000000547df7ba686920000000400000001
(1 rows)
as can be seen the system.hints.target_id correlates with host id in nodetool status (1ca6779d-fb41-4a26-8fa8-89c6b51d0bfa)

Add node(s), new rack

I have a 16 node Cassandra 2.1.11 Cluster, divided into 2 racks (R0 and R1), 8 nodes per rack. Each node serves about 700Gb of data. The cluster looks pretty balanced. Each node has 2x3Tb HDD.
Datacenter: DC0
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.21.72 677.37 GB 256 ? 0260665f-5f5c-4fbc-9583-8da86848713a R0
UN 192.168.21.73 658.8 GB 256 ? ed1e7814-f715-41c8-97c9-8b52164835f9 R1
UN 192.168.21.74 676.97 GB 256 ? 62833182-b339-46f2-9370-0e23f6bb7eab R1
UN 192.168.21.75 657.1 GB 256 ? ab31f28b-ffea-489f-a4da-3b5120760b8e R1
UN 192.168.21.76 690.29 GB 256 ? e636bf2e-89d6-4bf3-9263-cf1ed67fcbd9 R1
UN 192.168.21.77 679.77 GB 256 ? 959e5207-1251-4c58-afa9-3910b5a27ff5 R1
UN 192.168.21.78 648.85 GB 256 ? 6f650315-1cd1-4169-b300-391800be974f R1
UN 192.168.21.79 675.96 GB 256 ? 324bd475-b5f6-4b39-a753-0cd2c80a46c4 R1
UN 192.168.21.65 636.01 GB 256 ? 65e3faa1-e8d5-4d78-a87e-bfde1f4095a5 R0
UN 192.168.21.66 674.89 GB 256 ? 213696eb-c4a0-4803-a9b3-0efd04c567f2 R0
UN 192.168.21.67 716.77 GB 256 ? 62542a8e-8177-4f13-9077-ea2426607ace R0
UN 192.168.21.68 666.1 GB 256 ? a9864059-3de2-48a2-a926-00db3f9791ee R0
UN 192.168.21.69 691.9 GB 256 ? 02ea1b28-90f9-4837-8173-ff79fa6966d7 R0
UN 192.168.21.70 681.16 GB 256 ? a9c8adae-e54e-4e8e-a333-eb9b2b52bfed R0
UN 192.168.21.71 653.18 GB 256 ? 6aa8cf0c-069a-4049-824a-8359d1c58e59 R0
UN 192.168.21.80 694.14 GB 256 ? 7abb5609-7dca-465a-a68c-972e54469ad6 R1
Now I'm trying to expand the cluster by adding 16 more nodes, also divided into 2 racks (R2 and R3). After adding all new nodes I expect a 32 node cluster, divided by 4 racks, with 350Gb of data on each node.
I add one node at a time according to the Cassandra documentation. I started Cassandra process at the first node, with the same configuration as existing nodes, but in R3 (new) rack. It caused 16 streams from existing nodes to newly added node, 250Gb of data in each, all data successfully transferred to new node, at this point process looks normal.
But after that data size lands on the new node, as shown by nodetool status it's starting to increase, now it already says 1.7Tb and keeps growing.
UJ 192.168.21.89 1.69 TB 256 ? 42a80db9-59d6-44b6-b79c-ac7819f69cee R3
It's something opposite to what I expected (350Gb per node but not 1.7Tb).
4Tb disk space out of 6Tb already used by Cassandra data dir on the new node.
I've thought that it isn't normal and have stopped the process.
Now I'm wondering what I'm doing wrong and what I should do to add 16 nodes properly and have 32 nodes with 350Gb on each node in the end. Should I expand existing racks instead of adding new ones? Should I calculate tokens for new nodes? Any other options?

Ceph OSD always 'down' in Ubuntu 14.04.1

I am trying to install and deploy a ceph cluster. As I don't have enough physical servers, I create 4 VMs on my OpenStack using official Ubuntu 14.04 image. I want to deploy a cluster with 1 monitor node and 3 OSD node with ceph version 0.80.7-0ubuntu0.14.04.1. I followed the steps from manual deployment document, and successfully installed the monitor node. However, after the installation of OSD node, it seems that the OSD daemons are running but not correctly report to the monitor node. The osd tree always shows down when I request command ceph --cluster cephcluster1 osd tree.
Following are the commands and corresponding results that may be related to my problem.
root#monitor:/home/ubuntu# ceph --cluster cephcluster1 osd tree
# id weight type name up/down reweight
-1 3 root default
-2 1 host osd1
0 1 osd.0 down 1
-3 1 host osd2
1 1 osd.1 down 1
-4 1 host osd3
2 1 osd.2 down 1
root#monitor:/home/ubuntu# ceph --cluster cephcluster1 -s
cluster fd78cbf8-8c64-4b12-9cfa-0e75bc6c8d98
health HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean; 3/3 in osds are down
monmap e1: 1 mons at {monitor=172.26.111.4:6789/0}, election epoch 1, quorum 0 monitor
osdmap e21: 3 osds: 0 up, 3 in
pgmap v22: 192 pgs, 3 pools, 0 bytes data, 0 objects
0 kB used, 0 kB / 0 kB avail
192 creating
The configuration file /etc/ceph/cephcluster1.conf on all nodes:
[global]
fsid = fd78cbf8-8c64-4b12-9cfa-0e75bc6c8d98
mon initial members = monitor
mon host = 172.26.111.4
public network = 10.5.0.0/16
cluster network = 172.26.111.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd journal size = 1024
filestore xattr use omap = true
osd pool default size = 2
osd pool default min size = 1
osd pool default pg num = 333
osd pool default pgp num = 333
osd crush chooseleaf type = 1
[osd]
osd journal size = 1024
[osd.0]
osd host = osd1
[osd.1]
osd host = osd2
[osd.2]
osd host = osd3
Logs when I start one of the osd daemons through start ceph-osd cluster=cephcluster1 id=x where x is the OSD ID:
/var/log/ceph/cephcluster1-osd.0.log on the OSD node #1:
2015-02-11 09:59:56.626899 7f5409d74800 0 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3), process ceph-osd, pid 11230
2015-02-11 09:59:56.646218 7f5409d74800 0 genericfilestorebackend(/var/lib/ceph/osd/cephcluster1-0) detect_features: FIEMAP ioctl is supported and appears to work
2015-02-11 09:59:56.646372 7f5409d74800 0 genericfilestorebackend(/var/lib/ceph/osd/cephcluster1-0) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-02-11 09:59:56.658227 7f5409d74800 0 genericfilestorebackend(/var/lib/ceph/osd/cephcluster1-0) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2015-02-11 09:59:56.679515 7f5409d74800 0 filestore(/var/lib/ceph/osd/cephcluster1-0) limited size xattrs
2015-02-11 09:59:56.699721 7f5409d74800 0 filestore(/var/lib/ceph/osd/cephcluster1-0) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2015-02-11 09:59:56.700107 7f5409d74800 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway
2015-02-11 09:59:56.700454 7f5409d74800 1 journal _open /var/lib/ceph/osd/cephcluster1-0/journal fd 20: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
2015-02-11 09:59:56.704025 7f5409d74800 1 journal _open /var/lib/ceph/osd/cephcluster1-0/journal fd 20: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
2015-02-11 09:59:56.704884 7f5409d74800 1 journal close /var/lib/ceph/osd/cephcluster1-0/journal
2015-02-11 09:59:56.725281 7f5409d74800 0 genericfilestorebackend(/var/lib/ceph/osd/cephcluster1-0) detect_features: FIEMAP ioctl is supported and appears to work
2015-02-11 09:59:56.725397 7f5409d74800 0 genericfilestorebackend(/var/lib/ceph/osd/cephcluster1-0) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-02-11 09:59:56.736445 7f5409d74800 0 genericfilestorebackend(/var/lib/ceph/osd/cephcluster1-0) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2015-02-11 09:59:56.756912 7f5409d74800 0 filestore(/var/lib/ceph/osd/cephcluster1-0) limited size xattrs
2015-02-11 09:59:56.776471 7f5409d74800 0 filestore(/var/lib/ceph/osd/cephcluster1-0) mount: WRITEAHEAD journal mode explicitly enabled in conf
2015-02-11 09:59:56.776748 7f5409d74800 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway
2015-02-11 09:59:56.776848 7f5409d74800 1 journal _open /var/lib/ceph/osd/cephcluster1-0/journal fd 21: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
2015-02-11 09:59:56.777069 7f5409d74800 1 journal _open /var/lib/ceph/osd/cephcluster1-0/journal fd 21: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
2015-02-11 09:59:56.783019 7f5409d74800 0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello
2015-02-11 09:59:56.783584 7f5409d74800 0 osd.0 11 crush map has features 1107558400, adjusting msgr requires for clients
2015-02-11 09:59:56.783645 7f5409d74800 0 osd.0 11 crush map has features 1107558400 was 8705, adjusting msgr requires for mons
2015-02-11 09:59:56.783687 7f5409d74800 0 osd.0 11 crush map has features 1107558400, adjusting msgr requires for osds
2015-02-11 09:59:56.783750 7f5409d74800 0 osd.0 11 load_pgs
2015-02-11 09:59:56.783831 7f5409d74800 0 osd.0 11 load_pgs opened 0 pgs
2015-02-11 09:59:56.792167 7f53f9b57700 0 osd.0 11 ignoring osdmap until we have initialized
2015-02-11 09:59:56.792334 7f53f9b57700 0 osd.0 11 ignoring osdmap until we have initialized
2015-02-11 09:59:56.792838 7f5409d74800 0 osd.0 11 done with init, starting boot process
/var/log/ceph/ceph-mon.monitor.log on the monitor node:
2015-02-11 09:59:56.593494 7f24cc41d700 0 mon.monitor#0(leader) e1 handle_command mon_command({"prefix": "osd crush create-or-move", "args": ["host=osd1", "root=default"], "id": 0, "weight": 0.05} v 0) v1
2015-02-11 09:59:56.593955 7f24cc41d700 0 mon.monitor#0(leader).osd e21 create-or-move crush item name 'osd.0' initial_weight 0.05 at location {host=osd1,root=default}
Any suggestion is appreciate. Many thanks!

Resources