lsyncd - OVERFLOW in event queue - Solution is to tune fs.inotify.max_queued_events - linux

lsyncd is a fantastic alternative to NFS or NAS for replicating files among your Linux hosts. I have found the daemon works well with large Linux filesystems (many files, small to large sizes, xfs, ext4, luks) but requires some sysctl tuning as your filesystem grows.
This "question" is a note to myself so I can always find the answer via searching on stack overflow. Hope it helps you!
Github Project: https://github.com/axkibe/lsyncd
Exception in /var/log/lsyncd.log:
Thu Jun 18 17:48:52 2020 Normal: --- OVERFLOW in event queue ---
Thu Jun 18 17:48:52 2020 Normal: --- HUP signal, resetting ---
Thu Jun 18 17:48:52 2020 Normal: waiting for 1 more child processes.

Solution when you see "OVERFLOW in event queue" in lsyncd.log
From other knowledge bases, I had learned to tune max_user_watches, but by also tuning the max_queued_events, I corrected an OVERFLOW in event queue exception.
The temporary solution worked without needing to restart my lsyncd process.
I picked the number 1000000 as an arbitrarily large number. The default Ubuntu 18 value is 16384.
Temporary Solution
Check your current tuning values:
$ sysctl fs.inotify.max_queued_events
fs.inotify.max_queued_events = 16384
$ sysctl fs.inotify.max_user_watches
fs.inotify.max_user_watches = 8192
Update both max_user_watches and max_queued_events via shell
sudo sysctl fs.inotify.max_user_watches=1000000
sudo sysctl fs.inotify.max_queued_events=1000000
Permanent Solution, Persists after reboot
Update both max_user_watches and max_queued_events in /etc/sysctl.conf
fs.inotify.max_user_watches=1000000
fs.inotify.max_queued_events=1000000
Lsyncd.conf Basic Configuration
/etc/lsyncd/lsyncd.conf
settings {
logfile = "/var/log/lsyncd.log",
pidfile = "/var/run/lsyncd/lsyncd.pid",
insist = true
}
sync {
default.rsyncssh,
source="/var/application/data",
host="node2",
excludeFrom="/etc/lsyncd/exclude",
targetdir="/var/application/data",
rsync = {
archive = true,
compress = false,
whole_file = true
},
ssh = {
port = 22
}
}
System Details
Linux service1staging 5.0.0-36-generic #39~18.04.1-Ubuntu SMP Tue Nov 12 11:09:50 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Ubuntu 18.04.4 LTS
lsyncd --version
Version: 2.1.6

Related

The Linux(CentOS 7.9) kernel has prompted a bug. Is it harmful?

My runtime environment is CentOS 7.9(kernel is version 5.16.11) in the VMBox virtual machine, it is allocated 1G memory and 8 CPU cores.
[root#dev236 ~]# uname -a
Linux dev236 5.16.11-1.el7.elrepo.x86_64 #1 SMP PREEMPT Tue Feb 22 10:22:37 EST 2022 x86_64 x86_64 x86_64 GNU/Linux
I ran a computation-intensive program that used 8 threads to continuously use the CPU.
After some time, the operating system issues a bug alert, like this:
[root#dev236 src]# ./server --p:command-threads-count=8
[31274.179023] rcu: INFO: rcu_preempt self-detected stall on CPU
[31274.179039] watchdog: BUG: soft lockup - CPU#3 stuck for 210S! [server:1356]
[31274.179042] watchdog: BUG: soft lockup - CPU#1 stuck for 210S! [server:1350]
[31274.179070] watchdog: BUG: soft lockup - CPU#7 stuck for 210S! [server:1355]
[31274.179214] rcu: 0-...!: (1 GPs behind) idle=52f/1/0x4000000000000000 softirq=10073864/10073865
fqs=0
Message from syslogd#dev236 at Jan 25 18:59:49 ...
kernel:watchdog: BUG: soft lockup - CPU#3 stuck for 210S! [server:1356]
Message from syslogd#dev236 at Jan 25 18:59:49 ...
kernel:watchdog: BUG: soft lockup - CPU#1 stuck for 210S! [server:1350]
Message from syslogd#dev236 at Jan 25 18:59:49 ...
kernel:watchdog: BUG: soft lockup - CPU#7 stuck for 210S! [server:1355]
^C
[root#dev236 src]#
Then, I looked at the program log, and the log file was constantly being appended, which indicated that my test program was still running.
I wonder if I can ignore this bug tip?
Or, do I have to do something?
for example:
    Reduce the computational intensity of the program?
    Give the CPU a break once in a while?
    Reduce the number of threads started in the program?
Thank you all

Which process sends SIGKILL and terminates all SSH connections on/to my Namecheap Server?

I've been trying to troubleshoot this problem for some days now.
A couple of minutes after starting an SSH connection to my Namecheap server (on Mac/windows/cPanel's "Terminal"), it crashes and give the following error message :
Error: The connection to the server ended in failure at {TIME} PM. (SIGKILL)
and :
Exit Code: 137
I've tried to create some kind of log file for any SIGKILL signal, but, it seems like none can be made on a Namecheap server :
auditctl doesn't exist,
We can't get systemtap because no package managers are available.
Precision :
uname -a : Linux [-n] 2.6.32-954.3.5.lve1.4.78.el6.x86_64 #1 SMP Thu Mar 26 08:20:27 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
I calculated the time between each crash : around 6min.
I don't have a very good knowledge of Linux servers, and maybe didn't include needed information. So please ask for any specificities!

Kubernetes NFS PV: Lock reclaim failed

Configuration:
NFS server and the k8s cluster(single node cluster) run on two machines and use the same OS and NFS software, as below:
[root#test-2 ~]# yum info nfs-utils
Failed to set locale, defaulting to C
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* base: mirrors.tuna.tsinghua.edu.cn
* extras: mirrors.bfsu.edu.cn
* updates: mirrors.huaweicloud.com
Installed Packages
Name : nfs-utils
Arch : x86_64
Epoch : 1
Version : 1.3.0
Release : 0.68.el7
Size : 1.1 M
Repo : installed
From repo : base
Summary : NFS utilities and supporting clients and daemons for the kernel NFS server
URL : http://sourceforge.net/projects/nfs
License : MIT and GPLv2 and GPLv2+ and BSD
Description : The nfs-utils package provides a daemon for the kernel NFS server and
: related tools, which provides a much higher level of performance than the
: traditional Linux NFS server used by most users.
:
: This package also contains the showmount program. Showmount queries the
: mount daemon on a remote host for information about the NFS (Network File
: System) server on the remote host. For example, showmount can display the
: clients which are mounted on that host.
:
: This package also contains the mount.nfs and umount.nfs program.
[root#test-2 ~]# cat /etc/redhat-release
CentOS Linux release 7.7.1908 (Core)
[root#test-2 ~]# uname -a
Linux test-2 3.10.0-1062.el7.x86_64 #1 SMP Wed Aug 7 18:08:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[root#test-2 ~]# cat /etc/exports
/home/nfs 192.168.0.0/24(rw,sync,no_root_squash,no_subtree_check,insecure)
K8S version: v1.17.9
Problems:
The application(a statefulset) running on k8s is using a PV that was dynamically provisioned by the k8s-nfs-provisioner, the PV is actually backed by a directory on remote NFS server. The application is keeping "CrashLoopBackOff" because it hits "input/output error" constantly when writing some data to the PV after only a few seconds of running.
Meanwhile, I saw a lot of errors in /var/log/messages:
Dec 2 17:11:36 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:11:36 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:11:36 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:11:36 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:11:36 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:11:36 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:12:05 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:12:05 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:21:41 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:21:41 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:21:41 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:21:41 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:21:42 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
I took a tcpdump until hit "Lock reclaim failed" in system log, and found there are many NFS errors as below:
NFS4ERR_BADSESSION (10052)
NFS4ERR_STALE_CLIENTID (10022)
NFS4ERR_NO_GRACE (10033)
I'm not sure if they're related to the "lock reclaim failed" or the "input/output" error.
I have encountered this problem on different machines with different machines from time to time and it really annoys me.
Anyone knows the root cause or how to fix it? Big thanks in advance.
Screenshots
application pod log
NFS errors in tcpdump
nfsstate -m output on k8s
nfsstate -c output on k8s, NOTE the high open_noat value.
NFS server configuration (my k8s node is 111.1.30.16)

Excessive memory reserved for caching on Linux

How can I find out that it is causing excessive reserved of memory caching?
I only free space with:
free && sync && echo 3 > /proc/sys/vm/drop_caches && free
About of 10gb is reserved by caches. See list proccess before of run drop_caches:
Software version:
mysqld: 5.5.44-0ubuntu0.14.04.1
Linux: Ubuntu 14.04.3 LTS -> 3.13.0-37-generic #64-Ubuntu SMP Mon Sep 22 21:28:38 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
ClamAV: 0.98.7/23224/Tue Mar 21 08:29:04 2017
Apache: 2.4.7 (Ubuntu) Server built: Sep 23 2015 15:34:04

How do I access a USB drive on a OSX host from inside a docker container?

I have an application that I eventually want to run on a cloud computing service (e.g., such as AWS or Google Cloud) packaged inside a docker image. The reason the application will need to run in the cloud is because it's designed to process large data files, but before I actually deploy, I'd like to test it first on a local laptop, using a single large data file that I've stored (for test and development purposes) on an external USB drive.
My development machine is an OSX laptop, and I'm using a recent version of docker:
stachyra> uname -a
Darwin Andrews-MacBook-Pro-76.local 14.5.0 Darwin Kernel Version 14.5.0: Tue Sep 1 21:23:09 PDT 2015; root:xnu-2782.50.1~1/RELEASE_X86_64 x86_64
stachyra> docker --version
Docker version 1.10.2, build c3959b1
OSX has mounted my external USB drive, device /dev/disk2s2, as /Volumes/MGR DATA:
stachyra> df
Filesystem 512-blocks Used Available Capacity iused ifree %iused Mounted on
/dev/disk1 974770480 435721376 538537104 45% 54529170 67317138 45% /
devfs 375 375 0 100% 650 0 100% /dev
map -hosts 0 0 0 100% 0 0 100% /net
map auto_home 0 0 0 100% 0 0 100% /home
/dev/disk2s2 3906291632 3869523640 36767992 100% 483690453 4595999 99% /Volumes/MGR DATA
/dev/disk3s1 196608 193160 3448 99% 24143 431 98% /Volumes/VirtualBox
stachyra> diskutil list
/dev/disk0
#: TYPE NAME SIZE IDENTIFIER
0: GUID_partition_scheme *500.3 GB disk0
1: EFI EFI 209.7 MB disk0s1
2: Apple_CoreStorage 499.4 GB disk0s2
3: Apple_Boot Recovery HD 650.0 MB disk0s3
/dev/disk1
#: TYPE NAME SIZE IDENTIFIER
0: Apple_HFS Macintosh HD *499.1 GB disk1
Logical Volume on disk0s2
DB70B91A-3B57-4C82-A758-C4BDEA4160FD
Unlocked Encrypted
/dev/disk2
#: TYPE NAME SIZE IDENTIFIER
0: GUID_partition_scheme *2.0 TB disk2
1: EFI EFI 209.7 MB disk2s1
2: Apple_HFS MGR DATA 2.0 TB disk2s2
/dev/disk3
#: TYPE NAME SIZE IDENTIFIER
0: GUID_partition_scheme *100.7 MB disk3
1: Apple_HFS VirtualBox 100.7 MB disk3s1
and it should also be noted, the drive has several directories and data which are visible inside it, at least when viewed directly through OSX:
stachyra> ls -l /Volumes/MGR\ DATA
total 0
drwxr-xr-x 6 stachyra staff 204 Apr 14 2015 1000genomes
drwxr-xr-x 5 stachyra staff 170 Oct 12 17:41 GIAB
drwxr-xr-x 4 stachyra staff 136 Apr 28 2015 genome_browser_tracks
drwxr-xr-x 24 stachyra staff 816 Oct 6 14:00 mitty
I have tried to follow the advice from this question, which describes how to mount a USB drive in docker when docker is running within a linux host. But my local laptop is OSX, not linux, so it doesn't seem to work.
Explicitly, when attempting to follow the advice of the accepted answer, I obtain the following result:
stachyra> docker run -i -t --privileged -v /dev/disk2s2:/dev/foo ubuntu bash
root#8da7b492a707:/# uname -a
Linux 8da7b492a707 4.1.18-boot2docker #1 SMP Sat Feb 20 08:24:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
root#8da7b492a707:/# ls -l /dev/foo
total 0
root#8da7b492a707:/#
Based upon the response, one can see that docker does indeed launch a linux container correctly, and it also creates a volume /dev/foo inside of the container as requested, but the actual contents of the USB drive are not accessible via that location--the ls -l command claims there are no files or directories there.
I also tried the second method described in an alternate response to the same question, and that fails even worse:
stachyra> docker run -i -t --device=/dev/disk2s2 ubuntu bash
docker: Error response from daemon: error gathering device information while adding custom device "/dev/disk2s2": not a device node.
stachyra>
I have found another discussion thread on stackoverflow which suggests that raw USB access is handled quite differently in OSX than in linux, which I suspect is probably the reason why both of the above attempts at USB access are failing.
But, what should I actually do about it? That is to say, what is the correct sequence of actions or commands to allow docker to access a USB device mounted on an OSX host, rather than linux?
I was finally able to access my USB drive from /var/media inside my container by using the machine-diskutil.sh script mentioned in warmoverflow's comment like so
machine-diskutil.sh mount my-machine-name /Volumes/my-usb-drive
and then starting the container like so
docker run -v /Volumes/my-usb-drive:/var/media -it my/image:latest bash
Because I had tried to add /Volumes/my-usb-drive as a shared folder manually in VirtualBox, I first got this error.
Error: The shared folder /Volumes/Seagate already exists on the
docker machine, please unmount it first.
So I removed it manually and re-ran the machine-diskutil.sh mount command without any problems. Great stuff!
As per #pgayvallet comment on GitHub:
As the daemon runs inside a VM in Docker Desktop, it is not possible to actually share a mac host device with the container inside the VM, and this will most definitely never be possible.

Resources