Host OS Ubuntu 15.10
Guest OS Ubuntu 14.10
Using Vagrant with nfs and Virtualbox and static ip on the private network.
It is working perfectly except that after having suspended the host OS, the entire guest OS will be unusable.
This does not happen when using the normal virtualbox shared folders.
It's not only the nfs shared folder that is unusable, the entire OS is hanging.
Even syslog does not seem to see much action.
This is syslog on the guest, from waking up until vagrant halt is completed.
Feb 26 07:15:33 vagrant kernel: [ 8375.252989] e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Feb 26 07:16:11 vagrant kernel: [ 8413.109832] nfs: server 192.168.33.1 not responding, still trying
Feb 26 07:16:38 vagrant kernel: [ 8440.687476] nfs: server 192.168.33.1 not responding, still trying
Feb 26 07:17:01 vagrant CRON[3776]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Feb 26 07:20:33 vagrant rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="753" x-info="http://www.rsyslog.com"] exiting on signal 15.
How can this be fixed?
How should I debug it?
Related
Setup:
NFS server: NFSServerHOST
NFS Share: NFSServerHOST:/MYSHARE
NFS Client1: CLNT1
NFS Client2: CKNT2
NFS Client3: CLNT3
Client OS: RHEL 7 and 8
Client users: User1, User2
Local Mount on Client: /var/NFSSHARE
mount -t nfs4 NFSServerHOST:/MYSHARE /var/NFSSHARE
Mount created successfully on all clients. Both User1 and User2 can read/write on /var/NFSSHARE from all 3 Clients.
Now Something happens on Client2 (we're yet to find out if it's related to server patch or some cron job) User1 cannot read/write to /var/NFSSHARE only on Client1. User2 can still read/write to NFSSHare on CLient1. Both users can still read/write on Client2 and client3.
Error while performing read/write on Client1 for User1: Remote I/O error
If we reboot client1 the issue is gone and User1 can again perform I/O operatrion on NFSSahre from Client1.
Some of the things we checked:
No version mismatch: Both NFS client and NFS Server config is for NFS V4
Nothing wrong with whitelisting: ALl 3 Client IPs are whitelisted on NFSServer
Have checked the inodes and lsof usage which is well within the limit.
nfs4_getafacl /var/NFSSHARE
# file: /var/NFSSHARE
A::EVERYONE#:rwaDxtTnNcy
running getfacl /var/NFSSHARE with User1 as logged in User on CLient1
# file: var/MQHA/
# owner: nobody
# group: nobody
user::rwx
group::rwx
other::rwx
comparing rpcdebug log while performing I/O operation on Client1 (FAILURE) vs Client2 (SUCCESS)
kernel: NFS: nfs_update_inode(0:57/3963604504 fh_crc=0xbf9e74c8 ct=2 info=0x427e7f)
kernel: NFS: (0:57/3963604504) revalidation complete
kernel: NFS: permission(0:57/3963604504), mask=0x1, res=0
kernel: NFS: permission(0:57/3963604504), mask=0x3, res=0
kernel: NFS: atomic_open(0:57/3963604504), Abhi
kernel: --> nfs_put_client({2})
kernel: --> nfs4_alloc_slot used_slots=0002 highest_used=1 max_slots=1024
kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=0
Logs chane after this point. Before both SUCCESS and FAILURE are more or less same just the numeric values are different.
Client1 (FAILURE)
kernel: nfs4_free_slot: slotid 0 highest_used_slotid 1
kernel: NFS: permission(0:57/3963604504), mask=0x81, res=-10
kernel: --> nfs4_alloc_slot used_slots=0002 highest_used=1 max_slots=1024
kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=0
kernel: decode_attr_type: type=00
Client2 (SUCCESS)
kernel: decode_attr_type: type=0100000
kernel: decode_attr_change: change attribute=7148460619683717735
kernel: decode_attr_size: file size=0
Looking for suggestions to diagnose this issue. What more can we do to enable more verbose logging either on Client or server side to know more about the error ?
Thanks
Configuration:
NFS server and the k8s cluster(single node cluster) run on two machines and use the same OS and NFS software, as below:
[root#test-2 ~]# yum info nfs-utils
Failed to set locale, defaulting to C
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* base: mirrors.tuna.tsinghua.edu.cn
* extras: mirrors.bfsu.edu.cn
* updates: mirrors.huaweicloud.com
Installed Packages
Name : nfs-utils
Arch : x86_64
Epoch : 1
Version : 1.3.0
Release : 0.68.el7
Size : 1.1 M
Repo : installed
From repo : base
Summary : NFS utilities and supporting clients and daemons for the kernel NFS server
URL : http://sourceforge.net/projects/nfs
License : MIT and GPLv2 and GPLv2+ and BSD
Description : The nfs-utils package provides a daemon for the kernel NFS server and
: related tools, which provides a much higher level of performance than the
: traditional Linux NFS server used by most users.
:
: This package also contains the showmount program. Showmount queries the
: mount daemon on a remote host for information about the NFS (Network File
: System) server on the remote host. For example, showmount can display the
: clients which are mounted on that host.
:
: This package also contains the mount.nfs and umount.nfs program.
[root#test-2 ~]# cat /etc/redhat-release
CentOS Linux release 7.7.1908 (Core)
[root#test-2 ~]# uname -a
Linux test-2 3.10.0-1062.el7.x86_64 #1 SMP Wed Aug 7 18:08:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[root#test-2 ~]# cat /etc/exports
/home/nfs 192.168.0.0/24(rw,sync,no_root_squash,no_subtree_check,insecure)
K8S version: v1.17.9
Problems:
The application(a statefulset) running on k8s is using a PV that was dynamically provisioned by the k8s-nfs-provisioner, the PV is actually backed by a directory on remote NFS server. The application is keeping "CrashLoopBackOff" because it hits "input/output error" constantly when writing some data to the PV after only a few seconds of running.
Meanwhile, I saw a lot of errors in /var/log/messages:
Dec 2 17:11:36 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:11:36 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:11:36 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:11:36 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:11:36 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:11:36 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:12:05 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:12:05 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:21:41 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:21:41 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:21:41 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:21:41 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:21:42 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
I took a tcpdump until hit "Lock reclaim failed" in system log, and found there are many NFS errors as below:
NFS4ERR_BADSESSION (10052)
NFS4ERR_STALE_CLIENTID (10022)
NFS4ERR_NO_GRACE (10033)
I'm not sure if they're related to the "lock reclaim failed" or the "input/output" error.
I have encountered this problem on different machines with different machines from time to time and it really annoys me.
Anyone knows the root cause or how to fix it? Big thanks in advance.
Screenshots
application pod log
NFS errors in tcpdump
nfsstate -m output on k8s
nfsstate -c output on k8s, NOTE the high open_noat value.
NFS server configuration (my k8s node is 111.1.30.16)
linux:
[root#localhost bin]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.6 (Maipo)
[root#localhost bin]# cat /proc/version
Linux version 4.14.0-115.5.1.el7a.06.aarch64 (mockbuild#arm-buildhost1) (gcc version 4.8.5 20150623 (NeoKylin 4.8.5-36) (GCC)) #1 SMP Tue Jun 18 10:34:55 CST 2019
[root#localhost bin]# file /bin/bash
/bin/bash: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 3.7.0, BuildID[sha1]=8a346ec01d611062313a5a4ed2b0201ecc9d9fa1, stripped
JxBrower7.7:
i used this demo,the line 55 is:Browser browser = engine.newBrowser();
enter code here
public static void main(String[] args) {
Engine engine = Engine.newInstance(
EngineOptions.newBuilder(OFF_SCREEN).build());
Browser browser = engine.newBrowser();
enter code here
[root#localhost bin]# java -jar test.jar
Exception in thread "main" com.teamdev.jxbrowser.navigation.TimeoutException: Failed to execute task withing 45 seconds.
at com.teamdev.jxbrowser.navigation.internal.NavigationImpl.loadAndWait(NavigationImpl.java:248)
at com.teamdev.jxbrowser.navigation.internal.NavigationImpl.loadUrlAndWait(NavigationImpl.java:105)
at com.teamdev.jxbrowser.navigation.internal.NavigationImpl.loadUrlAndWait(NavigationImpl.java:82)
at com.teamdev.jxbrowser.navigation.internal.NavigationImpl.loadUrlAndWait(NavigationImpl.java:74)
at com.teamdev.jxbrowser.engine.internal.EngineImpl.newBrowser(EngineImpl.java:458)
at com.pinnet.HelloWorld.main(HelloWorld.java:55)
linux logs at /var/logs/messages:
22 09:48:53 localhost dbus[8661]: [system] Activating via systemd: service name='org.bluez' unit='dbus-org.bluez.service'
May 22 09:48:54 localhost abrt-hook-ccpp: Process 90562 (chromium) of user 0 killed by SIGABRT - dumping core
May 22 09:48:54 localhost abrt-hook-ccpp: Process 90566 (chromium) of user 0 killed by SIGABRT - ignoring (repeated crash)
May 22 09:48:54 localhost abrt-hook-ccpp: Process 90561 (chromium) of user 0 killed by SIGABRT - ignoring (repeated crash)
May 22 09:48:54 localhost abrt-hook-ccpp: Process 90593 (chromium) of user 0 killed by SIGABRT - ignoring (repeated crash)
May 22 09:48:55 localhost abrt-hook-ccpp: Process 90624 (chromium) of user 0 killed by SIGABRT - ignoring (repeated crash)
May 22 09:48:55 localhost abrt-hook-ccpp: Process 90623 (chromium) of user 0 killed by SIGABRT - ignoring (repeated crash)
May 22 09:48:56 localhost abrt-server: Duplicate: core backtrace
May 22 09:48:56 localhost abrt-server: DUP_OF_DIR: /var/spool/abrt/ccpp-2020-05-21-16:55:06-33694
May 22 09:48:56 localhost abrt-server: Deleting problem directory ccpp-2020-05-22-09:48:54-90562 (dup of ccpp-2020-05-21-16:55:06-33694)
May 22 09:48:56 localhost abrt-server: /bin/sh: reporter-mailx: 未找到命令
May 22 09:49:18 localhost dbus[8661]: [system] Failed to activate service 'org.bluez': timed out
Where are my iptables logging Blocked messages? I wonder if this is an OpenVZ issue or something from the scripted install. Note, I'm highly technical, but not a server admin. Could the OpenVZ host be blocking and logging outside of my VSP?
I have two newly installed machines running running text-mode CentOS 7 x64, yum up to date packages, and with iptables/CSF.
Also, I ensured machine #2 has all the packages that are on machine #1, though #2 has some extras.
OpenVZ VPS (installed with their image of CentOS 7 x64)
VMware VM (installed with official CentOS 7 x64 minimal mode)
I performed my extra installs/configs exactly the same on both machines, and I have these lines in /etc/csf/csf.conf
TESTING = "0"
TCP_IN = "22,80,443"
UDP_IN = ""
On the VM, I'm getting these /var/log/messages when I nmap scan it:
Apr 12 17:25:23 mach kernel: Firewall: *UDP_IN Blocked* IN=ens192 OUT= ...
Apr 12 17:25:55 mach kernel: Firewall: *TCP_IN Blocked* IN=ens192 OUT= ...
On the VPS, I'm NOT getting any Firewall /var/log/messages when I nmap scan it... but I think it is properly blocking traffic.
How do I even proceed/diagnose this?
I create an VM instance. I can connect to it as soon ad the SSH Daemon is started. But this is too early because kernel startup is only at approx. 30%. Is there a gcloud or other API to get the VM state when the kernel has finished startup?
Nov 18 10:58:51 image-name google: No startup script found in metadata.
Nov 18 10:58:53 image-name kernel: [ 27.491829] aufs au_opts_verify:1570:docker[2414]: dirperm1 breaks the protection by the permission bits on the lower branch
Nov 18 10:58:53 image-name kernel: [ 27.703142] aufs au_opts_verify:1570:docker[2414]: dirperm1 breaks the protection by the permission bits on the lower branch
Nov 18 10:58:53 image-name kernel: [ 27.735867] aufs au_opts_verify:1570:docker[2414]: dirperm1 breaks the protection by the permission bits on the lower branch
Nov 18 10:58:53 image-name kernel: [ 27.771732] aufs au_opts_verify:1570:docker[2260]: dirperm1 breaks the protection by the permission bits on the lower branch
Nov 18 10:58:53 image-name kernel: [ 27.797540] device vethfa3ab85 entered promiscuous mode
Nov 18 10:58:53 image-name kernel: [ 27.804420] IPv6: ADDRCONF(NETDEV_UP): vethfa3ab85: link is not ready
Nov 18 10:58:53 image-name kernel: [ 28.028306] IPv6: ADDRCONF(NETDEV_CHANGE): vethfa3ab85: link becomes ready
Nov 18 10:58:53 image-name kernel: [ 28.035505] docker0: port 1(vethfa3ab85) entered forwarding state
Nov 18 10:58:53 image-name kernel: [ 28.041963] docker0: port 1(vethfa3ab85) entered forwarding state
Nov 18 10:58:53 image-name kernel: [ 28.048532] IPv6: ADDRCONF(NETDEV_CHANGE): docker0: link becomes ready
Nov 18 10:58:54 image-name kernel: [ 28.980082] IPv6: eth0: IPv6 duplicate address fe80::42:acff:fe11:1 detected!
->>> about here I can SSH to the server
Nov 18 10:59:08 image-name kernel: [ 43.068094] docker0: port 1(vethfa3ab85) entered forwarding state
Nov 18 10:59:53 image-name kernel: [ 87.944452] aufs au_opts_verify:1570:docker[2864]: dirperm1 breaks the protection by the permission bits on the lower branch
Nov 18 10:59:53 image-name kernel: [ 88.001012] aufs au_opts_verify:1570:docker[2864]: dirperm1 breaks the protection by the permission bits on the lower branch
Nov 18 10:59:53 image-name kernel: [ 88.049510] aufs au_opts_verify:1570:docker[2815]: dirperm1 breaks the protection by the permission bits on the lower branch
->>> I want to know about this point in the startup process
My problem is that I can connect to it using SSH when kernel progress is below 30% and some processes are not yet started. I want to detect somehow if the server has completed startup. Or is there a script that can push to the server (through the GCE APIs) to notify me when a server is completely up?
gcloud compute instances describe image-name does return the same output from the moment the instance is started till the kernel startup is complete.
(In my case I use the Node.js GCE API, but this should not make any difference.)
Presently I am not aware of any such google native API that can provide a progress of instance start.
However this is a quick workaround check if this fits your requirement.
You can either use the Google Startup script or the native linux rc.local. The concept is the same, so explaining it for the case of rc.local [as it is generic and not tied to google]
We know that the last process in a bootup sequence that runs is rc.local. Any command or script or call that is in this rc.local [which is a sh or bash script by itself] will be executed at the end of boot process.
So the idea would be in the google image in case of rc.local, have a script or a call which send your a notification or writes a output to central system like KV or cloud storage the state that bootup is all done.
Similar to Kamran, but here is how I get this done. It depends on using a google startup script and an image where gcloud is installed by default (though you could rework this to just use curl and API calls)
On instance creation/configuration, I set a custom metadata flag: serverready=False
At the end of my google startup script, I have this:
sudo gcloud compute instances add-metadata $(hostname) \
--metadata serverready=True \
--zone $(curl \
"http://metadata.google.internal/computeMetadata/v1/instance/zone" \
-H "Metadata-Flavor: Google"|cut -d/ -f4)
When I run the instance creation, I can just poll the metadata for the serverready key, and set my app to wait until it sees serverready=True