nfs mount exiting with connection timed out - rpc

I am facing trouble while nfs mount on one of our node.
Using nfsv4.
# nfsstat -s
Server rpc stats:
calls badcalls badfmt badauth badclnt
30484 0 0 0 0
Server nfs v4:
null compound
2 0% 30481 99%
Kernel version: Linux 4.19.255
gcc version: gcc version 8.4.0
Below are error logs, when trying to mount from nfs server on nfs client.
# mount ser_ip:/ffs/run tmp -v -t nfs
mount.nfs: timeout set for Fri Sep 24 11:22:40 2022
mount.nfs: trying text-based options 'vers=4.2,addr=ser_ip,clientaddr=cli_ip'
mount.nfs: mount(2): Invalid argument
mount.nfs: trying text-based options
'vers=4,minorversion=1,addr=ser_ip,clientaddr=cli_ip'
mount.nfs: mount(2): Invalid argument
mount.nfs: trying text-based options 'vers=4,addr=ser_ip,clientaddr=cli_ip'
mount.nfs: mount(2): Connection timed out
mount.nfs: Connection timed out
Checked by enabling rpc_debug logs and nfsd debug logs, observed nfs server is calling
[72245.940260] svc: server 00000000037849c6 waiting for data (to = 3600000)
[72251.432660] svc: svc_process dropit
[72251.432665] svc: xprt 00000000764063c0 dropped request
The issue is inconsistent and even after restarting nfs server observing the same issue,
but after rebooting the node the issue seems disappear.
Any clues would be a great help.
Waiting for your responses.
Thanks in advance!

Related

Getting error while mounting nfs with error mount.nfs: Remote I/O error

When you are mounting AIX nfs shares, the following error may be obtained:
mount.nfs: Remote I/O error
Aix NFS server with NFSv4 protocol support
This may be due to a misconfiguration of NFSv4 on the Aix NFS server
Aix NFS server without NFSv4 protocol support
To resolve the issue, mount the nfs share with an option vers=3. Eg:
Raw
# mount -o vers=3 server_ip_addr:dir local_mount_point

dovecot unable to start due to address already in use

I upgraded my Linux kernel and dovecot failed to start with the following error messages:
Error: service(managesieve-login): listen(*, 4190) failed: Address already in use
Error: service(pop3-login): listen(*, 110) failed: Address already in use
Error: service(pop3-login): listen(*, 995) failed: Address already in use
Error: service(imap-login): listen(*, 143) failed: Address already in use
Error: service(imap-login): listen(*, 993) failed: Address already in use
Fatal: Failed to start listeners
Strangely enough, I couldn't find any process bounded to those port numbers. All commands below return nothing.
# netstat -tulpn | grep 110
# ss -tulpn |grep 110
# fuser 110/tcp
# lsof -i :110
I also tried to change the listen setting to my specific IP address and it still failed the same way.
Any idea how I can solve this problem? Here's my version info:
# uname -a
Linux ip-172-31-26-222 4.14.177-107.254.amzn1.x86_64 #1 SMP Thu May 7 18:30:14 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
# dovecot --version
2.2.36 (1f10bfa63)
Hi it looks like you are using AWS as I am. I recently updated via Yum as well. I noticed that a new package named 'portreserve' was also installed. I killed that process, left the /etc/dovecot/dovecot.conf as it was before and then started Dovecot successfully. I was also immediately able to reconnect my mail clients connection. I hope that helps you.
I also restarted the portreserve program since it seems useful to limit port access.

Slurm and Munge "Invalid Credential"

I'm installing slurm for the first time. I've installed the 19.05.1-2 tarball and used the configurator to make a very simple two node cluster. Control node is sdc, compute nodes (running slurmd) are sdc and sdc1. Both rebuilt with Ubuntu 18.04
I can start the controller, and the compute node sdc and also successfully submit jobs with srun. That's great. However, when I start slurmd on the second node, SDC1, I get:
slurmd: error: Unable to register: Zero Bytes were transmitted or received
That quickly led me to my munge configuration. Munge.log on the controller (sdc) shows "Invalid credential" every second. I triple checked that munge.key on both hosts are identical. I verified that ntp is running too.
So by hand I did munge -s foobar | unmunge on SDC1 and of course that worked locally. Then I saved the munged text from SDC1 to a file on SDC and tried unmunge. That did give me the error "Invalid credential" again.
Because of this I uninstalled and reinstalled munge on both systems, distributed the key and repeated that test with the same result.
I guess I'm missing something simple. I don't know what else to do to properly install munge.
It was UID/GID mismatch between nodes. Of course it's mentioned in the installation guide.
Did you remember to restart the munge daemon after copying the munge.key to /etc/munge? I got the same error doing
1: install slurm:
$ apt install -y slurm-client
2: copy slurm.conf
(perhaps create slurm-llnl beforehand):
$ cp slurm.conf /etc/slurm-llnl
3: copy munge key to client
(munge.key copied before from slurm server/slurmctld)
$ cp munge.key /etc/munge
and then I got all the invalid credetial errors and problems reported here and in reports including the 'Zero Bytes' error on the client side
[CLIENT]$ sinfo
slurm_load_partitions: Zero Bytes were transmitted or received
with corresponding entries in the Slurm SERVER/slurmctld logs ala
[SERVER]$ tail /var/log/munge/munged.log
2022-12-30 22:57:23 +0100 Notice: Running on ..
2022-12-30 23:01:11 +0100 Info: Invalid credential ...
and
[SERVER]$ tail /var/log/slurm-llnl/slurmctld.log
[2022-12-30T23:01:11.440] error: Munge decode failed: Invalid credential
[2022-12-30T23:01:11.440] ENCODED: Thu Jan 01 01:00:00 1970
[2022-12-30T23:01:11.440] DECODED: Thu Jan 01 01:00:00 1970
[2022-12-30T23:01:11.440] error: slurm_unpack_received_msg: REQUEST_PARTITION_INFO has authentication error: Invalid authentication credential
[2022-12-30T23:01:11.440] error: slurm_unpack_received_msg: Protocol authentication error
All of this is fixed by rebooting the client, as suggested by other here, or slightly less intrusive, just to restart the client munge daemon
(CLIENT)$ sudo systemctl restert munge.service
and then munge on client / unmunge on server works, but it also fixes my main problem of getting client to see the slurm server without the dreaded 'Zero Bytes' error
[CLIENT]$ sinfo
slurm_load_partitions: Zero Bytes were transmitted or received
with server log entries
[SERVER]$ tail /var/log/slurm-llnl/slurmctld.log
...
[2022-12-30T23:17:14.017] error: slurm_unpack_received_msg: Invalid Protocol Version 9472 from uid=-1 at XX.XX.XX.XX:44150
[2022-12-30T23:17:14.017] error: slurm_unpack_received_msg: Incompatible versions of client and server code
[2022-12-30T23:17:14.027] error: slurm_receive_msg [XX.XX.XX.XX:44150]: Unspecified error
And, after munge restart, voilĂ :
[CLIENT] $ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
LocalQ* up infinite 1 idle XXX
for the examples: SERVER Ubuntu 20.04, CLIENTS Ubuntu 20.04 (and 22.04 that seem to be incompatible with the SERVER slurm version, says the log)

dockerd: Error running deviceCreate (CreatePool) dm_task_run failed

I'm building some CentOS VM with VMWare, with no access to internet, so I've downloaded and made local repositories, including this one
Then I have installed docker-engine.x86_64, and when starting the docker daemon, I get the following errors :
[root]# dockerd
DEBU[0000] docker group found. gid: 993
...
...
DEBU[0001] Error retrieving the next available loopback: open /dev/loop-control: no such device
ERRO[0001] **There are no more loopback devices available.**
ERRO[0001] [graphdriver] prior storage driver "devicemapper" failed: loopback attach failed
DEBU[0001] Cleaning up old mountid : start.
FATA[0001] Error starting daemon: error initializing graphdriver: loopback attach failed
After manually add the loop module which control loop device with this command :
insmod /lib/modules/3.10.0-327.36.2.el7.x86_64/kernel/drivers/block/loop.ko
The error changes to :
[graphdriver] prior storage driver "devicemapper" failed: devicemapper: Error running deviceCreate (CreatePool) dm_task_run failed
I've read that it could be because I have not enough space disk, I think it's not that, any idea?
[root]# df -k .
Filesystem blocs de 1K Used Available Used Mounted on
/dev/mapper/centos-root 51887356 2436256 49451100 5% /
I got the "There are no more loopback devices available" error, which stopped dockerd from running.
I fixed it by ensuring the storage driver was 'overlay':
# /usr/bin/dockerd -D --storage-driver=overlay
This was on Debian Jessie and docker running as a systemd service/unit.
To make it permanent, I created a systemd drop-in:
$ cat /etc/systemd/system/docker.service.d/docker.conf
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fd:// --storage-driver=overlay

rsnapshot on Linux fails with "returned 12 while processing"

I thought I had rsnapshot all setup properly, but after checking my logs the next day I found the following:
[05/Sep/2014:10:34:11] /usr/bin/rsnapshot daily: ERROR: /usr/bin/rsync returned 12 while processing john#192.168.0.102:/media/linuxstorage/docs/
What does return code "12" mean?
To see what was going on, I ran it manually and went off to do other things:
raspberrypi $ sudo rsnapshot daily
Well lo and hehold, it had been sitting there waiting for my password.
john#192.168.0.102's password:
Connection closed by 192.168.0.102
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: error in rsync protocol data stream (code 12) at io.c(605) [Receiver=3.0.9]
----------------------------------------------------------------------------
rsnapshot encountered an error! The program was invoked with these options:
/usr/bin/rsnapshot daily
----------------------------------------------------------------------------
ERROR: /usr/bin/rsync returned 12 while processing bgrissom#192.168.0.102:/medi/linuxstorage/docs/
I had changed the rsnapshot user from pi to root in /etc/crontab and root was not setup the "ssh without a password" keys for the remote host. All I had to do to fix this is:
raspberrypi $ sudo bash
raspberrypi # ssh-copy-id john#192.168.0.102
The fact: return code "12" means there is something wrong with authentication to remote server.
I ran into this also and seems like this is the most common problem for getting that error:
ERROR: /usr/bin/rsync returned 12 while processing .....
Problem: rsnapshot uses rsync under the hood and can't connect because you probably never actually connected to that remote server.
Solution: You have to connect to that remote server at least once manually through terminal from that machine where rsnapshot is running
with: ssh remote_user#remote_server.domain
so that you confirm the connection and then entry can be made to known_hosts!
After that rsnapshot worked for me.

Resources