Slurm and Munge "Invalid Credential" - slurm

I'm installing slurm for the first time. I've installed the 19.05.1-2 tarball and used the configurator to make a very simple two node cluster. Control node is sdc, compute nodes (running slurmd) are sdc and sdc1. Both rebuilt with Ubuntu 18.04
I can start the controller, and the compute node sdc and also successfully submit jobs with srun. That's great. However, when I start slurmd on the second node, SDC1, I get:
slurmd: error: Unable to register: Zero Bytes were transmitted or received
That quickly led me to my munge configuration. Munge.log on the controller (sdc) shows "Invalid credential" every second. I triple checked that munge.key on both hosts are identical. I verified that ntp is running too.
So by hand I did munge -s foobar | unmunge on SDC1 and of course that worked locally. Then I saved the munged text from SDC1 to a file on SDC and tried unmunge. That did give me the error "Invalid credential" again.
Because of this I uninstalled and reinstalled munge on both systems, distributed the key and repeated that test with the same result.
I guess I'm missing something simple. I don't know what else to do to properly install munge.

It was UID/GID mismatch between nodes. Of course it's mentioned in the installation guide.

Did you remember to restart the munge daemon after copying the munge.key to /etc/munge? I got the same error doing
1: install slurm:
$ apt install -y slurm-client
2: copy slurm.conf
(perhaps create slurm-llnl beforehand):
$ cp slurm.conf /etc/slurm-llnl
3: copy munge key to client
(munge.key copied before from slurm server/slurmctld)
$ cp munge.key /etc/munge
and then I got all the invalid credetial errors and problems reported here and in reports including the 'Zero Bytes' error on the client side
[CLIENT]$ sinfo
slurm_load_partitions: Zero Bytes were transmitted or received
with corresponding entries in the Slurm SERVER/slurmctld logs ala
[SERVER]$ tail /var/log/munge/munged.log
2022-12-30 22:57:23 +0100 Notice: Running on ..
2022-12-30 23:01:11 +0100 Info: Invalid credential ...
and
[SERVER]$ tail /var/log/slurm-llnl/slurmctld.log
[2022-12-30T23:01:11.440] error: Munge decode failed: Invalid credential
[2022-12-30T23:01:11.440] ENCODED: Thu Jan 01 01:00:00 1970
[2022-12-30T23:01:11.440] DECODED: Thu Jan 01 01:00:00 1970
[2022-12-30T23:01:11.440] error: slurm_unpack_received_msg: REQUEST_PARTITION_INFO has authentication error: Invalid authentication credential
[2022-12-30T23:01:11.440] error: slurm_unpack_received_msg: Protocol authentication error
All of this is fixed by rebooting the client, as suggested by other here, or slightly less intrusive, just to restart the client munge daemon
(CLIENT)$ sudo systemctl restert munge.service
and then munge on client / unmunge on server works, but it also fixes my main problem of getting client to see the slurm server without the dreaded 'Zero Bytes' error
[CLIENT]$ sinfo
slurm_load_partitions: Zero Bytes were transmitted or received
with server log entries
[SERVER]$ tail /var/log/slurm-llnl/slurmctld.log
...
[2022-12-30T23:17:14.017] error: slurm_unpack_received_msg: Invalid Protocol Version 9472 from uid=-1 at XX.XX.XX.XX:44150
[2022-12-30T23:17:14.017] error: slurm_unpack_received_msg: Incompatible versions of client and server code
[2022-12-30T23:17:14.027] error: slurm_receive_msg [XX.XX.XX.XX:44150]: Unspecified error
And, after munge restart, voilĂ :
[CLIENT] $ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
LocalQ* up infinite 1 idle XXX
for the examples: SERVER Ubuntu 20.04, CLIENTS Ubuntu 20.04 (and 22.04 that seem to be incompatible with the SERVER slurm version, says the log)

Related

nfs mount exiting with connection timed out

I am facing trouble while nfs mount on one of our node.
Using nfsv4.
# nfsstat -s
Server rpc stats:
calls badcalls badfmt badauth badclnt
30484 0 0 0 0
Server nfs v4:
null compound
2 0% 30481 99%
Kernel version: Linux 4.19.255
gcc version: gcc version 8.4.0
Below are error logs, when trying to mount from nfs server on nfs client.
# mount ser_ip:/ffs/run tmp -v -t nfs
mount.nfs: timeout set for Fri Sep 24 11:22:40 2022
mount.nfs: trying text-based options 'vers=4.2,addr=ser_ip,clientaddr=cli_ip'
mount.nfs: mount(2): Invalid argument
mount.nfs: trying text-based options
'vers=4,minorversion=1,addr=ser_ip,clientaddr=cli_ip'
mount.nfs: mount(2): Invalid argument
mount.nfs: trying text-based options 'vers=4,addr=ser_ip,clientaddr=cli_ip'
mount.nfs: mount(2): Connection timed out
mount.nfs: Connection timed out
Checked by enabling rpc_debug logs and nfsd debug logs, observed nfs server is calling
[72245.940260] svc: server 00000000037849c6 waiting for data (to = 3600000)
[72251.432660] svc: svc_process dropit
[72251.432665] svc: xprt 00000000764063c0 dropped request
The issue is inconsistent and even after restarting nfs server observing the same issue,
but after rebooting the node the issue seems disappear.
Any clues would be a great help.
Waiting for your responses.
Thanks in advance!

How to create an x server with Singularity

Overall, I am trying to render images using Unity on a remote cluster.
The cluster does not have an X server; I don't have sudo permissions, or can start a Docker container, but I can start a Singularity container.
My plan is to create a container that would simulate the X Server. I created the following Singularity definition file:
Bootstrap: docker
From: nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
%post
# xvfb for rendering in headless mode
apt-get update
apt-get install -y xvfb mesa-utils xorg
echo "allowed_users = anybody" > /etc/X11/Xwrapper.config
I started the container with the option --containall. From the container, I launched the command /usr/bin/X :0, but it failed with the following error:
Singularity xvfb.sif:~> /usr/bin/X :0
_XSERVTransmkdir: Owner of /tmp/.X11-unix should be set to root
X.Org X Server 1.19.6
Release Date: 2017-12-20
X Protocol Version 11, Revision 0
Build Operating System: Linux 4.15.0-140-generic x86_64 Ubuntu
Current Operating System: Linux cooper 5.8.0-50-generic #56~20.04.1-Ubuntu SMP Mon Apr 12 21:46:35 UTC 2021 x86_64
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.8.0-50-generic root=/dev/mapper/vgubuntu-root ro quiet splash vt.handoff=7
Build Date: 08 April 2021 01:57:21PM
xorg-server 2:1.19.6-1ubuntu4.9 (For technical support please see http://www.ubuntu.com/support)
Current version of pixman: 0.34.0
Before reporting problems, check http://wiki.x.org
to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
(++) from command line, (!!) notice, (II) informational,
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/home/pierre-louis/.local/share/xorg/Xorg.0.log", Time: Wed May 26 09:17:05 2021
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
(EE)
Fatal server error:
(EE) parse_vt_settings: Cannot open /dev/tty0 (No such file or directory)
(EE)
(EE)
Please consult the The X.Org Foundation support
at http://wiki.x.org
for help.
(EE) Please also check the log file at "/home/pierre-louis/.local/share/xorg/Xorg.0.log" for additional information.
(EE)
(EE) Server terminated with error (1). Closing log file.
Not any /dev/tty* exist. Then I tried to launch startx, but only to get the same message error.
How can I launch an X Server using a Singularity image?
As mentioned in a separate discussion, Xvfb is not supposed to be start through startx or /usr/bin/X but rather with the supplied run script.

dovecot unable to start due to address already in use

I upgraded my Linux kernel and dovecot failed to start with the following error messages:
Error: service(managesieve-login): listen(*, 4190) failed: Address already in use
Error: service(pop3-login): listen(*, 110) failed: Address already in use
Error: service(pop3-login): listen(*, 995) failed: Address already in use
Error: service(imap-login): listen(*, 143) failed: Address already in use
Error: service(imap-login): listen(*, 993) failed: Address already in use
Fatal: Failed to start listeners
Strangely enough, I couldn't find any process bounded to those port numbers. All commands below return nothing.
# netstat -tulpn | grep 110
# ss -tulpn |grep 110
# fuser 110/tcp
# lsof -i :110
I also tried to change the listen setting to my specific IP address and it still failed the same way.
Any idea how I can solve this problem? Here's my version info:
# uname -a
Linux ip-172-31-26-222 4.14.177-107.254.amzn1.x86_64 #1 SMP Thu May 7 18:30:14 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
# dovecot --version
2.2.36 (1f10bfa63)
Hi it looks like you are using AWS as I am. I recently updated via Yum as well. I noticed that a new package named 'portreserve' was also installed. I killed that process, left the /etc/dovecot/dovecot.conf as it was before and then started Dovecot successfully. I was also immediately able to reconnect my mail clients connection. I hope that helps you.
I also restarted the portreserve program since it seems useful to limit port access.

GPFS : mmremote: Unable to determine the local node identity

I had a 4 node, gpfs cluster up and running, and things were fine till last week when the Server hosting these RHEL setups went down, After the server was brought up and rhel nodes were started back, one of the nodes's IP got changed,
After that I am not able to use the node,
simple commands like 'mmlscluster', mmgetstate', fails with this error:
[root#gpfs3 ~]# mmlscluster mmlscluster: Unable to determine the local
node identity. mmlscluster: Command failed. Examine previous error
messages to determine cause. [root#gpfs3 ~]# mmstartup mmstartup:
Unable to determine the local node identity. mmstartup: Command
failed. Examine previous error messages to determine cause.
Mmshutdown fails with different error:
[root#gpfs3 ~]# mmshutdown mmshutdown: Unexpected error from
getLocalNodeData: Unknown environmentType . Return code: 1
logs have this info:
Mon Feb 15 18:18:34 IST 2016: Node rebooted. Starting mmautoload...
mmautoload: Unable to determine the local node identity. Mon Feb 15
18:18:34 IST 2016 mmautoload: GPFS is waiting for daemon network
mmautoload: Unable to determine the local node identity. Mon Feb 15
18:19:34 IST 2016 mmautoload: GPFS is waiting for daemon network
mmautoload: Unable to determine the local node identity. Mon Feb 15
18:20:34 IST 2016 mmautoload: GPFS is waiting for daemon network
mmautoload: Unable to determine the local node identity. Mon Feb 15
18:21:35 IST 2016 mmautoload: GPFS is waiting for daemon network
mmautoload: Unable to determine the local node identity. Mon Feb 15
18:22:35 IST 2016 mmautoload: GPFS is waiting for daemon network
mmautoload: Unable to determine the local node identity. mmautoload:
The GPFS environment cannot be initialized. mmautoload: Correct the
problem and use mmstartup to start GPFS.
I tried changing the IP to new one, still the same error:
[root#gpfs1 ~]# mmchnode -N gpfs3 --admin-interface=xx.xx.xx.xx Mon Feb 15 20:00:05 IST 2016:
mmchnode: Processing node gpfs3 mmremote: Unable to determine the
local node identity. mmremote: Command failed. Examine previous error
messages to determine cause. mmremote: Unable to determine the local
node identity. mmremote: Command failed. Examine previous error
messages to determine cause. mmchnode: Unexpected error from
checkExistingClusterNode gpfs3. Return code: 0 mmchnode: Command
failed. Examine previous error messages to determine cause.
Can someone please help me in fixing this issue?
The easiest fix is probably to remove the node from the cluster (mmdelnode) and then add it back in (mmaddnode). You might need to mmdelnode -f.
If deleting and adding the node back in is not an option, try giving IBM support a call.

rsnapshot on Linux fails with "returned 12 while processing"

I thought I had rsnapshot all setup properly, but after checking my logs the next day I found the following:
[05/Sep/2014:10:34:11] /usr/bin/rsnapshot daily: ERROR: /usr/bin/rsync returned 12 while processing john#192.168.0.102:/media/linuxstorage/docs/
What does return code "12" mean?
To see what was going on, I ran it manually and went off to do other things:
raspberrypi $ sudo rsnapshot daily
Well lo and hehold, it had been sitting there waiting for my password.
john#192.168.0.102's password:
Connection closed by 192.168.0.102
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: error in rsync protocol data stream (code 12) at io.c(605) [Receiver=3.0.9]
----------------------------------------------------------------------------
rsnapshot encountered an error! The program was invoked with these options:
/usr/bin/rsnapshot daily
----------------------------------------------------------------------------
ERROR: /usr/bin/rsync returned 12 while processing bgrissom#192.168.0.102:/medi/linuxstorage/docs/
I had changed the rsnapshot user from pi to root in /etc/crontab and root was not setup the "ssh without a password" keys for the remote host. All I had to do to fix this is:
raspberrypi $ sudo bash
raspberrypi # ssh-copy-id john#192.168.0.102
The fact: return code "12" means there is something wrong with authentication to remote server.
I ran into this also and seems like this is the most common problem for getting that error:
ERROR: /usr/bin/rsync returned 12 while processing .....
Problem: rsnapshot uses rsync under the hood and can't connect because you probably never actually connected to that remote server.
Solution: You have to connect to that remote server at least once manually through terminal from that machine where rsnapshot is running
with: ssh remote_user#remote_server.domain
so that you confirm the connection and then entry can be made to known_hosts!
After that rsnapshot worked for me.

Resources