OpenMPI: Host key verification failed on SGE cluster - openmpi

I recently installed OpenMPI version 2.0 on my SGE cluster. But when I submit a job I get "Host ket verification failed". Even though I'm able to login to that node(compute10) without the password from the submit host.
The error in the output file:
Warning: no access to tty (Bad file descriptor). Thus no job control
in this shell. Wed Jan 30 15:58:53 EST 2019 Host key verification
failed. [file orca_main/gtoint.cpp, line 137]: ORCA finished by error
termination in ORCA_GTOInt
My SGE script is below:
!/bin/tcsh
$ -q sge-queue#compute10
$ -pe mpi 8
$ -V
$ -cwd
$ -j y
$ -l h_vmem=64G
date
setenv OMP_NUM_THREADS 8
/home/user/orca_4_0_1_2_linux_x86-64_openmpi202/orca ccl3.inp >
ccl3.out
date
And my parallel environment mpi:
pe_name mpi
slots 999
user_lists NONE
xuser_lists NONE
start_proc_args /export/sge6.2_U7/mpi/startmpi.sh -catch_rsh
$pe_hostfile
stop_proc_args /export/sge6.2_U7/mpi/stopmpi.sh
allocation_rule $pe_slots
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary TRUE

After trying various things, updating OpenMPI to 3.1.0 version and building with the options below solved the issue.
./configure --prefix=/usr/local --with-sge
--enable-orterun-prefix-by-default

Related

When I use vscode Remote-SSH extension to connect to my remote server, I found vscode cannot install vscode-server on host

this is the log when vscode install vscode-server in host
i found that it got vscode-server commit id as follow log:
[13:07:27.334] Using commit id "f80445acd5a3dadef24aa209168452a3d97cc326" and quality "stable" for server
but it didn't use this commit id in wget download url:
[13:07:28.420] > wget download failed
> https://update.code.visualstudio.com/commit:/server-darwin/stable:
[13:07:27.297] Log Level: 2
[13:07:27.298] remote-ssh#0.74.0
[13:07:27.298] darwin x64
[13:07:27.306] SSH Resolver called for "ssh-remote+devbox", attempt 1
[13:07:27.307] "remote.SSH.useLocalServer": true
[13:07:27.307] "remote.SSH.path": undefined
[13:07:27.308] "remote.SSH.configFile": undefined
[13:07:27.308] "remote.SSH.useFlock": true
[13:07:27.308] "remote.SSH.lockfilesInTmp": false
[13:07:27.308] "remote.SSH.localServerDownload": off
[13:07:27.309] "remote.SSH.remoteServerListenOnSocket": false
[13:07:27.309] "remote.SSH.showLoginTerminal": false
[13:07:27.309] "remote.SSH.defaultExtensions": []
[13:07:27.309] "remote.SSH.loglevel": 2
[13:07:27.310] "remote.SSH.enableDynamicForwarding": true
[13:07:27.310] "remote.SSH.enableRemoteCommand": false
[13:07:27.310] "remote.SSH.serverPickPortsFromRange": {}
[13:07:27.310] "remote.SSH.serverInstallPath": {}
[13:07:27.325] SSH Resolver called for host: devbox
[13:07:27.325] Setting up SSH remote "devbox"
[13:07:27.330] Acquiring local install lock: /var/folders/8f/x1b597tj715cn0x95bjqmy1m0000gp/T/vscode-remote-ssh-c4ea1055-install.lock
[13:07:27.333] Looking for existing server data file at /Users/bytedance/Library/Application Support/Code/User/globalStorage/ms-vscode-remote.remote-ssh/vscode-ssh-host-c4ea1055-f80445acd5a3dadef24aa209168452a3d97cc326-0.74.0/data.json
[13:07:27.334] Using commit id "f80445acd5a3dadef24aa209168452a3d97cc326" and quality "stable" for server
[13:07:27.340] Install and start server if needed
[13:07:27.344] PATH: /Users/bytedance/.yarn/bin:/Users/bytedance/.config/yarn/global/node_modules/.bin:/Users/bytedance/.nvm/versions/node/v14.15.1/bin:/Users/bytedance/bin:/usr/local/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/puppetlabs/bin:/Library/Apple/usr/bin:/Applications/Mplus
[13:07:27.345] Checking ssh with "ssh -V"
[13:07:27.359] > OpenSSH_8.6p1, LibreSSL 2.8.3
[13:07:27.369] askpass server listening on /var/folders/8f/x1b597tj715cn0x95bjqmy1m0000gp/T/vscode-ssh-askpass-efdbcc89680acf75582ff2c0e6d258ff60bbf93f.sock
[13:07:27.369] Spawning local server with {"serverId":1,"ipcHandlePath":"/var/folders/8f/x1b597tj715cn0x95bjqmy1m0000gp/T/vscode-ssh-askpass-340d49ea7f83ee5441ed6e4eecae6d815a8c993e.sock","sshCommand":"ssh","sshArgs":["-v","-T","-D","63970","-o","ConnectTimeout=15","devbox"],"serverDataFolderName":".vscode-server","dataFilePath":"/Users/bytedance/Library/Application Support/Code/User/globalStorage/ms-vscode-remote.remote-ssh/vscode-ssh-host-c4ea1055-f80445acd5a3dadef24aa209168452a3d97cc326-0.74.0/data.json"}
[13:07:27.370] Local server env: {"DISPLAY":"1","ELECTRON_RUN_AS_NODE":"1","SSH_ASKPASS":"/Users/bytedance/.vscode/extensions/ms-vscode-remote.remote-ssh-0.74.0/out/local-server/askpass.sh","VSCODE_SSH_ASKPASS_NODE":"/Applications/Visual Studio Code.app/Contents/MacOS/Electron","VSCODE_SSH_ASKPASS_EXTRA_ARGS":"--ms-enable-electron-run-as-node","VSCODE_SSH_ASKPASS_MAIN":"/Users/bytedance/.vscode/extensions/ms-vscode-remote.remote-ssh-0.74.0/out/askpass-main.js","VSCODE_SSH_ASKPASS_HANDLE":"/var/folders/8f/x1b597tj715cn0x95bjqmy1m0000gp/T/vscode-ssh-askpass-efdbcc89680acf75582ff2c0e6d258ff60bbf93f.sock"}
[13:07:27.371] Spawned 14057
[13:07:27.515] > local-server-1> Spawned ssh, pid=14064
[13:07:27.520] stderr> OpenSSH_8.6p1, LibreSSL 2.8.3
[13:07:27.596] stderr> debug1: Server host key: ecdsa-sha2-nistp256 SHA256:b+9mK6ATDvqmXeFeXiRzqh4iICIEtuNptAfPeSuV4sI
[13:07:27.953] stderr> Authenticated to 10.227.84.41 ([10.227.84.41]:22).
[13:07:28.012] > Linux n227-084-041 4.14.81.bm.15-amd64 #1 SMP Debian 4.14.81.bm.15 Sun Sep 8 05:02:31 UTC 2019 x86_64
>
> The programs included with the Debian GNU/Linux system are free software;
> the exact distribution terms for each program are described in the
> individual files in /usr/share/doc/*/copyright.
>
> Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
> permitted by applicable law.
[13:07:28.023] > ready: f8c7ecb7b7c3
[13:07:28.033] > Linux 4.14.81.bm.15-amd64 #1 SMP Debian 4.14.81.bm.15 Sun Sep 8 05:02:31 UTC 2019
[13:07:28.034] Platform: linux
[13:07:28.046] stderr> bash: line 1: syntax error near unexpected token `done'
[13:07:28.046] stderr> bash: line 1: `done'
[13:07:28.048] > Installing to ...
[13:07:28.048] stderr> zsh: parse error near `}'
[13:07:28.051] > f8c7ecb7b7c3%%1%%
[13:07:28.051] stderr> do_host_download:1: command not found: millis
[13:07:28.054] > Downloading with wget
[13:07:28.420] > wget download failed
> https://update.code.visualstudio.com/commit:/server-darwin/stable:
> 2022-02-23 13:07:30 ERROR 404: Not Found.
I had this problem as well since this morning and what was odd for me was that I could SSH from the terminal to the target host with no problem.
After some debugging, it seems like the Remote - SSH extension is causing the trouble. The following two options worked for me. Either:
Downgrading the extension to 0.70.0 works for me. The current version (0.74.0 as of now) was updated just two days ago and I think this update is causing the trouble.
If you would like to keep the current version, then turning off remote.ssh.useLocalServer also works. If you're on a mac, go to Code > Preferences > Settings (Cmd + ,) and then type remote.ssh.useLocalServer and it'll show the option which is turned on by default. Turning this off did the trick for me too.

How to create an x server with Singularity

Overall, I am trying to render images using Unity on a remote cluster.
The cluster does not have an X server; I don't have sudo permissions, or can start a Docker container, but I can start a Singularity container.
My plan is to create a container that would simulate the X Server. I created the following Singularity definition file:
Bootstrap: docker
From: nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
%post
# xvfb for rendering in headless mode
apt-get update
apt-get install -y xvfb mesa-utils xorg
echo "allowed_users = anybody" > /etc/X11/Xwrapper.config
I started the container with the option --containall. From the container, I launched the command /usr/bin/X :0, but it failed with the following error:
Singularity xvfb.sif:~> /usr/bin/X :0
_XSERVTransmkdir: Owner of /tmp/.X11-unix should be set to root
X.Org X Server 1.19.6
Release Date: 2017-12-20
X Protocol Version 11, Revision 0
Build Operating System: Linux 4.15.0-140-generic x86_64 Ubuntu
Current Operating System: Linux cooper 5.8.0-50-generic #56~20.04.1-Ubuntu SMP Mon Apr 12 21:46:35 UTC 2021 x86_64
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.8.0-50-generic root=/dev/mapper/vgubuntu-root ro quiet splash vt.handoff=7
Build Date: 08 April 2021 01:57:21PM
xorg-server 2:1.19.6-1ubuntu4.9 (For technical support please see http://www.ubuntu.com/support)
Current version of pixman: 0.34.0
Before reporting problems, check http://wiki.x.org
to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
(++) from command line, (!!) notice, (II) informational,
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/home/pierre-louis/.local/share/xorg/Xorg.0.log", Time: Wed May 26 09:17:05 2021
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
(EE)
Fatal server error:
(EE) parse_vt_settings: Cannot open /dev/tty0 (No such file or directory)
(EE)
(EE)
Please consult the The X.Org Foundation support
at http://wiki.x.org
for help.
(EE) Please also check the log file at "/home/pierre-louis/.local/share/xorg/Xorg.0.log" for additional information.
(EE)
(EE) Server terminated with error (1). Closing log file.
Not any /dev/tty* exist. Then I tried to launch startx, but only to get the same message error.
How can I launch an X Server using a Singularity image?
As mentioned in a separate discussion, Xvfb is not supposed to be start through startx or /usr/bin/X but rather with the supplied run script.

Slurm and Munge "Invalid Credential"

I'm installing slurm for the first time. I've installed the 19.05.1-2 tarball and used the configurator to make a very simple two node cluster. Control node is sdc, compute nodes (running slurmd) are sdc and sdc1. Both rebuilt with Ubuntu 18.04
I can start the controller, and the compute node sdc and also successfully submit jobs with srun. That's great. However, when I start slurmd on the second node, SDC1, I get:
slurmd: error: Unable to register: Zero Bytes were transmitted or received
That quickly led me to my munge configuration. Munge.log on the controller (sdc) shows "Invalid credential" every second. I triple checked that munge.key on both hosts are identical. I verified that ntp is running too.
So by hand I did munge -s foobar | unmunge on SDC1 and of course that worked locally. Then I saved the munged text from SDC1 to a file on SDC and tried unmunge. That did give me the error "Invalid credential" again.
Because of this I uninstalled and reinstalled munge on both systems, distributed the key and repeated that test with the same result.
I guess I'm missing something simple. I don't know what else to do to properly install munge.
It was UID/GID mismatch between nodes. Of course it's mentioned in the installation guide.
Did you remember to restart the munge daemon after copying the munge.key to /etc/munge? I got the same error doing
1: install slurm:
$ apt install -y slurm-client
2: copy slurm.conf
(perhaps create slurm-llnl beforehand):
$ cp slurm.conf /etc/slurm-llnl
3: copy munge key to client
(munge.key copied before from slurm server/slurmctld)
$ cp munge.key /etc/munge
and then I got all the invalid credetial errors and problems reported here and in reports including the 'Zero Bytes' error on the client side
[CLIENT]$ sinfo
slurm_load_partitions: Zero Bytes were transmitted or received
with corresponding entries in the Slurm SERVER/slurmctld logs ala
[SERVER]$ tail /var/log/munge/munged.log
2022-12-30 22:57:23 +0100 Notice: Running on ..
2022-12-30 23:01:11 +0100 Info: Invalid credential ...
and
[SERVER]$ tail /var/log/slurm-llnl/slurmctld.log
[2022-12-30T23:01:11.440] error: Munge decode failed: Invalid credential
[2022-12-30T23:01:11.440] ENCODED: Thu Jan 01 01:00:00 1970
[2022-12-30T23:01:11.440] DECODED: Thu Jan 01 01:00:00 1970
[2022-12-30T23:01:11.440] error: slurm_unpack_received_msg: REQUEST_PARTITION_INFO has authentication error: Invalid authentication credential
[2022-12-30T23:01:11.440] error: slurm_unpack_received_msg: Protocol authentication error
All of this is fixed by rebooting the client, as suggested by other here, or slightly less intrusive, just to restart the client munge daemon
(CLIENT)$ sudo systemctl restert munge.service
and then munge on client / unmunge on server works, but it also fixes my main problem of getting client to see the slurm server without the dreaded 'Zero Bytes' error
[CLIENT]$ sinfo
slurm_load_partitions: Zero Bytes were transmitted or received
with server log entries
[SERVER]$ tail /var/log/slurm-llnl/slurmctld.log
...
[2022-12-30T23:17:14.017] error: slurm_unpack_received_msg: Invalid Protocol Version 9472 from uid=-1 at XX.XX.XX.XX:44150
[2022-12-30T23:17:14.017] error: slurm_unpack_received_msg: Incompatible versions of client and server code
[2022-12-30T23:17:14.027] error: slurm_receive_msg [XX.XX.XX.XX:44150]: Unspecified error
And, after munge restart, voilĂ :
[CLIENT] $ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
LocalQ* up infinite 1 idle XXX
for the examples: SERVER Ubuntu 20.04, CLIENTS Ubuntu 20.04 (and 22.04 that seem to be incompatible with the SERVER slurm version, says the log)

ejabberd show error while configure at time of installation

I am trying to configure eJabberd on my server.
I have installed all dependencies and other needed things. ERL also installed success fully through RPM with run result
root#sXX-XX-XX-XX [~]# erl -smp disable
Erlang/OTP 18 [erts-7.1] [source-2882b0c] [64-bit] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V7.1 (abort with ^G)
1>
For final installation of eJabberd i have followed below link :
http://docs.ejabberd.im/admin/guide/installation/
I have tried make clean also for it and after try ./configure --enable-mysql than also i am facing continue below error.
root#sXX-XX-XX-XX [/etc/ejabberd_downloads/ejabberd]# ./configure
checking whether make sets $(MAKE)... yes
checking for a BSD-compatible install... /usr/bin/install -c
checking for a sed that does not truncate output... /bin/sed
checking for erl... /usr/bin/erl
checking for erlc... /usr/bin/erlc
checking for erl... /usr/bin/erl
checking for erlc... /usr/bin/erlc
checking Erlang/OTP version...
Crash dump is being written to: erl_crash.dump...
Failed to create aux thread
./configure: line 2523: 1636 Aborted $ERLC conftest.erl
configure: error: "Could not compile Erlang/OTP version check program using '/usr/bin/erlc'"
For erl_crash.dump few lines as below from top
=erl_crash_dump:0.3
Thu Nov 19 01:31:57 2015
Slogan: Failed to create aux thread
System version: Erlang/OTP 18 [erts-7.1] [source-2882b0c] [64-bit] [smp:64:24] [async-threads:0] [hipe] [kernel-poll:false]
Compiled: Wed Sep 23 15:34:00 2015
Taints:
Atoms: 2005
Calling Thread: beam.smp
=scheduler:1
Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING
Scheduler Sleep Info Aux Work: SET_TMO
Current Port:
Run Queue Max Length: 0
Run Queue High Length: 0
Run Queue Normal Length: 1
Run Queue Low Length: 0
Run Queue Port Length: 0
Run Queue Flags: NONEMPTY_NORMAL | NONEMPTY
Current Process:
=scheduler:2
which erl result as below :
/usr/bin/erl
I am not able to trace the issue, Any reference will be very helpful. Thanks in advence.
It seems that your Erlang version is corrupted or Old, Reinstall it and try again. For Ejabberd 16.x version 6.1 (Erlang/OTP 17.1) is required.
You can uninstall Erlang with following command:
$ sudo apt-get purge erlang*
And install latest Erlang from http://www.erlang.org/

Google Cloud Hadoop Nodes not yet sshable error

I ran the following commands referring to https://cloud.google.com/hadoop/setting-up-a-hadoop-cluster on cygwin.
gsutil.cmd mb -p [projectname] gs://[bucketname]
./bdutil -p [projectname] -n 2 -b [bucketname] -e hadoop2_env.sh
generate_config configuration.sh
./bdutil -e configuration.sh deploy
After deployment, I am getting the following errors:
.
.
.
Node 'hadoop-w-0' did not become ssh-able after 10 attempts
Node 'hadoop-w-1' did not become ssh-able after 10 attempts
Node 'hadoop-m' did not become ssh-able after 10 attempts
Command failed: wait ${SUBPROC} on line 308.
Exit code of failed command: 1
Detailed debug info available in file: /tmp/bdutil-20150120-103601-mDh/debuginfo.txt*
The logs in debuginfo.txt are like these:
******************* Exit codes and VM logs *******************
Tue, Jan 20, 2015 10:18:09 AM: Exited 1 : gcloud.cmd --project=[projectname] --quiet --verbosity=info compute ssh hadoop-w-0 --command=exit 0 --ssh-flag=-oServerAliveInterval=60 --ssh-flag=-oServerAliveCountMax=3 --ssh-flag=-oConnectTimeout=30 --zone=us-central1-a
Tue, Jan 20, 2015 10:18:09 AM: Exited 1 : gcloud.cmd --project=[projectname] --quiet --verbosity=info compute ssh hadoop-w-1 --command=exit 0 --ssh-flag=-oServerAliveInterval=60 --ssh-flag=-oServerAliveCountMax=3 --ssh-flag=-oConnectTimeout=30 --zone=us-central1-a
Tue, Jan 20, 2015 10:18:09 AM: Exited 1 : gcloud.cmd --project=[projectname] --quiet --verbosity=info compute ssh hadoop-w-2 --command=exit 0 --ssh-flag=-oServerAliveInterval=60 --ssh-flag=-oServerAliveCountMax=3 --ssh-flag=-oConnectTimeout=30 --zone=us-central1-a
Could you please help me in resolving this issue?. Thank you a lot.
You may need to look at the console output for your Hadoop instances, from within the Developers console > Compute Engine > VM Instances > INSTANCE_NAME > scroll down to View Console Output .
Additionally you can run :
$ gcloud compute instances get-serial-port-output INSTANCE_NAME
this should give you a better picture of what is going on behind the scenes when the instances are booted (check if SSH daemon has started and on which port..etc.).

Resources