Error install slurm, slurmd could no be started - slurm

I am trying to install slurm in a small two pc system. But I've got the following error while start slurmd
Job for slurmd.service failed because the control process exited with error code.
See "systemctl status slurmd.service" and "journalctl -xe" for details.
The output of systemctl status slurmd.service and journalctl -xe are as followed
● slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2020-12-04 13:18:51 CST; 4min 50s ago
Docs: man:slurmd(8)
Process: 26501 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
12月 04 13:18:51 Y-Cluster-Node1 systemd[1]: Starting Slurm node daemon...
12月 04 13:18:51 Y-Cluster-Node1 slurmd[26501]: fatal: Unable to determine this slurmd's NodeName
12月 04 13:18:51 Y-Cluster-Node1 systemd[1]: slurmd.service: Control process exited, code=exited status=1
12月 04 13:18:51 Y-Cluster-Node1 systemd[1]: slurmd.service: Failed with result 'exit-code'.
12月 04 13:18:51 Y-Cluster-Node1 systemd[1]: Failed to start Slurm node daemon.
12月 04 13:21:05 Y-Cluster-Node1 sshd[26624]: Disconnected from authenticating user root 150.158.213.234 port 54962 [preauth]
12月 04 13:21:23 Y-Cluster-Node1 sshd[26632]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=115.68.207.186 user=root
12月 04 13:21:25 Y-Cluster-Node1 sshd[26632]: Failed password for root from 115.68.207.186 port 58882 ssh2
12月 04 13:21:25 Y-Cluster-Node1 sshd[26632]: Received disconnect from 115.68.207.186 port 58882:11: Bye Bye [preauth]
12月 04 13:21:25 Y-Cluster-Node1 sshd[26632]: Disconnected from authenticating user root 115.68.207.186 port 58882 [preauth]
12月 04 13:21:25 Y-Cluster-Node1 sshd[26630]: Connection closed by 212.64.12.236 port 46106 [preauth]
12月 04 13:22:13 Y-Cluster-Node1 sshd[26635]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=118.25.24.84 user=root
12月 04 13:22:14 Y-Cluster-Node1 sshd[26637]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=111.125.70.22 user=root
12月 04 13:22:14 Y-Cluster-Node1 sshd[26635]: Failed password for root from 118.25.24.84 port 47018 ssh2
12月 04 13:22:15 Y-Cluster-Node1 sshd[26635]: Received disconnect from 118.25.24.84 port 47018:11: Bye Bye [preauth]
12月 04 13:22:15 Y-Cluster-Node1 sshd[26635]: Disconnected from authenticating user root 118.25.24.84 port 47018 [preauth]
12月 04 13:22:15 Y-Cluster-Node1 sshd[26637]: Failed password for root from 111.125.70.22 port 58216 ssh2
12月 04 13:22:15 Y-Cluster-Node1 sshd[26637]: Received disconnect from 111.125.70.22 port 58216:11: Bye Bye [preauth]
12月 04 13:22:15 Y-Cluster-Node1 sshd[26637]: Disconnected from authenticating user root 111.125.70.22 port 58216 [preauth]
12月 04 13:22:16 Y-Cluster-Node1 sshd[26639]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=72.167.227.34 user=root
12月 04 13:22:18 Y-Cluster-Node1 sshd[26639]: Failed password for root from 72.167.227.34 port 56304 ssh2
12月 04 13:22:18 Y-Cluster-Node1 sshd[26639]: Received disconnect from 72.167.227.34 port 56304:11: Bye Bye [preauth]
12月 04 13:22:18 Y-Cluster-Node1 sshd[26639]: Disconnected from authenticating user root 72.167.227.34 port 56304 [preauth]
12月 04 13:22:32 Y-Cluster-Node1 sshd[26641]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=182.138.239.224 user=root
12月 04 13:22:34 Y-Cluster-Node1 sshd[26641]: Failed password for root from 182.138.239.224 port 48870 ssh2
12月 04 13:22:36 Y-Cluster-Node1 sshd[26641]: Received disconnect from 182.138.239.224 port 48870:11: Bye Bye [preauth]
12月 04 13:22:36 Y-Cluster-Node1 sshd[26641]: Disconnected from authenticating user root 182.138.239.224 port 48870 [preauth]
12月 04 13:22:56 Y-Cluster-Node1 sshd[26648]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=81.68.123.185 user=root
12月 04 13:22:58 Y-Cluster-Node1 sshd[26648]: Failed password for root from 81.68.123.185 port 60848 ssh2
12月 04 13:23:00 Y-Cluster-Node1 sshd[26648]: Received disconnect from 81.68.123.185 port 60848:11: Bye Bye [preauth]
12月 04 13:23:00 Y-Cluster-Node1 sshd[26648]: Disconnected from authenticating user root 81.68.123.185 port 60848 [preauth]
12月 04 13:23:02 Y-Cluster-Node1 sshd[26652]: Connection closed by 139.217.221.89 port 35808 [preauth]
12月 04 13:23:13 Y-Cluster-Node1 sshd[26654]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=159.65.1.41 user=root
12月 04 13:23:16 Y-Cluster-Node1 sshd[26654]: Failed password for root from 159.65.1.41 port 40538 ssh2
12月 04 13:23:16 Y-Cluster-Node1 sshd[26654]: Received disconnect from 159.65.1.41 port 40538:11: Bye Bye [preauth]
12月 04 13:23:16 Y-Cluster-Node1 sshd[26654]: Disconnected from authenticating user root 159.65.1.41 port 40538 [preauth]
12月 04 13:23:43 Y-Cluster-Node1 sshd[26656]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=222.222.31.70 user=root
12月 04 13:23:46 Y-Cluster-Node1 sshd[26656]: Failed password for root from 222.222.31.70 port 35282 ssh2
12月 04 13:23:46 Y-Cluster-Node1 sshd[26656]: Received disconnect from 222.222.31.70 port 35282:11: Bye Bye [preauth]
12月 04 13:23:46 Y-Cluster-Node1 sshd[26656]: Disconnected from authenticating user root 222.222.31.70 port 35282 [preauth]
12月 04 13:24:02 Y-Cluster-Node1 sshd[26660]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=150.158.213.234 user=root
12月 04 13:24:04 Y-Cluster-Node1 sshd[26660]: Failed password for root from 150.158.213.234 port 36350 ssh2
12月 04 13:24:05 Y-Cluster-Node1 sshd[26660]: Received disconnect from 150.158.213.234 port 36350:11: Bye Bye [preauth]
12月 04 13:24:05 Y-Cluster-Node1 sshd[26660]: Disconnected from authenticating user root 150.158.213.234 port 36350 [preauth]
I tried to understand the problem, it looks like an connection issue that the control node(node1) cannot access to compute node(node2).
I did some search around, some mentioned it could due to the mismatch of UIDs and GIDs. As mentioned in the installation guideline, "Make sure the clocks, users and groups (UIDs and GIDs) are synchronized across the cluster." I did not find any issues regarding UIDs/GIDs myself, is there anyways to have a check on this? Could anyone give me a hand here?
Some additional Information:
used "munge -n | unmunge" I got the following on both node
y-cluster#Y-Cluster-Node1:~$ munge -n | unmunge
STATUS: Success (0)
ENCODE_HOST: Y-Cluster-Node1 (192.168.1.111)
ENCODE_TIME: 2020-12-04 15:00:18 +0800 (1607065218)
DECODE_TIME: 2020-12-04 15:00:18 +0800 (1607065218)
TTL: 300
CIPHER: aes128 (4)
MAC: sha256 (5)
ZIP: none (0)
UID: y-cluster (1000)
GID: y-cluster (1000)
LENGTH: 0
y-cluster#Y-Cluster-Node2:~/.ssh$ munge -n | unmunge
STATUS: Success (0)
ENCODE_HOST: Y-Cluster-Node2 (192.168.1.112)
ENCODE_TIME: 2020-12-04 15:00:20 +0800 (1607065220)
DECODE_TIME: 2020-12-04 15:00:20 +0800 (1607065220)
TTL: 300
CIPHER: aes128 (4)
MAC: sha256 (5)
ZIP: none (0)
UID: y-cluster (1000)
GID: y-cluster (1000)
LENGTH: 0
Both looks fine, same UID/GID/TIME.
From "slurmctld -Dcvvv", I get the following error, I wonder does it got to do with ownship of some log files?
y-cluster#Y-Cluster-Node1:~$ slurmctld -Dcvvv
slurmctld: debug: Log file re-opened
slurmctld: killing old slurmctld[4787]

Related

Unable to connect Azure ubuntu VM through VS code

Getting below issue while connecting Azure VM through VSCode but we are able to connect through Putty when we used .ppk file. Getting below error when we used both .ppk & .pem using ssh configin VSCode.
Error: Permission denied(Public key)
Below are the ssh logs:
root#VMWDEPOCEUS001:/var/log# tail -30 auth.log Oct 27 10:19:12 VMWDEPOCEUS001 sshd[3087]: Failed password for invalid user ssh user from 156.163.33.75 port 56425 ssh2 Oct 27 10:19:14 VMWDEPOCEUS001 sshd[3087]: Failed password for invalid user ssh user from 156.163.33.75 port 56425 ssh2 Oct 27 10:19:15 VMWDEPOCEUS001 sshd[3087]: Connection reset by invalid user ssh user 156.163.33.75 port 56425 [preauth] Oct 27 10:23:52 VMWDEPOCEUS001 sshd[3095]: Invalid user ssh user1 from 156.163.33.75 port 56590 Oct 27 10:23:57 VMWDEPOCEUS001 sshd[3095]: Failed none for invalid user ssh user1 from 156.163.33.75 port 56590 ssh2 Oct 27 10:24:00 VMWDEPOCEUS001 sshd[3095]: Failed password for invalid user ssh user1 from 156.163.33.75 port 56590 ssh2 Oct 27 10:24:04 VMWDEPOCEUS001 sshd[3095]: Failed password for invalid user ssh user1 from 156.163.33.75 port 56590 ssh2 Oct 27 10:24:05 VMWDEPOCEUS001 sshd[3095]: Connection reset by invalid user ssh user1 156.163.33.75 port 56590 [preauth] Oct 27 10:25:26 VMWDEPOCEUS001 sshd[3099]: Accepted password for user1 from 156.163.33.75 port 56649 ssh2 Oct 27 10:25:26 VMWDEPOCEUS001 sshd[3099]: pam_unix(sshd:session): session opened for user user1 by (uid=0) Oct 27 10:25:26 VMWDEPOCEUS001 systemd-logind[1238]: New session 6 of user user1. Oct 27 10:25:28 VMWDEPOCEUS001 sshd[3167]: Accepted password for user1 from 156.163.33.75 port 56651 ssh2 Oct 27 10:25:28 VMWDEPOCEUS001 sshd[3167]: pam_unix(sshd:session): session opened for user user1 by (uid=0) Oct 27 10:25:28 VMWDEPOCEUS001 systemd-logind[1238]: New session 7 of user user1. Oct 27 10:27:00 VMWDEPOCEUS001 sshd[3258]: Invalid user ssh user1 from 156.163.33.75 port 26689 Oct 27 10:27:14 VMWDEPOCEUS001 sshd[3258]: pam_unix(sshd:auth): check pass; user unknown Oct 27 10:27:14 VMWDEPOCEUS001 sshd[3258]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=156.163.33.75 Oct 27 10:27:16 VMWDEPOCEUS001 sshd[3258]: Failed password for invalid user ssh user1 from 156.163.33.75 port 26689 ssh2 Oct 27 10:27:30 VMWDEPOCEUS001 sshd[3258]: pam_unix(sshd:auth): check pass; user unknown Oct 27 10:27:33 VMWDEPOCEUS001 sshd[3258]: Failed password for invalid user ssh user1 from 156.163.33.75 port 26689 ssh2 Oct 27 10:27:47 VMWDEPOCEUS001 sshd[3258]: pam_unix(sshd:auth): check pass; user unknown Oct 27 10:27:49 VMWDEPOCEUS001 sshd[3258]: Failed password for invalid user ssh user1 from 156.163.33.75 port 26689 ssh2 Oct 27 10:27:49 VMWDEPOCEUS001 sshd[3258]: Connection reset by invalid user ssh user1 156.163.33.75 port 26689 [preauth] Oct 27 10:27:49 VMWDEPOCEUS001 sshd[3258]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=156.163.33.75 Oct 27 10:31:22 VMWDEPOCEUS001 sshd[3279]: Invalid user ssh user1 from 156.163.33.75 port 56826 Oct 27 10:31:34 VMWDEPOCEUS001 sshd[3279]: pam_unix(sshd:auth): check pass; user unknown Oct 27 10:31:34 VMWDEPOCEUS001 sshd[3279]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=156.163.33.75 Oct 27 10:31:35 VMWDEPOCEUS001 sshd[3279]: Failed password for invalid user ssh user1 from 156.163.33.75 port 56826 ssh2 Oct 27 10:32:09 VMWDEPOCEUS001 sshd[3279]: pam_unix(sshd:auth): check pass; user unknown Oct 27 10:32:10 VMWDEPOCEUS001 sshd[3279]: Failed password for invalid user ssh user1 from 156.163.33.75 port 56826 ssh2
Expectation: Need to connect to Azure Ubuntu (18.04) VM using VSCode config file
I tried to reproduce the same in my environment SSH is connected to Azure Ubuntu (18.04) VM using VSCode successfully.
As per Docs Putty is not supported occurs an error. try to install OpenSSh client
Change your user as root:
sudo -s
Try to install ssh or if already install update it and enable your ssh and check whether your status is active like below
sudo apt-get install ssh
sudo apt-get update
systemctl enable ssh.service
systemctl status ssh.service
Generate a ssh key. Try to avoid ssh -i please use user#hostname while configure ssh host like below.
Check whether u have provided config file host, user and hostname are correct.
When I try to connect, I got the same error and ssh log like below.
To resolve this issue:
First try to change your password using sudo passwd root update the password and enable Password Authentication run sudo nano /etc/ssh/sshd_config it will open nano editor like below.
Once your nano editor change a filesystem permission of you scroll down place your cursor in # press insert insert type PermitRootlogin yes and pubkeyauthentication yes press escape &: x Enter
Restart the SSH service by following this command:
sudo systemctl reload sshd
And try to connect with ssh user#hostname it's work.
If still you are facing issue, please check in root ssh install and update check status is active lie above and in nano editor check
PasswordAuthentication
ChallengeResponseAuthentication
GSSAPIAuthentication yes
GSSAPICleanupCredentials no
UsePAM yes
To know more in detail please refer this link:
SSH Failed Permission Denied by phoenixnap
Connect over SSH with Visual Studio Code

Apache Error on CentOs fails to start or restart

Apache web server fails to restart.The server has been running well and suddenly failed.
What would be the possible cause for making httpd.service fail to start and what is the solution?
Running Apachectl configtest returns symbol lookup error: /usr/local/apache/bin/httpd: undefined symbol: apr_crypto_init
Running systemctl status httpd.service :
httpd.service - Web server Apache
Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2022-10-04 22:36:27 CST; 1min 24s ago
Process: 13030 ExecStop=/usr/local/apache/bin/apachectl graceful-stop (code=exited, status=127)
Process: 3911 ExecStart=/usr/local/apache/bin/apachectl start (code=exited, status=127)
Main PID: 851 (code=exited, status=0/SUCCESS)
Oct 04 22:36:27 hwsrv-985893.hostwindsdns.com systemd[1]: Starting Web server Apache...
Oct 04 22:36:27 hwsrv-985893.hostwindsdns.com apachectl[3911]: /usr/local/apache/bin/httpd: symbol lookup error: /usr/local/apache/bin/httpd: undefined symbol: apr_crypto_init
Oct 04 22:36:27 hwsrv-985893.hostwindsdns.com systemd[1]: httpd.service: control process exited, code=exited status=127
Oct 04 22:36:27 hwsrv-985893.hostwindsdns.com systemd[1]: Failed to start Web server Apache.
Oct 04 22:36:27 hwsrv-985893.hostwindsdns.com systemd[1]: Unit httpd.service entered failed state.
Oct 04 22:36:27 hwsrv-985893.hostwindsdns.com systemd[1]: httpd.service failed.
Running journalctl -xe :
Oct 04 22:51:54 hwsrv-985893.hostwindsdns.com kernel: net_ratelimit: 75 callbacks suppressed
Oct 04 22:51:56 hwsrv-985893.hostwindsdns.com sshd[4063]: Failed password for root from 61.177.172.114 port 36803 ssh2
Oct 04 22:51:56 hwsrv-985893.hostwindsdns.com sshd[4065]: Failed password for root from 218.92.0.195 port 33236 ssh2
Oct 04 22:51:56 hwsrv-985893.hostwindsdns.com sshd[4063]: Received disconnect from 61.177.172.114 port 36803:11: [preauth]
Oct 04 22:51:56 hwsrv-985893.hostwindsdns.com sshd[4063]: Disconnected from 61.177.172.114 port 36803 [preauth]
Oct 04 22:51:56 hwsrv-985893.hostwindsdns.com sshd[4063]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=61.177.172.114 user=root
Oct 04 22:51:56 hwsrv-985893.hostwindsdns.com sshd[4065]: pam_succeed_if(sshd:auth): requirement "uid >= 1000" not met by user "root"
Oct 04 22:51:59 hwsrv-985893.hostwindsdns.com sshd[4065]: Failed password for root from 218.92.0.195 port 33236 ssh2
Oct 04 22:51:59 hwsrv-985893.hostwindsdns.com sshd[4065]: Received disconnect from 218.92.0.195 port 33236:11: [preauth]
Oct 04 22:51:59 hwsrv-985893.hostwindsdns.com sshd[4065]: Disconnected from 218.92.0.195 port 33236 [preauth]
Oct 04 22:51:59 hwsrv-985893.hostwindsdns.com sshd[4065]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=218.92.0.195 user=root
Oct 04 22:51:59 hwsrv-985893.hostwindsdns.com kernel: net_ratelimit: 65 callbacks suppressed
Oct 04 22:52:05 hwsrv-985893.hostwindsdns.com kernel: net_ratelimit: 77 callbacks suppressed

Job for httpd.service failed because the control process exited with error code See "systemctl status httpd.service" and "journalctl -xe" for details

I am unable to restart my apache server to successfully install the SSL certificates.
I get the following error
Job for httpd.service failed because the control process exited with error code. See "systemctl status httpd.service" and "journalctl -xe" for details.
I have tried several articles and the root cause seems to be the following
Mar 29 13:05:09 localhost.localdomain httpd\[1234546\]: (98)Address already in use: AH00072: make_sock: could not bind to address \[::\]:80
Mar 29 13:05:09 localhost.localdomain httpd\[1234546\]: (98)Address already in use: AH00072: make_sock: could not bind to address 0.0.0.0:80
I am able to diagnose the issue and get the following output and is also attached. I am unable to proceed further. Can you please help ?
Server - AlmaLinux 8
Host - IONOS
Server version: Apache/2.4.37 (AlmaLinux)
-- Unit session-62994.scope has finished starting up.
-
-- Unit session-62994.scope has finished starting up.
-
-- The unit session-62994.scope has successfully entered the 'dead' state.
Mar 31 06:07:10 localhost.localdomain dhclient\[1326\]: XMT: Solicit on ens192, interval 110600ms.
Mar 31 06:07:10 localhost.localdomain dhclient\[1326\]: RCV: Advertise message on ens192 from fe80::250:56ff:fe8c:84c6.
Mar 31 06:07:10 localhost.localdomain dhclient\[1326\]: RCV: Advertise message on ens192 from fe80::250:56ff:fe9a:f13a.
Mar 31 06:07:30 localhost.localdomain sshd\[1297516\]: Invalid user sui from 167.99.68.65 port 48488
Mar 31 06:07:30 localhost.localdomain sshd\[1297516\]: pam_unix(sshd:auth): check pass; user unknown
Mar 31 06:07:30 localhost.localdomain sshd\[1297516\]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=167.99.68.65
Mar 31 06:07:32 localhost.localdomain sshd\[1297516\]: Failed password for invalid user sui from 167.99.68.65 port 48488 ssh2
Mar 31 06:07:34 localhost.localdomain sshd\[1297516\]: Received disconnect from 167.99.68.65 port 48488:11: Bye Bye \[preauth\]
Mar 31 06:07:34 localhost.localdomain sshd\[1297516\]: Disconnected from invalid user sui 167.99.68.65 port 48488 \[preauth\]
Mar 31 06:07:44 localhost.localdomain unix_chkpwd\[1297520\]: password check failed for user (root)
Mar 31 06:07:44 localhost.localdomain sshd\[1297518\]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=61.177.173.27 user=root
Mar 31 06:07:46 localhost.localdomain sshd\[1297518\]: Failed password for root from 61.177.173.27 port 58626 ssh2
Mar 31 06:07:46 localhost.localdomain unix_chkpwd\[1297521\]: password check failed for user (root)
\[root#localhost \~\]# ss --listening --tcp --numeric --processes
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
LISTEN 0 128 0.0.0.0:80 0.0.0.0:\* users:(("nginx",pid=1087,fd=10),("nginx",pid=1086,fd=10),("nginx",pid=1084,fd=10))
LISTEN 0 128 0.0.0.0:22 0.0.0.0:\* users:(("sshd",pid=1335,fd=5))
LISTEN 0 128 0.0.0.0:443 0.0.0.0:\* users:(("nginx",pid=1087,fd=11),("nginx",pid=1086,fd=11),("nginx",pid=1084,fd=11))
LISTEN 0 128 \[::\]:22 \[::\]:\* users:(("sshd",pid=1335,fd=7))
LISTEN 0 80 \*:3306 *:* users:(("mysqld",pid=1098,fd=19))
Tried -
apachectl configtest - Result: syntax ok
setenforce 0

interpreting the auth.log on a linux system, what qualifies as one login attempt

Using Python 3.5 i am composing a bit of code to analyze the /var/log/auth.log and discern a few happenings from it. I am on Ubuntu 17.04 with default settings for the output to /var/log/auth.log
I am attempting to quantify a failed login event. However when i inspect the log file. It seems to me that a failed login event is logged multiple times. Is it safe to infer that all the lines below correspond to one failed login attempt as the call goes through the different layers of the system? Or is each line below a separate failed login attempt.
Lines that i am inclined to attribute to one failed login attempt:
Jun 21 20:05:33 node1 sshd[24969]: Failed password for invalid user root from 221.194.47.252 port 43974 ssh2
Jun 21 20:05:38 node1 sshd[24969]: message repeated 2 times: [ Failed password for invalid user root from
221.194.47.252 port 43974 ssh2]
Jun 21 20:05:38 node1 sshd[24969]: Received disconnect from 221.194.47.252 port 43974:11: [preauth]
Jun 21 20:05:38 node1 sshd[24969]: Disconnected from 221.194.47.252 port 43974 [preauth]
Jun 21 20:05:38 node1 sshd[24969]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser=
rhost=221.194.47.252 user=root
Jun 21 20:05:41 node1 sshd[24971]: User root from 221.194.47.252 not allowed because none of user's groups are listed
in AllowGroups
Jun 21 20:05:41 node1 sshd[24971]: input_userauth_request: invalid user root [preauth]
Jun 21 20:05:42 node1 sshd[24971]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser=
rhost=221.194.47.252 user=root
More context:
Jun 21 20:05:33 node1 sshd[24969]: Failed password for invalid user root from 221.194.47.252 port 43974 ssh2
Jun 21 20:05:38 node1 sshd[24969]: message repeated 2 times: [ Failed password for invalid user root from
221.194.47.252 port 43974 ssh2]
Jun 21 20:05:38 node1 sshd[24969]: Received disconnect from 221.194.47.252 port 43974:11: [preauth]
Jun 21 20:05:38 node1 sshd[24969]: Disconnected from 221.194.47.252 port 43974 [preauth]
Jun 21 20:05:38 node1 sshd[24969]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser=
rhost=221.194.47.252 user=root
Jun 21 20:05:41 node1 sshd[24971]: User root from 221.194.47.252 not allowed because none of user's groups are listed
in AllowGroups
Jun 21 20:05:41 node1 sshd[24971]: input_userauth_request: invalid user root [preauth]
Jun 21 20:05:42 node1 sshd[24971]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser=
rhost=221.194.47.252 user=root
Jun 21 20:05:44 node1 sshd[24971]: Failed password for invalid user root from 221.194.47.252 port 42071 ssh2
Jun 21 20:05:48 node1 sshd[24971]: message repeated 2 times: [ Failed password for invalid user root from
221.194.47.252 port 42071 ssh2]
Jun 21 20:05:49 node1 sshd[24971]: Received disconnect from 221.194.47.252 port 42071:11: [preauth]
Jun 21 20:05:49 node1 sshd[24971]: Disconnected from 221.194.47.252 port 42071 [preauth]
Jun 21 20:05:49 node1 sshd[24971]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser=
rhost=221.194.47.252 user=root
Jun 21 20:05:51 node1 sshd[24976]: User root from 221.194.47.252 not allowed because none of user's groups are listed
in AllowGroups
Jun 21 20:05:51 node1 sshd[24976]: input_userauth_request: invalid user root [preauth]
Jun 21 20:05:51 node1 sshd[24976]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser=
rhost=221.194.47.252 user=root
Jun 21 20:05:54 node1 sshd[24976]: Failed password for invalid user root from 221.194.47.252 port 58648 ssh2
Jun 21 20:05:58 node1 sshd[24976]: message repeated 2 times: [ Failed password for invalid user root from
221.194.47.252 port 58648 ssh2]
Jun 21 20:05:59 node1 sshd[24976]: Received disconnect from 221.194.47.252 port 58648:11: [preauth]
Jun 21 20:05:59 node1 sshd[24976]: Disconnected from 221.194.47.252 port 58648 [preauth]
Jun 21 20:05:59 node1 sshd[24976]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser=
rhost=221.194.47.252 user=root
Jun 21 20:06:02 node1 sshd[24980]: User root from 221.194.47.252 not allowed because none of user's groups are listed
in AllowGroups
Jun 21 20:06:02 node1 sshd[24980]: input_userauth_request: invalid user root [preauth]
Jun 21 20:06:02 node1 sshd[24980]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser=
rhost=221.194.47.252 user=root
Should i go by the pid of the sshd process to determine one failed login attempt? I can't go by the port since over one connection per port, multiple failed login attempts can occur and i am trying to be as granular as possible in counting failed login attempts for analysis later.
Any other ideas? My next step is to grep the sshd source or pam to see what i can find.

Arangodb stops and won't restart after dev-xvdb times out

I have arangodb 3.1.16 installed on an AWS C4 Instance. I have a Foxx Service trying to run in production. It is getting an average of 10 packets of 200 octets per second, and returning a flow of 20 packets of 200 octets per second.
Each time I start running my process, the foxx service runs with consistent performance for an hour and then suddenly stops. I do not have access to my foxx api anymore : all requests get connection timeout errors, and do not print on the foxx logs. I do not have access to the web interface anymore : the page just doesn’t load.
After a minute or so, the foxx logs show me an error message : 'ArangoError 18: lock timeout’
After an other minute the logs show me requests that are usually fast but took a very long time (WARNING {queries} slow query: took: 1770.862498)
Using "journalctl -xe", I learned that after a foreign IP tried to connect, I got = "Job dev-xvdb.device/start timed out"
I managed to restart arango using :
ps -eaf |grep arangod
sudo kill #
sudo apt-get --reinstall install arangodb3=3.1.16
How can I solve this recurring issue ?
"journalctl -xe" gives me :
Apr 04 15:03:10 my-ip systemd[1]: arangodb3.service: Failed with result 'exit-code’.
-- Subject: Unit arangodb3.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit arangodb3.service has begun starting up.
Apr 04 15:03:10 my-ip arangodb3[11481]: * Starting arango database server arangod
Apr 04 15:03:10 my-ip arangodb3[11481]: * database version check failed, maybe you need to run 'upgrade'?
Apr 04 15:03:10 my-ip systemd[1]: arangodb3.service: Control process exited, code=exited status=1
Apr 04 15:03:10 my-ip systemd[1]: Failed to start LSB: arangodb.
-- Subject: Unit arangodb3.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit arangodb3.service has failed.
--
-- The result is failed.
Apr 04 15:03:10 my-ip systemd[1]: arangodb3.service: Unit entered failed state.
Apr 04 15:03:10 my-ip systemd[1]: arangodb3.service: Failed with result 'exit-code'.
Apr 04 15:03:10 my-ip sudo[11346]: pam_unix(sudo:session): session closed for user root
Apr 04 15:03:17 my-ip sshd[11502]: Did not receive identification string from UNKNOWN IP 1
Apr 04 15:03:21 my-ip sshd[11503]: Connection closed by UNKNOWN IP 2 port 54736 [preauth]
Apr 04 15:03:21 my-ip sshd[11507]: Did not receive identification string from UNKNOWN IP 2
Apr 04 15:03:21 my-ip sshd[11506]: fatal: Unable to negotiate with UNKNOWN IP 2 port 54730: no matching host key type found. Their offer: ssh-dss [preauth]
Apr 04 15:03:21 my-ip sshd[11504]: Connection closed by UNKNOWN IP 2 port 54732 [preauth]
Apr 04 15:03:22 my-ip sshd[11505]: Connection closed by UNKNOWN IP 2 port 54734 [preauth]
Apr 04 15:03:40 my-ip systemd[1]: dev-xvdb.device: Job dev-xvdb.device/start timed out.
Apr 04 15:03:40 my-ip systemd[1]: Timed out waiting for device dev-xvdb.device.
-- Subject: Unit dev-xvdb.device has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit dev-xvdb.device has failed.
--
-- The result is timeout.
Apr 04 15:03:40 my-ip systemd[1]: Dependency failed for File System Check on /dev/xvdb.
-- Subject: Unit systemd-fsck#dev-xvdb.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit systemd-fsck#dev-xvdb.service has failed.
--
-- The result is dependency.
Apr 04 15:03:40 my-ip systemd[1]: Dependency failed for /mnt.
-- Subject: Unit mnt.mount has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit mnt.mount has failed.
--
-- The result is dependency.
Apr 04 15:03:40 my-ip systemd[1]: mnt.mount: Job mnt.mount/start failed with result 'dependency'.
Apr 04 15:03:40 my-ip systemd[1]: systemd-fsck#dev-xvdb.service: Job systemd-fsck#dev-xvdb.service/start failed with result 'dependency'.
Apr 04 15:03:40 my-ip systemd[1]: dev-xvdb.device: Job dev-xvdb.device/start failed with result 'timeout'.
I tried :
sudo curl --dump - -X GET http://127.0.0.1:8529/_api/version && echo
It gives me :
HTTP/1.1 401 Unauthorized
Www-Authenticate: Bearer token_type="JWT", realm="ArangoDB"
Server: ArangoDB
Connection: Keep-Alive
Content-Type: text/plain; charset=utf-8
Content-Length: 0
I tried :
ps auxw | fgrep arangod
It gives me :
root 10439 0.0 0.1 82772 8664 ? Ss 10:09 0:00 /usr/sbin/arangod --uid arangodb --gid arangodb --pid-file /var/run/arangodb/arangod.pid --temp.path /var/tmp/arangod --log.foreground-tty false --supervisor
arangodb 10440 5.7 94.5 12901776 7242340 ? Sl 10:09 16:36 /usr/sbin/arangod --uid arangodb --gid arangodb --pid-file /var/run/arangodb/arangod.pid --temp.path /var/tmp/arangod --log.foreground-tty false --supervisor
ubuntu 11339 0.0 0.0 12916 1000 pts/0 R+ 14:59 0:00 grep -F --color=auto arangod
arangod restart gives me :
2017-04-04T15:01:16Z [11344] INFO ArangoDB 3.1.16 [linux] 64bit, using VPack 0.1.30, ICU 54.1, V8 5.0.71.39, OpenSSL 1.0.2g 1 Mar 2016
2017-04-04T15:01:16Z [11344] INFO using SSL options: SSL_OP_CIPHER_SERVER_PREFERENCE, SSL_OP_TLS_ROLLBACK_BUG
2017-04-04T15:01:16Z [11344] FATAL could not open shutdown file '/var/log/arangodb3/restart/SHUTDOWN': internal error
'service arangodb3 restart’ gives me (after a short wait time) :
Job for arangodb3.service failed because the control process exited with error code. See "systemctl status arangodb3.service" and "journalctl -xe" for details.
'systemctl status arangodb3.service' gives me :
arangodb3.service - LSB: arangodb
Loaded: loaded (/etc/init.d/arangodb3; bad; vendor preset: enabled)
Active: failed (Result: exit-code) since Tue 2017-04-04 15:03:10 UTC; 34s ago
Docs: man:systemd-sysv-generator(8)
Process: 11352 ExecStop=/etc/init.d/arangodb3 stop (code=exited, status=0/SUCCESS)
Process: 11481 ExecStart=/etc/init.d/arangodb3 start (code=exited, status=1/FAILURE)
Tasks: 83
Memory: 6.5G
CPU: 73ms
CGroup: /system.slice/arangodb3.service
├─10439 /usr/sbin/arangod --uid arangodb --gid arangodb --pid-file /var/run/arangodb/arangod.pid --temp.path /var/tmp/arangod --log.foreground-tty false --supervisor
└─10440 /usr/sbin/arangod --uid arangodb --gid arangodb --pid-file /var/run/arangodb/arangod.pid --temp.path /var/tmp/arangod --log.foreground-tty false --supervisor
Apr 04 15:03:10 my-ip systemd[1]: Starting LSB: arangodb...
Apr 04 15:03:10 my-ip arangodb3[11481]: * Starting arango database server arangod
Apr 04 15:03:10 my-ip arangodb3[11481]: * database version check failed, maybe you need to run 'upgrade'?
Apr 04 15:03:10 my-ip systemd[1]: arangodb3.service: Control process exited, code=exited status=1
Apr 04 15:03:10 my-ip systemd[1]: Failed to start LSB: arangodb.
Apr 04 15:03:10 my-ip systemd[1]: arangodb3.service: Unit entered failed state.
From your log output it seems that the mounted disk volume goes away.
If the storage goes away under any kind of Database there is no reasonable way to continue working.
Thus the effects you see is that the ArangoDB isn't able to work with its data anymore - from its perspective its simply not there anymore.
One effect observed by others is that I/O credits on AWS dry up, which could also be the reason for what you see above.
https://aws.amazon.com/blogs/aws/new-burst-balance-metric-for-ec2s-general-purpose-ssd-gp2-volumes/
If I got that correctly, you can get more credits if you choose a bigger volume size. If that doesn't help, you either need to lower your test scenario, or choose a different hosting approach that doesn't have limitations on I/O operations.

Resources