GPFS : mmremote: Unable to determine the local node identity - linux

I had a 4 node, gpfs cluster up and running, and things were fine till last week when the Server hosting these RHEL setups went down, After the server was brought up and rhel nodes were started back, one of the nodes's IP got changed,
After that I am not able to use the node,
simple commands like 'mmlscluster', mmgetstate', fails with this error:
[root#gpfs3 ~]# mmlscluster mmlscluster: Unable to determine the local
node identity. mmlscluster: Command failed. Examine previous error
messages to determine cause. [root#gpfs3 ~]# mmstartup mmstartup:
Unable to determine the local node identity. mmstartup: Command
failed. Examine previous error messages to determine cause.
Mmshutdown fails with different error:
[root#gpfs3 ~]# mmshutdown mmshutdown: Unexpected error from
getLocalNodeData: Unknown environmentType . Return code: 1
logs have this info:
Mon Feb 15 18:18:34 IST 2016: Node rebooted. Starting mmautoload...
mmautoload: Unable to determine the local node identity. Mon Feb 15
18:18:34 IST 2016 mmautoload: GPFS is waiting for daemon network
mmautoload: Unable to determine the local node identity. Mon Feb 15
18:19:34 IST 2016 mmautoload: GPFS is waiting for daemon network
mmautoload: Unable to determine the local node identity. Mon Feb 15
18:20:34 IST 2016 mmautoload: GPFS is waiting for daemon network
mmautoload: Unable to determine the local node identity. Mon Feb 15
18:21:35 IST 2016 mmautoload: GPFS is waiting for daemon network
mmautoload: Unable to determine the local node identity. Mon Feb 15
18:22:35 IST 2016 mmautoload: GPFS is waiting for daemon network
mmautoload: Unable to determine the local node identity. mmautoload:
The GPFS environment cannot be initialized. mmautoload: Correct the
problem and use mmstartup to start GPFS.
I tried changing the IP to new one, still the same error:
[root#gpfs1 ~]# mmchnode -N gpfs3 --admin-interface=xx.xx.xx.xx Mon Feb 15 20:00:05 IST 2016:
mmchnode: Processing node gpfs3 mmremote: Unable to determine the
local node identity. mmremote: Command failed. Examine previous error
messages to determine cause. mmremote: Unable to determine the local
node identity. mmremote: Command failed. Examine previous error
messages to determine cause. mmchnode: Unexpected error from
checkExistingClusterNode gpfs3. Return code: 0 mmchnode: Command
failed. Examine previous error messages to determine cause.
Can someone please help me in fixing this issue?

The easiest fix is probably to remove the node from the cluster (mmdelnode) and then add it back in (mmaddnode). You might need to mmdelnode -f.
If deleting and adding the node back in is not an option, try giving IBM support a call.

Related

Elastisearch Enabling Remote Connection - Crashes AFTER Change*

I just installed filebeat, logstash, kibana and elasticsearch all running smoothly just to trial this product out for additional monthly reports/monitoring and noticed every time I try to change the "/etc/elasticsearch/elasticsearch.yml" config file for remote web access it'll basically crash the service every time I make the change.
Just want to say I'm new to the forum and this product, and my end goal for this question is to figure out how to allow remote connections to access elastisearch as I guinea pig and test without crashing elasticsearch.
For reference here is the error code when I run the 'sudo systemctl status elasticsearch' query:
Dec 30 07:27:37 ubuntu systemd[1]: Starting Elasticsearch...
Dec 30 07:27:52 ubuntu systemd-entrypoint[4067]: ERROR: [1] bootstrap checks failed. You must address the points described in the following [1] lines before starting Elasticsearch.
Dec 30 07:27:52 ubuntu systemd-entrypoint[4067]: bootstrap check failure [1] of [1]: the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.se>
Dec 30 07:27:52 ubuntu systemd-entrypoint[4067]: ERROR: Elasticsearch did not exit normally - check the logs at /var/log/elasticsearch/elasticsearch.log
Dec 30 07:27:53 ubuntu systemd[1]: elasticsearch.service: Main process exited, code=exited, status=78/CONFIG
Dec 30 07:27:53 ubuntu systemd[1]: elasticsearch.service: Failed with result 'exit-code'.
Dec 30 07:27:53 ubuntu systemd[1]: Failed to start Elasticsearch.
Any help on this is greatly appreciated!

Which process sends SIGKILL and terminates all SSH connections on/to my Namecheap Server?

I've been trying to troubleshoot this problem for some days now.
A couple of minutes after starting an SSH connection to my Namecheap server (on Mac/windows/cPanel's "Terminal"), it crashes and give the following error message :
Error: The connection to the server ended in failure at {TIME} PM. (SIGKILL)
and :
Exit Code: 137
I've tried to create some kind of log file for any SIGKILL signal, but, it seems like none can be made on a Namecheap server :
auditctl doesn't exist,
We can't get systemtap because no package managers are available.
Precision :
uname -a : Linux [-n] 2.6.32-954.3.5.lve1.4.78.el6.x86_64 #1 SMP Thu Mar 26 08:20:27 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
I calculated the time between each crash : around 6min.
I don't have a very good knowledge of Linux servers, and maybe didn't include needed information. So please ask for any specificities!

Slurm and Munge "Invalid Credential"

I'm installing slurm for the first time. I've installed the 19.05.1-2 tarball and used the configurator to make a very simple two node cluster. Control node is sdc, compute nodes (running slurmd) are sdc and sdc1. Both rebuilt with Ubuntu 18.04
I can start the controller, and the compute node sdc and also successfully submit jobs with srun. That's great. However, when I start slurmd on the second node, SDC1, I get:
slurmd: error: Unable to register: Zero Bytes were transmitted or received
That quickly led me to my munge configuration. Munge.log on the controller (sdc) shows "Invalid credential" every second. I triple checked that munge.key on both hosts are identical. I verified that ntp is running too.
So by hand I did munge -s foobar | unmunge on SDC1 and of course that worked locally. Then I saved the munged text from SDC1 to a file on SDC and tried unmunge. That did give me the error "Invalid credential" again.
Because of this I uninstalled and reinstalled munge on both systems, distributed the key and repeated that test with the same result.
I guess I'm missing something simple. I don't know what else to do to properly install munge.
It was UID/GID mismatch between nodes. Of course it's mentioned in the installation guide.
Did you remember to restart the munge daemon after copying the munge.key to /etc/munge? I got the same error doing
1: install slurm:
$ apt install -y slurm-client
2: copy slurm.conf
(perhaps create slurm-llnl beforehand):
$ cp slurm.conf /etc/slurm-llnl
3: copy munge key to client
(munge.key copied before from slurm server/slurmctld)
$ cp munge.key /etc/munge
and then I got all the invalid credetial errors and problems reported here and in reports including the 'Zero Bytes' error on the client side
[CLIENT]$ sinfo
slurm_load_partitions: Zero Bytes were transmitted or received
with corresponding entries in the Slurm SERVER/slurmctld logs ala
[SERVER]$ tail /var/log/munge/munged.log
2022-12-30 22:57:23 +0100 Notice: Running on ..
2022-12-30 23:01:11 +0100 Info: Invalid credential ...
and
[SERVER]$ tail /var/log/slurm-llnl/slurmctld.log
[2022-12-30T23:01:11.440] error: Munge decode failed: Invalid credential
[2022-12-30T23:01:11.440] ENCODED: Thu Jan 01 01:00:00 1970
[2022-12-30T23:01:11.440] DECODED: Thu Jan 01 01:00:00 1970
[2022-12-30T23:01:11.440] error: slurm_unpack_received_msg: REQUEST_PARTITION_INFO has authentication error: Invalid authentication credential
[2022-12-30T23:01:11.440] error: slurm_unpack_received_msg: Protocol authentication error
All of this is fixed by rebooting the client, as suggested by other here, or slightly less intrusive, just to restart the client munge daemon
(CLIENT)$ sudo systemctl restert munge.service
and then munge on client / unmunge on server works, but it also fixes my main problem of getting client to see the slurm server without the dreaded 'Zero Bytes' error
[CLIENT]$ sinfo
slurm_load_partitions: Zero Bytes were transmitted or received
with server log entries
[SERVER]$ tail /var/log/slurm-llnl/slurmctld.log
...
[2022-12-30T23:17:14.017] error: slurm_unpack_received_msg: Invalid Protocol Version 9472 from uid=-1 at XX.XX.XX.XX:44150
[2022-12-30T23:17:14.017] error: slurm_unpack_received_msg: Incompatible versions of client and server code
[2022-12-30T23:17:14.027] error: slurm_receive_msg [XX.XX.XX.XX:44150]: Unspecified error
And, after munge restart, voilà:
[CLIENT] $ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
LocalQ* up infinite 1 idle XXX
for the examples: SERVER Ubuntu 20.04, CLIENTS Ubuntu 20.04 (and 22.04 that seem to be incompatible with the SERVER slurm version, says the log)

Cask CDAP services started, but not running during installation

After going through the docs for installing CDAP on MapR system (v6.0) and starting the cdap services, am finding that some CDAP services not running after startup (https://docs.cask.co/cdap/current/en/admin-manual/installation/mapr.html#starting-cdap-services) despite the services' startup loop not showing any errors. The output after starting the services and checking their status is shown below:
[root#mapr007 conf]# for i in `ls /etc/init.d/ | grep cdap` ; do sudo service $i start ; done
/usr/bin/id: cannot find name for group ID 504
Wed Nov 21 16:03:01 HST 2018 Starting CDAP Auth Server service on mapr007.org.local
/usr/bin/id: cannot find name for group ID 504
Wed Nov 21 16:03:04 HST 2018 Starting CDAP Kafka Server service on mapr007.org.local
/usr/bin/id: cannot find name for group ID 504
Wed Nov 21 16:03:07 HST 2018 Starting CDAP Master service on mapr007.org.local
Warning: Unable to determine $DRILL_HOME
Wed Nov 21 16:03:48 HST 2018 Ensuring required HBase coprocessors are on HDFS
Wed Nov 21 16:04:00 HST 2018 Running CDAP Master startup checks -- this may take a few minutes
/usr/bin/id: cannot find name for group ID 504
Wed Nov 21 16:04:15 HST 2018 Starting CDAP Router service on mapr007.org.local
/usr/bin/id: cannot find name for group ID 504
Wed Nov 21 16:04:17 HST 2018 Starting CDAP UI service on mapr007.org.local
[root#mapr007 conf]# for i in `ls /etc/init.d/ | grep cdap` ; do sudo service $i status ; done
/usr/bin/id: cannot find name for group ID 504
PID file /var/cdap/run/auth-server-cdap.pid exists, but process 12126 does not appear to be running
/usr/bin/id: cannot find name for group ID 504
CDAP Kafka Server running as PID 12653
/usr/bin/id: cannot find name for group ID 504
PID file /var/cdap/run/master-cdap.pid exists, but process 15789 does not appear to be running
/usr/bin/id: cannot find name for group ID 504
CDAP Router running as PID 16184
/usr/bin/id: cannot find name for group ID 504
CDAP UI running as PID 16308
Note that there while there is an "Unable to determine $DRILL_HOME" error, I don't think that this should be a big problem since have added and set the explore.enabled value in the cdap-site.xml to be false.
Looking at the cdap-site.xml, the web UI port does appear to be set to the default 11011 and yet can't see it (if only to check if the UI would tell me more about any errors) despite the fact that it reports as running.
Checking some info about the PIDs, seeing
# looking at the process that report to not be running
[root#mapr007 conf.dist]# ps -p 12126
PID TTY TIME CMD
[root#mapr007 conf.dist]# ps -p 15789
PID TTY TIME CMD
# looking at the rest of the processes
[root#mapr007 conf.dist]# ps -p 12653
PID TTY TIME CMD
12653 ? 00:08:12 java
[root#mapr007 conf.dist]# ps -p 16184
PID TTY TIME CMD
16184 ? 00:03:02 java
[root#mapr007 conf.dist]# ps -p 16308
PID TTY TIME CMD
16308 ? 00:00:01 node
Also checked if the default security.auth.server.bind.port was being used by some other service
root#mapr007 conf.dist]# netstat -anp | grep 10009
but nothing detected.
Not sure where to start debugging from here, so any suggestions or information would be appreciated.
UPDATE
Restarting the services to try to get more logging data, now seeing some errors (better than just it just not complaining and then not working, I guess)
[root#mapr007 conf.dist]# for i in `ls /etc/init.d/ | grep cdap` ; do sudo service $i stop ; done
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:29 HST 2018 Stopping CDAP Auth Server ...
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:29 HST 2018 Stopping CDAP Kafka Server ....
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:30 HST 2018 Stopping CDAP Master ...
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:31 HST 2018 Stopping CDAP Router ....
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:32 HST 2018 Stopping CDAP UI ....
[root#mapr007 conf.dist]# for i in `ls /etc/init.d/ | grep cdap` ; do sudo service $i start ; done
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:41 HST 2018 Starting CDAP Auth Server service on mapr007.org.local
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:44 HST 2018 Starting CDAP Kafka Server service on mapr007.org.local
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:47 HST 2018 Starting CDAP Master service on mapr007.org.local
Warning: Unable to determine $DRILL_HOME
Mon Nov 26 11:07:17 HST 2018 Ensuring required HBase coprocessors are on HDFS
Mon Nov 26 11:08:57 HST 2018 Running CDAP Master startup checks -- this may take a few minutes
[ERROR] Master startup checks failed. Please check /var/log/cdap/master-cdap-mapr007.org.local.log to address issues.
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:10:08 HST 2018 Starting CDAP Router service on mapr007.org.local
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:10:11 HST 2018 Starting CDAP UI service on mapr007.org.local
Checking the content of the /var/log/cdap/master-cdap-mapr007.org.local.log file, at the bottom can see
...
...
...
2018-11-26 11:10:06,996 - ERROR [main:c.c.c.m.s.MasterStartupTool#109] - YarnCheck failed with RuntimeException: Unable to get status of YARN nodemanagers. Please check that YARN is running and that the correct Hadoop configuration (core-site.xml, yarn-site.xml) and libraries are included in the CDAP master classpath.
java.lang.RuntimeException: Unable to get status of YARN nodemanagers. Please check that YARN is running and that the correct Hadoop configuration (core-site.xml, yarn-site.xml) and libraries are included in the CDAP master classpath.
at co.cask.cdap.master.startup.YarnCheck.run(YarnCheck.java:79) ~[co.cask.cdap.cdap-master-5.1.0.jar:na]
at co.cask.cdap.common.startup.CheckRunner.runChecks(CheckRunner.java:51) ~[co.cask.cdap.cdap-common-5.1.0.jar:na]
at co.cask.cdap.master.startup.MasterStartupTool.canStartMaster(MasterStartupTool.java:106) [co.cask.cdap.cdap-master-5.1.0.jar:na]
at co.cask.cdap.master.startup.MasterStartupTool.main(MasterStartupTool.java:96) [co.cask.cdap.cdap-master-5.1.0.jar:na]
Caused by: java.util.concurrent.TimeoutException: null
at java.util.concurrent.FutureTask.get(FutureTask.java:205) ~[na:1.8.0_181]
at co.cask.cdap.master.startup.YarnCheck.run(YarnCheck.java:76) ~[co.cask.cdap.cdap-master-5.1.0.jar:na]
... 3 common frames omitted
2018-11-26 11:10:07,006 - ERROR [main:c.c.c.m.s.MasterStartupTool#113] - Root cause: TimeoutException:
2018-11-26 11:10:07,006 - ERROR [main:c.c.c.m.s.MasterStartupTool#116] - Errors detected while starting up master. Please check the logs, address all errors, then try again.
Following the "CDAP services on Distributed CDAP aren't starting up due to an exception. What should I do?" FAQ in the docs did not seem to help (https://docs.cask.co/cdap/current/en/faqs/cdap.html#cdap-services-on-distributed-cdap-aren-t-starting-up-due-to-an-exception-what-should-i-do).
Will continue debugging, but would appreciate any opinion on these new errors.
Restarting Resource Manager and Node Manager services on the cluster seems to have resolved this error. This was done mostly on a guess by another dev based only on the fact that the error was related to CDAP being unable to connect to YARN despite the cluster's RM and NM services running fine.
Furthermore, the CDAP installation docs for enabling kerberose (https://docs.cask.co/cdap/current/en/admin-manual/installation/mapr.html#enabling-kerberos) specify using a special keyword _HOST, eg.
<property>
<name>cdap.master.kerberos.keytab</name>
<value>/etc/security/keytabs/cdap.service.keytab</value>
</property>
<property>
<name>cdap.master.kerberos.principal</name>
<value><cdap-principal>/_HOST#EXAMPLE.COM</value>
</property>
where the _HOST is not just some doc placeholder, but is some special keyword that is supposed to automatically be filled in (eg. see https://mapr.com/docs/60/Hive/Config-HiveMetastoreForKerberos.html and https://mapr.com/docs/60/SecurityGuide/Config-YARN-Kerberos.html).
Apparently, for MapR client nodes (ie. non control- or data-nodes (nodes simply running the MapR client package to interact with the cluster)), this does not work and the kerberos principle server host name must be explicitly given (pretty sure the docs exist, but can't find at this time). This was discovered when further examining the logs and seeing that the CDAP services where trying to connect to _HOST#us.org instead of say the.actual.domain#us.org.

Elasticsearch connection error in Ubuntu 16.4

In my ubuntu machine when I run the command curl -X GET 'http://localhost:9200' to test connection it show following message.
curl: (7) Failed to connect to localhost port 9200: Connection refused
When i check server status with sudo systemctl start elasticsearch it show following message.
● elasticsearch.service - Elasticsearch
Loaded: loaded (/usr/lib/systemd/system/elasticsearch.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2016-11-20 16:32:30 BDT; 44s ago
Docs: http://www.elastic.co
Process: 8653 ExecStart=/usr/share/elasticsearch/bin/elasticsearch -p ${PID_DIR}/elasticsearch.pid --quiet -Edefault.path.logs=${LOG_DIR} -Edefa
Process: 8649 ExecStartPre=/usr/share/elasticsearch/bin/elasticsearch-systemd-pre-exec (code=exited, status=0/SUCCESS)
Main PID: 8653 (code=exited, status=1/FAILURE)
Nov 20 16:32:29 bahar elasticsearch[8653]: 2016-11-20 16:32:25,579 main ERROR Null object returned for RollingFile in Appenders.
Nov 20 16:32:29 bahar elasticsearch[8653]: 2016-11-20 16:32:25,579 main ERROR Null object returned for RollingFile in Appenders.
Nov 20 16:32:29 bahar elasticsearch[8653]: 2016-11-20 16:32:25,580 main ERROR Unable to locate appender "rolling" for logger config "root"
Nov 20 16:32:29 bahar elasticsearch[8653]: 2016-11-20 16:32:25,580 main ERROR Unable to locate appender "index_indexing_slowlog_rolling" for logge
Nov 20 16:32:29 bahar elasticsearch[8653]: 2016-11-20 16:32:25,581 main ERROR Unable to locate appender "index_search_slowlog_rolling" for logger
Nov 20 16:32:29 bahar elasticsearch[8653]: 2016-11-20 16:32:25,581 main ERROR Unable to locate appender "deprecation_rolling" for logger config "o
Nov 20 16:32:29 bahar elasticsearch[8653]: [2016-11-20T16:32:25,592][WARN ][o.e.c.l.LogConfigurator ] ignoring unsupported logging configuration
Nov 20 16:32:30 bahar systemd[1]: elasticsearch.service: Main process exited, code=exited, status=1/FAILURE
Nov 20 16:32:30 bahar systemd[1]: elasticsearch.service: Unit entered failed state.
Nov 20 16:32:30 bahar systemd[1]: elasticsearch.service: Failed with result 'exit-code'.
This is the error for the PATH and LOgs in the elasticsearch.yml (etc/elasticsearch/elasticsearch.yml)
Uncheck these path and your error will be removed.
That means elasticsearch is not running. And from what I see, there is a problem with starting it. Check your elasticsearch configuration.
check if Elasticsearch is running,run the follwing command:
$ ps aux|grep elasticsearch
if Elasticsearch is not started,check your JAVA Environment,download a new Elasticsearch and install it again:
1.check if JAVA is correctly installed:
$ java -version
java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
if your JAVA version is lower 1.7,change a new one.
2.download Elasticsearch install package,unzip it:
$ tar -zxvf elasticsearch-2.3.3.gz
3. run Elasticsearch
$ cd elasticsearch-2.3.3
$ ./bin/elasticsearch
Usually it's the write permission issue for the log directory (default as /var/log/elasticsearch), use ls -l to check the permission and change mode to 777 for the log directory and files if necessary.
Long story short: a system reboot might get it OK.
It has been a while since the question is asked. Anyway, I ran into a similar problem recently.
The elasticsearch service on one of my nodes died, with error saying similar to those posted in the question when restart the service. It says the log folder to write is read-only file system. But these files and directories are indeed owned by user elasticsearch (version 5.5, deployed on Cent OS 6.5), there should not be a read-only problem.
I checked and didn't find a clue. So, I just reboot the system. After rebooting, everything goes all right without any further tuning: elasticsearch service starts on boot as configured, it finds the cluster and all the other nodes, and the cluster health status turns green after a little while.
I guess, the root reason might be some hardware failure in my case. All data and logs managed by elasticsearch cluster are stored in a 2TB SSD driver mounted on each node. And our hardware team just managed to recover from an external storage failure recently. All the nodes restarted during that recovery. Chances are there are some lagged issues caused the problem.

Resources