Cask CDAP services started, but not running during installation - mapr

After going through the docs for installing CDAP on MapR system (v6.0) and starting the cdap services, am finding that some CDAP services not running after startup (https://docs.cask.co/cdap/current/en/admin-manual/installation/mapr.html#starting-cdap-services) despite the services' startup loop not showing any errors. The output after starting the services and checking their status is shown below:
[root#mapr007 conf]# for i in `ls /etc/init.d/ | grep cdap` ; do sudo service $i start ; done
/usr/bin/id: cannot find name for group ID 504
Wed Nov 21 16:03:01 HST 2018 Starting CDAP Auth Server service on mapr007.org.local
/usr/bin/id: cannot find name for group ID 504
Wed Nov 21 16:03:04 HST 2018 Starting CDAP Kafka Server service on mapr007.org.local
/usr/bin/id: cannot find name for group ID 504
Wed Nov 21 16:03:07 HST 2018 Starting CDAP Master service on mapr007.org.local
Warning: Unable to determine $DRILL_HOME
Wed Nov 21 16:03:48 HST 2018 Ensuring required HBase coprocessors are on HDFS
Wed Nov 21 16:04:00 HST 2018 Running CDAP Master startup checks -- this may take a few minutes
/usr/bin/id: cannot find name for group ID 504
Wed Nov 21 16:04:15 HST 2018 Starting CDAP Router service on mapr007.org.local
/usr/bin/id: cannot find name for group ID 504
Wed Nov 21 16:04:17 HST 2018 Starting CDAP UI service on mapr007.org.local
[root#mapr007 conf]# for i in `ls /etc/init.d/ | grep cdap` ; do sudo service $i status ; done
/usr/bin/id: cannot find name for group ID 504
PID file /var/cdap/run/auth-server-cdap.pid exists, but process 12126 does not appear to be running
/usr/bin/id: cannot find name for group ID 504
CDAP Kafka Server running as PID 12653
/usr/bin/id: cannot find name for group ID 504
PID file /var/cdap/run/master-cdap.pid exists, but process 15789 does not appear to be running
/usr/bin/id: cannot find name for group ID 504
CDAP Router running as PID 16184
/usr/bin/id: cannot find name for group ID 504
CDAP UI running as PID 16308
Note that there while there is an "Unable to determine $DRILL_HOME" error, I don't think that this should be a big problem since have added and set the explore.enabled value in the cdap-site.xml to be false.
Looking at the cdap-site.xml, the web UI port does appear to be set to the default 11011 and yet can't see it (if only to check if the UI would tell me more about any errors) despite the fact that it reports as running.
Checking some info about the PIDs, seeing
# looking at the process that report to not be running
[root#mapr007 conf.dist]# ps -p 12126
PID TTY TIME CMD
[root#mapr007 conf.dist]# ps -p 15789
PID TTY TIME CMD
# looking at the rest of the processes
[root#mapr007 conf.dist]# ps -p 12653
PID TTY TIME CMD
12653 ? 00:08:12 java
[root#mapr007 conf.dist]# ps -p 16184
PID TTY TIME CMD
16184 ? 00:03:02 java
[root#mapr007 conf.dist]# ps -p 16308
PID TTY TIME CMD
16308 ? 00:00:01 node
Also checked if the default security.auth.server.bind.port was being used by some other service
root#mapr007 conf.dist]# netstat -anp | grep 10009
but nothing detected.
Not sure where to start debugging from here, so any suggestions or information would be appreciated.
UPDATE
Restarting the services to try to get more logging data, now seeing some errors (better than just it just not complaining and then not working, I guess)
[root#mapr007 conf.dist]# for i in `ls /etc/init.d/ | grep cdap` ; do sudo service $i stop ; done
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:29 HST 2018 Stopping CDAP Auth Server ...
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:29 HST 2018 Stopping CDAP Kafka Server ....
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:30 HST 2018 Stopping CDAP Master ...
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:31 HST 2018 Stopping CDAP Router ....
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:32 HST 2018 Stopping CDAP UI ....
[root#mapr007 conf.dist]# for i in `ls /etc/init.d/ | grep cdap` ; do sudo service $i start ; done
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:41 HST 2018 Starting CDAP Auth Server service on mapr007.org.local
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:44 HST 2018 Starting CDAP Kafka Server service on mapr007.org.local
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:47 HST 2018 Starting CDAP Master service on mapr007.org.local
Warning: Unable to determine $DRILL_HOME
Mon Nov 26 11:07:17 HST 2018 Ensuring required HBase coprocessors are on HDFS
Mon Nov 26 11:08:57 HST 2018 Running CDAP Master startup checks -- this may take a few minutes
[ERROR] Master startup checks failed. Please check /var/log/cdap/master-cdap-mapr007.org.local.log to address issues.
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:10:08 HST 2018 Starting CDAP Router service on mapr007.org.local
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:10:11 HST 2018 Starting CDAP UI service on mapr007.org.local
Checking the content of the /var/log/cdap/master-cdap-mapr007.org.local.log file, at the bottom can see
...
...
...
2018-11-26 11:10:06,996 - ERROR [main:c.c.c.m.s.MasterStartupTool#109] - YarnCheck failed with RuntimeException: Unable to get status of YARN nodemanagers. Please check that YARN is running and that the correct Hadoop configuration (core-site.xml, yarn-site.xml) and libraries are included in the CDAP master classpath.
java.lang.RuntimeException: Unable to get status of YARN nodemanagers. Please check that YARN is running and that the correct Hadoop configuration (core-site.xml, yarn-site.xml) and libraries are included in the CDAP master classpath.
at co.cask.cdap.master.startup.YarnCheck.run(YarnCheck.java:79) ~[co.cask.cdap.cdap-master-5.1.0.jar:na]
at co.cask.cdap.common.startup.CheckRunner.runChecks(CheckRunner.java:51) ~[co.cask.cdap.cdap-common-5.1.0.jar:na]
at co.cask.cdap.master.startup.MasterStartupTool.canStartMaster(MasterStartupTool.java:106) [co.cask.cdap.cdap-master-5.1.0.jar:na]
at co.cask.cdap.master.startup.MasterStartupTool.main(MasterStartupTool.java:96) [co.cask.cdap.cdap-master-5.1.0.jar:na]
Caused by: java.util.concurrent.TimeoutException: null
at java.util.concurrent.FutureTask.get(FutureTask.java:205) ~[na:1.8.0_181]
at co.cask.cdap.master.startup.YarnCheck.run(YarnCheck.java:76) ~[co.cask.cdap.cdap-master-5.1.0.jar:na]
... 3 common frames omitted
2018-11-26 11:10:07,006 - ERROR [main:c.c.c.m.s.MasterStartupTool#113] - Root cause: TimeoutException:
2018-11-26 11:10:07,006 - ERROR [main:c.c.c.m.s.MasterStartupTool#116] - Errors detected while starting up master. Please check the logs, address all errors, then try again.
Following the "CDAP services on Distributed CDAP aren't starting up due to an exception. What should I do?" FAQ in the docs did not seem to help (https://docs.cask.co/cdap/current/en/faqs/cdap.html#cdap-services-on-distributed-cdap-aren-t-starting-up-due-to-an-exception-what-should-i-do).
Will continue debugging, but would appreciate any opinion on these new errors.

Restarting Resource Manager and Node Manager services on the cluster seems to have resolved this error. This was done mostly on a guess by another dev based only on the fact that the error was related to CDAP being unable to connect to YARN despite the cluster's RM and NM services running fine.
Furthermore, the CDAP installation docs for enabling kerberose (https://docs.cask.co/cdap/current/en/admin-manual/installation/mapr.html#enabling-kerberos) specify using a special keyword _HOST, eg.
<property>
<name>cdap.master.kerberos.keytab</name>
<value>/etc/security/keytabs/cdap.service.keytab</value>
</property>
<property>
<name>cdap.master.kerberos.principal</name>
<value><cdap-principal>/_HOST#EXAMPLE.COM</value>
</property>
where the _HOST is not just some doc placeholder, but is some special keyword that is supposed to automatically be filled in (eg. see https://mapr.com/docs/60/Hive/Config-HiveMetastoreForKerberos.html and https://mapr.com/docs/60/SecurityGuide/Config-YARN-Kerberos.html).
Apparently, for MapR client nodes (ie. non control- or data-nodes (nodes simply running the MapR client package to interact with the cluster)), this does not work and the kerberos principle server host name must be explicitly given (pretty sure the docs exist, but can't find at this time). This was discovered when further examining the logs and seeing that the CDAP services where trying to connect to _HOST#us.org instead of say the.actual.domain#us.org.

Related

Elastisearch Enabling Remote Connection - Crashes AFTER Change*

I just installed filebeat, logstash, kibana and elasticsearch all running smoothly just to trial this product out for additional monthly reports/monitoring and noticed every time I try to change the "/etc/elasticsearch/elasticsearch.yml" config file for remote web access it'll basically crash the service every time I make the change.
Just want to say I'm new to the forum and this product, and my end goal for this question is to figure out how to allow remote connections to access elastisearch as I guinea pig and test without crashing elasticsearch.
For reference here is the error code when I run the 'sudo systemctl status elasticsearch' query:
Dec 30 07:27:37 ubuntu systemd[1]: Starting Elasticsearch...
Dec 30 07:27:52 ubuntu systemd-entrypoint[4067]: ERROR: [1] bootstrap checks failed. You must address the points described in the following [1] lines before starting Elasticsearch.
Dec 30 07:27:52 ubuntu systemd-entrypoint[4067]: bootstrap check failure [1] of [1]: the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.se>
Dec 30 07:27:52 ubuntu systemd-entrypoint[4067]: ERROR: Elasticsearch did not exit normally - check the logs at /var/log/elasticsearch/elasticsearch.log
Dec 30 07:27:53 ubuntu systemd[1]: elasticsearch.service: Main process exited, code=exited, status=78/CONFIG
Dec 30 07:27:53 ubuntu systemd[1]: elasticsearch.service: Failed with result 'exit-code'.
Dec 30 07:27:53 ubuntu systemd[1]: Failed to start Elasticsearch.
Any help on this is greatly appreciated!

"failed to execute command: permission denied" Ubuntu 18.04.3 LTS

Trying to set up a game server for Ark on an old HP ProLiant running Ubuntu (version 18.04.3 LTS, 64-bit). Specs are 72GB RAM, Intel Xeon X5650 # 2.67 GHz x2. I'm learning Ubuntu along the way, so I barely know what I'm doing and realize I could just be making some silly error... but I'm totally lost. I managed to get a lot done thanks to Google, but even Google can't seem to help me anymore.
I've been using multiple guides to help me set it up.
https://ark.gamepedia.com/Dedicated_Server_Setup#Linux_.28via_systemd.29
http://arksurvivalevolved.gamewalkthrough-universe.com/dedicatedservers/linux/Default.aspx
https://survivetheark.com/index.php?/forums/topic/87419-guide-cluster-setup/
I've gone over every step in those guides multiple times and at least managed to get to this point where I'm stuck at this "permission denied" error.
I've tried every solution presented under this Google search: https://www.google.com/search?q=linux+%22failed+to+execute+command%3A+permission+denied%22
Additionally, I've tried executing the command to start the server with and without "sudo".
My guess is that the file it's trying to access is not permissible for some reason, but I can't seem to find a working solution for me.
[Unit]
Description=ARK: Survival Evolved dedicated server
Wants=network-online.target
After=syslog.target network.target nss-lookup.target network-online.target
[Service]
ExecStartPre=/home/kinare/steamcmd +login anonymous +force_install_dir /home/kinare/ark +app_update 376030
ExecStart=/home/kinare/ark/ShooterGame/Binaries/Linux/ShooterGameServer.exe Ragnarok?SessionName="Togerland - PVE Ragnarok"?AltSaveDirectoryName=RagSave?Port=7777?QueryPort=27015 -NoTransferFromFiltering -exclusivejoin -clusterid=Togerland
ShooterGameServer.exe Aberration_P?SessionName="Togerland - PVE Aberration"?AltSaveDirectoryName=AbSave?Port=7779?QueryPort=27017 -NoTransferFromFiltering -exclusivejoin -clusterid=Togerland
WorkingDirectory=/home/kinare/ark/ShooterGame/Binaries/Linux
LimitNOFILE=500000
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -s INT $MAINPID
User=steam
Group=steam
[Install]
WantedBy=multi-user.target
Only including 2 of 6 maps that are within the cluster there to save space, hopefully that's enough.
Expected result should be it not failing to start... Error message:
ark-dedicated.service - ARK: Survival Evolved dedicated server
Loaded: loaded (/etc/systemd/system/ark-dedicated.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2019-10-18 15:35:19 EDT; 56s ago
Process: 6383 ExecStartPre=/home/kinare/steamcmd +login anonymous +force_install_dir /home/kinare/ark +app_update 376030 (code=exited, status=203/EXEC)
Oct 18 15:35:19 togerland-server systemd[1]: Starting ARK: Survival Evolved dedicated server...
Oct 18 15:35:19 togerland-server systemd[6383]: ark-dedicated.service: Failed to execute command: Permission denied
Oct 18 15:35:19 togerland-server systemd[6383]: ark-dedicated.service: Failed at step EXEC spawning /home/kinare/steamcmd: Permission denied
Oct 18 15:35:19 togerland-server systemd[1]: ark-dedicated.service: Control process exited, code=exited status=203
Oct 18 15:35:19 togerland-server systemd[1]: ark-dedicated.service: Failed with result 'exit-code'.
Oct 18 15:35:19 togerland-server systemd[1]: Failed to start ARK: Survival Evolved dedicated server.
your systemd service uses user and group steam
...
User=steam
Group=steam
...
you are starting your ark server from the home of kinare
ExecStart=/home/kinare/ark/ShooterGame/Binaries...
and your system logs says: 'Permission denied':
Oct 18 15:35:19 togerland-server systemd[6383]: ark-dedicated.service: Failed to execute command: Permission denied
does the steam user have permissions to read files in /home/kinare?
You can solve this in a few ways:
give the steam user permissions to read from /home/kinare
# change the group of all files and dirs in /home/kinare to steam
chgrp -R steam /home/kinare
# give the group read rights on all files and dirs /home/kinare
chmod -R g+r /home/kinare
# allow the group to open folders under /home/kinare
find /home/kinare -type d -exec chmod 750 {} \;
use service account
move your ark and steam to the home of the steam user (/home/steam) and change
your unit file as needed. keep in mind that you need change the permissions of
the files in /home/steam. This is preferred, you use a service account instead
of your admin user kinare
change the user and group used in your systemd service file
User=kinare
Group=kinare
ark will now run as the user kinare. This is less preferred, see:
https://unix.stackexchange.com/questions/314725/what-is-the-difference-between-user-and-service-account
hope this helps, good luck

Elasticsearch connection error in Ubuntu 16.4

In my ubuntu machine when I run the command curl -X GET 'http://localhost:9200' to test connection it show following message.
curl: (7) Failed to connect to localhost port 9200: Connection refused
When i check server status with sudo systemctl start elasticsearch it show following message.
● elasticsearch.service - Elasticsearch
Loaded: loaded (/usr/lib/systemd/system/elasticsearch.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2016-11-20 16:32:30 BDT; 44s ago
Docs: http://www.elastic.co
Process: 8653 ExecStart=/usr/share/elasticsearch/bin/elasticsearch -p ${PID_DIR}/elasticsearch.pid --quiet -Edefault.path.logs=${LOG_DIR} -Edefa
Process: 8649 ExecStartPre=/usr/share/elasticsearch/bin/elasticsearch-systemd-pre-exec (code=exited, status=0/SUCCESS)
Main PID: 8653 (code=exited, status=1/FAILURE)
Nov 20 16:32:29 bahar elasticsearch[8653]: 2016-11-20 16:32:25,579 main ERROR Null object returned for RollingFile in Appenders.
Nov 20 16:32:29 bahar elasticsearch[8653]: 2016-11-20 16:32:25,579 main ERROR Null object returned for RollingFile in Appenders.
Nov 20 16:32:29 bahar elasticsearch[8653]: 2016-11-20 16:32:25,580 main ERROR Unable to locate appender "rolling" for logger config "root"
Nov 20 16:32:29 bahar elasticsearch[8653]: 2016-11-20 16:32:25,580 main ERROR Unable to locate appender "index_indexing_slowlog_rolling" for logge
Nov 20 16:32:29 bahar elasticsearch[8653]: 2016-11-20 16:32:25,581 main ERROR Unable to locate appender "index_search_slowlog_rolling" for logger
Nov 20 16:32:29 bahar elasticsearch[8653]: 2016-11-20 16:32:25,581 main ERROR Unable to locate appender "deprecation_rolling" for logger config "o
Nov 20 16:32:29 bahar elasticsearch[8653]: [2016-11-20T16:32:25,592][WARN ][o.e.c.l.LogConfigurator ] ignoring unsupported logging configuration
Nov 20 16:32:30 bahar systemd[1]: elasticsearch.service: Main process exited, code=exited, status=1/FAILURE
Nov 20 16:32:30 bahar systemd[1]: elasticsearch.service: Unit entered failed state.
Nov 20 16:32:30 bahar systemd[1]: elasticsearch.service: Failed with result 'exit-code'.
This is the error for the PATH and LOgs in the elasticsearch.yml (etc/elasticsearch/elasticsearch.yml)
Uncheck these path and your error will be removed.
That means elasticsearch is not running. And from what I see, there is a problem with starting it. Check your elasticsearch configuration.
check if Elasticsearch is running,run the follwing command:
$ ps aux|grep elasticsearch
if Elasticsearch is not started,check your JAVA Environment,download a new Elasticsearch and install it again:
1.check if JAVA is correctly installed:
$ java -version
java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
if your JAVA version is lower 1.7,change a new one.
2.download Elasticsearch install package,unzip it:
$ tar -zxvf elasticsearch-2.3.3.gz
3. run Elasticsearch
$ cd elasticsearch-2.3.3
$ ./bin/elasticsearch
Usually it's the write permission issue for the log directory (default as /var/log/elasticsearch), use ls -l to check the permission and change mode to 777 for the log directory and files if necessary.
Long story short: a system reboot might get it OK.
It has been a while since the question is asked. Anyway, I ran into a similar problem recently.
The elasticsearch service on one of my nodes died, with error saying similar to those posted in the question when restart the service. It says the log folder to write is read-only file system. But these files and directories are indeed owned by user elasticsearch (version 5.5, deployed on Cent OS 6.5), there should not be a read-only problem.
I checked and didn't find a clue. So, I just reboot the system. After rebooting, everything goes all right without any further tuning: elasticsearch service starts on boot as configured, it finds the cluster and all the other nodes, and the cluster health status turns green after a little while.
I guess, the root reason might be some hardware failure in my case. All data and logs managed by elasticsearch cluster are stored in a 2TB SSD driver mounted on each node. And our hardware team just managed to recover from an external storage failure recently. All the nodes restarted during that recovery. Chances are there are some lagged issues caused the problem.

CentOS 7 - boot order needs to be changed in order for sge to start automatically

It seems like sge tries start before lustre is mounted when the server boots, which brings an error to start automatically when it reboots.
Can somebody tell me how to change the order when it boots, so sge starts after lustre is mounted?
Error message from the log:
Aug 12 11:46:21 dragen1 systemd: Configuration file /usr/lib/systemd/system/sge_execd.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Aug 12 11:46:40 dragen1 sge_execd: error: SGE_ROOT directory "/cm/shared/apps/sge/2011.11p1" doesn't exist
Aug 12 11:46:40 dragen1 systemd: sge_execd.service: control process exited, code=exited status=1
Aug 12 11:46:40 dragen1 systemd: Unit sge_execd.service entered failed state.
Aug 12 11:46:40 dragen1 systemd: sge_execd.service failed
I added in the following under [Unit] from the sge service
RequiresMountsFor=(Mount Point)
This fixed the problem.

GPFS : mmremote: Unable to determine the local node identity

I had a 4 node, gpfs cluster up and running, and things were fine till last week when the Server hosting these RHEL setups went down, After the server was brought up and rhel nodes were started back, one of the nodes's IP got changed,
After that I am not able to use the node,
simple commands like 'mmlscluster', mmgetstate', fails with this error:
[root#gpfs3 ~]# mmlscluster mmlscluster: Unable to determine the local
node identity. mmlscluster: Command failed. Examine previous error
messages to determine cause. [root#gpfs3 ~]# mmstartup mmstartup:
Unable to determine the local node identity. mmstartup: Command
failed. Examine previous error messages to determine cause.
Mmshutdown fails with different error:
[root#gpfs3 ~]# mmshutdown mmshutdown: Unexpected error from
getLocalNodeData: Unknown environmentType . Return code: 1
logs have this info:
Mon Feb 15 18:18:34 IST 2016: Node rebooted. Starting mmautoload...
mmautoload: Unable to determine the local node identity. Mon Feb 15
18:18:34 IST 2016 mmautoload: GPFS is waiting for daemon network
mmautoload: Unable to determine the local node identity. Mon Feb 15
18:19:34 IST 2016 mmautoload: GPFS is waiting for daemon network
mmautoload: Unable to determine the local node identity. Mon Feb 15
18:20:34 IST 2016 mmautoload: GPFS is waiting for daemon network
mmautoload: Unable to determine the local node identity. Mon Feb 15
18:21:35 IST 2016 mmautoload: GPFS is waiting for daemon network
mmautoload: Unable to determine the local node identity. Mon Feb 15
18:22:35 IST 2016 mmautoload: GPFS is waiting for daemon network
mmautoload: Unable to determine the local node identity. mmautoload:
The GPFS environment cannot be initialized. mmautoload: Correct the
problem and use mmstartup to start GPFS.
I tried changing the IP to new one, still the same error:
[root#gpfs1 ~]# mmchnode -N gpfs3 --admin-interface=xx.xx.xx.xx Mon Feb 15 20:00:05 IST 2016:
mmchnode: Processing node gpfs3 mmremote: Unable to determine the
local node identity. mmremote: Command failed. Examine previous error
messages to determine cause. mmremote: Unable to determine the local
node identity. mmremote: Command failed. Examine previous error
messages to determine cause. mmchnode: Unexpected error from
checkExistingClusterNode gpfs3. Return code: 0 mmchnode: Command
failed. Examine previous error messages to determine cause.
Can someone please help me in fixing this issue?
The easiest fix is probably to remove the node from the cluster (mmdelnode) and then add it back in (mmaddnode). You might need to mmdelnode -f.
If deleting and adding the node back in is not an option, try giving IBM support a call.

Resources