Zabbix agent unable to detect PID of the running process - linux

I am getting some triggers that show process unavailable, but when I check on the host it runs fine. Here is how the expression for the Trigger is set:
{$hostname:proc.num[,,,/etc/alternatives/java].last()}=0
It seems to be working fine for some hosts, but some of them triggers process unavailable and sends the alert.
Affected host:
# ps ax | grep java
1717 ? Ssl 119:15 /etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Djava.awt.headless=true -Djsse.enableSNIExtension=false -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=-1 --httpsPort=8443 --ajp13Port=8009 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --httpsCertificate=/var/lib/jenkins/.ssl/hostssl.crt --httpsPrivateKey=/var/lib/jenkins/.ssl/hostssl.key
Zabbix log:
2000:20160901:081336.721 Starting Zabbix Agent [$hostname]. Zabbix 2.2.8 (revision 51174).
2000:20160901:081336.721 using configuration file: /etc/zabbix/zabbix_agentd.conf
2002:20160901:081336.724 agent #0 started [collector]
2004:20160901:081336.724 agent #2 started [listener #2]
2005:20160901:081336.725 agent #3 started [listener #3]
2006:20160901:081336.725 agent #4 started [active checks #1]
2003:20160901:081336.725 agent #1 started [listener #1]
cat: /proc//status: No such file or directory
cat: /proc//status: No such file or directory
cat: /proc//status: No such file or directory
cat: /proc//status: No such file or directory
Host sending zabbix data properly:
# ps ax | grep java
2472 ? Ssl 1279:26 /etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Djava.awt.headless=true -Djsse.enableSNIExtension=false -Dorg.apache.commons.jelly.tags.fmt.timeZone=Europe/Dublin -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=-1 --httpsPort=8443 --ajp13Port=8009 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --httpsCertificate=/var/lib/jenkins/.security/hostssl.crt --httpsPrivateKey=/var/lib/jenkins/.security/hostssl.key --httpsPort=8443
Zabbix log does not contain line cat: /proc//status: No such file or directory
In my understanding problem is that PID of the process is not discovered so it triggers an alert action.
Is there any way to troubleshoot this further so see why the zabbix agent cannot detect the PID of the running process on affected machines?

The problem is resolved now.
I used zabbix_get to get results from the zabbix agent. There I found that it cannot get any processes from the jenkins or any other non-zabbix user.
Googling brought me to this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1032691
Applying custom SELinux policy resolved the issue.

Related

mpirun launching a runtime script for thread binding

Running a program in a MPI process, which exec file is myapp_mpi. The command line below launches the app correctly
mpirun --bind-to none -np 128 myapp_mpi apprun -resethway -noconfout -nsteps 8000 -s benchData.tpr -cpo state.cpt -e ener.edr -dlb no -pin off -v
I now wish to constrain the thread binding with a script pin.sh. The command line below would then produce an error.
mpirun --bind-to none -np 128 pin.sh myapp_mpi apprun -resethway -noconfout -nsteps 8000 -s benchData.tpr -cpo state.cpt -e ener.edr -dlb no -pin off -v
I get the following error
--------------------------------------------------------------------------
Open MPI tried to fork a new process via the "execve" system call but
failed. Open MPI checks many things before attempting to launch a
child process, but nothing is perfect. This error may be indicative
of another problem on the target host, or even something as silly as
having specified a directory for your application. Your job will now
abort.
Local host: machine001
Working dir: /home/user/myapp/bin
Application name: /home/user/myapp/bin/pin.sh
Error: Exec format error
--------------------------------------------------------------------------
mpirun: Forwarding signal 18 to job
mpirun: Forwarding signal 18 to job
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered an
error:
Error code: 1
Error name: (null)
Node: machine001
when attempting to start process rank 0.
--------------------------------------------------------------------------
125 total processes failed to start
File locations are good, a priori. Any clue ?

cannot start worker in apache spark due to wrong java version

I saw a similar question, but didn't solve mine, hence posted...please help if you can..
I have 1 worker node and separate master in apache spark configuration. By start-all.sh, I see only worker ID in master, not on slave on web UI.
my /etc/hosts on master node:
10.0.0.6 master
10.0.0.20 slave01
on master:
ll /etc/alternatives/java
lrwxrwxrwx. 1 root root 73 Jan 13 00:14 /etc/alternatives/java -> /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.275.b01-0.el7_9.x86_64/jre/bin/java
On my master /opt/spark/conf/spark-env.sh
export SPARK_MASTER_HOST='10.0.0.6'
export JAVA_HOME='/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.275.b01-0.el7_9.x86_64'
When I run start-all.sh, heres the output:
[sudip#master sbin]$ ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-sudip-org.apache.spark.deploy.master.Master-1-master.out
slave01: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sudip-org.apache.spark.deploy.worker.Worker-1-localhost.localdomain.out
master: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sudip-org.apache.spark.deploy.worker.Worker-1-master.out
On master:
[sudip#master sbin]$ cat /opt/spark/logs/spark-sudip-org.apache.spark.deploy.worker.Worker-1-localhost.localdomain.out
/opt/spark/bin/spark-class: line 71: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64//bin/java: No such file or directory
I don't have any idea where it gets the "java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64" as in the log from. Any idea where I am going wrong.
Update:
I deleted the log file and created again. Now, the log file for the slave is empty even when I run (start-all.sh or start-slaves.sh. I dont see the slave entry also in http://:8080
Thanks and Regards,
Sudip

Documentum.cmis.too many open files error

We have deployed our application on rhel 7 from rhel 6 and after deployment we are seeing following error in the catalina.properties, due to this my vm link is getting down frequently. We are using Documentum CMIS 16.4 version on tomcat 8.5 version.
Following is the error's details:
27-Nov-2018 01:57:00.536 SEVERE [https-jsse-nio-0.0.0.0-12510-Acceptor-0] org.apache.tomcat.util.net.NioEndpoint$Acceptor.run Socket accept failed
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
at org.apache.tomcat.util.net.NioEndpoint$Acceptor.run(NioEndpoint.java:457)
at java.lang.Thread.run(Thread.java:748)
Here are my efforts in order to solve this problem:
I have increased ulimit value from 1024 to 8192 for specific user and rebooted it, recycled tomcat service but nothing happened. I had done changed to file named /etc/security/limits.d/20-nproc.conf/20-nproc.conf. kindly help here.
I don't have privileges to add a comment, so posting as an answer. Try to find out which files are open by using the command
lsof -p <pid> | wc -l.
That will tell you which files are not getting closed.
You can also check the limits of a running process by
cat /proc/<pid>/limits

icingaweb2 Permission denied

Please help me to solve this issue with icingaweb
icinga2: Can't send external Icinga command to the local command file "/var/run/icinga2/cmd/icinga2.cmd": Permission denied.
#0 /usr/share/icingaweb2/modules/monitoring/application/forms/Command/Object/ScheduleServiceDowntimeCommandForm.php(191): Icinga\Module\Monitoring\Command\Transport\CommandTransport->send(Object(Icinga\Module\Monitoring\Command\Object\ScheduleHostDowntimeCommand))
#1 /usr/share/icingaweb2/modules/monitoring/application/forms/Command/Object/ScheduleHostDowntimeCommandForm.php(108): Icinga\Module\Monitoring\Forms\Command\Object\ScheduleServiceDowntimeCommandForm->scheduleDowntime(Object(Icinga\Module\Monitoring\Command\Object\ScheduleHostDowntimeCommand), Object(Icinga\Web\Request))
#2 /usr/share/php/Icinga/Web/Form.php(1152): Icinga\Module\Monitoring\Forms\Command\Object\ScheduleHostDowntimeCommandForm->onSuccess()
#3 /usr/share/icingaweb2/modules/monitoring/library/Monitoring/Web/Controller/MonitoredObjectController.php(128): Icinga\Web\Form->handleRequest()
#4 /usr/share/icingaweb2/modules/monitoring/application/controllers/HostController.php(155): Icinga\Module\Monitoring\Web\Controller\MonitoredObjectController->handleCommandForm(Object(Icinga\Module\Monitoring\Forms\Command\Object\ScheduleHostDowntimeCommandForm))
#5 /usr/share/php/Zend/Controller/Action.php(516): Icinga\Module\Monitoring\Controllers\HostController->scheduleDowntimeAction()
#6 /usr/share/php/Icinga/Web/Controller/Dispatcher.php(76): Zend_Controller_Action->dispatch('scheduleDowntim...')
#7 /usr/share/php/Zend/Controller/Front.php(954): Icinga\Web\Controller\Dispatcher->dispatch(Object(Icinga\Web\Request), Object(Icinga\Web\Response))
#8 /usr/share/php/Icinga/Application/Web.php(384): Zend_Controller_Front->dispatch(Object(Icinga\Web\Request), Object(Icinga\Web\Response))
#9 /usr/share/php/Icinga/Application/webrouter.php(109): Icinga\Application\Web->dispatch()
#10 /usr/share/icingaweb2/public/index.php(4): require_once('/usr/share/php/...')
#11 {main}
In my case (CentOS 7) all I had to do was to ensure the icinga2 feature 'command' was enabled, and restart the service.
icinga2 feature enable command
systemctl restart icinga2.service
The error message is probably correct. You'll need to setup the correct unix permissions for that file. The CentOS7 packages do the right thing there, but for me the problem was related to selinux. Check SELinux denials to see if your commands are being denied:
ausearch -m avc --start recent
Check the context of the command file:
# ls -lZ /var/run/icinga2/cmd/icinga2.cmd
prw-rw----. icinga icingacmd system_u:object_r:var_run_t:s0 /var/run/icinga2/cmd/icinga2.cmd
I fixed this by installing the icinga2-selinux package after all the other configuration. In particular, you need to (re)install it after enabling the local (named pipe) command transport. After re-installing icinga2-selinux, the correct context should be:
# ls -lZ /var/run/icinga2/cmd/icinga2.cmd
prw-rw----. icinga icingacmd system_u:object_r:icinga2_command_t:s0 /var/run/icinga2/cmd/icinga2.cmd
Restart icinga2 and Apache.
Disabling selinux will help.
Temp disable selinux and try again.
setenforce 0
If it works, try a permanent one.
Edit /etc/selinux/config and make sure
SELINUX=disabled

PHP exec(myexe) fails in PHP App, but not CLI. Fails Running Under User "apache"

I have a custom program (e.g. myexe) being executed by a web app using PHP's exec() function. It does not fail when run using the PHP CLI nor does myexe fail when run from the command line with me as a user. I have built myexe so that there are no memory issues when profiled using valgrind. myexe is about 26MB in size.
To simplify the situation, I have run myexe on the command line under the user 'apache' and reproduced the failure.
su -s /bin/sh apache -c "/usr/local/bin/myexe parm1 parm2..."
==> Segmentation fault (core dumped)
BUT when I change the user to myself and run the same command above, it works.
su -s /bin/sh mike -c "/usr/local/bin/myexe parm1 parm2..."
==> WORKS
Here's the error from the system log file:
Jul 9 18:26:15 DEVSTN-1 kernel: myexe[27352]: segfault at 7fffa2bf9ff8 ip 0000000000410324 sp 00007fffa2bfa000 error 6 in myexe[400000+5ae000]
Jul 9 18:26:16 DEVSTN-1 abrt[27353]: Saved core dump of pid 27352 (/usr/local/bin/myexe) to /var/spool/abrt/ccpp-2015-07-09-18:26:15-27352 (13631488 bytes)
Jul 9 18:26:16 DEVSTN-1 abrtd: Directory 'ccpp-2015-07-09-18:26:15-27352' creation detected
Jul 9 18:26:17 DEVSTN-1 abrtd: Executable '/usr/local/bin/myexe' doesn't belong to any package and ProcessUnpackaged is set to 'no'
Jul 9 18:26:17 DEVSTN-1 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2015-07-09-18:26:15-27352' exited with 1
Jul 9 18:26:17 DEVSTN-1 abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2015-07-09-18:26:15-27352'
My configuration:
CentOS6 2.6.32-504.23.4.el6.x86_64
Apache/2.2.15 (CentOS)
PHP Version 5.3.3
Am I correct with assuming that PHP has nothing to do with the error?
What should I do next?
Correct; PHP has nothing to do with the error. This is a segmentation fault caused by invalid memory access (either overflowing a buffer, or accessing already-freed memory) in myexe. It seems to have saved a core dump to /var/spool/abrt/ccpp-2015-07-09-18:26:15-27352, so, try debugging with GDB:
gdb /usr/local/bin/myexe -c /var/spool/abrt/ccpp-2015-07-09-18:26:15-27352
(gdb) bt
And try to see where the executable is failing. To get useful output, it will need to be compiled with debugging symbols. If it doesn't fail running as root or a different user, or running in an interactive terminal, I'd look for bugs that could be triggered by being unable to open a file, unable to read an expected environment variable, etc. to help isolate your problem.
Running the executable under strace might help figure out what's going on as well.
Found the problem by entering a bash shell user user apache and running the program using gdb.
Turns out myexe was trying to create a directory under the user's home dir (/home/apache) which doesn't exist.
What helped me was knowing how to start a shell under a different user and using gdb.
Here's the command to start a shell under another user (apache):
su -s /bin/bash apache

Resources