Trouble running a custom monit check, stderr maybe? - linux

Trying to run this check from monit, but it doesn't work. The gravity program sends its output to stderr. Could it be that monit doesn't handle this properly because of the way it exec's the check?
contents of system.monitrc:
check program gravityStatus with path /usr/local/bin/check.sh
with timeout 10 seconds
if status !=0 then alert
check.sh:
root#tiki:~# cat /usr/local/bin/check.sh
#!/bin/bash
#This will return zero if all good
/usr/bin/gravity status |& /usr/bin/jq .SyncInfo.catching_up | grep -q 'false'
Output:
Program 'gravityStatus'
status Status failed
monitoring status Monitored
monitoring mode active
on reboot start
last exit value 1
last output parse error: Invalid numeric literal at line 1, column 6
data collected Tue, 01 Feb 2022 19:52:37
If I execute the contents of check.sh on the command line, the script works:
root#tiki:~# /usr/bin/gravity status |& /usr/bin/jq .SyncInfo.catching_up | grep -q 'false'
root#tiki:~# echo $?
0

I figured it out. I want to thank #boppy for his comment, it was very helpful. Here's what I did:
I changed the check.sh to just run 'gravity status' and then looked at the monit status. It says this:
Program 'gravityStatus'
status Status failed
monitoring status Monitored
monitoring mode active
on reboot start
last exit value 2
last output panic: $HOME is not defined
goroutine 1 [running]:
github.com/cosmos/cosmos-sdk/simapp.init.0()
/go/pkg/mod/github.com/cosmos/cosmos-sdk#v0.44.5/simapp/app.go:182 +0x189
The problem was that gravity status would die before it could send output to the jq process. gravity is a command that has to look at $HOME/.gravity were a bunch of configs are located. So the solution was to set $HOME to /root which is where all the gravity stuff is setup.

Related

timeout in shell script and report those input with timeout

I would like to conduct analysis using program Arlsumstat_64bit with thousand of input files.
Arlsumstat_64bit reads input files (.arp) and write result file (sumstat.out).
Each input will append new line on the result file (sumstat.out) based on the argument "0 1"
Therefore, I wrote a shell script to execute all the input (*.arp) in the same folder.
However, if the input files contain error, the shell script will be stuck without any subsequently process. Therefore, I found a command with "timeout" to deal my issue.
I made a shell script as following
#!/bin/bash
for sp in $(ls *.arp) ;
do
echo "process start: $sp"
timeout 10 arlsumstat_64bit ${sp}.arp sumstat.out 1 0
rm -r ${sp}.res
echo "process done: $sp"
done
However, I still need to know which input files failed.
How could make a list telling me which input files are "timeout"?
See the man page for the timeout command http://man7.org/linux/man-pages/man1/timeout.1.html
If the command times out, and --preserve-status is not set, then exit
with status 124. Otherwise, exit with the status of COMMAND. If no
signal is specified, send the TERM signal upon timeout. The TERM
signal kills any process that does not block or catch that signal.
It may be necessary to use the KILL (9) signal, since this signal
cannot be caught, in which case the exit status is 128+9 rather than
124.
You should find out which exit codes are possible for the program arlsumstat_64bit. I assume it should exit with status 0 on success. Otherwise the script below will not work. If you need to distinguish between timeout and other errors it should not use exit status 124 or which is used by timeout to indicate a timeout. So you can check the exit status of your command to distinguish between success, error or timeout as necessary.
To keep the script simple I assume you don't need to distingish between timeout and other errors.
I added some comments where I modified your script to improve it or to show alternatives.
#!/bin/bash
# don't parse the output of ls
for sp in *.arp
do
echo "process start: $sp"
# instead of using "if timeout 10 arlsumstat_64bit ..." you could also run
# timeout 10 arlsumstat_64bit... and check the value of `$?` afterwards,
# e.g. if you want to distinguish between error and timeout.
# $sp will already contain .arp so ${sp}.arp is wrong
# use quotes in case a file name contains spaces
if timeout 10 arlsumstat_64bit "${sp}" sumstat.out 1 0
then
echo "process done: $sp"
else
echo "processing failed or timeout: $sp"
fi
# If the result for foo.arp is foo.res, the .arp must be removed
# If it is foo.arp.res, rm -r "${sp}.res" would be correct
# use quotes
rm -r "${sp%.arp}.res"
done
Below code should work for you:
#!/bin/bash
for sp in $(ls *.arp) ;
do
echo "process start: $sp"
timeout 10 arlsumstat_64bit ${sp}.arp sumstat.out 1 0
if [ $? -eq 0 ]
then
echo "process done sucessfully: $sp"
else
echo "process failed: $sp"
fi
echo "Deleting ${sp}.res"
rm -r ${sp}.res
done

Get return code from command run on ssh tunnel [duplicate]

This question already has answers here:
Exit when one process in pipe fails
(2 answers)
Closed 4 years ago.
Even if the mycode.sh has non-0 exit code this command returns 0 as ssh connection was successful. How to get the actual return code of the .sh on remote server?
/home/mycode.sh '20'${ODATE} 1 | ssh -L 5432:localhost:5432 myuser#myremotehost cat
This is not related to SSH, but to how bash handles the exit status in pipelines. From the bash manual page:
The return status of a pipeline is the exit status of the last command, unless the pipefail option is enabled. If pipefail is enabled, the pipeline's return status is the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands exit successfully. If the reserved word ! precedes a pipeline, the exit status of that pipeline is the logical negation of the exit status as described above. The shell waits for all commands in the pipeline to terminate before returning a value.
If you want to check that there was an error in the pipeline due to any of the commands involved, just set the pipefail option:
set -o pipefail
your_pipeline_here
echo $? # Prints non-zero if something went wrong
It is not possible to actually send the exit status to the next command in the pipeline (in your case, ssh) without additional steps. If you really want to do that, the command will have to be split like this:
res="$(/home/mycode.sh '20'${ODATE} 1)"
if (( $? == 0 )); then
echo -n "$res" | ssh -L 5432:localhost:5432 myuser#myremotehost cat
else
# You can do anything with the exit status here - even pass it on as an argument to the remote command
echo "mycode.sh failed" >&2
fi
You may want to save the output of mycode.sh to a temporary file instead of the $res variable if it's too large.
/home/mycode.sh is located onto the local host.
the ssh command is running cat on the remote server.
All text printed to the standard output of the /home/mycode.sh is redirected to the cat standard input.
The man ssh reads:
EXIT STATUS
ssh exits with the exit status of the remote command or with 255 if an error occurred.
Conclusion: the ssh exists with the EXIT STATUS of the cat or 255 if an error occurred.
if /home/mycode.sh script prints commands to the standard input, they can be run on the remote server when the cat is not present:
/home/mycode.sh '20'${ODATE} 1 | ssh -L 5432:localhost:5432 myuser#myremotehost
In my test, the EXIT STATUS of the last command executed on the remote server is returned by ssh:
printf "%s\n" "uname -r" date "ls this_file_does_not_exist" |\
ssh -L 5432:localhost:5432 myuser#myremotehost ;\
printf "EXIT STATUS of the last command, executed remotely with ssh is %d\n" $?
4.4.0-119-generic
Wed Aug 29 02:55:04 EDT 2018
ls: cannot access 'this_file_does_not_exist': No such file or directory
EXIT STATUS of the last command, executed remotely with ssh is 2

SNMP Traphandle not working

This is my first time working with SNMP, but after reading the SNMP pages I'm still having trouble getting a simple shell script to run when receiving a trap.
My /etc/snmp/snmptrapd.conf file looks like this:
# Example configuration file for snmptrapd
#
# No traps are handled by default, you must edit this file!
#
disableAuthorization yes
authCommunity log,execute,net public
# the generic traps
traphandle default /usr/local/bin/snmptrapd.sh
The snmptrapd.sh script just says "hello".
#!/bin/sh
echo "hello"
The script is executable and runs when executed independently:
> /usr/local/bin/snmptrapd.sh
hello
The snmptrapd is running as a background process:
> ps -ef | grep snmp
root 29477 1 0 14:49 ? 00:00:00 /usr/sbin/snmptrapd -Lsd -p /var/run/snmptrapd.pid -Cc /etc/snmp/snmptrapd.conf
And yet when I send a trap locally using snmptrap nothing happens:
> snmptrap -v 2c -c public localhost "" NET-SNMP-EXAMPLES-MIB::netSnmpExampleHeartbeatNotification netSnmpExampleHeartbeatRate i 123456
>
Now it seems that the trap does get logged, because the system log file (/var/log/messages) has the following entry:
Aug 8 15:46:10 <server_name> snmptrapd[29477]: 2017-08-08 15:46:10 localhost
[UDP: [127.0.0.1]:44928->[127.0.0.1]]:#012DISMAN-EVENT-MIB::sysUpTimeInstance =
Timeticks: (1338382434) 154 days, 21:43:44.34#011SNMPv2-MIB::snmpTrapOID.0 =
OID: NET-SNMP-EXAMPLES-MIB::netSnmpExampleHeartbeatNotification#011NET-SNMP-EXAMPLES-MIB::netSnmpExampleHeartbeatRate
= INTEGER: 123456
As far as I can see everything is set up correctly. If so, why is the trap handle not working and how can one check why the trap doesn't trigger the script?
Thanks in advance.
EDIT: When I added the -Ci option to the snmptrapd command line options I got the following error:
No log handling enabled - turning on stderr logging
: Unknown Object Identifier (Sub-id not found: (top) -> )
OK, so after looking around some more I found the answer.
The reason that we are not seeing the output is because snmptrapd is being run as a daemon and doesn't send its standard output to the console. One can replace this with
echo "hello" > $HOME/output.txt
and the word 'hello' appears in the output.txt file.
See also http://www.linuxquestions.org/questions/linux-newbie-8/net-snmp-trap-handling-4175420577/
and
https://superuser.com/questions/823435/where-to-log-stdout-and-stderr-of-a-daemon

The Linux timeout command and exit codes

In a Linux shell script I would like to use the timeout command to end another command if some time limit is reached. In general:
timeout -s SIGTERM 100 command
But I also want that my shell script exits when the command is failing for some reason. If the command is failing early enough, the time limit will not be reached, and timeout will exit with exit code 0. Thus the error cannot be trapped with trap or set -e, as least I have tried it and it did not work. How can I achieve what I want to do?
Your situation isn't very clear because you haven't included your code in the post.
timeout does exit with the exit code of the command if it finishes before the timeout value.
For example:
timeout 5 ls -l non_existent_file
# outputs ERROR: ls: cannot access non_existent_file: No such file or directory
echo $?
# outputs 2 (which is the exit code of ls)
From man timeout:
If the command times out, and --preserve-status is not set, then
exit with status 124. Otherwise, exit with the status of COMMAND. If
no signal is specified, send the TERM signal upon timeout. The TERM
signal kills any process that does not block or catch that signal.
It may be necessary to use the KILL (9) signal, since this signal
cannot be caught, in which case the exit status is 128+9 rather than
124.
See BashFAQ105 to understand the pitfalls of set -e.

How to see if the process was killed?

When you want to set a time limit for a process, you can simply use timeout before the process:
timeout 1.5s COMMAND
This will kill the COMMAND if it was not done after 1.5 seconds.
I used that command in some bash scripts; How can i know if one process was completely done before the time limit, or it was killed (because of exceeding the time limit)?
The Gnu timeout command normally returns a status code of 124 if the timeout was exceeded. Otherwise, it returns the status code returned by the command itself. So you can test the status code by grabbing the value of $? immediately after executing timeout:
timeout 1.5s COMMAND
status=$?
if ((status==124)); then
# command timed out
elif (status!=0)); then
# command terminated in time, but it returned an error status
else
# command terminated in time and reported success
fi
If your command might return the status code 124, then you would have to use the --preserve-status option and check to see if the command was terminated by the signal you tell timeout to send. See the man timeout for details.
Add && echo >> time_limit.txt after COMMAND:
timeout 1.5s COMMAND && echo >> time_limit.txt
So, if you want to see if the COMMAND was killed, check the existence of file time_limit.txt. If that file exists, it means the command was NOT killed. Otherwise, the command was killed.
In bash script, you can check the existence of that file as follow:
if [[ -r time_limit.txt ]]then;
{
echo "The command was NOT killed"
}
else
{
echo "The command was killed"
}

Resources