Nagios Custom Script for finding defunct processes - linux

Objective: Check for zombie processes, but if any are found to be owned by the account nagios, ignore them. If a zombie process is found to be owned by an account other than the nagios account, then issue an alert.
Currently have the following nagios check in the nrds.cfg to find zombie processes.
command[Check_ZombieProcs]=/opt/tools/nagitem/libexec/check_procs -w 1 -c 2 -s Z
Which will generate output like this when it finds something:
PROCS CRITICAL: 3 processes with STATE = Z | procs=3;1;2;0;
To find out who owns the above following 3 defunct zombie processes, I can manually run the command:
ps -elf | grep Z | grep "defunct" which results in the output shown below.
0 Z nagios 9487 9948 0 80 0 - 0 do_exi Nov07 ? 00:00:00 [sh] <defunct>
0 Z nagios 28647 9949 0 80 0 - 0 do_exi Nov03 ? 00:00:00 [sh] <defunct>
0 Z nagios 67429 9947 0 80 0 - 0 do_exi Nov03 ? 00:00:00 [sh] <defunct>
My problem is that the defunct processes shown above end up clearing on their own and I want to exclude from the output any defunct zombie processes owned by the account nagios.
Since the check_procs nagios plugin does not have an argument to handle the exclusion of a user, I am creating the following script.
With the script below, what I want to accomplish is:
If status is equal to null or is equal to nagios, then state should be ok (green).
If status is equal to anything other than null or nagios, then state should be critcal (red).
#!/bin/ksh
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3
SCRIPTPATH=`echo $0 | /bin/sed -e 's,[\\/][^\\/][^\\/]*$,,'`
if [[ -f ${SCRIPTPATH}/utils.sh ]]; then
. ${SCRIPTPATH}/utils.sh # use nagios utils to set real STATE_* return values
fi
printvariables() {
echo "Variables:"
#Add all your variables at the end of the "for" line to display them in verbose
for i in EXITSTATUS EXITMESSAGE
do
echo -n "$i : "
eval echo \$${i}
done
echo
}
#Set to unknown in case of unexpected exit
EXITSTATUS=$STATE_UNKNOWN
EXITMESSAGE="UNKNOWN: Unexpected exit. You should check that everything is alright"
ZOMB=$(ps -elf | grep Z | grep "defunct" | awk {'print $3'} | sort -u)
if [ "$ZOMB" == "nagios" ];then
EXITSTATUS=$STATE_OK
EXITMESSAGE="OK - No Defunct processes found"
elif [ "$ZOMB" == "" ];then
EXITSTATUS=$STATE_OK
EXITMESSAGE="OK - No Defunct processes found"
else
EXITSTATUS=$STATE_CRITICAL
EXITMESSAGE="CRITICAL - Defunct processes found owned by $ZOMB"
fi
#Script end, display verbose information
if [[ $VERBOSE -eq 1 ]] ; then
printvariables
fi
echo ${EXITMESSAGE}
exit $EXITSTATUS
The command:
ps -elf | grep Z | grep "defunct" | awk {'print $3'}
has the potential to give output similar to the following which no doubt will cause an issue. I cannot replicate or thoroughly test as I only occasionally see a defunct process.
nagios
root
someuser
Note that in my script above I am using (see the sort -u):
ps -elf | grep Z | grep "defunct" | awk {'print $3'} | sort -u
to eliminate getting the same account (like nagios) listed multiple times in the output. My script above does work, but I am concerned that it cannot handle if there is more than one account that has defunct processes.
Appreciate any help.

This is more of a bash question than a nagios question, but I'll bite.
The first issue I see is that you're simply pushing all output from ps into $ZOMB when what you probably want to do is to iterate over each line (process).
You could do something like this:
For each line in the ps output, check if it's a zombie process. If not, discard it.
If yes, check whether the user is nagios, if so discard it.
If not, we have a problem, so exit 1 and say what the user and process is.
Example:
#!/bin/bash
echo "$(ps -elf)" | while IFS= read -r line ;
do
user=$(echo "$line" | awk {'print $3'})
stat=$(echo "$line" | awk {'print $2'})
proc=$(echo "$line" | awk {'print $15'})
if [ "$stat" != "R" ]; then
# it's not running, we don't care
continue
elif [ "$user" == "root" ]; then
# it's root, we don't care
continue
else
# we do care! print info and exit 1 immediately
echo "Found process: $user '$proc'"
exit 1
fi
done
In your case, "R" would be "Z" and "root" would be "nagios".

Related

Trying to get script working for Disk check and Deluge

I managed to get this script working earlier but then it stopped working and now I always get this error and the logs don't show any information the .log file is empty
My script was working fine until I changed { df -k ${fileSystem}|tail -n1 } to { quota -u |tail -n1 } because it shows the correct usage assigned to me instead of the entire seedbox
Check Entire Seedbox disk free
user#hera:~/scripts$ df -k /home20/<user>|tail -n1
/dev/sdu1 15616058976 3311158640 12303321368 22% /home20
Check Seedbox Slot only disk free
user#hera:~/scripts$ quota -u <user>|tail -n1
/dev/sdu1 1629501728 1953497088 1953497088 4344 0 0
Log file
tail: deluge-disk-check.log: file truncated
[empty]
Error message
./deluge-disk-check.sh: line 23: let: freeSpacePct=100*/: syntax error: operand expected (error token is "/")
my script
#!/bin/bash
exec 3>&1 4>&2
trap 'exec 2>&4 1>&3' 0 1 2 3
exec 1>/home/<user>/scripts/deluge-disk-check.log 2>&1
# Adjust these parameters to your system
fileSystem="/home/<user>" # Filesystem you want to monitor
minFreeSpace1="78" # Seedbox Free space 1st threshold percentage
minFreeSpace2="77" # Seedbox Free space 2nd threshold percentage
checkInterval="3600" # Interval between checks in seconds
SERVICE='deluged'
while (:); do
# Get the output of df -k and put in a variable for parsing
dfOutPut=$(df -k ${fileSystem}|tail -n1)
# Exctract the fields containing total and available 1K-blocks
totalBlocksKb=$(echo "${dfOutPut}" | awk '{print $2}')
availableBlocksKb=$(echo "${dfOutPut}" | awk '{print $4}')
# Calculate percentage of free space
let freeSpacePct=100\*${availableBlocksKb}/${totalBlocksKb}
# Check if free space percentage is below threshold value
if [ "${freeSpacePct}" -lt "${minFreeSpace1}" ]; then
date +'%Y-%m-%d %H:%M:%S'
echo "You only have ${freeSpacePct}% free space on seedbox"
# Check whether the instance of thread exists:
if ps ax | grep -v grep | grep $SERVICE > /dev/null
then
echo "Deluge is running, Is there free space on device?"
pkill deluge
echo -e "No, Trying to stop Deluge...\nDeluge stopped, exiting"
else
if [ "${freeSpacePct}" -lt "${minFreeSpace2}" ]; then
echo "refreshing data..."
echo "Deluge is not running, Is there free space on device?"
echo "Yes, Threshold value is now ${minFreeSpace2}%"
echo "Trying to restart Deluge ..." && app-deluge restart
echo "Deluge is running, exiting"
sleep ${checkInterval}
fi
fi
fi
done
On line 23:
let freeSpacePct=100\*${availableBlocksKb}/${totalBlocksKb}
For arithmetic evaluations, use $(( ... ))) like this:
let freeSpacePct=$(( 100 * ${availableBlocksKb} / ${totalBlocksKb} ))
or more simply (without ${...} around variable names):
let freeSpacePct=$(( 100 * availableBlocksKb / totalBlocksKb ))

How to get correct tail PID under multiple process in shell

I am implementing monitor_log function which will tail the most recent line from running log and check required string with while loop, the timeout logic should be when the tail log running over 300 seconds, it must close the tail and while loop pipeline. Refer to How to timeout a tail pipeline properly on shell, as mentioned even monitor_log function finished, it will left tail process on background, so I have to manually kill it in case of hundreds times calling monitor_log will left hundreds of tail process.
The function monitor_log below kill correct tail PID only under single process situation, because tail_pid=$(ps -ef | grep 'tail' | cut -d' ' -f5) will return the correct tail PID which on the only process. When comes to multiple process situation(process fork), for example, multiple calls of monitor_log function to parallel tail log, there will be multiple tail process running, tail_pid=$(ps -ef | grep 'tail' | cut -d' ' -f5) not able to return exactly tail PID which belong to current fork.
So is that possible to NOT left tail process when monitor_log function finished ? OR if left tail process, is there a proper way to find its correct PID (not the tail PID from other thread) and kill it under multiple process situation ?
function monitor_log() {
if [[ -f "running.log" ]]; then
# Tail the running log last line and keep check required string
while read -t 300 tail_line
do
if [[ "$tail_line" == "required string" ]]; then
capture_flag=1
fi
if [[ $capture_flag -eq 1 ]]; then
break;
fi
done < <(tail -n 1 -f "running.log")
# Not get correct tail PID under multiple process
tail_pid=$(ps -ef | grep 'tail' | cut -d' ' -f5)
kill -13 $tail_pid
fi
}
You may try below code:
function monitor_log() {
if [[ -f "running.log" ]]; then
# Tail the running log last line and keep check required string
while read -t 300 tail_line
do
if [[ "$tail_line" == "required string" ]]; then
capture_flag=1
fi
if [[ $capture_flag -eq 1 ]]; then
break;
fi
done < <(tail -n 1 -f "running.log")
tailpid=$!
kill $tailpid
fi
}
>
$! Gives the Process ID (PID) of last job run in background
Update: This would kill the subshell which triggers the tail as suggested in comment
Another way might be add timeout to the tail command as below:
timeout 300s tail -n 1 -f "running.log"

How can I compare the output of a command, (the output is a number) to a number? Trying to see if the output is greater than or equal to 1

I'm trying to run a script that essentially runs psg defunct | wc -l and gives output depending on if the output of that is greater than or equal to 1 or not.
However, I can't get the -ge flag to work to do so. I was wondering if anyone has any input as to what I'm doing wrong, and if you could help me correct it.
Here's the current script:
#!/bin/bash -x
export LANG="C"
#
# Created by Blake Smreker | b0s00dg
# Purpose is to assist users with defunct process tickets
#
#
menu ()
{
echo "What is the server name?"
read srvrname
badboi=$"$srvrname"
}
vars ()
{
psgwc=$"/u/bin/psg defunct | wc -l"
psgd=$"/u/bin/psg defunct"
die=$"/bin/kill -9 `psg defunct | awk '{print $3}'`"
belowmsg=$'echo The number of defunct processes is below threshold! Resolve the ticket and put the following in the ticket:'
abovemsg=$'echo There is more than 100 zombies! Getting you more information:'
}
connect ()
{
connectme="/usr/bin/dzdo -u oseho /bin/ssh -qo PreferredAuthentications=publickey root#$badboi"
}
main ()
{
${connectme} ". /u/data/environment; $psgwc"
if ${connectme} $psgwc -ge 1
then
$abovemsg
echo "Determine the owner of the below process. Then copy and paste the below, and send it to the team on Service Now."
echo "psg defunct | wc -l:"
${connectme} $psgwc
${connectme} $psgd
else
$belowmsg
echo "psg defunct | wc -l:"
${connectme} $psgwc
fi
}
menu
vars
connect
main
And here is the error message:
+ /usr/bin/dzdo -u oseho /bin/ssh -qo PreferredAuthentications=publickey root#oses4101 /u/bin/psg defunct '|' wc -l -ge 1
wc: illegal option -- g
usage: wc [-c | -m | -C] [-lw] [file ...]
It seems to think that me putting -ge is apart of me using wc. However, when I try to correct it, it doesn't work.
The issue lies in the if statement towards the bottom.
Putting the output of psg defunct | wc -l into a file, then using the file in test worked:
[ $(cat outputfile) -ge 1]
allowed the output file to be a number in this way.

What's the actual value of an if-statement executing a command?

I do not quite understand what the following if-statement actually does, in the sense of what the condition outputs could possibly be:
if linux-command-1 | linux-command-2 | linux-command-3 > /dev/null
I understand the execution as the following:
Linux Command 1 gets executed, it's output is PIPED into Linux-Command-2 as input.
Linux Command 2 gets executed with LC1's input, it's output is PIPED into Linux-Command-3.
Linux Command 3 gets executed with LC2's input, it's output is redirected into /dev/null, basically doesn't appear.
But what about the actual if-statement? What's responsible for it becoming true or false?
To further elaborate, here's an example with actual commands:
if ps ax | grep -v grep | grep terminator > /dev/null
then
echo "Success"
else
echo "Fail"
fi
I do know that the functionality behaves in the way that if any output occurs (Process is running) in that execution the condition is True, if nothing occurs (Process not running), the condition is False.
But I don't understand why or how it's coming to that conclusion? Is the shell if statement always expecting a string output as True?
I've also just discovered pgrep but the question would also remain if the statement were
if pgreg -f terminator > /dev/null
In your case you are testing the exit status of grep itself, which will return false (1) if there was no match and true (0) if there was one
if ps ax | grep -v grep | grep terminator > /dev/null
then
echo "Success"
else
echo "Fail"
fi
you could put a "-q" instead of redirection to /dev/null
if ps ax | grep -v grep | grep -q terminator
then
echo "Success"
else
echo "Fail"
fi
I would execute my commad and test the $?
ps ax | grep -v grep | grep terminator
if [ $? -eq 0 ]; then
echo 'it is ok'
else
echo 'is is ko'
fi
If you do:
if linux-command-1 | linux-command-2 | linux-command-3 > /dev/null
only the result of the last command matter
if everything is important put "&&" instead
if ps -ae | grep 'bash' | grep 'pty0' && ls . >/dev/null; then
echo "bash is in the house"
fi
that will fail because there is no not_exist
if ps -ae | grep 'bash' | grep 'pty0' && ls not_exist >/dev/null; then
echo "bash is in the house"
fi

Functionality of a start-stop-restart shell script

I am a shell scripting newbie trying to understand some code, but there are some lines that are too complexe for me. The piece of code I'm talking about can be found here: https://gist.github.com/447191
It's purpose is to start, stop and restart a server. That's pretty standard stuff, so it's worth taking some time to understand it. I commented those lines where I am unsure about the meaning or that I completely don't understand, hoping that somone could give me some explanation.
#!/bin/bash
#
BASE=/tmp
PID=$BASE/app.pid
LOG=$BASE/app.log
ERROR=$BASE/app-error.log
PORT=11211
LISTEN_IP='0.0.0.0'
MEM_SIZE=4
CMD='memcached'
# Does this mean, that the COMMAND variable can adopt different values, depending on
# what is entered as parameter? "memcached" is chosen by default, port, ip address and
# memory size are options, but what is -v?
COMMAND="$CMD -p $PORT -l $LISTEN_IP -m $MEM_SIZE -v"
USR=user
status() {
echo
echo "==== Status"
if [ -f $PID ]
then
echo
echo "Pid file: $( cat $PID ) [$PID]"
echo
# ps -ef: Display uid, pid, parent pid, recent CPU usage, process start time,
# controling tty, elapsed CPU usage, and the associated command of all other processes
# that are owned by other users.
# The rest of this line I don't understand, especially grep -v grep
ps -ef | grep -v grep | grep $( cat $PID )
else
echo
echo "No Pid file"
fi
}
start() {
if [ -f $PID ]
then
echo
echo "Already started. PID: [$( cat $PID )]"
else
echo "==== Start"
# Lock file that indicates that no 2nd instance should be started
touch $PID
# COMMAND is called as background process and ignores SIGHUP signal, writes it's
# output to the LOG file.
if nohup $COMMAND >>$LOG 2>&1 &
# The pid of the last background is saved in the PID file
then echo $! >$PID
echo "Done."
echo "$(date '+%Y-%m-%d %X'): START" >>$LOG
else echo "Error... "
/bin/rm $PID
fi
fi
}
# I don't understand this function :-(
kill_cmd() {
SIGNAL=""; MSG="Killing "
while true
do
LIST=`ps -ef | grep -v grep | grep $CMD | grep -w $USR | awk '{print $2}'`
if [ "$LIST" ]
then
echo; echo "$MSG $LIST" ; echo
echo $LIST | xargs kill $SIGNAL
# Why this sleep command?
sleep 2
SIGNAL="-9" ; MSG="Killing $SIGNAL"
if [ -f $PID ]
then
/bin/rm $PID
fi
else
echo; echo "All killed..." ; echo
break
fi
done
}
stop() {
echo "==== Stop"
if [ -f $PID ]
then
if kill $( cat $PID )
then echo "Done."
echo "$(date '+%Y-%m-%d %X'): STOP" >>$LOG
fi
/bin/rm $PID
kill_cmd
else
echo "No pid file. Already stopped?"
fi
}
case "$1" in
'start')
start
;;
'stop')
stop
;;
'restart')
stop ; echo "Sleeping..."; sleep 1 ;
start
;;
'status')
status
;;
*)
echo
echo "Usage: $0 { start | stop | restart | status }"
echo
exit 1
;;
esac
exit 0
1)
COMMAND="$CMD -p $PORT -l $LISTEN_IP -m $MEM_SIZE -v" — -v in Unix tradition very often is a shortcut for --verbose. All those dollar signs are variable expansion (their text values are inserted into the string assigned to new variable COMMAND).
2)
ps -ef | grep -v grep | grep $( cat $PID ) - it's a pipe: ps redirects its output to grep which outputs to another grep and the end result is printed to the standard output.
grep -v grep means "take all lines that do not contain 'grep'" (grep itself is a process, so you need to exclude it from output of ps). $( $command ) is a way to run command and insert its standard output into this place of script (in this case: cat $PID will show contents of file with name $PID).
3) kill_cmd.
This function is an endless loop trying to kill the LIST of 'memcached' processes' PIDs. First, it tries to send TERM signal (politely asking each process in $LIST to quit, saving its work and shutting down correctly), gives them 2 seconds (sleep 2) to do their shutdown job and then tries to make sure that all processes are killed using signal KILL (-9), which slays the process immediately using OS facilities: if a process has not done its shutdown work in 2 seconds, it's considered hung). If slaying with kill -9 was successful, it removes the PID file and quits the loop.
ps -ef | grep -v grep | grep $CMD | grep -w $USR | awk '{print $2}' prints all PIDs of processes with name $CMD ('memcached') and user $USR ('user'). -w option of grep means 'the Whole word only' (this excludes situations where the sought name is a part of another process name, like 'fakememcached'). awk is a little interpreter most often used to take a word number N from every line of input (you can consider it a selector for a column of a text table). In this case, it prints every second word in ps output lines, that means every PID.
If you have any other questions, I'll add answers below.
Here is an explanation of the pieces of code you do not understand:
1.
# Does this mean, that the COMMAND variable can adopt different values, depending on
# what is entered as parameter? "memcached" is chosen by default, port, ip address and
# memory size are options, but what is -v?
COMMAND="$CMD -p $PORT -l $LISTEN_IP -m $MEM_SIZE -v"
In the man, near -v:
$ man memcached
...
-v Be verbose during the event loop; print out errors and warnings.
...
2.
# ps -ef: Display uid, pid, parent pid, recent CPU usage, process start time,
# controling tty, elapsed CPU usage, and the associated command of all other processes
# that are owned by other users.
# The rest of this line I don't understand, especially grep -v grep
ps -ef | grep -v grep | grep $( cat $PID )
Print all processes details (ps -ef), exclude the line with grep (grep -v grep) (since you are running grep it will display itself in the process list) and filter by the text found in the file named $PID (/tmp/app.pid) (grep $( cat $PID )).
3.
# I don't understand this function :-(
kill_cmd() {
SIGNAL=""; MSG="Killing "
while true
do
## create a list with all the pid numbers filtered by command (memcached) and user ($USR)
LIST=`ps -ef | grep -v grep | grep $CMD | grep -w $USR | awk '{print $2}'`
## if $LIST is not empty... proceed
if [ "$LIST" ]
then
echo; echo "$MSG $LIST" ; echo
## kill all the processes in the $LIST (xargs will get the list from the pipe and put it at the end of the kill command; something like this < kill $SIGNAL $LIST > )
echo $LIST | xargs kill $SIGNAL
# Why this sleep command?
## some processes might take one or two seconds to perish
sleep 2
SIGNAL="-9" ; MSG="Killing $SIGNAL"
## if the file $PID still exists, delete it
if [ -f $PID ]
then
/bin/rm $PID
fi
## if list is empty
else
echo; echo "All killed..." ; echo
## get out of the while loop
break
fi
done
}
This function will kill all the processes related to memcached slowly and painfully (actually quite the opposite).
Above are the explanations.

Resources