Need a way to check content of file (time data) is more than 5 minutes or not - linux

I am creating a script to pull out the list of highest cpu consuming jobs on my server. I am taking the dump of processes in a flat file in below format:
202009080230,4.1,218579,padaemon,02:30,00:00:01,SD_RCC_FRU:wf_rcc_ds_CRS_FactSuspendedGdc_d.s_m_rcc_ds_Staging_Dimension_FactSuspendedGdc_ipi_d
The second data point(4.1) is my cpu utilization while the sixth one is the time taken(00:00:01).
I intend to filter out only the entries from this file that are greater than 75% and run for more than 00:05:00 i.e. 5 minutes. I wrote an if logic for the same like below:
var10=`ls -lrt /projects/Informatica/INFA_ADMIN/LogFiles/myfile.txt |awk '{ print $9 }'`
sed -i 's/ /,/g' $var10
for i in `cat $var10`
do
var12=`echo $i |cut -d',' -f2`
var16=`echo $i |cut -d',' -f6`
if [ $var12 -ge 75.0 ] && ("$var16" > "00:05:00");
then
<logic>
fi
done
the script is able to identify processes taking more than 75.0 cpu but failing in the second condition.
[ 136 -ge 75.0 ]
00:20:26
1> 00:05:00 cpu_report_email_prateek_test.sh[63]: 00:20:26: not found [No such file or directory]
Can you please guide how this can be fixed?

First, I understand that on line 1 you're trying to have the latest file. I would rather have the file created always with the same name, then at the end of the script, after processing it, mv to another name (like myfile.txt.20200101). So you'll always know that myfile.txt is the latest one. More efficient than doing ls. Then again, you could improve your line with : (it's the number one, not the letter "l")
var10=$(ls -1rt /projects/Informatica/INFA_ADMIN/LogFiles/myfile.txt)
From your line 2, I assume the original file is space delimited. Actually, you might use that to your advantage instead of adding commas. Also, using a for loop with "cat" is less performant than using a while loop. For the time comparison, I would use a little trick. Transforming your threshold and reading, in second since epoch then compared them.
MyTreshold=$(date -d 00:05:00 +"%s")
while read -r line; do
MyArray=( $line )
MyReading=$(date -d ${MyArray[5]} +"%s")
#the array starts a index 0
if [[ ${MyArray[1]} -ge 75 && $MyReading -gt $MyThreshold ]]; then
<LOGIC>
fi
done <$var10

The RHS of your && expression would expand $var16 and attempt to execute the result, redirecting the output to a file 00:05:00. Replace with [ $x -gt $y ]. For a numeric comparison, you'll need straight numbers (no ':' chars). You could modify from hh:mm:ss to total seconds then compare.

Related

bash/awk/unix detect changes in lines of csv files

I have a timestamp in this format:
(normal_file.csv)
timestamp
19/02/2002
19/02/2002
19/02/2002
19/02/2002
19/02/2002
19/02/2002
The dates are usually uniform, however, there are files with irregular dates pattern such as this example:
(abnormal_file.csv)
timestamp
19/02/2002
19/02/2003
19/02/2005
19/02/2006
In my directory, there are hundreds of files that consist of normal.csv and abnormal.csv.
I want to write a bash or awk script that detect the dates pattern in all files of a directory. Files with abnormal.csv should be moved automatically to a new, separate directory (let's say dir_different/).
Currently, I have tried the following:
#!/bin/bash
mkdir dir_different
for FILE in *.csv;
do
# pipe 1: detect the changes in the line
# pipe 2: print the timestamp column (first column, columns are comma-separated)
awk '$1 != prev {print ; prev = $1}' < $FILE | awk -F , '{print $1}'
done
If the timestamp in a given file is normal, then only one single timestamp will be printed; but for abnormal files, multiple dates will be printed.
I am not sure how to separate the abnormal files from the normal files, and I have tried the following:
do
output=$(awk 'FNR==3{print $0}' $FILE)
echo ${output}
if [[ ${output} =~ ([[:space:]]) ]]
then
mv $FILE dir_different/
fi
done
Or is there an easier method to detect changes in lines and separate files that have different lines? Thank you for any suggestions :)
Assuming that none of your "normal" CSV files have trailing newlines this should do the separation just fine:
#!/bin/bash
mkdir -p dir_different
for FILE in *.csv;
do
if awk '{a[$1]++}END{if(length(a)<=2){exit 1}}' "$FILE" ; then
echo mv "$FILE" dir_different
fi
done
After a dry-run just get rid of the echo :)
Edit:
{a[$1]++} This bit creates an array a that gets the first field of each line as an index, and that gets incremented every time the same value is seen.
END{if(length(a)<=2){exit 1}} This checks how many elements are in the array. If there there are less than 3 (which should be the case if there's always the same date and we only get 1 header, 1 date) exit the processing with 1.
"$FILE" is part of the bash script, not awk, and I quoted your variable out of habit, should you ever have files w/ spaces in their names you'll see why :)
So, a "normal" file contains only two different lines:
timestamp
dd/mm/yyyy
Testing if a file is normal is thus as simple as:
[ $(sort -u file.csv | wc -l) -eq 2 ]
This leads to the following possible solution:
#!/usr/bin/env bash
mkdir -p dir_different
for FILE in *.csv;
do
if [ $(sort -u "$FILE" | wc -l) -ne 2 ] ; then
echo mv "$FILE" dir_different
fi
done

Length comparison of one specific field in linux

I was trying to check the length of second field of a TSV file (hundreds of thousands of lines). However, it runs very very slowly. I guess it should be something wrong with "echo", but not sure how to do.
Input file:
prob name
1.0 Claire
1.0 Mark
... ...
0.9 GFGKHJGJGHKGDFUFULFD
So I need to print out what went wrong in the name. I tested with a little example using "head -100" and it worked. But just can't cope with original file.
This is what I ran:
for title in `cat filename | cut -f2`;do
length=`echo -n $line | wc -m`
if [ "$length" -gt 10 ];then
echo $line
fi
done
awk to rescue:
awk 'length($2)>10' file
This will print all lines having the second field length longer than 10 characters.
Note that it doesn't require any block statement {...} because if the condition is met, awk will by default print the line.
Try this probably:
cat file.tsv | awk '{if (length($2) > 10) print $0;}'
This should be a bit faster since the whole processing is done by the single awk process, while your solution starts 2 processes per loop iteration to make that comparison.
We can use awk if that helps.
awk '{if(length($2) > 10){print}}' filename
$2 here is 2nd field in filename which runs for every line. It would be faster.

How to monitor CPU usage automatically and return results when it reaches a threshold

I am new to shell script , i want to write a script to monitor CPU usage and if the CPU usage reaches a threshold it should print the CPU usage by top command ,here is my script , which is giving me error bad number and also not storing any value in the log files
while sleep 1;do if [ "$(top -n1 | grep -i ^cpu | awk '{print $2}')">>sy.log - ge "$Threshold" ]; then echo "$(top -n1)">>sys.log;fi;done
Your script HAS to be indented and stored to a file, especially if you are new to shell !
#!/bin/sh
while sleep 1
do
if [ "$(top -n1 | grep -i ^cpu | awk '{print $2}')">>sy.log - ge "$Threshold" ]
then
echo "$(top -n1)" >> sys.log
fi
done
Your condition looks a bit odd. It may work, but it looks really complex. Store intermediate results in variables, and evaluate them.
Then, you will immediately see the syntax error on the “-ge”.
You HAVE to store logfiles within an absolute path for security reasons. Use variables to simplify the reading.
#!/bin/sh
LOGFILE=/absolute_path/sy.log
WHOLEFILE=/absolute_path/sys.log
Thresold=80
while sleep 1
do
TOP="$(top -n1)"
CPU="$(echo $TOP | grep -i ^cpu | awk '{print $2}')"
echo $CPU >> $LOGFILE
if [ "$CPU" -ge "$Threshold" ] ; then
echo "$TOP" >> $WHOLEFILE
fi
done
You have a couple of errors.
If you write output to sy.log with a redirection then that output is no longer available to the shell. You can work around this with tee.
The dash before -ge must not be followed by a space.
Also, a few stylistic remarks:
grep x | awk '{y}' is a useless use of grep; this can usefully and more economically (as well as more elegantly) be rewritten as awk '/x/{y}'
echo "$(command)" is a useless use of echo -- not a deal-breaker, but you simply want command; there is no need to capture what it prints to standard output just so you can print that text to standard output.
If you are going to capture the output of top -n 1 anyway, there is no need really to run it twice.
Further notes:
If you know the capitalization of the field you want to extract, maybe you don't need to search case-insensitively. (I could not find a version of top which prints a CPU prefix with the load in the second field -- it the expression really correct?)
The shell only supports integer arithmetic. Is this a bug? Maybe you want to use Awk (which has floating-point support) to perform the comparison? This also allows for a moderately tricky refactoring. We make Awk output an exit code of 1 if the comparison fails, and use that as the condition for the if.
#!/bin/sh
while sleep 1
do
if top=$(top -n 1 |
awk -v thres="$Threshold" '1; # print every line
tolower($1) ~ /^cpu/ { print $2 >>"sy.log";
exitcode = ($2 >= thres ? 0 : 1) }
END { exit exitcode }')
then
echo "$top" >>sys.log
fi
done
Do you really mean to have two log files with nearly the same name, or is that a typo? Including a time stamp in the log might be useful both for troubleshooting and for actually using the log files.

Making a graph (asteriks) with Printf on Linux

I made a little program where I see the date, time and donwload speed every time in 10 sec. Before I made my program I installed speedtest-cli.
Now follows a code block:
!/bin/bash
while [ True ]
do
sleep 1
date=$(date "+%D +%T)
dlspeed=$(speedtest-cli --simple | egrep 'Download')
echo $date > bandwidth
echo $dlspeed >> bandwidth
content=$(cat bandwidth | sed ':a;N;$!ba;s/;\n/;/g')
echo $content >> output
done
When we run the program everything works nice. The first ouput is 14/12/2016 18:33:25 Download: 8.33 Mbits. Every 10 sec it shows me my download speed. I am using here a loop.
Now I need to use printf to make an graph based on the Download speed.So the output has to be 14/12/2016 18:33:25 Download: 8.33 Mbits ********.
My question. How to make this graph with asterisks and add them.
You can use a for loop in bash:
for i in {1..6} ; do
printf '*'
done
For a variable, you can use seq instead of curly brackets:
n=6
for i in $(seq $n) ; do
printf '*'
done

Trying to scrub 700 000 data against 15 million data

I am trying to scrub 700 000 data obtained from single file, which need to be scrubbed against a data of 15 million present in multiple files.
Example: 1 file of 700 000 say A. Multiple files pool which have 15 million call it B.
I want a pool B of files with no data of file A.
Below is the shell script I am trying to use it is working fine. But it is taking massive time of more than 8 Hours in scrubbing.
IFS=$'\r\n' suppressionArray=($(cat abhinav.csv1))
suppressionCount=${#suppressionArray[#]}
cd /home/abhinav/01-01-2015/
for (( j=0; j<$suppressionCount; j++));
do
arrayOffileNameInWhichSuppressionFound=`grep "${suppressionArray[$j]}," *.csv| awk -F ':' '{print $1}' > /home/abhinav/fileNameContainer.txt`
IFS=$'\r\n' arrayOffileNameInWhichSuppressionFound=($(cat /home/abhinav/fileNameContainer.txt))
arrayOffileNameInWhichSuppressionFoundCount=${#arrayOffileNameInWhichSuppressionFound[#]}
if [ $arrayOffileNameInWhichSuppressionFoundCount -gt 0 ];
then
echo -e "${suppressionArray[$j]}" >> /home/abhinav/emailid_Deleted.txt
for (( k=0; k<$arrayOffileNameInWhichSuppressionFoundCount; k++));
do
sed "/^${suppressionArray[$j]}/d" /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$k]} > /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$i]}".tmp" && mv -f /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$i]}".tmp" /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$i]}
done
fi
done
Another solution clicked in my mind is to breakdown 700k data into smaller size files of 50K and send across 5-available servers, also POOL A will be available at each server.
Each server will serve for 2-Smaller file.
These two lines are peculiar:
arrayOffileNameInWhichSuppressionFound=`grep "${suppressionArray[$j]}," *.csv| awk -F ':' '{print $1}' > /home/abhinav/fileNameContainer.txt`
IFS=$'\r\n' arrayOffileNameInWhichSuppressionFound=($(cat /home/abhinav/fileNameContainer.txt))
The first assigns an empty string to the mile-long variable name because the standard output is directed to the file. The second then reads that file into the array. ('Tis curious that the name is not arrayOfFileNameInWhichSuppressionFound, but the lower-case f for file is consistent, so I guess it doesn't matter beyond making it harder to read the variable name.)
That could be reduced to:
ArrFileNames=( $(grep -l "${suppressionArray[$j]}," *.csv) )
You shouldn't need to keep futzing with carriage returns in IFS; either set it permanently, or make sure there are no carriage returns before you start.
You're running these loops 7,00,000 times (using the Indian notation). That's a lot. No wonder it is taking hours. You need to group things together.
You should probably simply take the lines from abhinav.csv1 and arrange to convert them into appropriate sed commands, and then split them up and apply them. Along the lines of:
sed 's%.*%/&,/d%' abhinav.csv1 > names.tmp
split -l 500 names.tmp sed-script.
for script in sed-script.*
do
sed -f "$script" -i.bak *.csv
done
This uses the -i option to backup the files. It may be necessary to do redirection explicitly if your sed does not support the -i option:
for file in *.csv
do
sed -f "$script" "$file" > "$file.tmp" &&
mv "$file.tmp" "$file"
done
You should experiment to see how big the scripts can be. I chose 500 in the split command as a moderate compromise. Unless you're on antique HP-UX, that should be safe, but you may be able to increase the size of the script more, which will reduce the number of times you have to edit each file, which speeds up the processing. If you can use 5,000 or 50,000, you should do so. Experiment to see what the upper limit. I'm not sure that you'd find doing all 700,000 lines at once is feasible — but it should be fastest if you can do it that way.

Resources