How to get frequency of logging using bash if each line contains a timestamp? - linux

I have a program that during it's operation it writes to a text file. In this text file each line consists of 4 parts.
Thread ID (a number)
A date in the format yyyy-mm-dd
A timestamp in the format 12:34:56.123456
A function name
Some useful comments printed out by the programs
An example of what a log line would look like would be something like this:
127894 2020-07-30 22:04:30.234124 foobar caught an unknown exception
127895 2020-07-30 22:05:30.424134 foobar clearing the programs cache
127896 2020-07-30 22:06:30.424134 foobar recalibrating dankness
The logs are printed in chronological order and I would like to know how I to get the highest frequency of these logs. For example I wanted to know at what minute or second of the day the program has the highest congestion.
Ideally I'd like an output that could tell me for example, "The highest logging frequency is between 22:04:00 and 22:05:00 with 10 log lines printed in this timeframe".

Let's consider this test file:
$ cat file.log
127894 2020-07-30 22:04:30.234124 foobar caught an unknown exception
127895 2020-07-30 22:05:20.424134 foobar clearing the programs cache
127895 2020-07-30 22:05:30.424134 foobar clearing the programs cache
127895 2020-07-30 22:05:40.424134 foobar clearing the programs cache
127896 2020-07-30 22:06:30.424134 foobar recalibrating dankness
127896 2020-07-30 22:06:40.424134 foobar recalibrating dankness
To get the most congested minutes, ranked in order:
$ awk '{sub(/:[^:]*$/, "", $3); a[$2" "$3]++} END{for (d in a)print a[d], d}' file.log | sort -nr
3 2020-07-30 22:05
2 2020-07-30 22:06
1 2020-07-30 22:04
22:05 appeared three times in the log file and is, thus, the most congested, followed by 22:06.
To get only the top most congested minutes, add head. For example:
$ awk '{sub(/:[^:]*$/, "", $3); a[$2" "$3]++} END{for (d in a)print a[d], d}' file.log | sort -nr | head -1
3 2020-07-30 22:05
Note that we select here based on the second and third fields. The presense of dates or times in the texts of log messages will not confuse this code.
How it works
sub(/:[^:]*$/, "", $3) removes everything after minutes in the third field.
a[$2" "$3]++ counts the number of times that date and time (up to minutes) appeared.
After the whole file has been read, for (d in a)print a[d], d prints out the count and date for every date observed.
sort -nr sorts the output with the highest count at the top. (Alternatively, we could have awk do the sorting but sort -nr is simple and portable.)
To sort down to the second
Instead of minutes resolution, we can get seconds resolution:
$ awk '{sub(/\.[^.]*$/, "", $3); a[$2" "$3]++} END{for (d in a)print a[d], d}' file.log | sort -nr
1 2020-07-30 22:06:40
1 2020-07-30 22:06:30
1 2020-07-30 22:05:40
1 2020-07-30 22:05:30
1 2020-07-30 22:05:20
1 2020-07-30 22:04:30

With GNU utilities:
grep -o ' [0-9][0-9]:[0-9][0-9]' file.log | sort | uniq -c | sort -nr | head -n 1
Prints
frequency HH:MM
HH:MM is the hour and minute the highest frequency occurs and frequency is the highest frequency. If you drop the | head -n 1 then you will see the list of frequencies and minutes ordered by frequencies.

Related

Extracting the user with the most amount of files in a dir

I am currently working on a script that should receive a standard input, and output the user with the highest amount of files in that directory.
I've wrote this so far:
#!/bin/bash
while read DIRNAME
do
ls -l $DIRNAME | awk 'NR>1 {print $4}' | uniq -c
done
and this is the output I get when I enter /etc for an instance:
26 root
1 dip
8 root
1 lp
35 root
2 shadow
81 root
1 dip
27 root
2 shadow
42 root
Now obviously the root folder is winning in this case, but I don't want only to output this, i also want to sum the number of files and output only the user with the highest amount of files.
Expected output for entering /etc:
root
is there a simple way to filter the output I get now, so that the user with the highest sum will be stored somehow?
ls -l /etc | awk 'BEGIN{FS=OFS=" "}{a[$4]+=1}END{ for (i in a) print a[i],i}' | sort -g -r | head -n 1 | cut -d' ' -f2
This snippet returns the group with the highest number of files in the /etc directory.
What it does:
ls -l /etc lists all the files in /etc in long form.
awk 'BEGIN{FS=OFS=" "}{a[$4]+=1}END{ for (i in a) print a[i],i}' sums the number of occurrences of unique words in the 4th column and prints the number followed by the word.
sort -g -r sorts the output descending based on numbers.
head -n 1 takes the first line
cut -d' ' -f2 takes the second column while the delimiter is a white space.
Note: In your question, you are saying that you want the user with the highest number of files, but in your code you are referring to the 4th column which is the group. My code follows your code and groups on the 4th column. If you wish to group by user and not group, change {a[$4]+=1} to {a[$3]+=1}.
Without unreliable parsing the output of ls:
read -r dirname
# List user owner of files in dirname
stat -c '%U' "$dirname/" |
# Sort the list of users by name
sort |
# Count occurrences of user
uniq -c |
# Sort by higher number of occurrences numerically
# (first column numerically reverse order)
sort -k1nr |
# Get first line only
head -n1 |
# Keep only starting at character 9 to get user name and discard counts
cut -c9-
I have an awk script to read standard input (or command line files) and sum up the unique names.
summer:
awk '
{ sum[ $2 ] += $1 }
END {
for ( v in sum ) {
print v, sum[v]
}
}
' "$#"
Let's say we are using your example of /etc:
ls -l /etc | summer
yields:
0
dip 2
shadow 4
root 219
lp 1
I like to keep utilities general so I can reuse them for other purposes. Now you can just use sort and head to get the maximum result output by summer:
ls -l /etc | summer | sort -r -k2,2 -n | head -1 | cut -f1 -d' '
Yields:
root

Linux bash scripting: Sum one column using awk for overall cpu utilization and display all fields

problem below:
Script: I execute ps command with pid,user... and I am trying to use awk to sum overall cpu utilization of different processes.
Command:
> $ps -eo pid,user,state,comm,%cpu,command --sort=-%cpu | egrep -v '(0.0)|(%CPU)' | head -n10 | awk '
> { process[$4]+=$5; }
> END{
> for (i in process)
> {
> printf($1" "$2" "$3" ""%-20s %s\n",i, process[i]" "$6) ;
> }
>
> }' | sort -nrk 5 | head
Awk: Sum 5th column according to the process name (4th column)
Output:
1. 10935 zbynda S dd 93.3 /usr/libexec/gnome-terminal-server
2. 10935 zbynda S gnome-shell 1.9 /usr/libexec/gnome-terminal-server
3. 10935 zbynda S sublime_text 0.6 /usr/libexec/gnome-terminal-server
4. 10935 zbynda S sssd_kcm 0.2 /usr/libexec/gnome-terminal-server
As you can see, the fourth and the fifth columns are all good, but the other ones (rows/columns) are just the first entry from ps command. I should have 4 different processes as in the fourth column, but for example, the last column shows only one same process.
How to get other entries from ps command? (not only the first entry)
Try this
ps -eo pid,user,state,comm,%cpu,command --sort=-%cpu | egrep -v '(0.0)|(%CPU)' | head -n10 | awk '
{ process[$4]+=$5; a1[$4]=$1;a2[$4]=$2;a3[$4]=$3;a6[$4]=$6}
END{
for (i in process)
{
printf(a1[i]" "a2[i]" "a3[i]" ""%-20s %s\n",i, process[i]" "a6[i]) ;
}
}' | sort -nrk 5 | head
an END rule is executed once only, after all the input is read.
Your printf uses $6, which retains the value from the last line. Think you want to use "i" instead.
Of course $1, $2, and $3 have the same problem so you will need to preserve incoming values as well. An exercise to the student is to fix this.

Check values in file is greater or equal in bash script

I have file.txt include:
2
10
60
90
now how can i check if numbers in that file is equal on greater than 50 end then do something. Something in my case is sending an email this part i have.
I have tried do this with awk but it does not work in script.
The following command will output the greatest value of your file:
sort -nr file.txt | head -1
Then just compare it to the value of your choice and voilĂ . Something like:
if [ `sort -nr file.txt | head -1` -ge 50 ]
then
<do something>
fi
Explanation:
sort -n sorts the file as numbers (otherwise 12 would be considered greater than 100).
sort -r reverse the sort (by default it displays lower numbers first, with -r it displays higher first).
head -1 displays only the first output.
This will serve your job.
$ awk 'FNR > 0 { if($1 > 50) print $1 }' <file>

How to retrieve one days data from a log file having multiple days of data

I have a zipped log file () having 3 days of data.I want to retrieve only one days data. Currently the code for calculating the sum of datavolume is as given below.
Server_Sent_bl1=`gzcat $LOGDIR/blprxy1/archive"$i"/*.log.gz | nawk -F"|" '{sum+=$(NF -28)} END{print sum}'`
There are 3 logs , suppose all the 3 logs contain data of 06/jul/2014 , how to retrieve jul 6th data from those 3 files and then sum up the data volume?
You could try this:
$ gzcat $LOGDIR/blprxy1/archive"$i"/*.log.gz | grep "06/jul/2014" | nawk -F"|" '{sum+=$(NF -28)} END{print sum}'

Cannot get this simple sed command

This sed command is described as follows
Delete the cars that are $10,000 or more. Pipe the output of the sort into a sed to do this, by quitting as soon as we match a regular expression representing 5 (or more) digits at the end of a record (DO NOT use repetition for this):
So far the command is:
$ grep -iv chevy cars | sort -nk 5
I have to add another pipe at the end of that command I think which "quits as soon as we match a regular expression representing 5 or more digits at the end of a record"
I tried things like
$ grep -iv chevy cars | sort -nk 5 | sed "/[0-9][0-9][0-9][0-9][0-9]/ q"
and other variations within the // but nothing works! What is the command which matches a regular expression representing 5 or more digits and quits according to this question?
Nominally, you should add a $ before the second / to match 5 digits at the end of the record. If you omit the $, then any sequence of 5 digits will cause sed to quit, so if there is another number (a VIN, perhaps) before the price, it might match when you didn't intend it to.
grep -iv chevy cars | sort -nk 5 | sed '/[0-9][0-9][0-9][0-9][0-9]$/q'
On the whole, it's safer to use single quotes around the regex, unless you need to substitute a shell variable into it (or unless the regex contains single quotes itself). You can also specify the repetition:
grep -iv chevy cars | sort -nk 5 | sed '/[0-9]\{5,\}$/q'
The \{5,\} part matches 5 or more digits. If for any reason that doesn't work, you might find you're using GNU sed and you need to do something like sed --posix to get it working in the normal mode. Or you might be able to just remove the backslashes. There certainly are options to GNU sed to change the regex mechanism it uses (as there are with GNU grep too).
Another way.
As you don't post a file sample, a did it as a guess.
Here I'm looking for lines with the word "chevy" where the field 5 is less than 10000.
awk '/chevy/ {if ( $5 < 10000 ) print $0} ' cars
I forgot the flag -i from grep ... so the correct is:
awk 'BEGIN{IGNORECASE=1} /chevy/ {if ( $5 < 10000 ) print $0} ' cars
$ cat > cars
Chevy 2 3 4 10000
Chevy 2 3 4 5000
chEvy 2 3 4 1000
CHEVY 2 3 4 10000
CHEVY 2 3 4 2000
Prevy 2 3 4 1000
Prevy 2 3 4 10000
$ awk 'BEGIN{IGNORECASE=1} /chevy/ {if ( $5 < 10000 ) print $0} ' cars
Chevy 2 3 4 5000
chEvy 2 3 4 1000
CHEVY 2 3 4 2000
grep -iv chevy cars | sort -nk 5 | sed '/[0-9][0-9][0-9][0-9][0-9]$/d'

Resources