How to retrieve one days data from a log file having multiple days of data - linux

I have a zipped log file () having 3 days of data.I want to retrieve only one days data. Currently the code for calculating the sum of datavolume is as given below.
Server_Sent_bl1=`gzcat $LOGDIR/blprxy1/archive"$i"/*.log.gz | nawk -F"|" '{sum+=$(NF -28)} END{print sum}'`
There are 3 logs , suppose all the 3 logs contain data of 06/jul/2014 , how to retrieve jul 6th data from those 3 files and then sum up the data volume?

You could try this:
$ gzcat $LOGDIR/blprxy1/archive"$i"/*.log.gz | grep "06/jul/2014" | nawk -F"|" '{sum+=$(NF -28)} END{print sum}'

Related

Linux Unique values count [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I have .csv file, I want to count total values from column 5 only if corresponding value of column8 is not equal to "999"
I have tried with this, but not getting desired output.
cat test.csv | sed "1 d" |awk -F , '$8 != 999' | cut -d, -f5 | sort | uniq | wc -l >test.txt
Note that total number of records is more than 20K
I do am getting the number of unique values but it is not subtracting values with 999.
Can anyone help?
Sample Input:
Col1,Col2,Col3,Col4,Col5,Col7,Col8,Col9,Col10,Col11
1,0,0,0,ABCD,0,0,5436,0,0,0
1,0,0,0,543674,0,0,18999,0,0,0
1,0,0,0,143527,0,0,1336,0,0,0
1,0,0,0,4325,0,0,999,0,0,0
1,0,0,0,MCCDU,0,0,456,0,0,0
1,0,0,0,MCCDU,0,0,190,0,0,0
1,0,0,0,4325,0,0,190,0,0,0
What I want to do is not count the value from col5 if the corresponding value from col8 ==999.
By count total I mean total lines.
In above sample input, col5 value of line 6 and 7 is same, I need them to count as one.
I need to sort as Col5 values could be duplicate as I need to find total unique values.
Script:
awk 'BEGIN {FS=","} NR > 1 && $8 != 999 {uniq[$5]++} END {for(key in uniq) {total+=uniq[key]}; print "Total: "total}' input.csv
Output:
543674 1
143527 1
ABCD 1
MCCDU 2
4325 1
Total: 6
With an awk that supports length(array) (e.g. GNU awk and some others):
$ awk -F',' '(NR>1) && ($8!=999){vals[$5]} END{print length(vals)}' test.csv
5
With any awk:
$ awk -F',' '(NR>1) && ($8!=999) && !seen[$5]++{ cnt++ } END{print cnt+0}' test.csv
5
The +0 in the END is so you get numeric output even if the input file is empty.

Wanted to count total attemps by day awk [duplicate]

This question already has answers here:
Best way to simulate "group by" from bash?
(17 answers)
Closed 2 years ago.
im trying to count the number of ocurrences for a list in awk, actually im able to get it for each user the total attemps, but i wanted total attemps by day. I have a txt file something like:
ID, Event, Date, Type, Message, API, User, Protocol, Attemps
1, ERROR, 30-NOV-20, 4, TEXT, 2, user1, GUI, 9
i used below awk to count total attemps:
awk 'FNR == NR {count[$(NF-3)]++; next} {print $(NF-3), $3 "\t" count[$(NF-3)]}' file file
Can someone help me?
Expected output:
USER ATTEMPS DATE
user1 3 20-NOV-2020
user1 6 22-NOV-2020
user2 2 01-DEC-2020
user3 4 12-NOV-2020
user3 19 18-NOV-2020
This is not using only awk but it should works if you need total attempts per day:
awk -F, '{print $3}' file | sort | uniq -c
Edit: To have total attempts per day and per user you can do following:
awk -F, '{print $3 $7}' file | sort | uniq -c
this should do, but couldn't test with more data
$ awk -F', ' -v OFS='\t' '
NR==1 {print $7,$NF,$3; next}
NF {a[$7,$3]+=$NF}
END {for(k in a)
{split(k,ks,SUBSEP);
print ks[1],a[k],ks[2]}}' file
User Attemps Date
user1 9 30-NOV-20
awk -F, 'NR > 1 { map[$3]+=$NF } END { for (i in map) { print i" - "map[i] } }' file
Using gnu Awk, set the field delimiter to a comma and then use the 3rd field as the index for an array map and the value a running total of attempts ($NF). Once all lines are processed, we loop through the map array printing the index and the value which is the date and the attempt total.

How to get frequency of logging using bash if each line contains a timestamp?

I have a program that during it's operation it writes to a text file. In this text file each line consists of 4 parts.
Thread ID (a number)
A date in the format yyyy-mm-dd
A timestamp in the format 12:34:56.123456
A function name
Some useful comments printed out by the programs
An example of what a log line would look like would be something like this:
127894 2020-07-30 22:04:30.234124 foobar caught an unknown exception
127895 2020-07-30 22:05:30.424134 foobar clearing the programs cache
127896 2020-07-30 22:06:30.424134 foobar recalibrating dankness
The logs are printed in chronological order and I would like to know how I to get the highest frequency of these logs. For example I wanted to know at what minute or second of the day the program has the highest congestion.
Ideally I'd like an output that could tell me for example, "The highest logging frequency is between 22:04:00 and 22:05:00 with 10 log lines printed in this timeframe".
Let's consider this test file:
$ cat file.log
127894 2020-07-30 22:04:30.234124 foobar caught an unknown exception
127895 2020-07-30 22:05:20.424134 foobar clearing the programs cache
127895 2020-07-30 22:05:30.424134 foobar clearing the programs cache
127895 2020-07-30 22:05:40.424134 foobar clearing the programs cache
127896 2020-07-30 22:06:30.424134 foobar recalibrating dankness
127896 2020-07-30 22:06:40.424134 foobar recalibrating dankness
To get the most congested minutes, ranked in order:
$ awk '{sub(/:[^:]*$/, "", $3); a[$2" "$3]++} END{for (d in a)print a[d], d}' file.log | sort -nr
3 2020-07-30 22:05
2 2020-07-30 22:06
1 2020-07-30 22:04
22:05 appeared three times in the log file and is, thus, the most congested, followed by 22:06.
To get only the top most congested minutes, add head. For example:
$ awk '{sub(/:[^:]*$/, "", $3); a[$2" "$3]++} END{for (d in a)print a[d], d}' file.log | sort -nr | head -1
3 2020-07-30 22:05
Note that we select here based on the second and third fields. The presense of dates or times in the texts of log messages will not confuse this code.
How it works
sub(/:[^:]*$/, "", $3) removes everything after minutes in the third field.
a[$2" "$3]++ counts the number of times that date and time (up to minutes) appeared.
After the whole file has been read, for (d in a)print a[d], d prints out the count and date for every date observed.
sort -nr sorts the output with the highest count at the top. (Alternatively, we could have awk do the sorting but sort -nr is simple and portable.)
To sort down to the second
Instead of minutes resolution, we can get seconds resolution:
$ awk '{sub(/\.[^.]*$/, "", $3); a[$2" "$3]++} END{for (d in a)print a[d], d}' file.log | sort -nr
1 2020-07-30 22:06:40
1 2020-07-30 22:06:30
1 2020-07-30 22:05:40
1 2020-07-30 22:05:30
1 2020-07-30 22:05:20
1 2020-07-30 22:04:30
With GNU utilities:
grep -o ' [0-9][0-9]:[0-9][0-9]' file.log | sort | uniq -c | sort -nr | head -n 1
Prints
frequency HH:MM
HH:MM is the hour and minute the highest frequency occurs and frequency is the highest frequency. If you drop the | head -n 1 then you will see the list of frequencies and minutes ordered by frequencies.

Bash Colum sum over a table of variable length

Im trying to get the columsums (exept for the first one) of a tab delimited containing numbers.
To find out the number of columns an store it in a variable I use:
cols=$(awk '{print NF}' file.txt | sort -nu | tail -n 1
next I want to calculate the sum of all numbers in that column and store this in a variable again in a for loop:
for c in 2:$col
do
num=$(cat file.txt | awk '{sum+$2 ; print $0} END{print sum}'| tail -n 1
done
this
num=$(cat file.txt | awk '{sum+$($c) ; print $0} END{print sum}'| tail -n 1
on itself with a fixed numer and without variable input works find but i cannot get it to accept the for-loop variable.
Thanks for the support
p.s. It would also be fine if i could sum all columns (expept the first one) at once without the loop-trouble.
Assuming you want the sums of the individual columns,
$ cat file
1 2 3 4
5 6 7 8
9 10 11 12
$ awk '
{for (i=2; i<=NF; i++) sum[i] += $i}
END {for (i=2; i<=NF; i++) printf "%d%s", sum[i], OFS; print ""}
' file
18 21 24
In case you're not bound to awk, there's a nice tool for "command-line statistical operations" on textual files called GNU datamash.
With datamash, summing (probably the simplest operation of all) a 2nd column is as easy as:
$ datamash sum 2 < table
9
Assuming the table file holds tab-separated data like:
$ cat table
1 2 3 4
2 3 4 5
3 4 5 6
To sum all columns from 2 to n use column ranges (available in datamash 1.2):
$ n=4
$ datamash sum 2-$n < table
9 12 15
To include headers, see the --headers-out option

Extract emails from file with more than 100 users

I can't quite wrap my head around this issue. I'm trying to output a file with a list of email addresses from a list of email address. If there are more than 100 email addresses assigned to any given in that list domain i need those emails outputted those to a file.
emaillist.txt file will have:
5000 occurrences of userID#yahoo.com
2000 occurrences of userID#aol.com
100 occurrences of userID#rr.com
10 occurrences of userID#whatever.com
cut -d # -f 2 emailist.txt | sort | uniq -c | sort -rn
outputs
5000 yahoo.com
2000 aol.com
100 rr.com
10 whatever.com
Now that i know the counts of how many emails i have at each domain, i only want the email addresses in the new file of domains that have more than 100 users.
This should do what you want:
cut -d # -f 2 email.txt | sort | uniq -c | awk '$1 >= 100 {print $2}' | while read e; do grep "#$e$" email.txt >> emailkeep.txt; done
Assuming your file contains emails only. Use the following awk would solve your problem.
awk '{split($0, a, "#");} NR==FNR{mp[a[2]]++; next} (mp[a[2]]>=100)' emaillist.txt emaillist.txt
^^^ modify to whatever you need
DEMO
lo#ubuntu:~$ cat emaillist.txt
userID#yahoo.com
userID1#yahoo.com
userID2#yahoo.com
userID#aol.com
userID#rr.com
userID#whatever.com
lo#ubuntu:~$ awk '{split($0, a, "#");} NR==FNR{mp[a[2]]++; next} (mp[a[2]]>1)' emaillist.txt emaillist.txt
userID#yahoo.com
userID1#yahoo.com
userID2#yahoo.com

Resources