Grep logs for occurrences per second - linux

I am trying to search logs for a range of time looking for the number of occurrences a specific account has. For instance I am running this now:
sed ā€˜/23:50:28/,/23:55:02/! dā€™ log.log | grep account_number | wc -l
Which nicely returns the total number of times this account might have entries given the time frame per second. My question is how can I also get a list of all those occurrences by each time entry? Example:
23:50:28 - 2
23:50:29 - 1
23:50:30 - 3
etc.
etc.
Thanks

awk to the rescue!
awk ā€˜/23:50:28/,/23:55:02/{if(/account_number/) a[$1]++}
END{for(k in a) print k " - " a[k]}' log | sort
obviously not tested since there is no sample input.

Related

grep for a substring

I have a file that has the following user names in random places in the file:
albert#ghhdh
albert#jdfjgjjg
john#jfkfeie
mike#fjfkjf
bill#fjfj
bill#fkfkfk
Usernames are the names to the left of the # symbol.
I want to use unix commands to grep the file for usernames, then make a count of unique usernames.
Therefore using the example above, the output should state that there are 4 unique users (I just need the count as the output, no words)
Can someone help me determine the correct count?
You could extract the words before #, sort them and count them :
cat test.txt | cut -d '#' -f 1 | sort | uniq -c
With test.txt :
albert#ghhdh
john#jfkfeie
bill#fjfj
mike#fjfkjf
bill#fkfkfk
albert#jdfjgjjg
It outputs :
2 albert
2 bill
1 john
1 mike
Note that the duplicate usernames don't have to be grouped in the input list.
If you're just interested in the count of uniq users :
cat test.txt | cut -d '#' -f 1 | sort -u | wc -l
# => 4
Or shorter :
cut -d '#' -f 1 test.txt | sort -u | wc -l
Here is the solution that finds the usernames anywhere on the line (not just at the beginning), even if there are multiple usernames on a single line, and finds their unique count:
grep -oE '\b[[:alpha:]_][[:alnum:]_.]*#' file | cut -f1 -d# | sort -u | wc -l
-o only fetches the matched portion
-E processes extended regex
\b[[:alpha:]_][[:alnum:]]*# matches usernames (a string following a word boundary \b that starts with an alpha or underscore followed by zero or more alphanumeric and other permitted characters, ending with a #
cut -f1 -d# extracts the username portion which is then sorted and counted for unique names
Faster with one awk command, if awk is allowed:
awk -F"#" '!seen[$1]++{c++}END{print "Unique users =" c}'
Small Explanation:
using # as delimiter (-F) you look for field 1 = $1 for awk.
For every field 1 that is not seen again we increase a counter c.
In the same time we increase the particular field1 so if found again the test "not seen" will not be valid.
At the end we just print the counter of unique "seen".
As a plus, this solution does not require pre-sorting. Duplicates would be found even if file is not sorted.

Create diff between two files based on specific column

I have the following problem.
Say I have 2 files:
A.txt
1 A1
2 A2
B.txt
1 B1
2 B2
3 B3
I want to make diff which is based only on values of first column, so the result should be
3 B3
How this problem can be solved with bash in linux?
[ awk ] is your friend
awk 'NR==FNR{f[$1];next}{if($1 in f){next}else{print}}' A.txt B.txt
or more simply
awk 'NR==FNR{f[$1];next}!($1 in f){print}' A.txt B.txt
or even more simply
awk 'NR==FNR{f[$1];next}!($1 in f)' A.txt B.txt
A bit of explanation will certainly help
NR & FNR are awk built-in variables which stand for total number of records - including current - processed so far and total number of records - including current - processed so far in the current file respectively and they will be equal only for the first file processed.
f[$1] creates the array f at first and then adds $1 as a key if the same key doesn't yet exist. If no value is assigned, then f[$1] is auto-initialized to zero, but this aspect doesn't find a use in your case
next goes to the next record with out processing rest of the awk script.
Note that {if($1 in f){next}else{print}} part will be processed only for the second (and subsequent if any) file/s.
$1 in f checks if the the key $1 exists in the array f
The if-else-print part is self explanatory.
Note in the third version, the {print} is omitted coz the default action for awk is printing !!
awk 'NR==FNR{array[$1];next} !($1 in array)' a.txt b.txt
3 B3
Like this in bash but only if you are really not interested in the second column at all:
diff <(cut -f1 -d" " A.txt) <(cut -f1 -d" " B.txt)

what is the meaning of delimiter in cut and why in this command it is sorting twice?

I am trying to find the reason of this command and as I know very basic I found that
last | cut -d" " -f 1 | sort | uniq -c | sort
last = Last searches back through the file /var/log/wtmp (or the file designated by the -f flag) and displays a list of all users logged in (and out) since that file was created.
cut is to show the desired column.
The option -d specifies what is the field delimiter that is used in the input file.
-f specifies which field you want to extract
1 is the out put I think which I am not sure
and the it is sorting and then it is
Uniq command is helpful to remove or detect duplicate entries in a file. This tutorial explains few most frequently used uniq command line options that you might find helpful.
If anyone can explain this command and also explain why there is two sorts I will appreciate it.
You are right on your explanation of cut: cut -d" " -f1 (no need of space after f) gets the first field of a stream based on delimiter " " (space).
Then why sort | uniq -c | sort?
From man uniq:
Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use 'sort -u' without 'uniq'.
Also, comparisons honor the rules specified by 'LC_COLLATE'.
That's why you need to sort the lines before piping to uniq. Finally, as uniq output is not sorted, you need to sort again to see the most repeated items first.
See an example of sort and uniq -c for a given file with repeated items:
$ seq 5 >>a
$ seq 5 >>a
$ cat a
1
2
3
4
5
1
2
3
4
5
$ sort a | uniq -c | sort <--- no repeated matches
2 1
2 2
2 3
2 4
2 5
$ uniq -c a | sort <---- repeated matches
1 1
1 1
1 2
1 2
1 3
1 3
1 4
1 4
1 5
1 5
Note you can do the sort | uniq -c all together with this awk:
last | awk '{a[$1]++} END{for (i in a) print i, a[i]}'
This will store in the a[] array the values of the first column and increase the counter whenever it finds more. In the END{} blocks it prints the results, unsorted, so you could pipe again to sort.
uniq -c is being used to create a frequency histogram. The reason for the second sort is that you are then sorting your histogram by frequency order.
The reason for the first sort is that uniq is only comparing each line to its previous when deciding whether the line is unique or not.

Bash Script - Divide Colum 2 by Colum in the middle but keep 1 and 4 on either side

I have a list that has an ID, population, area and province, that looks like this:
1:517000:405212:Newfoundland and Labrador
2:137900:5660:Prince Edward Island
3:751400:72908:New Brunswick
4:938134:55284:Nova Scotia
5:7560592:1542056:Quebec
6:12439755:1076359:Ontario
7:1170300:647797:Manitoba
8:996194:651036:Saskatchewan
9:3183312:661848:Alberta
10:4168123:944735:British Comumbia
11:42800:1346106:Northwest Territories
12:31200:482443:Yukon Territories
13:29300:2093190:Nunavut
I need display the names of the provinces with the lowest and highest population density (population/area). How can I divide column 1 by column 2 (2 decimal places) but keep the file information in tact on either side (eg. 1:1.28:Newfoundland and Labrador). After that I figure I can just pump it into sort -t: -nk2 | head -n 1 and sort -t: -nrk2 | head -n 1 to pull them.
The recommended command given was grep.
Since you seem to have the sorting and extraction under control, here's an example awk script that should work for you:
#!/usr/bin/env awk -f
BEGIN {
FS=":"
OFS=":"
OFMT="%.2f"
}
{
print $1,$2/$3,$4
}

Bash- sum values from an array in one line

I have this array:
array=(1 2 3 4 4 3 4 3)
I can get the largest number with:
echo "num: $(printf "%d\n" ${array[#]} | sort -nr | head -n 1)"
#outputs 4
But i want to get all 4's add sum them up, meaning I want it to output 12 (there are 3 occurrences of 4) instead. any ideas?
dc <<<"$(printf '%d\n' "${array[#]}" | sort -n | uniq -c | tail -n 1) * p"
sort to get max value at end
uniq -c to get only unique values, with a count of how many times they appear
tail to get only the last line (with the max value and its count)
dc to multiply the value by the count
I picked dc for the multiplication step because it's RPN, so you don't have to split up the uniq -c output and insert anything in the middle of it - just add stuff to the end.
Using awk:
$ printf "%d\n" "${array[#]}" | sort -nr | awk 'NR>1 && p!=$0{print x;exit;}{x+=$0;p=$0;}'
12
Using sort, the numbers are sorted(-n) in reverse(-r) order, and the awk keeps summing the numbers till it finds a number which is different from the previous one.
You can do this with awk:
awk -v RS=" " '{sum[$0]+=$0; if($0>max) max=$0} END{print sum[max]}' <<<"${array[#]}"
Setting RS (record separator) to space allows you to read your array entries as separate records.
sum[$0]+=$0; means sum is a map of cumulative sums for each input value; if($0>max) max=$0 calculates the max number seen so far; END{print sum[max]} prints the sum for the larges number seen at the end.
<<<"${array[#]}" is a here-document that allows you to feed a string (in this case all elements of the array) as stdin into awk.
This way there is no piping or looping involved - a single command does all the work.
Using only bash:
echo $((${array// /+}))
Replace all spaces with plus, and evaluate using double-parentheses expression.

Resources