Extract part of Column Awk - linux

I am trying to count the number of occurrences per second in a log file for a term searched. I've been using AWK and have the issue of the time stamp being locate in a column with additional information. Is it possible to get the number of occurrences per second by only looking for the time pattern 00:00:00 - 24:00:00?
Data example:
[01/May/2018:23:59:59.532
[01/May/2018:23:59:59.848
[01/May/2018:23:59:59.851
[01/May/2018:23:59:59.911
[01/May/2018:23:59:59.923
[01/May/2018:23:59:59.986
[01/May/2018:23:59:59.988
[01/May/2018:23:59:59.756
[01/May/2018:23:59:59.786
[01/May/2018:23:59:59.883
So far I can extract the data easily enough using:
awk '/00:00:00/,/24:00:00/{if(/search_term/) a[$4]++} END{for(k in a) print k " - " a[k]}' file.log |sort
This will return:
[02/May/2018:10:40:05.903 - 1
[02/May/2018:10:40:05.949 - 1
[02/May/2018:10:40:05.975 - 1
[02/May/2018:10:40:05.982 - 2
[02/May/2018:10:40:06.022 - 1
[02/May/2018:10:40:06.051 - 1
[02/May/2018:10:40:06.054 - 1
[02/May/2018:10:40:06.086 - 1
[02/May/2018:10:40:06.094 - 1
[02/May/2018:10:40:06.126 - 1
What I'm aiming for is more:
10:40:05 - 5
10:40:06 - 6
No idea if I'm even thinking about this correctly. New to AWK in general.

Use colon and dot as the field separators, and we have hours in col2, minutes in col3 and seconds in col4
awk -F'[:.]' '
{count[$2 ":" $3 ":" $4]++}
END {for (time in count) print time " - " count[time]}
' file
10:40:05 - 4
10:40:06 - 6
Output will not necessarily be sorted. If you're using GNU awk, use
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (time in count)
print time " - " count[time]
}
(reference),
or simply pipe the output to | sort

One thing you can do is this:
awk 'BEGIN{FIELDWIDTHS = "1 11 1 12"} {print $4}' datetimes
Specify the field widths and then this will give you your time, for example. If you don't care about milliseconds, then "1 11 1 8 4"

You can use substr for the line as index of an array. for example, you have this file
cat 1.txt
[01/May/2018:23:59:59.532
[01/May/2018:01:59:59.848
[01/May/2018:02:59:59.851
[01/May/2018:02:59:59.911
[01/May/2018:02:59:59.923
[01/May/2018:02:00:59.986
you can use an awk command like this
cat 1.txt | awk '{a[substr($0,index($0,":")+1,8)]++} END{for(i in a) print i" - "a[i]}'
where substr($0,index($0,":")+1,8) cuts 8 chars from the occurrence of the first ":", use this as index of the array

Related

Splitting the first column of a file in multiple columns using AWK

File looks like this, but with millions of lines (TAB separated):
1_number_column_ranking_+ 100 200 Target "Hello"
I want to split the first column by the _ so it becomes:
1 number column ranking + 100 200 Target "Hello"
This is the code I have been trying:
awk -F"\t" '{n=split($1,a,"_");for (i=1;i<=n;i++) print $1"\t"a[i]}'
But it's not quite what I need.
Any help is appreciated (the other threads on this topic were not helpful for me).
No need to split, just replace would do:
awk 'BEGIN{FS=OFS="\t"}{gsub("_","\t",$1)}1'
Eg:
$ cat file
1_number_column_ranking_+ 100 200 Target "Hello"
$ awk 'BEGIN{FS=OFS="\t"}{gsub("_","\t",$1)}1' file
1 number column ranking + 100 200 Target "Hello"
gsub will replace all occurances, when no 3rd argument given, it will replace in $0.
Last 1 is a shortcut for {print}. (always true, implied {print}.)
Another awk, if the "_" appears only in the first column.
Split the input field by regex "[_\t]+" and just do a dummy operation like $1=$1 in the main section, so that $0 is reconstructed with OFS="\t"
$ cat steveman.txt
1_number_column_ranking_+ 100 200i Target "Hello"
$ awk -F"[_\t]" ' BEGIN { OFS="\t"} { $1=$1; print } ' steveman.txt
1 number column ranking + 100 200i Target "Hello"
$
Thanks #Ed, updated from -F"[_\t]+" to -F"[_\t]" that will avoid concatenating empty fields.

awk print number of row only in uniq column

I have data set like this:
1 A
1 B
1 C
2 A
2 B
2 C
3 B
3 C
And I have a script which calculates me:
Number of occurrences in searching string
Number of rows
awk -v search="A" \
'BEGIN{count=0} $2 == search {count++} END{print count "\n" NR}' input
That works perfectly fine.
I would like to add to my awk one liner number of unique lines from the first column.
So the output should be separated by \n:
2
8
3
I can do this in separate awk code, but I am not able to integrate it to my original awk code.
awk '{a[$1]++}END{for(i in a){print i}}' input | wc -l
Any idea how to integrate it in one awk solution without piping ?
Looks like you want this:
awk -v search="A" '{a[$1]++}
$2 == search {count++}
END{OFS="\n";print count+0, NR, length(a)}' file

Put into file : index of line and total numbers of specific pattern

I'm trying to invent a command which add to the file index of line where numbers of commas are less than 5 + numbers of comma in line. Let's assume result:
Let's assume input:
'abc','abc','abc','abc,'abc,'abc
'abc','abc','abc','abc,'abc,'abc,'abc
'abc','abc','abc','abc,'abc,'abc,'abc,'abc'
So in first line there are 5 commas, in second - 6 and in third - 7
and the expected result:
Index: 2 Number of commas : 6
Index: 3 Number of commas : 7
I invented that command which put into errors.csv all contents of line if comma > 50.
awk -F , 'NF > 50' <filename.csv >> errors.csv
The hardest for me is - how to retrieve and put into file index of line ??
Could you support me?
You can get this expected output using NR and NF variables of awk:
awk -F"," '{ if(NF > 6) printf("Index: %d Number of commas : %d\n", NR, NF-1); }' filename.csv
NR gives you the number of records in a file.

script to compare two large 900 x 900 comma delimited files

I have tried awk but havent been able to perform a diff for every cell 1 at a time on both files. I have tried awk but havent been able to perform a diff for every cell 1 at a time on both files. I have tried awk but havent been able to perform a diff for every cell 1 at a time on both files.
If you just want a rough answer, possibly the simplest thing is to do something like:
tr , \\n file1 > /tmp/output
tr , \\n file2 | diff - /tmp/output
That will convert each file to one column and run diff. You can compute the cells that differ from the line numbers of the output.
Simplest way with awk without accounting for newlines inside fields,quoted commas etc.
Print the same
awk 'BEGIN{RS=",|"RS}a[FNR]==$0;{a[NR]=$0}' file{,2}
Print differences
awk 'BEGIN{RS=",|"RS}FNR!=NR&&a[FNR]!=$0;{a[NR]=$0}' file{,2}
Print which are the same different
awk 'BEGIN{RS=",|"RS}FNR!=NR{print "cell"FNR (a[FNR]==$0?"":" not")" the same"}{a[NR]=$0}' file{,2}
Input
file
1,2,3,4,5
6,7,8,9,10
11,12,13,14,15
file2
1,2,3,4,5
2,7,1,9,12
1,1,1,1,12
Output
same
1
2
3
4
5
7
9
Different
2
1
12
1
1
1
1
12
Same different
cell1 the same
cell2 the same
cell3 the same
cell4 the same
cell5 the same
cell6 not the same
cell7 the same
cell8 not the same
cell9 the same
cell10 not the same
cell11 not the same
cell12 not the same
cell13 not the same
cell14 not the same
cell15 not the same

Bash- sum values from an array in one line

I have this array:
array=(1 2 3 4 4 3 4 3)
I can get the largest number with:
echo "num: $(printf "%d\n" ${array[#]} | sort -nr | head -n 1)"
#outputs 4
But i want to get all 4's add sum them up, meaning I want it to output 12 (there are 3 occurrences of 4) instead. any ideas?
dc <<<"$(printf '%d\n' "${array[#]}" | sort -n | uniq -c | tail -n 1) * p"
sort to get max value at end
uniq -c to get only unique values, with a count of how many times they appear
tail to get only the last line (with the max value and its count)
dc to multiply the value by the count
I picked dc for the multiplication step because it's RPN, so you don't have to split up the uniq -c output and insert anything in the middle of it - just add stuff to the end.
Using awk:
$ printf "%d\n" "${array[#]}" | sort -nr | awk 'NR>1 && p!=$0{print x;exit;}{x+=$0;p=$0;}'
12
Using sort, the numbers are sorted(-n) in reverse(-r) order, and the awk keeps summing the numbers till it finds a number which is different from the previous one.
You can do this with awk:
awk -v RS=" " '{sum[$0]+=$0; if($0>max) max=$0} END{print sum[max]}' <<<"${array[#]}"
Setting RS (record separator) to space allows you to read your array entries as separate records.
sum[$0]+=$0; means sum is a map of cumulative sums for each input value; if($0>max) max=$0 calculates the max number seen so far; END{print sum[max]} prints the sum for the larges number seen at the end.
<<<"${array[#]}" is a here-document that allows you to feed a string (in this case all elements of the array) as stdin into awk.
This way there is no piping or looping involved - a single command does all the work.
Using only bash:
echo $((${array// /+}))
Replace all spaces with plus, and evaluate using double-parentheses expression.

Resources