Find the durations and their average between the dataset in an interval in shell script - linux

This is related to my older question Find the durations and their maximum between the dataset in an interval in shell script
I have a dataset as:
ifile.txt
2
3
2
3
2
20
2
0
2
0
0
2
1
2
5
6
7
0
3
0
3
4
5
I would like to find out different duration and their average between the 0 values in 6 values interval.
My desire output is:
ofile.txt
6 5.33
1 2
1 2
1 2
5 4.2
1 3
3 4
Where
6 is the number of counts until next 0 within 6 values (i.e. 2,3,2,3,2,20) and 5.33 is the average value among them;
1 is the number of counts until next 0 within next 6 values (i.e. 2,0,2,0,0,2) and 2 is the average;
Next 1 and 2 are within same 6 values;
5 is the number of counts until next 0 within next 6 values (i.e. 1,2,5,6,7,0) and 4.2 is the average among them;
And so on
As per the answer to my previous question, I was trying with this:
awk '
$0!=0{
count++
sum=sum+$0
found=""
}
$0==0{
print count,max
count=max=0
next
}
FNR%6==0{
print count,max
count=max=0
found=1
}
END{
if(!found){
print count,max
}
}
' Input_file | awk '!/^ /' | awk '$1 != 0'

EDIT3: One more try since 2nd set of 6 lines have 2 0 2 0 0 2 so its output should be 1 2, 1 2, 0 0,1 2 if this is the case(which I believe ideally should be) then try following.
awk '
{
occur++
}
{
count=$0!=0?++count:count
sum+=$0
}
$0==0 || occur==6{
printf("%d %0.2f\n",count,count?sum/count:prev)
prev=count?sum/count:0
prev_count=count
count=sum=prev=prev_count=""
if(occur==6){
occur=""
}
}
END{
if(occur){
printf("%d %0.2f\n",count?count:prev_count,count?sum/count:prev)
}
}
' Input_file | awk '$1 != 0'
Output will be as follows:
6 5.33
1 2.00
1 2.00
1 2.00
5 4.20
1 3.00
3 4.00
EDITs below may help in similar kind of problems which are bit different from this actual problem, so keeping them here in post.
EDIT2: In case you don't want to RESET count whenever a zero occurs in Input_file then try following. This will continuously look for only 6 lines and will NOT RESET its count.
awk '
{
occur++
}
$0!=0{
count++
sum+=$0
found=prev_count=prev=""
}
$0==0 && occur!=6{
printf("%d,%0.2f\n",count?count:prev_count,count?sum/count:prev)
prev=count?sum/count:0
prev_count=count
count=sum=""
found=1
next
}
occur==6{
printf("%d,%0.2f\n",count,count?sum/count:prev)
prev=count?sum/count:0
prev_count=count
count=sum=occur=""
found=1
}
END{
if(!found){
printf("%d,%0.2f\n",count?count:prev_count,count?sum/count:prev)
}
}
' Input_file
EDIT1: Could you please try following, tested and written with provided samples only.
awk '
{
occur++
}
$0!=0{
count++
sum+=$0
found=prev_count=prev=""
}
$0==0{
printf("%d,%0.2f\n",count?count:prev_count,count?sum/count:prev)
prev=count?sum/count:0
prev_count=count
count=sum=occur=""
found=1
next
}
occur==6{
printf("%d,%0.2f\n",count,count?sum/count:prev)
prev=count?sum/count:0
prev_count=count
count=sum=occur=""
found=1
}
END{
if(!found){
printf("%d,%0.2f\n",count?count:prev_count,count?sum/count:prev)
}
}
' Input_file
What does code take care of:
It takes care of logic where if any continuous 2 lines are having 0 value then it will print previous count and average values for that line.
This will also take care of edge cases like:
a- In case a line is either NOT ending with a 0 it will check if some values are there to print by found flag I created.
b- In case of any Input_file's last line is NOT divided by 6 then also this case will be covered by END block's logic of checking it by found flag.
Explanation: Adding a detailed explanation for above code.
awk ' ##Starting awk program from here.
{
occur++
}
$0!=0{ ##Checking condition if a line is NOT having zero value then do following.
count++ ##Increment variable count with 1 each time it comes here.
sum+=$0 ##Creating variable sum and keep adding current line value in it.
found=prev_count=prev="" ##Nullifying variables found, prev_count, prev here.
} ##Closing BLOCK for condition $0!=0 here.
$0==0{ ##Checking condition if a line is having value zero then do following.
printf("%d,%0.2f\n",count?count:prev_count,count?sum/count:prev) ##Printing count and count/sum here, making sure later is NOT getting divided by 0 too.
prev=count?sum/count:0 ##Creating variable prev which will be sum/count or zero in case count variable is NULL.
prev_count=count ##Creating variable prev_count whose value is count.
count=sum=occur="" ##Nullify variables count and sum here.
found=1 ##Setting value 1 to variable found here.
next ##next will skip all further statements from here.
} ##Closing BLOCK for condition $0==0 here.
occur==6{ ##Checking if current line is fully divided with 6 then do following.
printf("%d,%0.2f\n",count,count?sum/count:prev) ##Printing count and count/sum here, making sure later is NOT getting divided by 0 too.
prev=count?sum/count:0 ##Creating variable prev which will be sum/count or zero in case count variable is NULL.
prev_count=count ##Creating variable prev_count whose value is count.
count=sum=occur="" ##Nullifying variables count and sum here.
found=1 ##Setting value 1 to variable found here.
} ##Closing BLOCK for condition FNR%6==0 here.
END{ ##Starting END block for this awk program here.
if(!found){ ##Checking condition if variable found is NULL then do following.
printf("%d,%0.2f\n",count?count:prev_count,count?sum/count:prev) ##Printing count and count/sum here, making sure later is NOT getting divided by 0 too.
}
}
' Input_file ##Mentioning Input_file name here.

Another which didn't turn out quite the one-liner I was going for:
$ tail -n +2 file | # strip header with tail to not disturb NR
awk '
{
s+=$0 # sum them up
c++ # keep count
}
$0==0 || NR%6==0 { # act if zero or every 6th record
if($0==0) # remove zero effect on c
c--
if(s>0) # avoid division by zero
print c,s/c # output
s=c=0 # rinse and repeat
}
END { # the end-game if NR%6!=0
if($0==0)
c--
if(s>0)
print c,s/c
}'
Output:
6 5.33333
1 2
1 2
1 2
5 4.2
1 3
3 4
The tail removes the header in the file before feeding the data to awk, the idea is that the header won't show in the NR. If you don't have a header, just awk ... file.

Related

Awk column value in file 1 is in the range of two columns in file 2

I modified the question based on the comments.
I would like to match two files: if $4 in file 1 is in the range of $3 and $4 in file 2, I would like to print file 1 with $6 in file 2. If there is no match, I would like to print NA in the output. If there are overlapping ranges, I would like to print the first match (sorting based on $4 of file 1).
File 1:
1 rs537182016 0 8674590 A C
1 rs575272151 0 69244805 G C
1 rs544419019 0 69244469 G C
1 rs354682 0 1268900 G C
File 2:
18 16 8674587 8784575 + ABAT
10349 17 69148007 69244815 - ABCA10
23461 17 69244435 69327182 - ABCA5
Output:
1 rs537182016 0 8674590 A C ABAT
1 rs575272151 0 69244805 G C ABCA10
1 rs544419019 0 69244469 G C ABCA10
1 rs354682 0 1268900 G C NA
I tried the following based on previous answers, but it did not work. The output is an empty file.
awk 'FNR == NR { val[$1] = $4 }
FNR != NR { if ($1 in val && val[$1] >= $3 && val[$1] <= $4)
print $1, val[$1], $6
}' file1 file2 > file3
Assumptions:
in the case of multiple matches OP has stated we only use the 'first' match; OP hasn't (yet) defined 'first' so I'm going to assume it means the order in which lines appear in file2 (aka the line number)
One awk idea:
awk '
FNR==NR { min[++n]=$3 # save 1st file values in arrays; use line number as index
max[n]=$4
col6[n]=$6
next
}
{ for (i=1;i<=n;i++) # loop through 1st file entries
if (min[i] <= $4 && $4 <= max[i]) { # if we find a range match then ...
print $0, col6[i] # print current line plus column #6 from 1st file and then ...
next # skip to next line of input; this keeps us from matching on additional entries from 1st file
}
print $0, "NA" # if we got here there was no match so print current line plus "NA"
}
' file2 file1
NOTE: make note of the order of the input files; the first answer (below) was based on an input of file1 file2; this answer requires the order of the input files to be flipped, ie, file2 file1
This generates:
1 rs537182016 0 8674590 A C ABAT
1 rs575272151 0 69244805 G C ABCA10
1 rs544419019 0 69244469 G C ABCA10
1 rs354682 0 1268900 G C NA
NOTE: following is based on OP's original question and expected output (revision #2); OP has since modified the expected output to such an extent that the following answer is no longer valid ...
Assumptions:
in file1 both rs575272151 / 69244805 and rs544419019 / 69244469 match 2 different (overlapping) ranges from file2 but OP has only showed one set of matches in the expected output; from this I'm going to assume ...
once a match is found for an entry from file1, remove said entry from any additional matching; this will eliminate multiple matches for file1 entries
once a match is found for a line from file2 then stop looking for matches for that line (ie, go to the next intput line from file2); this will eliminate multiple-matches for file2
OP has not provided any details on how to determine which mulit-match to keep so we'll use the first match we find
One awk idea:
awk '
FNR==NR { val[$2]=$4; next }
{ for (i in val) # loop through list of entries from 1st file ...
if ($3 <= val[i] && val[i] <= $4) { # looking for a range match and if found ...
print $0,i # print current line plus 2nd field from 1st file and then ...
delete val[i] # remove 1st file entry from further matches and ...
next # skip to next line of input from 2nd file, ie, stop looking for additional matches for the current line
}
}
' file1 file2
This generates:
18 16 8674587 8784575 + ABAT rs537182016
10349 17 69148007 69244815 - ABCA10 rs575272151
23461 17 69244435 69327182 - ABCA5 rs544419019
NOTES:
the for (i in val) construct is not guaranteed to process the array entries in a consistent manner; net result is that in the instance where there are multiple matches we simply match on the 'first' array entry provided by awk; if this 'random' nature of the for (i in val) is not acceptable then OP will need to update the question with additional details on how to handle multiple matches
for this particular case we actually generate the same output as expected by OP, but the assignments of rs575272151 and rs544419019 could just as easily be reversed (due to the nature of the for (i in val) construct)

Awk - Count Each Unique Value and Match Values Between Two Files without printing all feilds of matched line [duplicate]

This question already has answers here:
Awk - Count Each Unique Value and Match Values Between Two Files
(2 answers)
Closed 2 years ago.
I have two files. First I am trying to get the count of each unique field in column 4.
And then match the unique field value from the 2nd column of the 2nd file.
File1 - column 4's each unique value and File2 - columns 2 contains the value that I need to match between the two files
So essentially, I am trying to -> take each unique value and value count from column 4 from File1, if there is a match in column2 of file2
File1
1 2 3 6 5
2 3 4 5 1
3 5 7 6 1
2 3 4 6 2
4 6 6 5 1
File2
1 2 3 hello "6"
1 3 3 hi "5"
needed output
total count of hello,6 : 3
total count of hi,5 : 2
My test code
awk 'NR==FNR{a[$4]++}NR!=FNR{gsub(/"/,"",$2);b[$2]=$0}END{for( i in b){printf "Total count of %s,%d : %d\n",gensub(/^([^ ]+).*/,"\\1","1",b[i]),i,a[i]}}' File1 File2
I believe I should be able to do this with awk, but for some reason I am really struggling with this one.
Thanks
With your shown samples, could you please try following.
awk '
FNR==NR{
count[$4]++
next
}
{
gsub(/"/,"",$NF)
}
($NF in count){
print "total count of "$(NF-1)","$NF" : "count[$NF]
}
' file1 file2
Sample output will be as follows.
total count of hello,6 : 3
total count of hi,5 : 2
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when file1 is being read.
count[$4]++ ##Creating array count with index of 4th field and keep increasing its value with 1 here.
next ##next will skip all further statements from here.
}
{
gsub(/"/,"",$NF) ##Globally substituting " with NULL in last field of current line.
}
($NF in count){ ##Checking condition if last field is present in count then do following.
print "total count of "$(NF-1)","$NF" : "count[$NF]
##Printing string 2nd last field, last field : and count value here as per request.
}
' file1 file2 ##Mentioning Input_file names here.

Sum of 2nd and 3rd column for same value in 1st column

I want to sum the value in column 2nd and 3rd column for same value in 1st column
1555971000 6 1
1555971000 0 2
1555971300 2 0
1555971300 3 0
Output would be like
1555971000 6 3
1555971300 5 0
I have tried below command
awk -F" " '{b[$2]+=$1} END { for (i in b) { print b[i],i } } '
but this seems to be for only one column.
Here is another way with reading Input_file 2 times and it will provide output in same sequence as Input_file's sequence.
awk 'FNR==NR{a[$1]+=$2;b[$1]+=$3;next} ($1 in a){print $1,a[$1],b[$1];delete a[$1]}' Input_file Input_file
if data in 'd' without sort, tried on gnu awk,
awk 'BEGIN{f=1} {if($1==a||f){b+=$2;c+=$3;f=0} else{print a,b,c;b=$2;c=$3} a=$1} END{print a,b,c}' d
with sort gnu awk
awk '{w[NR]=$0} END{asort(w);f=1;for(;i++<NR;){split(w[i],v);if(v[1]==a||f){f=0;b+=v[2];c+=v[3]} else{print a,b,c;b=v[2];c=v[3];} a=v[1]} print a,b,c;}' d
You can do it with awk by first saving the fields in the first record, and then for all subsequent records, comparing if the first field matches, if so, add the contents of fields two and three and continue. If the first field fails to match, then output your first field and the running-sums, e.g.
awk '{
if ($1 == a) {
b+=$2; c+=$3;
}
else {
print a, b, c; a=$1; b=$2; c=$3;
}
} END { print a, b, c; }' file
With your input in file, you can copy and paste the foregoing into your terminal and obtain, the following:
Example Use/Output
$ awk '{
> if ($1 == a) {
> b+=$2; c+=$3;
> }
> else {
> print a, b, c; a=$1; b=$2; c=$3;
> }
> } END { print a, b, c; }' file
1555971000 6 3
1555971300 5 0
Using awk Arrays
A shorter more succinct alternative using arrays that does not require your input to be in sorted order would be:
awk '{a[$1]+=$2; b[$1]+=$3} END{ for (i in a) print i, a[i], b[i] }' file
(same output)
Using arrays allows the summing of columns for like field1 to work equally well if your data file contained the following lines in random order, e.g.
1555971300 2 0
1555971000 0 2
1555971000 6 1
1555971300 3 0
Another awk that would work regardless of any order of records whether or not they are not sorted :
awk '{r[$1]++}
r[$1]==1{o[++c]=$1}
{f[$1]+=$2;s[$1]+=$3}
END{for(i=1;i<=c;i++){print o[i],f[o[i]],s[o[i]]}}' file
Assuming when you wrote:
awk -F" " '{b[$2]+=$1} END { for (i in b) { print b[i],i } } '
you meant to write:
awk '{ b[$1]+=$2 } END{ for (i in b) print i,b[i] }'
It shouldn't be a huge leap to figure out:
$ awk '{ b[$1]+=$2; c[$1]+=$3 } END{ for (i in b) print i,b[i],c[i] }' file
1555971000 6 3
1555971300 5 0
Please get the book "Effective Awk Programming", 4th Edition, by Arnold Robbins and just read a paragraph or 2 about fields and arrays.

Find duplicate lines based on column and print both lines and their numbers with awk

I have a following file:
userID PWD_HASH
test 1234
admin 1234
user 6789
abcd 5555
efgh 6666
root 1234
Using AWK,
I need to find both original lines and their duplicates with row numbers,
so that get the output like:
NR $0
1 test 1234
2 admin 1234
6 root 1234
I have tried the following, but it does not print the correct row number with NR :
awk 'n=x[$2]{print NR" "n;print NR" "$0;} {x[$2]=$0;}' file.txt
Any help would be appreciated!
$ awk '
($2 in a) { # look for duplicates in $2
if(a[$2]) { # if found
print a[$2] # output the first, stored one
a[$2]="" # mark it outputed
}
print NR,$0 # print the duplicated one
next # skip the storing part that follows
}
{
a[$2]=NR OFS $0 # store the first of each with NR and full record
}' file
Output (with the header in file):
2 test 1234
3 admin 1234
7 root 1234
Using GAWK, you can do this by below construct : -
awk '
{
NR>1
{
a[$2][NR-1 " " $0];
}
}
END {
for (i in a)
if(length(a[i]) > 1)
for (j in a[i])
print j;
}
' Input_File.txt
Create a 2-dimensional array.
In first dimension, store PWD_HASH and in second dimension, store line number(NR-1) concatenated with whole line($0).
To display only duplicate ones, you can use length(a[i] > 1) condition.
Could you please try following.
awk '
FNR==NR{
a[$2]++
b[$2,FNR]=FNR==1?FNR:(FNR-1) OFS $0
next
}
a[$2]>1{
print b[$2,FNR]
}
' Input_file Input_file
Output will be as follows.
1 test 1234
2 admin 1234
6 root 1234
Explanation: Following is the explanation for above code.
awk ' ##Starting awk program here.
FNR==NR{ ##Checking condition here FNR==NR which will be TRUE when first time Input_file is being read.
a[$2]++ ##Creating an array named a whose index is $1 and incrementing its value to 1 each time it sees same index.
b[$2,FNR]=FNR==1?FNR:(FNR-1) OFS $0 ##Creating array b whose index is $2,FNR and concatenating its value to its own.
next ##Using next for skipping all further statements from here.
}
a[$2]>1{ ##Checking condition where value of a[$2] is greater than 1, this will be executed when 2nd time Input_file read.
print b[$2,FNR] ##Printing value of array b whose index is $2,FNR here.
}
' Input_file Input_file ##Mentioning Input_file(s) names here 2 times.
Without using awk, but GNU coretutils tools:
tail -n+2 file | nl | sort -k3n | uniq -D -f2
tail remove the first line.
nl add line number.
sort based on the 3rd field.
uniq only prints duplicate based on the 3rd field.

awk key-value issue for array

I meet a awk array issue, details as below:
[~/temp]$ cat test.txt
1
2
3
4
1
2
3
Then I want to count the frequency of the number.
[~/temp]$ awk 'num[$1]++;END{for (i in num){printf("%s\t%-s\n", num[i],i)|"sort -r -n -k1"} }' test.txt
1
2
3
2 3
2 2
2 1
1 4
As you see, why does the output of first 3 line '1 2 3' will come blank value?
Thank for your answer.
An awk statement consists of a pattern and related action. Omitted pattern matches every record of input. Omitted action is an alias to {print $0}, ie. output the current record, which is what you are getting. Looking at the first part of your program:
$ awk 'num[$1]++' file
1
2
3
Let's change that a bit to understand what happens there:
$ awk '{print "NR:",NR,"num["$1"]++:",num[$1]++}' file
NR: 1 num[1]++: 0
NR: 2 num[2]++: 0
NR: 3 num[3]++: 0
NR: 4 num[4]++: 0
NR: 5 num[1]++: 1
NR: 6 num[2]++: 1
NR: 7 num[3]++: 1
Since you are using postfix operator num[$1]++ in the pattern, on records 1-4 it gets evaluated to 0 before it's value is incremented. Output would be different if you used the prefix operator ++num[$1] which would first increment the value of the variable after which it would get evaluated and would lead to outputing every record of input, not just the last three, which you were getting.
Correct way would've been to use num[$1]++ as an action, not as a pattern:
$ awk '{num[$1]++}' file
Put your "per line" part in {} i.e. { num[$1]++; }
awk programs a a collection of [pattern] { actions } (the pattern is optional, the {} is not). Seems that in your case your line is being treated as the pattern.

Resources