Awk - Count Each Unique Value and Match Values Between Two Files without printing all feilds of matched line [duplicate] - linux

This question already has answers here:
Awk - Count Each Unique Value and Match Values Between Two Files
(2 answers)
Closed 2 years ago.
I have two files. First I am trying to get the count of each unique field in column 4.
And then match the unique field value from the 2nd column of the 2nd file.
File1 - column 4's each unique value and File2 - columns 2 contains the value that I need to match between the two files
So essentially, I am trying to -> take each unique value and value count from column 4 from File1, if there is a match in column2 of file2
File1
1 2 3 6 5
2 3 4 5 1
3 5 7 6 1
2 3 4 6 2
4 6 6 5 1
File2
1 2 3 hello "6"
1 3 3 hi "5"
needed output
total count of hello,6 : 3
total count of hi,5 : 2
My test code
awk 'NR==FNR{a[$4]++}NR!=FNR{gsub(/"/,"",$2);b[$2]=$0}END{for( i in b){printf "Total count of %s,%d : %d\n",gensub(/^([^ ]+).*/,"\\1","1",b[i]),i,a[i]}}' File1 File2
I believe I should be able to do this with awk, but for some reason I am really struggling with this one.
Thanks

With your shown samples, could you please try following.
awk '
FNR==NR{
count[$4]++
next
}
{
gsub(/"/,"",$NF)
}
($NF in count){
print "total count of "$(NF-1)","$NF" : "count[$NF]
}
' file1 file2
Sample output will be as follows.
total count of hello,6 : 3
total count of hi,5 : 2
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when file1 is being read.
count[$4]++ ##Creating array count with index of 4th field and keep increasing its value with 1 here.
next ##next will skip all further statements from here.
}
{
gsub(/"/,"",$NF) ##Globally substituting " with NULL in last field of current line.
}
($NF in count){ ##Checking condition if last field is present in count then do following.
print "total count of "$(NF-1)","$NF" : "count[$NF]
##Printing string 2nd last field, last field : and count value here as per request.
}
' file1 file2 ##Mentioning Input_file names here.

Related

Merging two txt files based on a common column with different row numbers

I would like to merge two whitespace-delimited files without sorting them first based on the "phenotype" column. File 1 contains the same phenotype several times, while file 2 has each phenotype only once. I need to match "phenotype" from file 1 to "category" in file 2.
File 1:
chr pos pval_EAS phenotype FDR
1 1902906 0.234 biomarkers-30600-both_sexes-irnt.tsv.gz 1
2 1475898 0.221 biomarkers-30600-both_sexes-irnt.tsv.gz 1
2 568899 0.433 continuous-4566-both_sexes-irnt.tsv.gz 1
2 2435478 0.113 continuous-4566-both_sexes-irnt.tsv.gz 1
4 1223446 0.112 phecode-554-both_sexes-irnt.tsv.gz 0.345
4 3456573 0.0003 phecode-554-both_sexes-irnt.tsv.gz 0.989
File 2:
phenotype Category
biomarkers-30600-both_sexes-irnt.tsv.bgz Metabolic
continuous-4566-both_sexes-irnt.tsv.gz Neoplasms
phecode-554-both_sexes-irnt.tsv.gz Immunological
I tried the following, but I don't get the desired output:
awk -F' ' 'FNR==NR{a[$1]=$4; next} {print $0 a[$6]}' file2 file1 > file3
With your shown samples, please try following.
awk 'FNR==NR{arr[$1]=$2;next} ($4 in arr){print $0,arr[$4]}' file2 file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when file2 is being read.
arr[$1]=$2 ##Creating array arr with index of $1 and value is $2.
next ##next will skip all further statements from here.
}
($4 in arr){ ##Checking condition if 4th field is in arr then do following.
print $0,arr[$4] ##Printing current line along with value of arr with 4th field as index number.
}
' file2 file1 ##Mentioning Input_file names here.
Bonus solution: In case you want to print those lines which are not matching values and want to print with N/A then do following.
awk 'FNR==NR{arr[$1]=$2;next} {print $0,(($4 in arr)?arr[$4]:"N/A")}' file2 file1

Find the durations and their average between the dataset in an interval in shell script

This is related to my older question Find the durations and their maximum between the dataset in an interval in shell script
I have a dataset as:
ifile.txt
2
3
2
3
2
20
2
0
2
0
0
2
1
2
5
6
7
0
3
0
3
4
5
I would like to find out different duration and their average between the 0 values in 6 values interval.
My desire output is:
ofile.txt
6 5.33
1 2
1 2
1 2
5 4.2
1 3
3 4
Where
6 is the number of counts until next 0 within 6 values (i.e. 2,3,2,3,2,20) and 5.33 is the average value among them;
1 is the number of counts until next 0 within next 6 values (i.e. 2,0,2,0,0,2) and 2 is the average;
Next 1 and 2 are within same 6 values;
5 is the number of counts until next 0 within next 6 values (i.e. 1,2,5,6,7,0) and 4.2 is the average among them;
And so on
As per the answer to my previous question, I was trying with this:
awk '
$0!=0{
count++
sum=sum+$0
found=""
}
$0==0{
print count,max
count=max=0
next
}
FNR%6==0{
print count,max
count=max=0
found=1
}
END{
if(!found){
print count,max
}
}
' Input_file | awk '!/^ /' | awk '$1 != 0'
EDIT3: One more try since 2nd set of 6 lines have 2 0 2 0 0 2 so its output should be 1 2, 1 2, 0 0,1 2 if this is the case(which I believe ideally should be) then try following.
awk '
{
occur++
}
{
count=$0!=0?++count:count
sum+=$0
}
$0==0 || occur==6{
printf("%d %0.2f\n",count,count?sum/count:prev)
prev=count?sum/count:0
prev_count=count
count=sum=prev=prev_count=""
if(occur==6){
occur=""
}
}
END{
if(occur){
printf("%d %0.2f\n",count?count:prev_count,count?sum/count:prev)
}
}
' Input_file | awk '$1 != 0'
Output will be as follows:
6 5.33
1 2.00
1 2.00
1 2.00
5 4.20
1 3.00
3 4.00
EDITs below may help in similar kind of problems which are bit different from this actual problem, so keeping them here in post.
EDIT2: In case you don't want to RESET count whenever a zero occurs in Input_file then try following. This will continuously look for only 6 lines and will NOT RESET its count.
awk '
{
occur++
}
$0!=0{
count++
sum+=$0
found=prev_count=prev=""
}
$0==0 && occur!=6{
printf("%d,%0.2f\n",count?count:prev_count,count?sum/count:prev)
prev=count?sum/count:0
prev_count=count
count=sum=""
found=1
next
}
occur==6{
printf("%d,%0.2f\n",count,count?sum/count:prev)
prev=count?sum/count:0
prev_count=count
count=sum=occur=""
found=1
}
END{
if(!found){
printf("%d,%0.2f\n",count?count:prev_count,count?sum/count:prev)
}
}
' Input_file
EDIT1: Could you please try following, tested and written with provided samples only.
awk '
{
occur++
}
$0!=0{
count++
sum+=$0
found=prev_count=prev=""
}
$0==0{
printf("%d,%0.2f\n",count?count:prev_count,count?sum/count:prev)
prev=count?sum/count:0
prev_count=count
count=sum=occur=""
found=1
next
}
occur==6{
printf("%d,%0.2f\n",count,count?sum/count:prev)
prev=count?sum/count:0
prev_count=count
count=sum=occur=""
found=1
}
END{
if(!found){
printf("%d,%0.2f\n",count?count:prev_count,count?sum/count:prev)
}
}
' Input_file
What does code take care of:
It takes care of logic where if any continuous 2 lines are having 0 value then it will print previous count and average values for that line.
This will also take care of edge cases like:
a- In case a line is either NOT ending with a 0 it will check if some values are there to print by found flag I created.
b- In case of any Input_file's last line is NOT divided by 6 then also this case will be covered by END block's logic of checking it by found flag.
Explanation: Adding a detailed explanation for above code.
awk ' ##Starting awk program from here.
{
occur++
}
$0!=0{ ##Checking condition if a line is NOT having zero value then do following.
count++ ##Increment variable count with 1 each time it comes here.
sum+=$0 ##Creating variable sum and keep adding current line value in it.
found=prev_count=prev="" ##Nullifying variables found, prev_count, prev here.
} ##Closing BLOCK for condition $0!=0 here.
$0==0{ ##Checking condition if a line is having value zero then do following.
printf("%d,%0.2f\n",count?count:prev_count,count?sum/count:prev) ##Printing count and count/sum here, making sure later is NOT getting divided by 0 too.
prev=count?sum/count:0 ##Creating variable prev which will be sum/count or zero in case count variable is NULL.
prev_count=count ##Creating variable prev_count whose value is count.
count=sum=occur="" ##Nullify variables count and sum here.
found=1 ##Setting value 1 to variable found here.
next ##next will skip all further statements from here.
} ##Closing BLOCK for condition $0==0 here.
occur==6{ ##Checking if current line is fully divided with 6 then do following.
printf("%d,%0.2f\n",count,count?sum/count:prev) ##Printing count and count/sum here, making sure later is NOT getting divided by 0 too.
prev=count?sum/count:0 ##Creating variable prev which will be sum/count or zero in case count variable is NULL.
prev_count=count ##Creating variable prev_count whose value is count.
count=sum=occur="" ##Nullifying variables count and sum here.
found=1 ##Setting value 1 to variable found here.
} ##Closing BLOCK for condition FNR%6==0 here.
END{ ##Starting END block for this awk program here.
if(!found){ ##Checking condition if variable found is NULL then do following.
printf("%d,%0.2f\n",count?count:prev_count,count?sum/count:prev) ##Printing count and count/sum here, making sure later is NOT getting divided by 0 too.
}
}
' Input_file ##Mentioning Input_file name here.
Another which didn't turn out quite the one-liner I was going for:
$ tail -n +2 file | # strip header with tail to not disturb NR
awk '
{
s+=$0 # sum them up
c++ # keep count
}
$0==0 || NR%6==0 { # act if zero or every 6th record
if($0==0) # remove zero effect on c
c--
if(s>0) # avoid division by zero
print c,s/c # output
s=c=0 # rinse and repeat
}
END { # the end-game if NR%6!=0
if($0==0)
c--
if(s>0)
print c,s/c
}'
Output:
6 5.33333
1 2
1 2
1 2
5 4.2
1 3
3 4
The tail removes the header in the file before feeding the data to awk, the idea is that the header won't show in the NR. If you don't have a header, just awk ... file.

awk key-value issue for array

I meet a awk array issue, details as below:
[~/temp]$ cat test.txt
1
2
3
4
1
2
3
Then I want to count the frequency of the number.
[~/temp]$ awk 'num[$1]++;END{for (i in num){printf("%s\t%-s\n", num[i],i)|"sort -r -n -k1"} }' test.txt
1
2
3
2 3
2 2
2 1
1 4
As you see, why does the output of first 3 line '1 2 3' will come blank value?
Thank for your answer.
An awk statement consists of a pattern and related action. Omitted pattern matches every record of input. Omitted action is an alias to {print $0}, ie. output the current record, which is what you are getting. Looking at the first part of your program:
$ awk 'num[$1]++' file
1
2
3
Let's change that a bit to understand what happens there:
$ awk '{print "NR:",NR,"num["$1"]++:",num[$1]++}' file
NR: 1 num[1]++: 0
NR: 2 num[2]++: 0
NR: 3 num[3]++: 0
NR: 4 num[4]++: 0
NR: 5 num[1]++: 1
NR: 6 num[2]++: 1
NR: 7 num[3]++: 1
Since you are using postfix operator num[$1]++ in the pattern, on records 1-4 it gets evaluated to 0 before it's value is incremented. Output would be different if you used the prefix operator ++num[$1] which would first increment the value of the variable after which it would get evaluated and would lead to outputing every record of input, not just the last three, which you were getting.
Correct way would've been to use num[$1]++ as an action, not as a pattern:
$ awk '{num[$1]++}' file
Put your "per line" part in {} i.e. { num[$1]++; }
awk programs a a collection of [pattern] { actions } (the pattern is optional, the {} is not). Seems that in your case your line is being treated as the pattern.

how to print 3rd field in 3rd column itself

In my file I have 3 fields, I want to print only the third field in the third column only but output is getting to the first row. Please check my file and output:
cat filename
1st field 2nd field 3rd field
--------- --------- -----------
a,b,c,d d,e,f,g,h 1,2,3,4,5,5
q,w,e,r t,y,g,t,i 9,8,7,6,5,5
I'm using the following command to print the third field only in the third column
cat filename |awk '{print $3}' |tr ',' '\n'
OUTPUT printing 3rd field strings in the 1st field place, i want that to print in only 3rd field area only
first field :-
---------------
1
2
3
4
5
5
expected output
1st field 2nd field 3rd field
--------- --------- -----------
a,b,c,d d,e,f,g,h 1
2
3
4
5
5
q,w,e,r t,y,g,t,i 9
8
7
6
5
5
Input
[akshay#localhost tmp]$ cat file
1st field 2nd field 3rd field
--------- --------- -----------
a,b,c,d d,e,f,g,h 1,2,3,4,5,5
q,w,e,r t,y,g,t,i 9,8,7,6,5,5
Script
[akshay#localhost tmp]$ cat test.awk
NR<3 || !NF{ print; next}
{
split($0,D,/[^[:space:]]*/)
c1=sprintf("%*s",length($1),"")
c2=sprintf("%*s",length($2),"")
split($3,A,/,/)
for(i=1; i in A; i++)
{
if(i==2)
{
$1 = c1
$2 = c2
}
printf("%s%s%s%s%d\n",$1,D[2],$2,D[3],A[i])
}
}
Output
[akshay#localhost tmp]$ awk -f test.awk file
1st field 2nd field 3rd field
--------- --------- -----------
a,b,c,d d,e,f,g,h 1
2
3
4
5
5
q,w,e,r t,y,g,t,i 9
8
7
6
5
5
Explanation
NR<3 || !NF{ print; next}
NR gives you the total number of records being processed or line number, in short NR variable has line number.
NF gives you the total number of fields in a record.
The next statement forces awk to immediately stop processing the
current record and go on to the next record.
If line number is less than 3 or not NF (meaning no fields in record that is blank line), print current record and go to next record.
split($0,D,/[^[:space:]]*/)
Since we are interested to preserve the formatting, so we are saving separators between fields on array D here, if you have GNU awk you can make use of 4th arg for split() - it lets you split the line into 2 arrays, one of the fields and the other of the separators between the fields and then you can just operate on the fields array and print using the separators array between each field array element to rebuild the original $0.
c1=sprintf("%*s",length($1),"") and c2=sprintf("%*s",length($2),"")
Here sprintf function is used to fill space char of field ($1 or $2) length.
split($3,A,/,/)
split(string, array [, fieldsep [, seps ] ])
Divide string into pieces separated by fieldsep and store the pieces
in array and the separator strings in the seps array. The first piece
is stored in array[1], the second piece in array[2], and so forth. The
string value of the third argument, fieldsep, is a regexp describing
where to split string (much as FS can be a regexp describing where to
split input records). If fieldsep is omitted, the value of FS is used.
split() returns the number of elements created.
Loop till as long as i in A is true, I just came to know that i=1 and i++ control the order of traversal of the array, Thanks to Ed Morton
if(i==2)
{
$1 = c1
$2 = c2
}
when i = 1 we print a,b,c,d and d,e,f,g,h, in next iteration we modify $1 and $2 value with c1 and c2 we created above since you are interested to show only once as requested.
printf("%s%s%s%s%d\n",$1,D[2],$2,D[3],A[i])
Finally print field1 ($1), separator between field1 and field2 to we saved above, that is D[2], field2 ($2), separator between field2 and field3 and array A element only by one which we created from (split($3,A,/,/)).
$ cat tst.awk
NR<3 || !NF { print; next }
{
front = gensub(/((\S+\s+){2}).*/,"\\1","")
split($3,a,/,/)
for (i=1;i in a;i++) {
print front a[i]
gsub(/\S/," ",front)
}
}
$ awk -f tst.awk file
1st field 2nd field 3rd field
--------- --------- -----------
a,b,c,d d,e,f,g,h 1
2
3
4
5
5
q,w,e,r t,y,g,t,i 9
8
7
6
5
5
The above uses GNU awk for gensub(), with other awks use match()+substr(). It also uses \S and \s shorthand for [^[:space:]] and [[:space:]].
Considering the columns are tab separated, I would say:
awk 'BEGIN{FS=OFS="\t"}
NR<=2 || !NF {print; next}
NR>2{n=split($3,a,",")
for (i=1;i<=n; i++)
print (i==1?$1 OFS $2:"" OFS ""), a[i]
}' file
This prints the 1st, 2nd and empty lines normally
Then, slices the 3rd field using the comma as separator.
Finally, loops through the amount of pieces printing each one every time; it prints the first two columns the first time, then just the last value.
Test
$ awk 'BEGIN{FS=OFS="\t"} NR<=2 || !NF {print; next} NR>2{n=split($3,a,","); for (i=1;i<=n; i++) print (i==1?$1 OFS $2:"" OFS ""), a[i]}' a
1st field 2nd field 3rd field
--------- --------- -----------
a,b,c,d d,e,f,g,h 1
2
3
4
5
5
q,w,e,r t,y,g,t,i 9
8
7
6
5
5
Note the output is a bit ugly, since tab separating the columns lead them like this.

write a two column file from two files using awk

I have two files of one column each
1
2
3
and
4
5
6
I want to write a unique file with both elements as
1 4
2 5
3 6
It should be really simple I think with awk.
You could try paste -d ' ' <file1> <file2>. (Without -d ' ' the delimiter would be tab.)
paste works okay for the example given but it doesn't handle variable length lines very well. A nice little-know core-util pr provides a more flexible solution:
$ pr -mtw 4 file1 file2
1 4
2 5
3 6
A variable length example:
$ pr -mtw 22 file1 file2
10 4
200 5
300,000,00 6
And since you asked about awk here is one way:
$ awk '{a[FNR]=a[FNR]$0" "}END{for(i=1;i<=length(a);i++)print a[i]}' file1 file2
1 4
2 5
3 6
Using awk
awk 'NR==FNR { a[FNR]=$0;next } { print a[FNR],$0 }' file{1,2}
Explanation:
NR==FNR will ensure our first action statement runs for first file only.
a[FNR]=$0 with this we are inserting first file into array a indexed at line number
Once first file is complete we move to second action
Here we print each line of first file along with second file

Resources