Awk column value in file 1 is in the range of two columns in file 2 - linux

I modified the question based on the comments.
I would like to match two files: if $4 in file 1 is in the range of $3 and $4 in file 2, I would like to print file 1 with $6 in file 2. If there is no match, I would like to print NA in the output. If there are overlapping ranges, I would like to print the first match (sorting based on $4 of file 1).
File 1:
1 rs537182016 0 8674590 A C
1 rs575272151 0 69244805 G C
1 rs544419019 0 69244469 G C
1 rs354682 0 1268900 G C
File 2:
18 16 8674587 8784575 + ABAT
10349 17 69148007 69244815 - ABCA10
23461 17 69244435 69327182 - ABCA5
Output:
1 rs537182016 0 8674590 A C ABAT
1 rs575272151 0 69244805 G C ABCA10
1 rs544419019 0 69244469 G C ABCA10
1 rs354682 0 1268900 G C NA
I tried the following based on previous answers, but it did not work. The output is an empty file.
awk 'FNR == NR { val[$1] = $4 }
FNR != NR { if ($1 in val && val[$1] >= $3 && val[$1] <= $4)
print $1, val[$1], $6
}' file1 file2 > file3

Assumptions:
in the case of multiple matches OP has stated we only use the 'first' match; OP hasn't (yet) defined 'first' so I'm going to assume it means the order in which lines appear in file2 (aka the line number)
One awk idea:
awk '
FNR==NR { min[++n]=$3 # save 1st file values in arrays; use line number as index
max[n]=$4
col6[n]=$6
next
}
{ for (i=1;i<=n;i++) # loop through 1st file entries
if (min[i] <= $4 && $4 <= max[i]) { # if we find a range match then ...
print $0, col6[i] # print current line plus column #6 from 1st file and then ...
next # skip to next line of input; this keeps us from matching on additional entries from 1st file
}
print $0, "NA" # if we got here there was no match so print current line plus "NA"
}
' file2 file1
NOTE: make note of the order of the input files; the first answer (below) was based on an input of file1 file2; this answer requires the order of the input files to be flipped, ie, file2 file1
This generates:
1 rs537182016 0 8674590 A C ABAT
1 rs575272151 0 69244805 G C ABCA10
1 rs544419019 0 69244469 G C ABCA10
1 rs354682 0 1268900 G C NA
NOTE: following is based on OP's original question and expected output (revision #2); OP has since modified the expected output to such an extent that the following answer is no longer valid ...
Assumptions:
in file1 both rs575272151 / 69244805 and rs544419019 / 69244469 match 2 different (overlapping) ranges from file2 but OP has only showed one set of matches in the expected output; from this I'm going to assume ...
once a match is found for an entry from file1, remove said entry from any additional matching; this will eliminate multiple matches for file1 entries
once a match is found for a line from file2 then stop looking for matches for that line (ie, go to the next intput line from file2); this will eliminate multiple-matches for file2
OP has not provided any details on how to determine which mulit-match to keep so we'll use the first match we find
One awk idea:
awk '
FNR==NR { val[$2]=$4; next }
{ for (i in val) # loop through list of entries from 1st file ...
if ($3 <= val[i] && val[i] <= $4) { # looking for a range match and if found ...
print $0,i # print current line plus 2nd field from 1st file and then ...
delete val[i] # remove 1st file entry from further matches and ...
next # skip to next line of input from 2nd file, ie, stop looking for additional matches for the current line
}
}
' file1 file2
This generates:
18 16 8674587 8784575 + ABAT rs537182016
10349 17 69148007 69244815 - ABCA10 rs575272151
23461 17 69244435 69327182 - ABCA5 rs544419019
NOTES:
the for (i in val) construct is not guaranteed to process the array entries in a consistent manner; net result is that in the instance where there are multiple matches we simply match on the 'first' array entry provided by awk; if this 'random' nature of the for (i in val) is not acceptable then OP will need to update the question with additional details on how to handle multiple matches
for this particular case we actually generate the same output as expected by OP, but the assignments of rs575272151 and rs544419019 could just as easily be reversed (due to the nature of the for (i in val) construct)

Related

Find the durations and their average between the dataset in an interval in shell script

This is related to my older question Find the durations and their maximum between the dataset in an interval in shell script
I have a dataset as:
ifile.txt
2
3
2
3
2
20
2
0
2
0
0
2
1
2
5
6
7
0
3
0
3
4
5
I would like to find out different duration and their average between the 0 values in 6 values interval.
My desire output is:
ofile.txt
6 5.33
1 2
1 2
1 2
5 4.2
1 3
3 4
Where
6 is the number of counts until next 0 within 6 values (i.e. 2,3,2,3,2,20) and 5.33 is the average value among them;
1 is the number of counts until next 0 within next 6 values (i.e. 2,0,2,0,0,2) and 2 is the average;
Next 1 and 2 are within same 6 values;
5 is the number of counts until next 0 within next 6 values (i.e. 1,2,5,6,7,0) and 4.2 is the average among them;
And so on
As per the answer to my previous question, I was trying with this:
awk '
$0!=0{
count++
sum=sum+$0
found=""
}
$0==0{
print count,max
count=max=0
next
}
FNR%6==0{
print count,max
count=max=0
found=1
}
END{
if(!found){
print count,max
}
}
' Input_file | awk '!/^ /' | awk '$1 != 0'
EDIT3: One more try since 2nd set of 6 lines have 2 0 2 0 0 2 so its output should be 1 2, 1 2, 0 0,1 2 if this is the case(which I believe ideally should be) then try following.
awk '
{
occur++
}
{
count=$0!=0?++count:count
sum+=$0
}
$0==0 || occur==6{
printf("%d %0.2f\n",count,count?sum/count:prev)
prev=count?sum/count:0
prev_count=count
count=sum=prev=prev_count=""
if(occur==6){
occur=""
}
}
END{
if(occur){
printf("%d %0.2f\n",count?count:prev_count,count?sum/count:prev)
}
}
' Input_file | awk '$1 != 0'
Output will be as follows:
6 5.33
1 2.00
1 2.00
1 2.00
5 4.20
1 3.00
3 4.00
EDITs below may help in similar kind of problems which are bit different from this actual problem, so keeping them here in post.
EDIT2: In case you don't want to RESET count whenever a zero occurs in Input_file then try following. This will continuously look for only 6 lines and will NOT RESET its count.
awk '
{
occur++
}
$0!=0{
count++
sum+=$0
found=prev_count=prev=""
}
$0==0 && occur!=6{
printf("%d,%0.2f\n",count?count:prev_count,count?sum/count:prev)
prev=count?sum/count:0
prev_count=count
count=sum=""
found=1
next
}
occur==6{
printf("%d,%0.2f\n",count,count?sum/count:prev)
prev=count?sum/count:0
prev_count=count
count=sum=occur=""
found=1
}
END{
if(!found){
printf("%d,%0.2f\n",count?count:prev_count,count?sum/count:prev)
}
}
' Input_file
EDIT1: Could you please try following, tested and written with provided samples only.
awk '
{
occur++
}
$0!=0{
count++
sum+=$0
found=prev_count=prev=""
}
$0==0{
printf("%d,%0.2f\n",count?count:prev_count,count?sum/count:prev)
prev=count?sum/count:0
prev_count=count
count=sum=occur=""
found=1
next
}
occur==6{
printf("%d,%0.2f\n",count,count?sum/count:prev)
prev=count?sum/count:0
prev_count=count
count=sum=occur=""
found=1
}
END{
if(!found){
printf("%d,%0.2f\n",count?count:prev_count,count?sum/count:prev)
}
}
' Input_file
What does code take care of:
It takes care of logic where if any continuous 2 lines are having 0 value then it will print previous count and average values for that line.
This will also take care of edge cases like:
a- In case a line is either NOT ending with a 0 it will check if some values are there to print by found flag I created.
b- In case of any Input_file's last line is NOT divided by 6 then also this case will be covered by END block's logic of checking it by found flag.
Explanation: Adding a detailed explanation for above code.
awk ' ##Starting awk program from here.
{
occur++
}
$0!=0{ ##Checking condition if a line is NOT having zero value then do following.
count++ ##Increment variable count with 1 each time it comes here.
sum+=$0 ##Creating variable sum and keep adding current line value in it.
found=prev_count=prev="" ##Nullifying variables found, prev_count, prev here.
} ##Closing BLOCK for condition $0!=0 here.
$0==0{ ##Checking condition if a line is having value zero then do following.
printf("%d,%0.2f\n",count?count:prev_count,count?sum/count:prev) ##Printing count and count/sum here, making sure later is NOT getting divided by 0 too.
prev=count?sum/count:0 ##Creating variable prev which will be sum/count or zero in case count variable is NULL.
prev_count=count ##Creating variable prev_count whose value is count.
count=sum=occur="" ##Nullify variables count and sum here.
found=1 ##Setting value 1 to variable found here.
next ##next will skip all further statements from here.
} ##Closing BLOCK for condition $0==0 here.
occur==6{ ##Checking if current line is fully divided with 6 then do following.
printf("%d,%0.2f\n",count,count?sum/count:prev) ##Printing count and count/sum here, making sure later is NOT getting divided by 0 too.
prev=count?sum/count:0 ##Creating variable prev which will be sum/count or zero in case count variable is NULL.
prev_count=count ##Creating variable prev_count whose value is count.
count=sum=occur="" ##Nullifying variables count and sum here.
found=1 ##Setting value 1 to variable found here.
} ##Closing BLOCK for condition FNR%6==0 here.
END{ ##Starting END block for this awk program here.
if(!found){ ##Checking condition if variable found is NULL then do following.
printf("%d,%0.2f\n",count?count:prev_count,count?sum/count:prev) ##Printing count and count/sum here, making sure later is NOT getting divided by 0 too.
}
}
' Input_file ##Mentioning Input_file name here.
Another which didn't turn out quite the one-liner I was going for:
$ tail -n +2 file | # strip header with tail to not disturb NR
awk '
{
s+=$0 # sum them up
c++ # keep count
}
$0==0 || NR%6==0 { # act if zero or every 6th record
if($0==0) # remove zero effect on c
c--
if(s>0) # avoid division by zero
print c,s/c # output
s=c=0 # rinse and repeat
}
END { # the end-game if NR%6!=0
if($0==0)
c--
if(s>0)
print c,s/c
}'
Output:
6 5.33333
1 2
1 2
1 2
5 4.2
1 3
3 4
The tail removes the header in the file before feeding the data to awk, the idea is that the header won't show in the NR. If you don't have a header, just awk ... file.

Sum of 2nd and 3rd column for same value in 1st column

I want to sum the value in column 2nd and 3rd column for same value in 1st column
1555971000 6 1
1555971000 0 2
1555971300 2 0
1555971300 3 0
Output would be like
1555971000 6 3
1555971300 5 0
I have tried below command
awk -F" " '{b[$2]+=$1} END { for (i in b) { print b[i],i } } '
but this seems to be for only one column.
Here is another way with reading Input_file 2 times and it will provide output in same sequence as Input_file's sequence.
awk 'FNR==NR{a[$1]+=$2;b[$1]+=$3;next} ($1 in a){print $1,a[$1],b[$1];delete a[$1]}' Input_file Input_file
if data in 'd' without sort, tried on gnu awk,
awk 'BEGIN{f=1} {if($1==a||f){b+=$2;c+=$3;f=0} else{print a,b,c;b=$2;c=$3} a=$1} END{print a,b,c}' d
with sort gnu awk
awk '{w[NR]=$0} END{asort(w);f=1;for(;i++<NR;){split(w[i],v);if(v[1]==a||f){f=0;b+=v[2];c+=v[3]} else{print a,b,c;b=v[2];c=v[3];} a=v[1]} print a,b,c;}' d
You can do it with awk by first saving the fields in the first record, and then for all subsequent records, comparing if the first field matches, if so, add the contents of fields two and three and continue. If the first field fails to match, then output your first field and the running-sums, e.g.
awk '{
if ($1 == a) {
b+=$2; c+=$3;
}
else {
print a, b, c; a=$1; b=$2; c=$3;
}
} END { print a, b, c; }' file
With your input in file, you can copy and paste the foregoing into your terminal and obtain, the following:
Example Use/Output
$ awk '{
> if ($1 == a) {
> b+=$2; c+=$3;
> }
> else {
> print a, b, c; a=$1; b=$2; c=$3;
> }
> } END { print a, b, c; }' file
1555971000 6 3
1555971300 5 0
Using awk Arrays
A shorter more succinct alternative using arrays that does not require your input to be in sorted order would be:
awk '{a[$1]+=$2; b[$1]+=$3} END{ for (i in a) print i, a[i], b[i] }' file
(same output)
Using arrays allows the summing of columns for like field1 to work equally well if your data file contained the following lines in random order, e.g.
1555971300 2 0
1555971000 0 2
1555971000 6 1
1555971300 3 0
Another awk that would work regardless of any order of records whether or not they are not sorted :
awk '{r[$1]++}
r[$1]==1{o[++c]=$1}
{f[$1]+=$2;s[$1]+=$3}
END{for(i=1;i<=c;i++){print o[i],f[o[i]],s[o[i]]}}' file
Assuming when you wrote:
awk -F" " '{b[$2]+=$1} END { for (i in b) { print b[i],i } } '
you meant to write:
awk '{ b[$1]+=$2 } END{ for (i in b) print i,b[i] }'
It shouldn't be a huge leap to figure out:
$ awk '{ b[$1]+=$2; c[$1]+=$3 } END{ for (i in b) print i,b[i],c[i] }' file
1555971000 6 3
1555971300 5 0
Please get the book "Effective Awk Programming", 4th Edition, by Arnold Robbins and just read a paragraph or 2 about fields and arrays.

How to print the last column from a specific row?

I've got datafile file which more or less looks something like this:
*** some text ***
Results
1 50
2 -75
3 80
*** some text ***
What I'd like to do is:
Find the line that contains the string "Results".
List those three results but only the last column is significant.
Show only the positive ones.
I was trying to solve my problem with awk command which for each result looks like this:
res1=$(awk '/Results/{nr[NR+1]}; NR in nr' datafile | awk '{print $NF}')
I hoped to get the first positive results by:
If [ $res1 -gt 0 ]; then
echo "$res1"
fi
But instead of the expected result I've got the error Integer expression expected. So it leads to a conclusion that the variable res1 isn't a numeric value. Any idea how to define it properly?
Something like this might work:
$ awk '$0 == "Results" { f = 3; next } f && f-- && $NF > 0 { print $NF }' input
50
80
Basically the variable f is set to 3 when the line Results is passed.
Then the last column from the next lines are printed as long as f > 0 and $NF > 0 $NF is the last column.

split text file (Genome data) based on column values keeping header line

I have a big genome data file (.txt) in the format below. I would like to split it based on chromosome column chr1, chr2..chrX,chrY and so forth keeping the header line in all splitted files. How can I do this using unix/linux command?
genome data
variantId chromosome begin end
1 1 33223 34343
2 2 44543 46444
3 2 55566 59999
4 3 33445 55666
result
file.chr1.txt
variantId chromosome begin end
1 1 33223 34343
file.chr2.txt
variantId chromosome begin end
2 2 44543 46444
3 2 55566 59999
file.chr3.txt
variantId chromosome begin end
4 3 33445 55666
Is this data for the human genome (i.e. always 46 chromosomes)? If so, how's this:
for chr in $(seq 1 46)
do
head -n1 data.txt >chr$chr.txt
done
awk 'NR != 1 { print $0 >>("chr"$2".txt") }' data.txt
(This is a second edit, based on #Sasha's comment above.)
Note that the parens around ("chr"$2".txt") are apparently not needed on GNU awk, but they are on my OS X version of awk.

how to print 3rd field in 3rd column itself

In my file I have 3 fields, I want to print only the third field in the third column only but output is getting to the first row. Please check my file and output:
cat filename
1st field 2nd field 3rd field
--------- --------- -----------
a,b,c,d d,e,f,g,h 1,2,3,4,5,5
q,w,e,r t,y,g,t,i 9,8,7,6,5,5
I'm using the following command to print the third field only in the third column
cat filename |awk '{print $3}' |tr ',' '\n'
OUTPUT printing 3rd field strings in the 1st field place, i want that to print in only 3rd field area only
first field :-
---------------
1
2
3
4
5
5
expected output
1st field 2nd field 3rd field
--------- --------- -----------
a,b,c,d d,e,f,g,h 1
2
3
4
5
5
q,w,e,r t,y,g,t,i 9
8
7
6
5
5
Input
[akshay#localhost tmp]$ cat file
1st field 2nd field 3rd field
--------- --------- -----------
a,b,c,d d,e,f,g,h 1,2,3,4,5,5
q,w,e,r t,y,g,t,i 9,8,7,6,5,5
Script
[akshay#localhost tmp]$ cat test.awk
NR<3 || !NF{ print; next}
{
split($0,D,/[^[:space:]]*/)
c1=sprintf("%*s",length($1),"")
c2=sprintf("%*s",length($2),"")
split($3,A,/,/)
for(i=1; i in A; i++)
{
if(i==2)
{
$1 = c1
$2 = c2
}
printf("%s%s%s%s%d\n",$1,D[2],$2,D[3],A[i])
}
}
Output
[akshay#localhost tmp]$ awk -f test.awk file
1st field 2nd field 3rd field
--------- --------- -----------
a,b,c,d d,e,f,g,h 1
2
3
4
5
5
q,w,e,r t,y,g,t,i 9
8
7
6
5
5
Explanation
NR<3 || !NF{ print; next}
NR gives you the total number of records being processed or line number, in short NR variable has line number.
NF gives you the total number of fields in a record.
The next statement forces awk to immediately stop processing the
current record and go on to the next record.
If line number is less than 3 or not NF (meaning no fields in record that is blank line), print current record and go to next record.
split($0,D,/[^[:space:]]*/)
Since we are interested to preserve the formatting, so we are saving separators between fields on array D here, if you have GNU awk you can make use of 4th arg for split() - it lets you split the line into 2 arrays, one of the fields and the other of the separators between the fields and then you can just operate on the fields array and print using the separators array between each field array element to rebuild the original $0.
c1=sprintf("%*s",length($1),"") and c2=sprintf("%*s",length($2),"")
Here sprintf function is used to fill space char of field ($1 or $2) length.
split($3,A,/,/)
split(string, array [, fieldsep [, seps ] ])
Divide string into pieces separated by fieldsep and store the pieces
in array and the separator strings in the seps array. The first piece
is stored in array[1], the second piece in array[2], and so forth. The
string value of the third argument, fieldsep, is a regexp describing
where to split string (much as FS can be a regexp describing where to
split input records). If fieldsep is omitted, the value of FS is used.
split() returns the number of elements created.
Loop till as long as i in A is true, I just came to know that i=1 and i++ control the order of traversal of the array, Thanks to Ed Morton
if(i==2)
{
$1 = c1
$2 = c2
}
when i = 1 we print a,b,c,d and d,e,f,g,h, in next iteration we modify $1 and $2 value with c1 and c2 we created above since you are interested to show only once as requested.
printf("%s%s%s%s%d\n",$1,D[2],$2,D[3],A[i])
Finally print field1 ($1), separator between field1 and field2 to we saved above, that is D[2], field2 ($2), separator between field2 and field3 and array A element only by one which we created from (split($3,A,/,/)).
$ cat tst.awk
NR<3 || !NF { print; next }
{
front = gensub(/((\S+\s+){2}).*/,"\\1","")
split($3,a,/,/)
for (i=1;i in a;i++) {
print front a[i]
gsub(/\S/," ",front)
}
}
$ awk -f tst.awk file
1st field 2nd field 3rd field
--------- --------- -----------
a,b,c,d d,e,f,g,h 1
2
3
4
5
5
q,w,e,r t,y,g,t,i 9
8
7
6
5
5
The above uses GNU awk for gensub(), with other awks use match()+substr(). It also uses \S and \s shorthand for [^[:space:]] and [[:space:]].
Considering the columns are tab separated, I would say:
awk 'BEGIN{FS=OFS="\t"}
NR<=2 || !NF {print; next}
NR>2{n=split($3,a,",")
for (i=1;i<=n; i++)
print (i==1?$1 OFS $2:"" OFS ""), a[i]
}' file
This prints the 1st, 2nd and empty lines normally
Then, slices the 3rd field using the comma as separator.
Finally, loops through the amount of pieces printing each one every time; it prints the first two columns the first time, then just the last value.
Test
$ awk 'BEGIN{FS=OFS="\t"} NR<=2 || !NF {print; next} NR>2{n=split($3,a,","); for (i=1;i<=n; i++) print (i==1?$1 OFS $2:"" OFS ""), a[i]}' a
1st field 2nd field 3rd field
--------- --------- -----------
a,b,c,d d,e,f,g,h 1
2
3
4
5
5
q,w,e,r t,y,g,t,i 9
8
7
6
5
5
Note the output is a bit ugly, since tab separating the columns lead them like this.

Resources