AWK field substitution based on lookup table - linux

I am trying to replace values in column 1 of file1 using a lookup table. A sample (tab separated):
chr1 1243 A T 0.14
chr5 1432 G C 0.0006
chr10 731 T C 0.9421
chr11 98234 T G .000032
chr12 1284 A T 0.93428
chr17 941 G T 0.1111
chr19 134325 T C 0.00001
chr21 9824 T C 0.9
Lookup table:
chr1 NC_000001.11
chr2 NC_000002.12
chr3 NC_000003.12
chr4 NC_000004.12
chr5 NC_000005.10
chr6 NC_000006.12
chr7 NC_000007.14
chr8 NC_000008.11
chr9 NC_000009.12
chr10 NC_000010.11
chr11 NC_000011.10
chr12 NC_000012.12
chr13 NC_000013.11
chr14 NC_000014.9
chr15 NC_000015.10
chr16 NC_000016.10
chr17 NC_000017.11
chr18 NC_000018.10
chr19 NC_000019.10
chr20 NC_000020.11
chr21 NC_000021.9
chr22 NC_000022.11
script being used:
awk 'FNR==NR{a[$1]=$2;next} {for (i in a)sub(i,a[i]);print' lookup.txt file1 > new_table.txt
output with comment on which line is correct/incorrect (with right answer in brackets):
NC_000001.11 1243 A T 0.14 #correct
NC_000005.10 1432 G C 0.0006 #correct
NC_000001.110 731 T C 0.9421 #incorrect (NC_000010.11)
NC_000001.111 98234 T G .000032 #incorrect (NC_000011.10)
NC_000012.12 1284 A T 0.93428 #correct
NC_000001.117 941 G T 0.1111 #incorrect (NC_000017.11)
NC_000001.119 134325 T C 0.00001 #incorrect (NC_000019.10)
NC_000021.9 9824 T C 0.9 #correct
I don't understand the pattern of why it isn't working and would welcome any help with the awk script. I thought it was just those with double digits e.g. chr17 but then chr21 seems to work fine.
Many thanks

Shouldn't it be:
awk 'FNR==NR{a[$1]=$2;next}{$1=a[$1]}1' lookup.txt file1
?
Output:
NC_000001.11 1243 A T 0.14
NC_000005.10 1432 G C 0.0006
NC_000010.11 731 T C 0.9421
NC_000011.10 98234 T G .000032
NC_000012.12 1284 A T 0.93428
NC_000017.11 941 G T 0.1111
NC_000019.10 134325 T C 0.00001
NC_000021.9 9824 T C 0.9
Explanation:
# true as long as we are reading the first file, lookup.txt
FNR==NR {
# create a lookup array 'a' indexed by field 1 of lookup txt
a[$1]=$2
# don't process further actions
next
}
# because of the 'next' statement above, this will be only executed
# when we are processing the second file, file1
{
# translate field 1. use the value from the lookup array
$1=a[$1]
}
# always true. print the line
1
PS: If there's the possibility that entries can't be found in the lookup table, you could use a special text for them:
awk 'FNR==NR{a[$1]=$2;next}{$1=($1 in a)?a[$1]:"NOT FOUND "$1}1' lookup.txt file1

I believe sub could be the problem in OP's attempt, not checked thoroughly, this could be done simply by:
awk 'FNR==NR{arr[$1]=$2;next} ($1 in arr){first=$1;$1="";print arr[first],$0}' lookup_table Input_file
Problem with OP's attempt(Only for understanding purposes NOT to be run to get shown samples results): Though OP's code shown one doesn't look like complete one to dig it out why its giving wrong output as per OP's question, I have written it as follows.
awk 'FNR==NR{a[$1]=$2;next} {for (i in a){line=$0;if(sub(i,a[i])){print (Previous line)line">>>(array key)"i"....(array value)"a[i]"............(new line)"$0}}}' lookup_table Input_file
So whenever a proper substitution happens then only its printing the line as follows, where we could see whats going wrong with OP's code.
chr1 1243 A T 0.14 chr1 1243 A T 0.14 >>>(array key)chr1....(array value)NC_000001.11............(new line)NC_000001.11 1243 A T 0.14
chr5 1432 G C 0.0006chr5 1432 G C 0.0006>>>(array key)chr5....(array value)NC_000005.10............(new line)NC_000005.10 1432 G C 0.0006
chr10 731 T C 0.9421chr10 731 T C 0.9421>>>(array key)chr1....(array value)NC_000001.11............(new line)NC_000001.110 731 T C 0.9421
chr11 98234 T G .000032chr11 98234 T G .000032>>>(array key)chr1....(array value)NC_000001.11............(new line)NC_000001.111 98234 T G .000032
chr12 1284 A T 0.93428chr12 1284 A T 0.93428>>>(array key)chr12....(array value)NC_000012.12............(new line)NC_000012.12 1284 A T 0.93428
chr17 941 G T 0.1111chr17 941 G T 0.1111>>>(array key)chr1....(array value)NC_000001.11............(new line)NC_000001.117 941 G T 0.1111
chr19 134325 T C 0.00001chr19 134325 T C 0.00001>>>(array key)chr1....(array value)NC_000001.11............(new line)NC_000001.119 134325 T C 0.00001
chr21 9824 T C 0.9chr21 9824 T C 0.9>>>(array key)chr21....(array value)NC_000021.9............(new line)NC_000021.9 9824 T C 0.9
Where we could easily see
old line from chr1 1243 A T 0.14 chr1 1243 A T 0.14 to becomes like NC_000001.11 1243 A T 0.14 that's because array key(chr1) get substituted with array value (NC_000001.11) If you see output shown above for understanding purposes.

Looks like sub is causing the issue and so simply prepend the value for the index specified by $1 to the line with a space and print the line with short hand 1 and so:
awk 'FNR==NR{a[$1]=$2;next} {$0=a[$1]" "$0 }1' lookup.txt file1 > new_table.txt

Related

How do I merge space-separated files of unequal length (inner join) using join in linux?

I have a simple problem: I have a space separated file and want to add a column from another space separated file, but this second file is longer. I want to perform an inner join (so only add the column and not rows). I want to do it with linux join (for efficiency reasons). I have seen similar questions but as I'm only a beginner I can't distill the information I need and apply it to my case.
I removed headers and sorted on the key column (first column in both files). I checked for duplicate keys (there were none).
join -1 1 -1 1 <(sort -k1 file1) <(sort -k1 file2) > file3
File 1:
rs1248851 C 655 0.7666 -0.8358 0.4033
rs1248857 G 654 1.069 0.4283 0.6684
rs1248860 G 656 1.052 0.3234 0.7464
rs12488651 G 652 1.246 1.343 0.1792
rs1248865 C 649 0.7419 -0.9125 0.3615
rs1248866 C 649 0.7696 -0.8053 0.4207
rs1248868 C 649 0.7717 -0.8317 0.4056
rs1248869 T 647 0.7878 -0.766 0.4437
File 2:
rs1248851 G
rs1248857 A
rs1248858 C
rs1248859 C
rs1248860 A
rs1248861 T
rs12488651 T
rs1248865 G
rs1248866 G
rs1248867 G
rs1248868 T
rs1248869 C
Expected result File 3:
rs1248851 C 655 0.7666 -0.8358 0.4033 G
rs1248857 G 654 1.069 0.4283 0.6684 A
rs1248860 G 656 1.052 0.3234 0.7464 A
rs12488651 G 652 1.246 1.343 0.1792 T
rs1248865 C 649 0.7419 -0.9125 0.3615 G
rs1248866 C 649 0.7696 -0.8053 0.4207 G
rs1248868 C 649 0.7717 -0.8317 0.4056 T
rs1248869 T 647 0.7878 -0.766 0.4437 C
Actual resulting error message:
join: /dev/fd/63:5: is not sorted: rs1248865 C 649 0.7419 -0.9125 0.3615
join: /dev/fd/62:8: is not sorted: rs1248865 G
The join command on linux seems more verbose, but still results in a file3 file with the expected results. join on macOS does not complain.
If you still want to use something else, you could try filtering file2 based on the keys present in file1, like this:
for i in `cut -f1 -d' ' file1`; do grep $i file2 >> file2.filtered; done;
And then use the original join you already had:
join -1 1 -1 1 <(sort -k1 file1) <(sort -k1 file2.filtered) > file3
Not sure about your logic with join or why it complains, but ...
awk 'NR==FNR{a[$1]=$0};NR!=FNR{if($1 in a){print a[$1],$2}}' file1 file2
rs1248851 C 655 0.7666 -0.8358 0.4033 G
rs1248857 G 654 1.069 0.4283 0.6684 A
rs1248860 G 656 1.052 0.3234 0.7464 A
rs12488651 G 652 1.246 1.343 0.1792 T
rs1248865 C 649 0.7419 -0.9125 0.3615 G
rs1248866 C 649 0.7696 -0.8053 0.4207 G
rs1248868 C 649 0.7717 -0.8317 0.4056 T
rs1248869 T 647 0.7878 -0.766 0.4437 C
Let's break the awk down; while NR (the record number in the input over all) matches FNR (file record number) we save each line into an array, using the first column's entries as an index.
When we reach the second file, and its first column can be found in our previously created array, we print the line from the first file followed by the second column from the second file.

How to compare columns from different files and replace the value in one column?

I have two files:
input1
22 rs145072688 14431347 C G 0.3418 0.648 0.830 0.516 0.506 0.497 0.785 0.586
22 rs201725126 14432618 G A 0.8119 1.571 1.748 1.661 1.384 1.374 1.614 1.718
22 rs200579949 14433624 G A 0.8598 1.590 1.669 1.763 1.754 1.832 1.627 1.250
22 rs75454623 14433659 C A 0.7888 1.564 1.606 1.667 1.355 1.619 1.692 1.775
22 rs199856693 14433758 G A 0.9354 1.807 1.936 1.906 1.847 1.929 1.734 1.327
22 rs9604721 14434713 C T 0.9723 1.984 1.984 1.984 1.984 1.984 1.878 1.412
input2
rs145072688:10352:T:TA rs145072688
rs201725126:13116:T:G rs201725126
rs200579949:13118:A:G rs200579949
rs75454623:14930:A:G rs75454623
rs199856693:14933:G:A rs199856693
desired output:
22 rs145072688:10352:T:TA 14431347 C G 0.3418 0.648 0.830 0.516 0.506 0.497 0.785 0.586
22 rs201725126:13116:T:G 14432618 G A 0.8119 1.571 1.748 1.661 1.384 1.374 1.614 1.718
22 rs200579949:13118:A:G 14433624 G A 0.8598 1.590 1.669 1.763 1.754 1.832 1.627 1.250
22 rs75454623:14930:A:G 14433659 C A 0.7888 1.564 1.606 1.667 1.355 1.619 1.692 1.775
22 rs199856693:14933:G:A 14433758 G A 0.9354 1.807 1.936 1.906 1.847 1.929 1.734 1.327
22 rs9604721 14434713 C T 0.9723 1.984 1.984 1.984 1.984 1.984 1.878 1.412
So if the 2nd columns of both files match I want to replace the values in 2nd column in file input1 with values in 1st column from input2.
I tried this:
awk 'FNR==NR{a[$1]=$2;next} $2 in a{$2=a[$1]}1' input2 input1
and this
awk 'FNR==NR { F2[$2]=$2 ; next } $2 in F2 {$1 = F2[$1] ; print } ' input2 input1
Your first attempt is almost right.
awk '
FNR==NR { a[$2]=$1; next }
$2 in a { $2=a[$2] }
1
' input2 input1
$2 in a looks for $2 in the keys of a, not in its values
so store column 2 as value rather than the key for a
in the second line action, you should refer to the second column (of input1) not first column (of input2)

Expand each line of text file according to their corresponding numbers on linux

Can I transfer this first format to the second one just by basic shell procession or awk or sed on linux?
This is a toy example:
This kind of text file is what I have, three cols, col2 and col3 like range, left close and right open,
chr1 0 2 0
chr1 2 6 1.5
chr2 0 3 0
chr2 3 10 2.1
Transfer to describe each position as:
chr1 0 0
chr1 1 0
chr1 2 1.5
chr1 3 1.5
chr1 4 1.5
chr1 5 1.5
chr2 0 0
chr2 1 0
chr2 2 0
chr2 3 2.1
...
chr2 9 2.1
This can be done by awk,
awk '{for(i=$2;i<$3;i++)print $1,i,$4}' file
Set the start and end of the range as $2 and $3, respectively.
And Print as request for the range in each line.
Another option is to use set and map operations with bedops, bedmap, and cut:
$ bedops --chop 1 foo.bed | bedmap --faster --echo --echo-map-id --delim "\t" - foo.bed | cut -f1,2,4 > answer.txt
Might offer some flexibility if other types of divisions and signal mapping are needed.

Use part of a column in one file as search term in other file

I have two files. The output file I am searching has earthquake locations and has the following format:
19090212 1323 30.12 36 19.41 103 28.24 7.29 0.00 4 149 25.8 0.02 5.7 9.8 D - 0
19090216 1828 49.61 36 13.27 101 35.38 10.94 0.00 13 54 38.5 0.07 0.3 0.7 B 0
19090711 2114 54.11 35 1.07 99 56.42 7.00 0.00 7 177 18.7 4.00 63.3 53.2 D # 0
I want to use the last 6 digits of the first column (i.e. '090418' out of '19090418') with the first 3 digits of the second column (i.e. '072' out of '0728') as my search term. The file I am searching has the following format:
SC17 P 090212132329.89
X25A P 090212132330.50
AMTX P 090216182814.12
X29A P 090216182813.70
Y28A P 090216182822.36
MSTX P 090216182826.80
Y27A P 090216182831.43
After I search the second file for the term, I need to figure out how many lines are in that section. So for this example, if I were searching the terms shown for the second file above, I want to know there are 2 lines for 090212132 and 5 lines for 090216182.
This is my first post, so please let me know how I can improve clarity or conciseness in my posts. Thanks for the help!
awk to the rescue!
$ awk 'NR==FNR{a[substr($1,3) substr($2,1,3)]; next}
{k=substr($3,1,9)}
k in a{a[k]++}
END{for(k in a) if(a[k]>0) print k,a[k]}' file1 file2
with your input files, there is no output as expected.
The answer karakfa suggested worked! My output looks like this:
100224194 7
100117172 18
091004005 11
090520220 10
090526143 21
090122033 20
Thanks for the help!
Karafka answer with explanation
awk 'NR==FNR { # For first file
$1 = substr($1, 3); # Get last 6 characters from first col
$2 = substr($2, 1, 3); # Get first 3 characters from second col
a[$1 $2]; # Add to an array
next } # Move to next record in first file
# Start processing second file
{k = substr($3, 1, 9)} # Get first 9 character for third col
k in a {a[k]++} # If key in a, then increment the key
END {
for (k in a) # Iterate array
if (a[k] > 0) # If pattern was matched
print k, a[k] # print the pattern and num occurrence
}'

How to use AWK to unique a table (keep the biggest value for each unique ID)?

I have a TAB delimited table like this (the first line is header):
symbol value chr start end
Arrb1 10 chr1 1000 2000
Arrb1 20 chr1 1000 2000
Arrb1 30 chr1 1000 2000
Myc 5 chr2 3000 4000
Actin 3 chr4 25000 30000
Actin 5 chr4 25000 30000
.
.
.
I want to unique the table by the first column(symbol), and if there are multiple lines for the same symbol, keep the line with biggest value (column 2). So the result should look like:
symbol value chr start end
Arrb1 30 chr1 1000 2000
Myc 5 chr2 3000 4000
Actin 5 chr4 25000 30000
.
.
.
Can I do it using AWK? Thanks!
awk -F'\t' 'NR==1{print}
NR>1{if(b[$1]<$2){ a[$1]=$0; b[$1]=$2 }}
END{for(x in a)print a[x]}' file
If no header. I provide a shorter one.
sort -k1,1 -k2,2nr file |awk '!a[$1]++'

Resources