Lookup in shell script - linux

I have two files as below:
cat file_1:
12 a
34 b
24 c
18 d
cat file_2:
x a
y c
z d
I want something like in shell script:
x a 12
y c 24
z d 18
File 1 and file 2 has different number of rows and join is not working as I can't sort the files (the files are already sorted for the requirement if I sort again for join then requirement will not served).

#Maulik Patel: Try also.
awk 'FNR==NR{A[$2]=$0;next} ($2 in A){print A[$2] FS $1}' Input_file2 Input_file1
Very short description:
So here I am using FNR==NR condition which will be TRUE when Input_file is being read and saving $0(current line's value) in to array A whose index is field 2.
Then while reading 2nd Input_file I am checking which second field of Input_file2 is coming in array A and printing it's value with Input_file1's first field.

Here is another way of doing it
join -1 2 -2 2 file_2 file_1 --nocheck-order -o 1.1,1.2,2.1
x a 12
y c 24
z d 18

Related

Looping though list of IDs to count matches in two columns

This is going to be a complicated one to explain so bear with me.
I am doing a blastp comparison multiple proteins all vs all and want the number of shared proteins between the genomes.
I have a large file of the query id and sequence id, example:
A A 100
A A 100
A A 100
A B 74
A B 47
A B 67
A C 73
A C 84
A C 74
A D 48
A D 74
A D 74
B A 67
B A 83
B A 44
B B 100
The file continues like that. I'd like to count the number of occurrences of A in column 1 and B in column 2. I have found a way to do this with awk:
awk -F, '$1=="A" && $2=="A"' file | wc -l
However, I have hundreds of genomes and this would involve typing the awk script thousands of times to get the different combinations. I had added the IDs from column 1 to a text file and tried a loop to loop through all the IDs for all possible combinations
for i in $(cat ID.txt); do input_file=file.csv; awk -F, '$1==$i && $2==$i' ${input_file} | wc -l; done
This is the output:
0
0
0
0
0
0
0
etc.
I'd like the output to be:
A A 60
A B 54
A C 34
A D 35
etc.
Any help would be appreciated.
If I'm understanding correctly, then you can collect the count for each pair into an array, and then print out the array once complete:
awk -F, '{++a[$1 FS $2]} END{for(entry in a){print entry, a[entry]}}' file
A,A 3
B,A 3
A,B 3
B,B 1
A,C 3
A,D 3
This is doing the following:
Increment the count in array a for the item with the key constructed from the concatenation of the first two columns, separated by the field separator FS (comma): {++a[$1 FS $2]}
Once the file processing is done END, loop through the array calling each array entry entry, for (entry in a)
In the loop, print the key/entry and the value {print entry, a[entry]}
…input… | WHINY_USERS=1 \ # Not trying to insult anyone -
# this is a special shell parameter
# recognized by mawk-1 to have array
# indices pre-sorted, somewhat similar to gawk's
#
# PROCINFO["sorted_in"]="#ind_str_asc"
mawk '{__[$!--NF]—-} END { for(_ in __) { print _,-__[_] } }' OFS=',' FS='[, \t]+'
A,A,3
A,B,3
A,C,3
A,D,3
B,A,3
B,B,1
if there's a chance more in than 3 columns in input, then do :
{m,n,g}awk '
BEGIN { _ += _ ^= FS = "["(OFS=",")" \t]+"
} { __[$!(NF=_)]++
} END {
for(_ in __) { print _, __[_] } }'
let $1 = $1 take care of placing the comma in between columns 1 and 2 instead of having to manually do it

Finding the rows sharing information

I have a file having a structure like below:
file1.txt:
1 10 20 A
1 10 20 B
1 10 20 E
1 10 20 F
1 12 22 C
1 13 23 X
2 33 45 D
2 48 49 D
2 48 49 E
I am trying to find out, which letters have the same information in the 1st,2nd,3rd columns?
For example the output should be:
A
B
E
F
D
E
I am only able to count how many lines are unique via:
cut -f1,2,3 file1.txt | sort | uniq | wc -l
5
which does not give me anything related with the 4th column.
How do I have the letters in the forth column sharing the first three columns?
Following awk may help you here.
awk 'FNR==NR{a[$1,$2,$3]++;next} a[$1,$2,$3]>1' Input_file Input_file
Output will be as follows.
1 10 20 A
1 10 20 B
1 10 20 E
1 10 20 F
2 48 49 D
2 48 49 E
To get only the last field's value change a[$1,$2,$3]>1 to a[$1,$2,$3]>1{print $NF}'
process the file once:
awk '{k=$1 FS $2 FS $3}
k in a{a[k]=a[k]RS$4;b[k];next}{a[k]=$4}END{for(x in b)print a[x]}' file
process the file twice:
awk 'NR==FNR{a[$1,$2,$3]++;next}a[$1,$2,$3]>1{print $4}' file file
With the given example, both one-liners above give same output:
A
B
E
F
D
E
Note the first one may generate the "letters" in different order.
using best of both worlds...
$ awk '{print $4 "\t" $1,$2,$3}' file | uniq -Df1 | cut -f1
A
B
E
F
D
E
swap the order of the fields, ask uniq to skip the first field and print duplicates only, remove compared fields.
or,
$ rev file | uniq -Df1 | cut -d' ' -f1
A
B
E
F
D
E
if the tagname is not single char you need to add | rev at the end.
NB. Both scripts assume the data is sorted on the compared keys already as in the input file.
Another one-pass:
$ awk ' {
k=$1 FS $2 FS $3 # create array key
if(k in a) { # a is the not-yet-printed queue
print a[k] ORS $NF # once printed from a...
b[k]=$NF # move it to b
delete a[k] # delete from a
}
else if(k in b) { # already-printed queue
print $NF
} else a[k]=$NF # store to not-yet-printed queue a
}' file
A
B
E
F
D
E

How to do divide a column based on the corresponding value in another file?

I have multiple files (66) and want to divid column 3 of each file to its corresponding value in the info.file and insert the new value in column 4 of each file.
My manual code is:
awk '{print $4=$3/NUmber from info.file}1' file
But this takes me hours to do for each individual file. So I want to automate it for all files. Thanks
file1:
chrm name value
4 a 8
3 b 4
file2:
chrm name value
3 g 6
5 s 12
info.file:
file_name average
file1 8
file2 6
file3 10
output:
file1:
chrm name value new_value
4 a 8 1
3 b 4 0.5
file2:
chrm name value new_value
3 g 6 1
5 s 12 2
without error handling
$ awk 'NR==FNR {a[$1]=$2; next}
FNR==1 {out=FILENAME".new"; print $0, "new_value" > out; next}
{v=$NF/a[FILENAME]; $++NF=v; print > out}' info file1 file2
will generate updated files
$ head file{1,2}.new | column -t
==> file1.new <==
chrm name value new_value
4 a 8 1
3 b 4 0.5
==> file2.new <==
chrm name value new_value
3 g 6 1
5 s 12 2
Explanation
NR==FNR {a[$1]=$2; next} scan the first file and save the file/value pairs in the associative array
FNR==1 in the header line of each data file
out=FILENAME".new" set a output filename
print $0, "new_value" > out print existing header appended with the new column name
v=$NF/a[FILENAME] for every data line, scale the last field and assign to v
$++NF=v increment number of fields and assign the new computed value to the last field
print > out print the new line to the same file set before
info file1 file2 the list of files should be preceded by the info file
I have prepared the following double nested awk command for you:
awk 'NR>1{system("awk -v div="$2" -f div_column3.awk "$1" | column -t > new_"$1);}' info.file
with div_column3.awk being a awk commands script file with the content:
$ cat div_column3.awk
NR==1{print $0" new_value"}NR>1{print $0" "$3/div}

If first two columns are equal, select top 3 based on descending order of 3rd column

I want to select top 3 results for every line that has the same first two column.
For example the data will look like,
cat data.txt
A A 10
A A 1
A A 2
A A 5
A A 8
A B 1
A B 2
A C 6
A C 5
A C 10
A C 1
B A 1
B A 1
B A 2
B A 8
And for the result I want
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 1
B A 1
B A 2
Note that some of the "groups" do not contain 3 rows.
I have tried
sort -k1,1 -k2,2 -k3,3nr data.txt | sort -u -k1,1 -k2,2 > 1.txt
comm -23 <(sort data.txt) <(sort 1.txt)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 2.txt
comm -23 <(sort data.txt) <(cat 1.txt 2.txt | sort)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 3.txt
It seems like it's working but since I am learning to code better was wondering if there was a better way to go about this. Plus, my code will generate many files that I will have to delete.
You can do:
$ sort -k1,1 -k2,2 -k3,3nr file | awk 'a[$1,$2]++<3'
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 8
B A 2
B A 1
Explanation:
There are two key items to understand the awk program; associative arrays and fields.
If you reference an empty awk array element, it is an empty container -- ready for anything you put into it. You can use that as a counter.
You state If first two columns are equal...
The sort puts the file in order desired. The statement a[$1,$2] uses the values of the first two fields as a unique entry into an associative array.
You then state ...select top 3 based on descending order of 3rd column...
Once again, the sort put the file into the desired order, and the statement a[$1,$2]++ counts them. Now just count up to three.
awk is organized into blocks of condition {action} The statement a[$1,$2]++<3 is true until there are more than 3 of the same pattern seen.
A wordier version of the program would be:
awk 'a[$1,$2]++<3 {print $0}'
But the default action if the condition is true is to print $0 so it is not needed.
If you are processing text in Unix, you should get to know awk. It is the most powerful tool that POSIX guarantees you will have, and is commonly used for these tasks.
Great place to start is the online book Effective AWK Programming by Arnold D. Robbins
#Dawg has the best answer. This one will be a little lighter on memory, which probably won't be a concern for your data:
sort -k1,2 -k3,3nr file |
awk '
{key = $1 FS $2}
prev != key {prev = key; count = 1}
count <= 3 {print; count++}
'
You can sort the file by first two columns primarily and by the 3rd one numerically secondarily, then read the output and only print the first three lines for each combination of the first two columns.
sort -k1,2 -k3,3rn data.txt \
| while read c1 c2 n ; do
if [[ $c1 == $l1 && $c2 == $l2 ]] ; then
((c++))
else
c=0
fi
if (( c < 3 )) ; then
echo $c1 $c2 $n
l1=$c1
l2=$c2
fi
done

how can I make awk match up lines in file 1 with the lines in file 2 based on some number ranges in file 2

I have the following two files:
file 1:
22
2
42
32
file 2:
1 10 valuea
11 20 valueb
21 30 valuec
31 40 valued
41 50 valuee
51 60 valuef
How can I make awk grab each value from file 1, match it up with file 2 based on whether it falls between the number range in columns 1 and 2 of file 2, and then print out column 3 from the matched column in file 2? The output would resemble the following:
valuec
valuea
valuee
valued
I tried using the following AWK command (based on what I found in this post: How to check value of a column lies between values of two columns in other file and print corresponding value from column in Unix?), but it does not seem to be working correctly.
#!/bin/bash
awk 'FNR == NR { val[$1] = $1 }
FNR != NR { if (val[$1] >= $1 && val[$1] <= $2)
print $3
}' file1 file2
Also I did not include it in here for obvious reasons, but for the actual application of this script, file 1 would include around 7,000 entries while file 2 would include 68,000 entries
alternative awk script
$ awk 'FNR == NR {a[$1]=$2; v[$1]=$3; next}
{for(k in a)
if(k+0<=$1 && $1+0<=a[k]) print v[k]}' file2 file1
valuec
valuea
valuee
valued
note that file2 is the first file. This will cover multiple range matches as well. +0 is to force for numerical comparison.

Resources