AWK script: Finding number of matches that each element in Col2 has in Col1 - linux

I want to compare two columns in a file as below using AWK, can someone gives a help please?
e.g.
Col1 Col2
---- ----
2 A
2 D
3 D
3 D
3 A
7 N
7 M
1 D
1 R
Now I want to use AWK to implement the following algorithm to find matches between those columns:
list1[] <=== Col1
list2[] <=== Col2
NewList[]
for i in col2:
d = 0
for j in range(1,len(col2)):
if i == list2[j]:
d++
NewList.append(list1[list2.index[i]])
Expected result:
A ==> 2 // means A matches two times to Col1
D ==> 4 // means D matches two times to Col1
....
So I want to write the above code in AWK script and I find it too complicated for me as I haven't used it yet.
Thank you very much for your help

Not all that complicated, keep the count in an array indexed by the character and print the array out at the end;
awk '{cnt[$2]++} END {for(c in cnt) print c, cnt[c]}' test.txt
# A 2
# D 4
# M 1
# N 1
# R 1
{cnt[$2]++} # For each row, get the second column and increase the
# value of the array at that position (ie cnt['A']++)
END {for(c in cnt) print c, cnt[c]}
# When all rows done (END), loop through the keys of the
# array and print key and array[key] (the value)

alternative solution
$ rev file | cut -c1 | sort | uniq -c
2 A
4 D
1 M
1 N
1 R
for the formatting pipe to ... | sed -r 's/(\w) (\w)/\2 ==> \1/'
A ==> 2
D ==> 4
M ==> 1
N ==> 1
R ==> 1
Or, do everything in awk

Related

Looping though list of IDs to count matches in two columns

This is going to be a complicated one to explain so bear with me.
I am doing a blastp comparison multiple proteins all vs all and want the number of shared proteins between the genomes.
I have a large file of the query id and sequence id, example:
A A 100
A A 100
A A 100
A B 74
A B 47
A B 67
A C 73
A C 84
A C 74
A D 48
A D 74
A D 74
B A 67
B A 83
B A 44
B B 100
The file continues like that. I'd like to count the number of occurrences of A in column 1 and B in column 2. I have found a way to do this with awk:
awk -F, '$1=="A" && $2=="A"' file | wc -l
However, I have hundreds of genomes and this would involve typing the awk script thousands of times to get the different combinations. I had added the IDs from column 1 to a text file and tried a loop to loop through all the IDs for all possible combinations
for i in $(cat ID.txt); do input_file=file.csv; awk -F, '$1==$i && $2==$i' ${input_file} | wc -l; done
This is the output:
0
0
0
0
0
0
0
etc.
I'd like the output to be:
A A 60
A B 54
A C 34
A D 35
etc.
Any help would be appreciated.
If I'm understanding correctly, then you can collect the count for each pair into an array, and then print out the array once complete:
awk -F, '{++a[$1 FS $2]} END{for(entry in a){print entry, a[entry]}}' file
A,A 3
B,A 3
A,B 3
B,B 1
A,C 3
A,D 3
This is doing the following:
Increment the count in array a for the item with the key constructed from the concatenation of the first two columns, separated by the field separator FS (comma): {++a[$1 FS $2]}
Once the file processing is done END, loop through the array calling each array entry entry, for (entry in a)
In the loop, print the key/entry and the value {print entry, a[entry]}
…input… | WHINY_USERS=1 \ # Not trying to insult anyone -
# this is a special shell parameter
# recognized by mawk-1 to have array
# indices pre-sorted, somewhat similar to gawk's
#
# PROCINFO["sorted_in"]="#ind_str_asc"
mawk '{__[$!--NF]—-} END { for(_ in __) { print _,-__[_] } }' OFS=',' FS='[, \t]+'
A,A,3
A,B,3
A,C,3
A,D,3
B,A,3
B,B,1
if there's a chance more in than 3 columns in input, then do :
{m,n,g}awk '
BEGIN { _ += _ ^= FS = "["(OFS=",")" \t]+"
} { __[$!(NF=_)]++
} END {
for(_ in __) { print _, __[_] } }'
let $1 = $1 take care of placing the comma in between columns 1 and 2 instead of having to manually do it

Is there a way to make permutations for file names in a for loop in linux bash?

The idea is that you have 3 text files lets name it A B C where you only have a unique column with strings (doesn't matter the content in this example). What you want is to make a join function between these three, so you'll have a join for A - B another one for B - C and a last one for A - C as if it is a permutation.
Let's make a graphic example.
The individual code would be
join -1 1 -2 1 A.txt B.txt > AB.txt
and so on for the other 2
Imagine A has
100
101
102
104
B has
101
103
104
105
C has
100
103
104
105
So A - B comparison (AB.txt) would be:
101
104
A - C comparison (AC.txt):
100
104
B - C comparison (BC.txt):
103
105
And you'll have three output file named after the comparisons AB.txt, AC.txt and BC.txt
A solution might look like this:
#!/usr/bin/env bash
# Read positional parameters into array
list=("$#")
# Loop over all but the last element
for ((i = 0; i < ${#list[#]} - 1; ++i)); do
# Loop over the elements starting with the first after the one i points to
for ((j = i + 1; j < ${#list[#]}; ++j)); do
# Run the join command and redirect to constructed filename
join "${list[i]}" "${list[j]}" > "${list[i]%.txt}${list[j]%.txt}".txt
done
done
Notice that the -1 1 -2 1 is the default behaviour for join and can be skipped.
The script has to be called with the filenames as the parameters:
./script A.txt B.txt C.txt
A function that does nothing but generate the possible combinations of two among its arguments:
#!/bin/bash
combpairs() {
local a b
until [ $# -lt 2 ]; do
a="$1"
for b in "${#:2}"; do
echo "$a - $b"
done
shift
done
}
combpairs A B C D E
A - B
A - C
A - D
A - E
B - C
B - D
B - E
C - D
C - E
D - E
I would put the files in an array, and use the index like this:
files=(a.txt b.txt c.txt) # or files=(*.txt)
for ((i=0; i<${#files[#]}; i++)); do
f1=${files[i]} f2=${files[i+1]:-$files}
join -1 1 -2 1 "$f1" "$f2" > "${f1%.txt}${f2%.txt}.txt"
done
Using echo join to debug (and quoting >), this is what would be executed:
join -1 1 -2 1 a.txt b.txt > ab.txt
join -1 1 -2 1 b.txt c.txt > bc.txt
join -1 1 -2 1 c.txt a.txt > ca.txt
Or for six files:
join -1 1 -2 1 a.txt b.txt > ab.txt
join -1 1 -2 1 b.txt c.txt > bc.txt
join -1 1 -2 1 c.txt d.txt > cd.txt
join -1 1 -2 1 d.txt e.txt > de.txt
join -1 1 -2 1 e.txt f.txt > ef.txt
join -1 1 -2 1 f.txt a.txt > fa.txt
LC_ALL=C; files(*.txt) would use all .txt files in the current directory, sorted by name, which may be relevant.
One in GNU awk:
$ gawk '{
a[ARGIND][$0] # hash all files to arrays
}
END { # after hashing
for(i in a) # form pairs
for(j in a)
if(i<j) { # avoid self and duplicate comparisons
f=ARGV[i] ARGV[j] ".txt" # form output filename
print ARGV[i],ARGV[j] > f # output pair info
for(k in a[i])
if(k in a[j])
print k > f # output matching records
}
}' a b c
Output, for example:
$ cat ab.txt
a b
101
104
All files are hashed in the memory in the beginning so if the files are huge, you may run out of memory.
Another variation
declare -A seen
for a in {A,B,C}; do
for b in {A,B,C}; do
[[ $a == $b || -v seen[$a$b] || -v seen[$b$a] ]] && continue
seen[$a$b]=1
comm -12 "$a.txt" "$b.txt" > "$a$b.txt"
done
done

Finding the rows sharing information

I have a file having a structure like below:
file1.txt:
1 10 20 A
1 10 20 B
1 10 20 E
1 10 20 F
1 12 22 C
1 13 23 X
2 33 45 D
2 48 49 D
2 48 49 E
I am trying to find out, which letters have the same information in the 1st,2nd,3rd columns?
For example the output should be:
A
B
E
F
D
E
I am only able to count how many lines are unique via:
cut -f1,2,3 file1.txt | sort | uniq | wc -l
5
which does not give me anything related with the 4th column.
How do I have the letters in the forth column sharing the first three columns?
Following awk may help you here.
awk 'FNR==NR{a[$1,$2,$3]++;next} a[$1,$2,$3]>1' Input_file Input_file
Output will be as follows.
1 10 20 A
1 10 20 B
1 10 20 E
1 10 20 F
2 48 49 D
2 48 49 E
To get only the last field's value change a[$1,$2,$3]>1 to a[$1,$2,$3]>1{print $NF}'
process the file once:
awk '{k=$1 FS $2 FS $3}
k in a{a[k]=a[k]RS$4;b[k];next}{a[k]=$4}END{for(x in b)print a[x]}' file
process the file twice:
awk 'NR==FNR{a[$1,$2,$3]++;next}a[$1,$2,$3]>1{print $4}' file file
With the given example, both one-liners above give same output:
A
B
E
F
D
E
Note the first one may generate the "letters" in different order.
using best of both worlds...
$ awk '{print $4 "\t" $1,$2,$3}' file | uniq -Df1 | cut -f1
A
B
E
F
D
E
swap the order of the fields, ask uniq to skip the first field and print duplicates only, remove compared fields.
or,
$ rev file | uniq -Df1 | cut -d' ' -f1
A
B
E
F
D
E
if the tagname is not single char you need to add | rev at the end.
NB. Both scripts assume the data is sorted on the compared keys already as in the input file.
Another one-pass:
$ awk ' {
k=$1 FS $2 FS $3 # create array key
if(k in a) { # a is the not-yet-printed queue
print a[k] ORS $NF # once printed from a...
b[k]=$NF # move it to b
delete a[k] # delete from a
}
else if(k in b) { # already-printed queue
print $NF
} else a[k]=$NF # store to not-yet-printed queue a
}' file
A
B
E
F
D
E

Lookup in shell script

I have two files as below:
cat file_1:
12 a
34 b
24 c
18 d
cat file_2:
x a
y c
z d
I want something like in shell script:
x a 12
y c 24
z d 18
File 1 and file 2 has different number of rows and join is not working as I can't sort the files (the files are already sorted for the requirement if I sort again for join then requirement will not served).
#Maulik Patel: Try also.
awk 'FNR==NR{A[$2]=$0;next} ($2 in A){print A[$2] FS $1}' Input_file2 Input_file1
Very short description:
So here I am using FNR==NR condition which will be TRUE when Input_file is being read and saving $0(current line's value) in to array A whose index is field 2.
Then while reading 2nd Input_file I am checking which second field of Input_file2 is coming in array A and printing it's value with Input_file1's first field.
Here is another way of doing it
join -1 2 -2 2 file_2 file_1 --nocheck-order -o 1.1,1.2,2.1
x a 12
y c 24
z d 18

awk difference between subsequent lines

This is a great example how to solve the problem if I want to print differences between subsequent lines of a single column.
awk 'NR>1{print $1-p} {p=$1}' file
But how would I do it if I have multiple (unknown) number of columns in the file and I want the differences for all of them, eg. (note that the number of columns is not necessarily 3, it can be 10 or 15 or more)
col1 col2 col3
---- ---- ----
1 3 2
2 4 10
1 9 -3
. . .
the output would be:
col1 col2 col3
---- ---- ----
1 1 8
-1 5 -13
. . .
Instead of saving the first column, save the entire line and you would able to split it then print the difference using a loop:
awk 'NR>1{for(i=1;i<=NF;i++) printf "%d ", $i - a[i] ; print ""}
{p=split($0, a)}' file
If you need the column title then you can print it using BEGIN.
$ awk 'NR<3; NR>3{for (i=1;i<=NF;i++) printf "%d%s", $i-p[i], (i<NF?OFS:ORS)} {split($0,p)}' file | column -t
col1 col2 col3
---- ---- ----
1 1 8
-1 5 -13

Resources