Find rows common in more than two files using awk [duplicate]

Find rows common in more than two files using awk [duplicate] - linux

This question already has answers here:
How to find common rows in multiple files using awk
(2 answers)
Closed 7 years ago.
I have tab delimited text files in which common rows between them are to be found based on columns 1 and 2 as key columns.
Sample files:
file1.txt
aba 0 0
abc 0 1
abd 1 1
xxx 0 0
file2.txt
xyz 0 0
aba 0 0 0 0
xxx 0 0
abc 1 1
file3.txt
xyx 0 0
aba 0 0
aba 0 1 0
xxx 0 0 0 1
abc 1 1
I would like to get rows common in 2 files or 3 files using columns 1 and 2 as key to search. For the common rows based on column 1 and 2 reporting the first occurrence in any file would do the job.
Sample Output for rows common in 2 files:
abc 1 1
Sample output for rows common in 3 files:
aba 0 0
xxx 0 0
In real scenario I do have to specify different values for number of files. Can anybody suggest a generalized solution to pass the value for number of files in which it has to be common.
I have this piece of code which looks for rows common in all files.
awk '
FNR == NR {
arr[$1,$2] = 1
line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0
next
}
FNR == 1 { delete found }
{ if ( arr[$1,$2] && ! found[$1,$2] ) { arr[$1,$2]++; found[$1,$2] = 1 } }
END {
num_files = ARGC -1
for ( key in arr ) {
if ( arr[key] < num_files ) { continue }
split( line[ key ], line_arr, SUBSEP )
for ( i = 1; i <= length( line_arr ); i++ ) {
printf "%s\n", line_arr[ i ]
}
}
}
' *.txt > commoninall.txt

This should work:
cat file[123].txt | sort | awk 'BEGIN{FS="\t"; V1=""; V2=""}
{if (V1==$1 && V2==$2) { b=b+1 } else
{ print b":"$0; b=1; V1=$1; V2=$2} }' |grep "2:"|awk '
BEGIN{FS=":"} {print $2}'
I cat all file in one stream, sort the lines, check if the first two tab seperated colums are equal (if they are then print the line) and then filter out all duplicated lines.
BTW: I took this nice file[123].txt globbing idea from the comment of William Pursell.

This should work too
I put all the lines in an array (b) with two first values and accumulate in a the number of repetitions. If number > 1 it will be printed from b which has the last line saved for this pair combination column1/column2
cat *.txt | awk -F" " '{a[$1$2]=a[$1$2]+1; b[$1$2]=$0} END{ for (i in a){if(a[i]>1){print b[i]}}}'
Is it ok too?
EDIT
To show all lines in all files, you need just a little more:
cat *.txt | awk -F" " '{a[$1$2]=a[$1$2]+1; c=a[$1$2]; b[$1$2c]=$0} END{ for (i in a){if(a[i]>1){for(c=1; c<=a[i];++c){print b[i c]}}}}'
Very thanks to #PeterPaulKiefer for the cat *txt idea

Related

Insert a row and a column in a matrix using awk

I have a gridded dataset with 250 rows x 300 columns in matrix form:
ifile.txt
2 3 4 1 2 3
3 4 5 2 4 6
2 4 0 5 0 7
0 0 5 6 3 8
I would like to insert the latitude values at the first column and longitude values at the top. Which looks like:
ofile.txt
20.00 20.33 20.66 20.99 21.32 21.65
100.00 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8
The increment is 0.33
I can do it for a small size matrix in manually, but I can't able to get any idea how to get my output in my desired format. I was writing a script in the following way, but completely useless.
echo 20 > latitude.txt
for i in `seq 1 250`;do
i1=$(( i + 0.33 )) #bash can't recognize fractions
echo $i1 >> latitude.txt
done
echo 100 > longitude.txt
for j in `seq 1 300`;do
j1=$(( j + 0.33 ))
echo $j1 >> longitude.txt
done
paste longitude.txt ifile.txt > dummy_file.txt
cat latitude.txt dummy_file.txt > ofile.txt

$ cat tst.awk
BEGIN {
lat = 100
lon = 20
latWid = lonWid = 6
latDel = lonDel = 0.33
latFmt = lonFmt = "%*.2f"
}
NR==1 {
printf "%*s", latWid, ""
for (i=1; i<=NF; i++) {
printf lonFmt, lonWid, lon
lon += lonDel
}
print ""
}
{
printf latFmt, latWid, lat
lat += latDel
for (i=1; i<=NF; i++) {
printf "%*s", lonWid, $i
}
print ""
}
$ awk -f tst.awk file
20.00 20.33 20.66 20.99 21.32 21.65
100.00 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8

Following awk may also help you on same.
awk -v col=100 -v row=20 'FNR==1{printf OFS;for(i=1;i<=NF;i++){printf row OFS;row=row+.33;};print ""} {col+=.33;$1=$1;print col OFS $0}' OFS="\t" Input_file
Adding non one liner form of above solution too now:
awk -v col=100 -v row=20 '
FNR==1{
printf OFS;
for(i=1;i<=NF;i++){
printf row OFS;
row=row+.33;
};
print ""
}
{
col+=.33;
$1=$1;
print col OFS $0
}
' OFS="\t" Input_file

Awk solution:
awk 'NR == 1{
long = 20.00; lat = 100.00; printf "%12s%.2f", "", long;
for (i=1; i<NF; i++) { long += 0.33; printf "\t%.2f", long } print "" }
NR > 1{ lat += 0.33 }
{
printf "%.2f%6s", lat, "";
for (i=1; i<=NF; i++) printf "\t%d", $i; print ""
}' file

With perl
$ perl -lane 'print join "\t", "", map {20.00+$_*0.33} 0..$#F if $.==1;
print join "\t", 100+(0.33*$i++), #F' ip.txt
20 20.33 20.66 20.99 21.32 21.65
100 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8
-a to auto-split input on whitespaces, result saved in #F array
See https://perldoc.perl.org/perlrun.html#Command-Switches for details on command line options
if $.==1 for the first line of input
map {20.00+$_*0.33} 0..$#F iterate based on size of #F array, and for each iteration, we get a value based on equation inside {} where $_ will be 0, 1, etc upto last index of #F array
print join "\t", "", map... use tab separator to print empty element and results of map
For all the lines, print contents of #F array pre-fixed with results of 100+(0.33*$i++) where $i will be initially 0 in numeric context. Again, tab is used as separator while joining these values
Use sprintf if needed for formatting, also $, can be initialized instead of using join
perl -lane 'BEGIN{$,="\t"; $st=0.33}
print "", map { sprintf "%.2f", 20+$_*$st} 0..$#F if $.==1;
print sprintf("%.2f", 100+($st*$i++)), #F' ip.txt

How to find common rows in multiple files using awk

I have tab delimited text files in which common rows between them are to be found based on columns 1 and 2 as key columns.
Sample files:
file1.txt
aba 0 0
aba 0 0 1
abc 0 1
abd 1 1
xxx 0 0
file2.txt
xyz 0 0
aba 0 0 0 0
aba 0 0 0 1
xxx 0 0
abc 1 1
file3.txt
xyx 0 0
aba 0 0
aba 0 1 0
xxx 0 0 0 1
abc 1 1
The below code does the same and returns the rows only if the key column is found in all the N files (3 files in this case).
awk '
FNR == NR {
arr[$1,$2] = 1
line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0
next
}
FNR == 1 { delete found }
{ if ( arr[$1,$2] && ! found[$1,$2] ) { arr[$1,$2]++; found[$1,$2] = 1 } }
END {
num_files = ARGC -1
for ( key in arr ) {
if ( arr[key] < num_files ) { continue }
split( line[ key ], line_arr, SUBSEP )
for ( i = 1; i <= length( line_arr ); i++ ) {
printf "%s\n", line_arr[ i ]
}
}
}
' *.txt > commoninall.txt
Output:
xxx 0 0
aba 0 0
aba 0 0 1
However, now I would like to get the output if 'x' files have the key columns.
For example x=2 i.e. rows which are common in two files based on key columns 1 and 2. The output in this case would be:
xyz 0 0
abc 1 1
In real scenario I do have to specify different values for x. Can anybody suggest an edit to this or a new solution.

First attempt
I think you just need to modify the END block a little, and the command invocation:
awk -v num_files=${x:-0} '
…
…script as before…
…
END {
if (num_files == 0) num_files = ARGC - 1
for (key in arr) {
if (arr[key] == num_files) {
split(line[key], line_arr, SUBSEP)
for (i = 1; i <= length(line_arr); i++) {
printf "%s\n", line_arr[i]
}
}
}
}
'
Basically, this takes a command line parameter based on $x, defaulting to 0, and assigning it to the awk variable num_files. In the END block, the code checks for num_files being zero, and resets it to the number of files passed on the command line. (Interestingly, the value in ARGC discounts any -v var=value options and either a command line script or -f script.awk, so the ARGC-1 term remains correct. The array ARGV contains awk (or whatever name you invoked it with) in ARGV[0] and the files to be processed in ARGV[1] through ARGV[ARGC-1].) The loop then checks for the required number of matches and prints as before. You can change == to >= if you want the 'or more' option.
Does it work?
I observed in a comment:
I'm not clear what you are asking. I took it that your code was working for the example with three files and producing the right answer. I simply suggested how to modify the working code to handle N files and at least M of them sharing an entry. I have just realized, while typing this, that there is a bit more work to do. An entry could be missing from the first file but present in the others and will need to be processed, therefore. It is easy to report all occurrences in every file, or the first occurrence in any file. It is harder to report all occurrences only in the first file with a key.
The response was:
It is perfectly fine to report first occurrence in any file and need not be only from the first file. However, the issue with the suggested modification is, it is producing the same output for different values of x.
That's curious: I was able to get sane output from the amended code with different values for the number of files where the key must appear. I used this shell script. The code in the awk program up to the END block is the same as in the question; the only change is in the END processing block.
#!/bin/bash
while getopts n: opt
do
case "$opt" in
(n) num_files=$OPTARG;;
(*) echo "Usage: $(basename "$0" .sh) [-n number] file [...]" >&2
exit 1;;
esac
done
shift $(($OPTIND - 1))
awk -v num_files=${num_files:-$#} '
FNR == NR {
arr[$1,$2] = 1
line[$1,$2] = line[$1,$2] (line[$1,$2] ? SUBSEP : "") $0
next
}
FNR == 1 { delete found }
{ if (arr[$1,$2] && ! found[$1,$2]) { arr[$1,$2]++; found[$1,$2] = 1 } }
END {
if (num_files == 0) num_files = ARGC - 1
for (key in arr) {
if (arr[key] == num_files) {
split(line[key], line_arr, SUBSEP)
for (i = 1; i <= length(line_arr); i++) {
printf "%s\n", line_arr[i]
}
}
}
}
' "$#"
Sample runs (data files from question):
$ bash common.sh file?.txt
xxx 0 0
aba 0 0
aba 0 0 1
$ bash common.sh -n 3 file?.txt
xxx 0 0
aba 0 0
aba 0 0 1
$ bash common.sh -n 2 file?.txt
$ bash common.sh -n 1 file?.txt
abc 0 1
abd 1 1
$
That shows different answers depending on the value specified via -n. Note that this only shows lines that appear in the first file and appear in exactly N files in total. The only key that appears in two files (abc/1) does not appear in the first file, so it is not listed by this code which stops paying attention to new keys after the first file is processed.
Rewrite
However, here's a rewrite, using some of the same ideas, but working more thoroughly.
#!/bin/bash
# SO 30428099
# Given that the key for a line is the first two columns, this script
# lists all appearances in all files of a given key if that key appears
# in N different files (where N defaults to the number of files). For
# the benefit of debugging, it includes the file name and line number
# with each line.
usage()
{
echo "Usage: $(basename "$0" .sh) [-n number] file [...]" >&2
exit 1
}
while getopts n: opt
do
case "$opt" in
(n) num_files=$OPTARG;;
(*) usage;;
esac
done
shift $(($OPTIND - 1))
if [ "$#" = 0 ]
then usage
fi
# Record count of each key, regardless of file: keys
# Record count of each key in each file: key_file
# Count of different files containing each key: files
# Accumulate line number, filename, line for each key: lines
awk -v num_files=${num_files:-$#} '
{
keys[$1,$2]++;
if (++key_file[$1,$2,FILENAME] == 1)
files[$1,$2]++
#printf "%s:%d: Key (%s,%s); keys = %d; key_file = %d; files = %d\n",
# FILENAME, FNR, $1, $2, keys[$1,$2], key_file[$1,$2,FILENAME], files[$1,$2];
sep = lines[$1,$2] ? RS : ""
#printf "B: [[\n%s\n]]\n", lines[$1,$2]
lines[$1,$2] = lines[$1,$2] sep FILENAME OFS FNR OFS $0
#printf "A: [[\n%s\n]]\n", lines[$1,$2]
}
END {
#print "END"
for (key in files)
{
#print "Key =", key, "; files =", files[key]
if (files[key] == num_files)
{
#printf "TAG\n%s\nEND\n", lines[key]
print lines[key]
}
}
}
' "$#"
Sample output (given the data files from the question):
$ bash common.sh file?.txt
file1.txt 5 xxx 0 0
file2.txt 4 xxx 0 0
file3.txt 4 xxx 0 0 0 1
file1.txt 1 aba 0 0
file1.txt 2 aba 0 0 1
file2.txt 2 aba 0 0 0 0
file2.txt 3 aba 0 0 0 1
file3.txt 2 aba 0 0
file3.txt 3 aba 0 1 0
$ bash common.sh -n 2 file?.txt
file2.txt 5 abc 1 1
file3.txt 5 abc 1 1
$ bash common.sh -n 1 file?.txt
file1.txt 3 abc 0 1
file3.txt 1 xyx 0 0
file1.txt 4 abd 1 1
file2.txt 1 xyz 0 0
$ bash common.sh -n 3 file?.txt
file1.txt 5 xxx 0 0
file2.txt 4 xxx 0 0
file3.txt 4 xxx 0 0 0 1
file1.txt 1 aba 0 0
file1.txt 2 aba 0 0 1
file2.txt 2 aba 0 0 0 0
file2.txt 3 aba 0 0 0 1
file3.txt 2 aba 0 0
file3.txt 3 aba 0 1 0
$ bash common.sh -n 4 file?.txt
$
You can fettle this to give the output you want (probably missing file name and line number). If you only want the lines from the first file containing a given key, you only add the information to lines when files[$1,$2] == 1. You can separate the recorded information with SUBSEP instead of RS and OFS if you prefer.

Can't you simply use uniq to search for repeated lines in you files?
Something like:
cat file1.txt file2.txt file3.txt | uniq -d
For your complete scenario, you could use uniq -c to get the number of repetition for each line, and filter this with grep.

Using an if/else statement in the middle of AWK

I have a 5-column file:
PS 6 15 0 1
PS 1 17 0 1
PS 4 18 0 1
that I would like to get it in this 7-column format:
PS.15 PS 6 N 1 0 1
PS.17 PS 1 P 1 0 1
PS.18 PS 4 N 1 0 1
To create 6 of the 7 columns requires just grabbing directly (and sometimes applying small arithmetic) from columns in the original file. However, to create one column (column 4) requires an if-else statement.
Specifically, to create new columns 1, 2, 3, I use:
cat File | awk '{print $1"."$3"\t"$1"\t"$2}'
and to create new columns 5, 6,7, I use:
cat testFileB | awk '{print $4+$5"\t"$4/($4+$5)"\t"$5/($4+$5)}'
and to create new column 4, I use:
cat testFileB | awk '{if ($2 == 1 || $2 == 2 || $2 == 3) print "P"; else print "N";}'
These three statements work fine independently and get me what I want (the correct values for the columns that are all separated by tabs). However, when I try to apply them simultaneously (create all 7 columns at once), I can only do so with unwanted new lines (instead of tabs) before and after column 4 (the if/else statement column):
For instance, my attempt to simultaneously create columns 1, 2, 3, 4:
cat File | awk '{print $1"."$3"\t"$1"\t"$2; if ($2 == 1 || $2 == 2 || $2 == 3) print "P"; else print "N";}'
results in unwanted new lines before column 4:
PS.15 PS 6
N
PS.17 PS 1
P
PS.18 PS 4
Similarly, my attempt to simultaneously create columns 4, 5, 6, 7:
cat File | awk '{if ($2 == 1 || $2 == 2 || $2 == 3) print "P"; else print "N"; print $4+$5"\t"$4/($4+$5)"\t"$5/($4+$5)}'
results in unwanted new lines after column 4:
N
1 0 1
P
1 0 1
N
1 0 1
Is there a solution so that I can create all 7 columns at once, and there are only tabs between them (no new lines)?

If you don't want automatic line feeds, you can just use printf instead of print. I'm not quite sure if you want a tab separating the N1 or not, but that's easy enough to adjust;
cat testfile | awk '{printf "%s.%s\t%s\t%s\t",$1,$3,$1,$2; if ($2 == 1 || $2 == 2 || $2 == 3) printf "P"; else printf "N"; print $4+$5"\t"$4/($4+$5)"\t"$5/($4+$5)}'
PS.15 PS 6 N1 0 1
PS.17 PS 1 P1 0 1
PS.18 PS 4 N1 0 1

Simply set your OFS (instead of repeating a \t all across the line), and use the ternary operator to print P or N:
$ awk -v OFS='\t' '{s=$4+$5;print $1"."$3,$1,$2,($2~/^[123]$/?"P":"N"),s,$4/s,$5/s}' file
PS.15 PS 6 N 1 0 1
PS.17 PS 1 P 1 0 1
PS.18 PS 4 N 1 0 1

How to loop an awk command on every column of a table and output to a single output file?

I have a multi column file composed of single unit 1s, 2s and 3s. There are a lot of repeats of a unit in each column, and sometimes it switches from one to another. I want to count how many times this switch happens on every column. For example in column 1 the switch change from 1 to 2 to 3 to 1, so there are 3 switches and the output should be 3. In the second column there are 2s the entire column, so the changes is 0 and the output is 0.
My input file has 4000 columns so it is impossible to do it by hand. The file is space separated.
For example:
Input:
1 2 3 1 2
1 2 2 1 3
1 2 3 1 2
2 2 2 1 2
2 2 2 1 2 ......
3 2 2 1 2
3 2 2 1 1
1 2 2 1 1
1 2 2 1 2
1 2 2 1 1
Desired output:
3 ## column 1 switch times
0 ## column 2 switch times
3 .....
0
5
I was using:
awk '{print $1}' <inputfile> | uniq | wc -l
awk '{print $2}' <inputfile> | uniq | wc -l
awk '{print $3}' <inputfile> | uniq | wc -l
....
This execute one column at a time. It will give me the output "4" for the first column, later I will just calculate 4-1 =3 to get my desired output. But Is there a way I can write this awk command into a loop and execute it on each column and output to one file?
Thanks!

awk tells you how many fields are in a given row in the variable NF, so you can create two arrays to keep track of the information you need. One array will keep the value of the last row in the given column. The other will count the number of switches in a given column. You'll also keep a track of the maximum number of columns (and set the counts for new columns to zero so that they are printed appropriately in the output at the end if the number of switches is 0 for that column). You'll also make sure you don't count the transition from an empty string to a non-empty string — which happens when the column is encountered for the first time.
If, in fact, the file is uniformly the same number of columns, that will only affect the first row of data. If subsequent rows actually have more columns than the first line, then it adds them. If a column stops appearing for a bit, I've assumed it should resume where it left off (as if the missing columns were the same value as before). You can decide on different algorithms; that could count as two transitions (from number to blank and from blank to number too. If that's the case, you have to modify the counting code. Or, perhaps more sensibly, you could decide that irregular numbers of columns are simply not allowed, in which case you can bail out early if the number of columns in the current row is not the same as in the previous row (beware blank lines, or are they outlawed too?).
And you won't try writing the whole program on one line because it will be incomprehensible and it really isn't necessary.
awk '{ if (NF > maxNF)
{
for (i = maxNF + 1; i <= NF; i++)
count[i] = 0;
maxNF = NF;
}
for (i = 1; i <= NF; i++)
{
if (col[i] != "" && $i != col[i])
count[i]++;
col[i] = $i;
}
}
END {
for (i = 1; i <= maxNF; i++)
print count[i];
}' data-file-with-4000-columns
Given your sample data (with the dots removed), the output from the script is as requested:
3
0
3
0
5
This alternative data file with jagged rows:
1 2 3 1 2
1 2 2 1 3
1 2 3 1 2
2 2 2 1 2
2 2 2 1 2 1 1 1
3 2 2 1 2 2 1
3 2 2 1 1
1 2 2 1 1 2 2 1
1 2 2 1
1 2 2 1 1 3
produces the output:
3
0
3
0
3
2
1
0
Which is correct according to the rules I formulated — but if you decide you want different rules to cover the data, you can end up with different answers.
If you used printf("%d\n", count[i]); in the final loop, you'd not need to set the count values to zero in a loop. You pays your money and takes your pick.

Use a loop and keep an array for each of the column current value and another array for the corresponding count:
awk '{for(i=0;i<5;i++) if(c[i]!=$(i+1)) {c[i]=$(i+1); t[i]++}} END{for(i=0;i<5;i++)print t[i]-1}' filename
Note that this assumes that column's value are not zero. If you happen to have zero values, then just initialize the array c to some unique value which will not be present in the file.

Coded out for ease of viewing, SaveColx, CountColx should be arrays. I'd print the column number itself in the results at least for checking :-)
BEGIN {
SaveCol1 = " "
CountCol1 = 0
CountCol2 = 0
CountCol3 = 0
CountCol4 = 0
CountCol5 = 0
}
{
if ( SaveCol1 == " " ) {
SaveCol1 = $1
SaveCol2 = $2
SaveCol3 = $3
SaveCol4 = $4
SaveCol5 = $5
next
}
if ( $1 != SaveCol1 ) {
CountCol1++
SaveCol1 = $1
}
if ( $2 != SaveCol2 ) {
CountCol2++
SaveCol2 = $2
}
if ( $3 != SaveCol3 ) {
CountCol3++
SaveCol3 = $3
}
if ( $4 != SaveCol4 ) {
CountCol4++
SaveCol4 = $4
}
if ( $5 != SaveCol5 ) {
CountCol5++
SaveCol5 = $5
}
}
END {
print CountCol1
print CountCol2
print CountCol3
print CountCol4
print CountCol5
}

How to select rows in which column two and three are not equal to each other and to 0 or 1?(with awk)

I have a file like this:
AX-75448119 0 1
AX-75448118 0.45 0.487179
AX-75474642 0 0
AX-75474643 0.25 0.820513
AX-75448113 1 0
AX-75474641 1 1
and I want to select the rows that column 2 and 3 are not equal each other and 0 or 1 (both of them)! (i.e if column 2 and 3 are similar but equal to 0.5 (or any other number except 0 and 1) I would like to have that row)
so the output would be:
AX-75448119 0 1
AX-75448118 0.45 0.487179
AX-75474643 0.25 0.820513
AX-75448113 1 0
I know how to write the command to select the rows that column 2 and 3 are equal to each other and are equal to 0 or 1 which is this:
awk '$2=$3==1 || $2=$3==0' test.txt | wc -l
but I want exactly the opposite, to select every rows that are not the output of the above command!
Thanks, I hope I was able to explain what I want

It might work for you if I get your requirements correctly.
awk ' $2 != $3 { print; next } $2 == $3 && $2 != 0 && $2 != 1 { print }' INPUTFILE
See it in action at Ideone.com

This might work for you:(?)
awk '($2==0 || $2==1) && ($3==0 || $3==1) && $2==$3{next}1' file

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Find rows common in more than two files using awk [duplicate] - linux

Related

Insert a row and a column in a matrix using awk

How to find common rows in multiple files using awk

Using an if/else statement in the middle of AWK

How to loop an awk command on every column of a table and output to a single output file?

How to select rows in which column two and three are not equal to each other and to 0 or 1?(with awk)

Categories

Resources