compare columns from different files and print those that DO NOT match - linux

I have two files, file1 and file2. I want to compare several columns - $1,$2 ,$3 and $4 of file1 with several columns $1,$2, $3 and $4 of file2 and print those rows of file2 that do not match any row in file1.
E.g.
file1
aaa bbb ccc 1 2 3
aaa ccc eee 4 5 6
fff sss sss 7 8 9
file2
aaa bbb ccc 1 f a
mmm nnn ooo 1 d e
aaa ccc eee 4 a b
ppp qqq rrr 4 e a
sss ttt uuu 7 m n
fff sss sss 7 5 6
I want to have as output:
mmm nnn ooo 1 d e
ppp qqq rrr 4 e a
sss ttt uuu 7 m n
I have seen questions asked here for finding those that do match and printing them, but not viceversa,those that DO NOT match.
Thank you!

Use the following script:
awk '{k=$1 FS $2 FS $3 FS $4} NR==FNR{a[k]; next} !(k in a)' file1 file2
k is the concatenated value of the columns 1, 2, 3 and 4, delimited by FS (see comments), and will be used as a key in a search array a later. NR==FNR is true while reading file1. I'm creating the array a indexed by k while reading file1.
For the remaining lines of input I check with !(k in a) if the index does not exists in a. If that evaluates to true awk will print that line.

here is another approach if the files are sorted and you know the used char set.
$ function f(){ sed 's/ /~/g;s/~/ /4g' $1; }; join -v2 <(f file1) <(f file2) |
sed 's/~/ /g'
mmm nnn ooo 1 d e
aaa ccc eee 4 a b
ppp qqq rrr 4 e a
sss ttt uuu 7 m n
fff sss sss 7 5 6
create a key field by concatenating first four fields (with a ~ char, but any unused char can be used), use join to find the unmatched entries from file2 and partition the synthetic key field back.
However, the best way is to use awk solution with a slight fix
$ awk 'NR==FNR{a[$1,$2,$3,$4]; next} !(($1,$2,$3,$4) in a)' file1 file2

No doubt that the awk solution from #hek2mgl is better than this one, but for information this is also possible using uniq, sort, and rev:
rev file1 file2 | sort -k3 | uniq -u -f2 | rev
rev is reverting both files from right to left.
sort -k3 is sorting lines skipping the 2 first column.
uniq -u -f2 prints only lines that are unique (skipping the 2 first while comparing).
At last the rev is reverting back the lines.
This solution sorts the lines of both files. That might be desired or not.

Related

Swap column x of tab-separated values file with column x of second tsv file

Let's say I have:
file1.tsv
Foo\tBar\tabc\t123
Bla\tWord\tabc\tqwer
Blub\tqwe\tasd\tqqq
file2.tsv
123\tzxcv\tAAA\tqaa
asd\t999\tBBB\tdef
qwe\t111\tCCC\tabc
And I want to overwrite column 3 of file1.tsv with column 3 of file2.tsv to end up with:
Foo\tBar\tAAA\t123
Bla\tWord\tBBB\tqwer
Blub\tqwe\tCCC\tqqq
What would be a good way to do this in bash?
Take a look at this awk:
awk 'FNR==NR{a[NR]=$3;next}{$3=a[FNR]}1' OFS='\t' file{2,1}.tsv > output.tsv
If you want to use just bash, with little more effort:
while IFS=$'\t' read -r a1 a2 _ a4; do
IFS=$'\t' read -ru3 _ _ b3 _
printf '%s\t%s\t%s\t%s\n' "$a1" "$a2" "$b3" "$a4"
done <file1.tsv 3<file2.tsv >output.tsv
Output:
Foo Bar AAA 123
Bla Word BBB qwer
Blub qwe CCC qqq
Another way to do this can be, with correction as pointed out by #PesaThe:
paste -d$'\t' <(cut -d$'\t' -f1,2 file1.tsv) <(cut -d$'\t' -f3 file2.tsv) <(cut -d$'\t' -f4 file1.tsv)
The output will be:
Foo Bar AAA 123
Bla Word BBB qwer
Blub qwe CCC qqq

How to compare two columns in same file and store the difference in new file with the unchanged column according to it?

Row Actual Expected
1 AAA BBB
2 CCC CCC
3 DDD EEE
4 FFF GGG
5 HHH HHH
I want to compare actual and expected and store the difference in a file. Like
Row Actual Expected
1 AAA BBB
3 DDD EEE
4 FFF GGG
I have used awk -F, '{if ($2!=$3) {print $1,$2,$3}}' Sample.csv It will only compare Int values not String value
You can use AWK to do this
awk '{if($2!=$3) print $0}' oldfile > newfile
where
$2 and $3 are second and third columns
!= means second and third columns does not match
$0 means whole line
> newfile redirects to new file
I prefer an awk solution (can handle more fields and easier to understand), but you could use
sed -r '/\t([^ ]*)\t\1$/d' Sample.csv
Assuming the file uses tab or some other delimiter to separate the columns, then tsv-filter from eBay's TSV Utilities supports this type of field comparison directly. For the file above:
$ tsv-filter --header --ff-str-ne 2:3 file.tsv
Row Actual Expected
1 AAA BBB
3 DDD EEE
4 FFF GGG
The --ff-str-ne option compares two fields in a row for non-equal strings.
Disclaimer: I'm the author.

printf format specifiers in awk does not work for multiple parameters

I'm trying to Write a program script in Bash named example7
Which accepts as parameters: a file name (lets call it file 1) and a list of
Numbers (below we'll call it List 1) and what the program needs to do is to print as output the columns from
File 1 after rescheduling them to right or left by the numbers in list 1. (This is obtainable by using the printf command of awk).
Example
Suppose the contents of an F1 file are:
A abcd ddd eee zz tt
ab gggwe 12 88 iii jjj
yaara yyzz 12abcd xyz x y z
After running the program by command:
example7 F1 -8 -7 6 4
Output:
A abcd ddd eee
ab gggwe 12 88
yaara yyzz 12abcd xyz
In the example above between A and ABCD there are 7 spaces, between abcd and ddd
there are 6 spaces and between ddd and eee
there is one space.
Another example:
After running the program by command:
example7 F1 -8 -7 6 4 5
Output:
A abcd ddd eee zz
ab gggwe 12 88 iii
yaara yyzz 12abcd xyz x
In the example above between A and ABCD there are 7 spaces, between abcd and ddd
there are 6 spaces, between ddd and eee
there is one space, between eee and zz there are 3 spaces, between 88 and iii
there are two spaces, between xyz and x there are 4 spaces.
I've tried doing something like this:
file=$1
shift
awk '{printf "%'$1's\n" ,$1}' $file
but it only works for one number and one parameter and I don't know how I can do it for multiple columns and multiple parameters..
any help will be appreciated.
Set an awk variable to all the remaining parameters, then split it and loop over them.
file=$1
shift
awk -v sizes="$*" '{words = split(sizes, s); for(i = 1; i <= words; i++) printf("%" s[i] "s", $i); print ""; }' "$file"
It's generally wrong to try to substitute a shell variable directly into an awk script. You should prefer to set an awk variable using -v, and then use awk's own string concatenation operation, as I did with s[i].

how to sort a file according to another file?

Is there a unix oneliner or some other quick way on linux to sort a file according to a permutation set by the sorting of another file?
i.e.:
file1: (separated by CRLFs, not spaces)
2
3
7
4
file2:
a
b
c
d
sorted file1:
2
3
4
7
so the result of this one liner should be
sorted file2:
a
b
d
c
paste file1 file2 | sort | cut -f2
Below is a perl one-liner that will print the contents of file2 based on the sorted input of file1.
perl -n -e 'BEGIN{our($x,$t,#a)=(0,1,)}if($t){$a[$.-1]=$_}else{$a[$.-1].=$_ unless($.>$x)};if(eof){$t=0;$x=$.;close ARGV};END{foreach(sort #a){($j,$l)=split(/\n/,$_,2);print qq($l)}}' file1 file2
Note: If the files are different lengths, the output will only print up to the shortest file length.
For example, if file-A has 5 lines and file-B has 8 lines then the output will only be 5 lines.

In a *nix environment, how would I group columns together?

I have the following text file:
A,B,C
A,B,C
A,B,C
Is there a way, using standard *nix tools (cut, grep, awk, sed, etc), to process such a text file and get the following output:
A
A
A
B
B
B
C
C
C
You can do:
tr , \\n
and that will generate
A
B
C
A
B
C
A
B
C
which you could sort.
Unless you want to pull the first column then second then third, in which case you want something like:
awk -F, '{for(i=1;i<=NF;++i) print i, $i}' | sort -sk1 | awk '{print $2}'
To explain this, the first part generates
1 A
2 B
3 C
1 A
2 B
3 C
1 A
2 B
3 C
the second part will stably sort (so the internal order is preserved)
1 A
1 A
1 A
2 B
2 B
2 B
3 C
3 C
3 C
and the third part will strip the numbers
You could use a shell for-loop combined with cut if you know in advanced the number of columns. Here is an example using bash syntax:
for i in {1..3}; do
cut -d, -f $i file.txt
done
Try:
awk 'BEGIN {FS=","} /([A-C],)+([A-C])?/ {for (i=1;i<=NF;i++) print $i}' YOURFILE | sort

Resources