comparing files by fields in bash - linux

I have two arbitrary files:
==> file1 <==
11110 abcdef
11111 apple
11112 banana
11113 carrot
11114 date
11115 eggplant
==> file2 <==
11110 abcdefg
11111 apple-pie
11112 banana-cake
11113 chocolate
11115 egg
11116 fruit
For the sake of comparison of these files, I only care about the number in the first column, the words after the break are unimportant.
I want to be able to readily identify numbers that are missing from each file.
For example, file 1 has no 11116 and file 2 has no 11114.
If I sort the files together I can get a complete list:
$ sort file*
11110 abcdef
11110 abcdefg
11111 apple
11111 apple-pie
11112 banana
11112 banana-cake
11113 carrot
11113 chocolate
11114 date
11115 egg
11115 eggplant
11116 fruit
I can get a list of all the numbers by running it through uniq and only comparing the length of the number:
$ sort file* | uniq -w5
11110 abcdef
11111 apple
11112 banana
11113 carrot
11114 date
11115 egg
11116 fruit
That's a list of all numbers 11110-11116.
I can get a list of uniques and duplicates by asking uniq to filter those for me:
duplicates (numbers that appear in both files):
$ sort file* | uniq -dw5
11110 abcdef
11111 apple
11112 banana
11113 carrot
11115 egg
unique numbers, or numbers that only appear in one file:
$ sort file* | uniq -uw5
11114 date
11116 fruit
I would like something that has output resembling:
# shows numbers that do not exist in this file
$ sort file* | <is missing>
==> file1 <==
11116 fruit
==> file2 <==
11114 date
It could do the reverse and show what numbers are missing from the OTHER file, each case is workable:
# shows numbers that do exist ONLY in this file
$ sort file* | <has unqie>
==> file1 <==
11114 date
==> file2 <==
11116 fruit
The first field will contain ~30 alphanumeric characters.
The files in question contain thousands of entries and the majority of entries are expected to be in both files.
The arbitrary data to the right of the number is relevant and needs to remain.
I had the idea of:
generate a complete list of numbers
compare that list with file1 searching for unique entries
compare that list with file2 searching for unique entries
But I can't work out how to do that on a single line:
sort file* | uniq -w5 | sort file1 | uniq -uw5
sort file* | uniq -w5 | sort file2 | uniq -uw5
However, the output of the first uniq doesn't get merged in with the resorting of file1/2...
The solution I came up with was to create the output of all the numbers:
$ sort file* | uniq -w5
and then run that against each file individually, that does work. I just couldn't piece it together on one line:
$ sort all file1 | uniq -uw5
11116 fruit
$ sort all file2 | uniq -uw5
11114 date
I am now working on incorporating join, thanks Kamil
edit: I never got to go any further myself, #Shawn gave it to me in one very short line:
join -j1 -v1 file1 file2
After I have two compiled lists in the format I require, a join performed on the files spits out the required answer. From my code examples above:
$join -j1 -v1 file1 file2
11114 date
$ join -j1 -v2 file1 file2
11116 fruit
A real world Example:
I thought I would generate a real world example of what I have been working on. Take 5 arbitrary files:
lorem1.txt
lorem2.txt
lorem3.txt
lorem4.txt
lorem5.txt
and make a backup of them. I have modified one bit in lorem2.txt and I removed `lorem4.txt from the backup (consider it a new file, or for whatever reason, it is just a missing file):
test$ tree
.
├── data
│   ├── lorem1.txt
│   ├── lorem2.txt
│   ├── lorem3.txt
│   ├── lorem4.txt
│   └── lorem5.txt
└── data-backup
├── lorem1.txt
├── lorem2.txt
├── lorem3.txt
└── lorem5.txt
2 directories, 9 files
mad#test$ md5deep data/* | sort > hash1
mad#test$ md5deep data-backup/* | sort > hash2
mad#test$ head hash*
==> hash1 <==
44da5caec444b6f00721f499e97c857a /test/data/lorem1.txt
5ba24c9a5f6d74f81499872877a5061d /test/data/lorem2.txt
a00edd450c533091e0f62a06902545a4 /test/data/lorem5.txt
b80118923d16f649dd5410d54e5acb2d /test/data/lorem4.txt
fb8f7f39344394c78ab02d2ac524df9d /test/data/lorem3.txt
==> hash2 <==
000e755b8e840e42d50ef1ba5c7ae45d /test/data-backup/lorem2.txt
44da5caec444b6f00721f499e97c857a /test/data-backup/lorem1.txt
a00edd450c533091e0f62a06902545a4 /test/data-backup/lorem5.txt
fb8f7f39344394c78ab02d2ac524df9d /test/data-backup/lorem3.txt
Running our joins:
join 1
mad#test$ join -j1 -v1 hash*
5ba24c9a5f6d74f81499872877a5061d /test/data/lorem2.txt
b80118923d16f649dd5410d54e5acb2d /test/data/lorem4.txt
From our two sets of hash files, joining them verified against the first file, we see that the matching hashes of lorem2.txt and lorem4.txtare missing from the second file. (lorem2because we changed a bit, andlorem4` because we didn't copy, or we deleted the file from the backup).
Doing the reverse join we can see lorem2 exists, it's just that the hash is incorrect:
join 2
mad#test$ join -j1 -v2 hash*
000e755b8e840e42d50ef1ba5c7ae45d /test/data-backup/lorem2.txt
Using my sort and uniq examples from earlier, I could get similar results, but the join above is much better. join1 shows us files we need to revisit, join2 specifically shows us what hashes are incorrect.
sort by name and show uniq names (which was way outside the scope of the original question) can show us files that are missing from the backup. In this example, I convert the backup filenames so they mimic the original filenames, merge/sort them with the original filenames and sort based only on the names, not the hashes. This will show files that are missing from the backup:
test$ sort -k2 hash1 <(sed 's/data-backup/data/g' hash2) | uniq -uf1
b80118923d16f649dd5410d54e5acb2d /test/data/lorem4.txt
if we had a file that contained all the hashes:
test$ sort -k2 hash1 allhashes | uniq -uf1
b80118923d16f649dd5410d54e5acb2d /test/data/lorem4.txt
Thanks again to everyone who helped me formulate this. It has turned into a real life and time saver.

Using gnu awk, you can make use of this approach:
awk 'ARGIND < ARGC-1 {
a[ARGIND][$1] = 1
next
} {
for (i=1; i<ARGC-1; i++)
if (!a[i][$1])
print ARGV[i] ":", $0
}' file1 file2 <(sort file1 file2)
file2: 11114 date
file1: 11116 fruit

Only in file1:
grep `comm -23 <(cut -d \ -f 1 f1 | sort) <(cut -d \ -f 1 f2 | sort)` f1

This awk version only takes one pass through each file:
It assumes that there are no duplicate IDs in a file.
awk '
NR == FNR {f1[$1] = $0; next}
!($1 in f1) {printf "only in %s: %s\n", FILENAME, $0}
$1 in f1 {delete f1[$1]}
END {for (id in f1) printf "only in %s: %s\n", ARGV[1], f1[id]}
' file1 file2
ouputs
only in file2: 11116 fruit
only in file1: 11114 date

You can use diff between 2 files. However, if you diff these files all the lines will listed.
$ diff file1 file2
1,6c1,6
< 11110 abcdef
< 11111 apple
< 11112 banana
< 11113 carrot
< 11114 date
< 11115 eggplant
---
> 11110 abcdefg
> 11111 apple-pie
> 11112 banana-cake
> 11113 chocolate
> 11115 egg
> 11116 fruit
But you only care about the leading numbers.
$ diff <(cut -d' ' -f1 file1) <(cut -d' ' -f1 file2)
5d4
< 11114
6a6
> 11116
If the files are not sorted then add a sort
$ diff <(cut -d' ' -f1 file1 | sort) <(cut -d' ' -f1 file2 | sort)
5d4
< 11114
6a6
> 11116

Related

Join two files linux

I am trying to join two files but they don't have the same number of lines. I need to join them by the second column.
File1:
11#San Noor#New York, US
22#Maria Shiry#Dubai, UA
55#John Smith#London, England
66#Viki Sam#Roman, Italy
81#Sara Moheeb#Montreal, Canada
File2:
C1#Steve White#11
C2#Hight Look#21
E1#The Heaven is more#52
I1#The Roma Seen#55
The output should be:
The output for paired lines should look like:
San Noor#Sereve White
The output for unpairable lines should look like:
Sara Moheeb#NA
(The file3 after joining should contain 5 lines and look as followed.)
San Noor#Steve White
Maria Shiry#Hight Look
John Smith#The Heaven is more
Viki Sam#The Roma Seen
Sara Moheeb#NA
I have tried to join these two files using this command:
join -t '#' -j2 -e "NA" <(sort -t '#' -k2 File1) <(sort -t '#' -k2 File2) > File3
It says that both files are not sorted. Also, I need a way to fill in missing values after join.
Extract relevant columns and paste them together.
paste -d '#' <(cut -d '#' -f2 file1) <(cut -d '#' -f2 file2)
Well, but this will fail for the NA case, when one file has less lines then the other. You could pipe it to something along awk -v OFS='#' -F'#' { for (i=1;i<NF;++i) if (length($i) == 0) $i="NA"; } to substitute empty fields for the string NA.
So I guess your method is a possible one, but you have nothing to "join" on the files. So join on an a imaginary column with line numbers:
join -t'#' -eNA -a1 -a2 -o1.2,2.2 <(cut -d'#' -f2 file1 | nl -w1 -s'#') <(cut -d'#' -f2 file2 | nl -w1 -s'#')

diff 2 files with an output that does not include extra lines

I have 2 files test and test1 and I would like to do a diff between them without the output having extra characters 2a3, 4a6, 6a9 as shown below.
mangoes
apples
banana
peach
mango
strawberry
test1:
mangoes
apples
blueberries
banana
peach
blackberries
mango
strawberry
star fruit
when I diff both the files
$ diff test test1
2a3
> blueberries
4a6
> blackberries
6a9
> star fruit
How do I get the output as
$ diff test test1
blueberries
blackberries
star fruit
A solution using comm:
comm -13 <(sort test) <(sort test1)
Explanation
comm - compare two sorted files line by line
With no options, produce three-column output. Column one contains
lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files.
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files
As we only need the lines unique to the second file test1, -13 is used to suppress the unwanted columns.
Process Substitution is used to get the sorted files.
You can use grep to filter out lines that are not different text:
$ diff file1 file2 | grep '^[<>]'
> blueberries
> blackberries
> star fruit
If you want to remove the direction indicators that indicate which file differs, use sed:
$ diff file1 file2 | sed -n 's/^[<>] //p'
blueberries
blackberries
star fruit
(But it may be confusing to not see which file differs...)
You can use awk
awk 'NR==FNR{a[$0];next} !($0 in a)' test test1
NR==FNR means currently first file on the command line (i.e. test) is being processed,
a[$0] keeps each record in array named a,
next means read next line without doing anything else,
!($0 in a) means if current line does not exist in a, print it.

Bash: join by numeric column

If I want to use join on my Ubuntu, I need to first sort both files lexicographically (according to join --help), and only then join them:
tail -n +2 meta/201508_1 | sort -k 1b,1 > meta.txt
tail -n +2 keywords/copy | sort -k 1b,1 > keywords.txt
join meta.txt keywords.txt -1 1 -2 1 -t $'\t'
(I also remove the header from both of them using tail)
But instead of sorting files lexicographically, I would like to sort them numerically: the first column in both files is an ID.
tail -n +2 meta/201508_1 | sort -k1 -n > meta.txt
tail -n +2 keywords/copy.txt | sort -k1 -n > keywords.txt
And then join. But for join these files look unsorted:
join: meta.txt:10: is not sorted: 1023 301000 en
join: keywords.txt:2: is not sorted: 10 keyword1
If I add --nocheck-order to join, it doesn't join properly - it outputs just one line.
How do I join two files on their numerical ID in bash?
Sample (columns are tab-separated):
file 1
id volume lang
1 10 en
2 20 en
5 30 en
6 40 en
10 50 en
file 2
id keyword
4 kw1
2 kw2
10 kw3
1 kw4
3 kw5
desired output
1 kw4 10 en
2 kw2 20 en
10 kw3 50 en
Both of these work. The first one (sort -b is recommended on the Mac)
join <(sed 1d file1 | sort -b) <(sed 1d file2 | sort -b) | sort -n
the Linux man page recommends sort -k 1b,1
join <(sed 1d file1 | sort -k 1b,1) <(sed 1d file2 | sort -k 1b,1) | sort -n
In any case, you need to sort them lexicographically to join them. At the end you can still sort the result numerically.
You can ditch join and use awk instead:
awk -F'\t' 'FNR==1{next} NR==FNR{a[$1]=$2; next} $1 in a{print $1, a[$1], $2, $3}' file2 file1 | column -t
1 kw4 10 en
2 kw2 20 en
10 kw3 50 en
It is probably already in the order that you want (as per the ID column in file1). However if you need specific sorting you can do:
awk -F'\t' 'FNR==1{next} NR==FNR{a[$1]=$2; next} $1 in a{
print $1, a[$1], $2, $3}' file2 file1 | sort -nk1 | column -t
Note that column -t is there to produce tabular formatted output.

bash remove the same in file

I have one issue with getting number different strings.
I have two files, for example :
file1 :
aaa1
aaa4
bbb3
ccc2
and
file2:
bbb3
ccc2
aaa4
How from this get value 1 (in this case aaa1 string reason)?
I have one query but it calculates not only different strings, them also takes into account the order of the rows.
diff file1 file2 | grep "<" | wc -l
Thanks.
You can use grep -v -c with other options as this:
grep -cvwFf file2 file1
1
Options used are:
-c - get the count of matches
-v - invert matches
-w - full word match (to avoid partial matches)
-F - fixed string match
-f - Use a file for matching patterns
As far as I understand your requirements, sorting the files prior to the diff is a quick solution:
sort file1 > file1.sorted
sort file2 > file2.sorted
diff file1.sorted file2.sorted | egrep "[<>]" | wc -l

Two text file comparison with grep

I have two files (a.txt, b.txt)
a.txt is a list of English words (one word in ever row)
b.txt contains in every row: a number, a space character, a 5-65 char long string
(for example b.txt can contain: 1234 dsafaaraehawada)
I would like to know which row in b.txt contains words from a.txt and how many of them?
Example input:
a.txt
green
apple
bar
b.txt
1212 greensdsdappleded
12124 dfsfsd
123 bardws
output:
2 1212 greensdsdappleded
1 123 bardws
First row contains 'green' and 'apple' (2)
Second row contains nothing.
Third row contains 'bar' (1)
Thats all I would like to know.
The code (By Mr. Barmar):
grep -F -o -f a.txt b.txt | sort | uniq -c | sort -nr
But it need to be modified.
Try something like this:
awk 'NR==FNR{A[$1]; next} {t=0; for (i in A) t+=gsub(i,"&",$2)} t{print t, $0}' file1 file2
Try something like this:
awk '
NR==FNR { list[$1]++; next }
{
cnt=0
for(word in list) {
if(index($2,word) > 0)
cnt++
}
if(cnt>0)
print cnt,$0
}' a.txt b.txt
Test:
$ cat a.txt
green
apple
bar
$ cat b.txt
1212 greensdsdappleded
12124 dfsfsd
123 bardws
$ awk '
NR==FNR { list[$1]++; next }
{
cnt=0
for(word in list) {
if(index($2,word) > 0)
cnt++
}
if(cnt>0)
print cnt,$0
}' a.txt b.txt
2 1212 greensdsdappleded
1 123 bardws

Resources