diff 2 files with an output that does not include extra lines - linux

I have 2 files test and test1 and I would like to do a diff between them without the output having extra characters 2a3, 4a6, 6a9 as shown below.
mangoes
apples
banana
peach
mango
strawberry
test1:
mangoes
apples
blueberries
banana
peach
blackberries
mango
strawberry
star fruit
when I diff both the files
$ diff test test1
2a3
> blueberries
4a6
> blackberries
6a9
> star fruit
How do I get the output as
$ diff test test1
blueberries
blackberries
star fruit

A solution using comm:
comm -13 <(sort test) <(sort test1)
Explanation
comm - compare two sorted files line by line
With no options, produce three-column output. Column one contains
lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files.
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files
As we only need the lines unique to the second file test1, -13 is used to suppress the unwanted columns.
Process Substitution is used to get the sorted files.

You can use grep to filter out lines that are not different text:
$ diff file1 file2 | grep '^[<>]'
> blueberries
> blackberries
> star fruit
If you want to remove the direction indicators that indicate which file differs, use sed:
$ diff file1 file2 | sed -n 's/^[<>] //p'
blueberries
blackberries
star fruit
(But it may be confusing to not see which file differs...)

You can use awk
awk 'NR==FNR{a[$0];next} !($0 in a)' test test1
NR==FNR means currently first file on the command line (i.e. test) is being processed,
a[$0] keeps each record in array named a,
next means read next line without doing anything else,
!($0 in a) means if current line does not exist in a, print it.

Related

comparing files by fields in bash

I have two arbitrary files:
==> file1 <==
11110 abcdef
11111 apple
11112 banana
11113 carrot
11114 date
11115 eggplant
==> file2 <==
11110 abcdefg
11111 apple-pie
11112 banana-cake
11113 chocolate
11115 egg
11116 fruit
For the sake of comparison of these files, I only care about the number in the first column, the words after the break are unimportant.
I want to be able to readily identify numbers that are missing from each file.
For example, file 1 has no 11116 and file 2 has no 11114.
If I sort the files together I can get a complete list:
$ sort file*
11110 abcdef
11110 abcdefg
11111 apple
11111 apple-pie
11112 banana
11112 banana-cake
11113 carrot
11113 chocolate
11114 date
11115 egg
11115 eggplant
11116 fruit
I can get a list of all the numbers by running it through uniq and only comparing the length of the number:
$ sort file* | uniq -w5
11110 abcdef
11111 apple
11112 banana
11113 carrot
11114 date
11115 egg
11116 fruit
That's a list of all numbers 11110-11116.
I can get a list of uniques and duplicates by asking uniq to filter those for me:
duplicates (numbers that appear in both files):
$ sort file* | uniq -dw5
11110 abcdef
11111 apple
11112 banana
11113 carrot
11115 egg
unique numbers, or numbers that only appear in one file:
$ sort file* | uniq -uw5
11114 date
11116 fruit
I would like something that has output resembling:
# shows numbers that do not exist in this file
$ sort file* | <is missing>
==> file1 <==
11116 fruit
==> file2 <==
11114 date
It could do the reverse and show what numbers are missing from the OTHER file, each case is workable:
# shows numbers that do exist ONLY in this file
$ sort file* | <has unqie>
==> file1 <==
11114 date
==> file2 <==
11116 fruit
The first field will contain ~30 alphanumeric characters.
The files in question contain thousands of entries and the majority of entries are expected to be in both files.
The arbitrary data to the right of the number is relevant and needs to remain.
I had the idea of:
generate a complete list of numbers
compare that list with file1 searching for unique entries
compare that list with file2 searching for unique entries
But I can't work out how to do that on a single line:
sort file* | uniq -w5 | sort file1 | uniq -uw5
sort file* | uniq -w5 | sort file2 | uniq -uw5
However, the output of the first uniq doesn't get merged in with the resorting of file1/2...
The solution I came up with was to create the output of all the numbers:
$ sort file* | uniq -w5
and then run that against each file individually, that does work. I just couldn't piece it together on one line:
$ sort all file1 | uniq -uw5
11116 fruit
$ sort all file2 | uniq -uw5
11114 date
I am now working on incorporating join, thanks Kamil
edit: I never got to go any further myself, #Shawn gave it to me in one very short line:
join -j1 -v1 file1 file2
After I have two compiled lists in the format I require, a join performed on the files spits out the required answer. From my code examples above:
$join -j1 -v1 file1 file2
11114 date
$ join -j1 -v2 file1 file2
11116 fruit
A real world Example:
I thought I would generate a real world example of what I have been working on. Take 5 arbitrary files:
lorem1.txt
lorem2.txt
lorem3.txt
lorem4.txt
lorem5.txt
and make a backup of them. I have modified one bit in lorem2.txt and I removed `lorem4.txt from the backup (consider it a new file, or for whatever reason, it is just a missing file):
test$ tree
.
├── data
│   ├── lorem1.txt
│   ├── lorem2.txt
│   ├── lorem3.txt
│   ├── lorem4.txt
│   └── lorem5.txt
└── data-backup
├── lorem1.txt
├── lorem2.txt
├── lorem3.txt
└── lorem5.txt
2 directories, 9 files
mad#test$ md5deep data/* | sort > hash1
mad#test$ md5deep data-backup/* | sort > hash2
mad#test$ head hash*
==> hash1 <==
44da5caec444b6f00721f499e97c857a /test/data/lorem1.txt
5ba24c9a5f6d74f81499872877a5061d /test/data/lorem2.txt
a00edd450c533091e0f62a06902545a4 /test/data/lorem5.txt
b80118923d16f649dd5410d54e5acb2d /test/data/lorem4.txt
fb8f7f39344394c78ab02d2ac524df9d /test/data/lorem3.txt
==> hash2 <==
000e755b8e840e42d50ef1ba5c7ae45d /test/data-backup/lorem2.txt
44da5caec444b6f00721f499e97c857a /test/data-backup/lorem1.txt
a00edd450c533091e0f62a06902545a4 /test/data-backup/lorem5.txt
fb8f7f39344394c78ab02d2ac524df9d /test/data-backup/lorem3.txt
Running our joins:
join 1
mad#test$ join -j1 -v1 hash*
5ba24c9a5f6d74f81499872877a5061d /test/data/lorem2.txt
b80118923d16f649dd5410d54e5acb2d /test/data/lorem4.txt
From our two sets of hash files, joining them verified against the first file, we see that the matching hashes of lorem2.txt and lorem4.txtare missing from the second file. (lorem2because we changed a bit, andlorem4` because we didn't copy, or we deleted the file from the backup).
Doing the reverse join we can see lorem2 exists, it's just that the hash is incorrect:
join 2
mad#test$ join -j1 -v2 hash*
000e755b8e840e42d50ef1ba5c7ae45d /test/data-backup/lorem2.txt
Using my sort and uniq examples from earlier, I could get similar results, but the join above is much better. join1 shows us files we need to revisit, join2 specifically shows us what hashes are incorrect.
sort by name and show uniq names (which was way outside the scope of the original question) can show us files that are missing from the backup. In this example, I convert the backup filenames so they mimic the original filenames, merge/sort them with the original filenames and sort based only on the names, not the hashes. This will show files that are missing from the backup:
test$ sort -k2 hash1 <(sed 's/data-backup/data/g' hash2) | uniq -uf1
b80118923d16f649dd5410d54e5acb2d /test/data/lorem4.txt
if we had a file that contained all the hashes:
test$ sort -k2 hash1 allhashes | uniq -uf1
b80118923d16f649dd5410d54e5acb2d /test/data/lorem4.txt
Thanks again to everyone who helped me formulate this. It has turned into a real life and time saver.
Using gnu awk, you can make use of this approach:
awk 'ARGIND < ARGC-1 {
a[ARGIND][$1] = 1
next
} {
for (i=1; i<ARGC-1; i++)
if (!a[i][$1])
print ARGV[i] ":", $0
}' file1 file2 <(sort file1 file2)
file2: 11114 date
file1: 11116 fruit
Only in file1:
grep `comm -23 <(cut -d \ -f 1 f1 | sort) <(cut -d \ -f 1 f2 | sort)` f1
This awk version only takes one pass through each file:
It assumes that there are no duplicate IDs in a file.
awk '
NR == FNR {f1[$1] = $0; next}
!($1 in f1) {printf "only in %s: %s\n", FILENAME, $0}
$1 in f1 {delete f1[$1]}
END {for (id in f1) printf "only in %s: %s\n", ARGV[1], f1[id]}
' file1 file2
ouputs
only in file2: 11116 fruit
only in file1: 11114 date
You can use diff between 2 files. However, if you diff these files all the lines will listed.
$ diff file1 file2
1,6c1,6
< 11110 abcdef
< 11111 apple
< 11112 banana
< 11113 carrot
< 11114 date
< 11115 eggplant
---
> 11110 abcdefg
> 11111 apple-pie
> 11112 banana-cake
> 11113 chocolate
> 11115 egg
> 11116 fruit
But you only care about the leading numbers.
$ diff <(cut -d' ' -f1 file1) <(cut -d' ' -f1 file2)
5d4
< 11114
6a6
> 11116
If the files are not sorted then add a sort
$ diff <(cut -d' ' -f1 file1 | sort) <(cut -d' ' -f1 file2 | sort)
5d4
< 11114
6a6
> 11116

Linux command or/and script for duplicate lines retrieval

I would like to know if there's an easy way way to locate duplicate lines in a text file that contains many entries (about 200.000 or more) and output a file with the duplicates' line numbers, keeping the source file intact. For instance, I got a file with tweets like this:
1. i got red apple
2. i got red apple in my stomach
3. i got green apple
4. i got red apple
5. i like blue bananas
6. i got red apple
7. i like blues music
8. i like blue bananas
9. i like blue bananas
I want the output to be a separate file like this:
4
6
8
9
where numbers will indicate the lines with duplicate entries (excluding the first occurrence of the duplicates). Also note that the matching pattern must be exactly the same sentence (like line 1 is different than line 2, 5 is different than 7 and so on).
Everything I could find with sort | uniq doesn't seem to match the whole sentence but only the first word of the sentence so I'm considering if an awk script would be better for this task or if there is another type of command that can do that.
I also need the first file to be intact (not sorted or reordered in any way) and get only the line numbers as shown above because I want to manually delete these lines from two files. The first file contains the tweets and the second the hashtags of these tweets, so I want to delete the lines that contain duplicate tweets in both files, keeping the first occurrence.
You can try this awk:
awk '$0 in a && a[$0]==1{print NR} {a[$0]++}' file
As per comment,
awk '$0 in a{print NR} {a[$0]++}' file
Output:
$ awk '$0 in a && a[$0]==1{print NR} {a[$0]++}' file
4
8
$ awk '$0 in a{print NR} {a[$0]++}' file
4
6
8
9
you could use python script for doing the same.
f = open("file")
lines = f.readlines()
count = len (lines)
i=0
ignore = []
for i in range(count):
if i in ignore:
continue
for j in range(count):
if (j<= i):
continue
if lines[i] == lines[j]:
ignore.append(j)
print j+1
output :
4
6
8
9
Here is a method combining a few command line tools:
nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend | sed '/^$/{N;d}' |
cut -f 1
This
numbers the lines with nl, left adjusted with no leading zeroes (-n ln)
sorts them (ignoring the the first field, i.e., the line number) with sort
finds duplicate lines, ignoring the first field with uniq; the --all-repeated=prepend adds an empty line before each group of duplicate lines
removes all the empty lines and the first one of each group of duplicates with sed
removes everything but the line number with cut
This is what the output looks like at the different stages:
$ nl -n ln file
1 i got red apple
2 i got red apple in my stomach
3 i got green apple
4 i got red apple
5 i like blue bananas
6 i got red apple
7 i like blues music
8 i like blue bananas
9 i like blue bananas
$ nl -n ln file | sort -k 2
3 i got green apple
1 i got red apple
4 i got red apple
6 i got red apple
2 i got red apple in my stomach
5 i like blue bananas
8 i like blue bananas
9 i like blue bananas
7 i like blues music
$ nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend
1 i got red apple
4 i got red apple
6 i got red apple
5 i like blue bananas
8 i like blue bananas
9 i like blue bananas
$ nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend | sed '/^$/{N;d}'
4 i got red apple
6 i got red apple
8 i like blue bananas
9 i like blue bananas
$ nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend | sed '/^$/{N;d}' | cut -f 1
4
6
8
9

Comparing two files using Awk

I have two text files one with a list of ids and another one with some id and corresponding values.
File 1
abc
abcd
def
cab
kac
File 2
abcd 100
def 200
cab 500
kan 400
So, I want to compare both the files and fetch the value of matching columns and also keep all the id from File 1 and assign "NA" to the ids that don't have a value in File2
Desired output
abc NA
abcd 100
def 200
cab 500
kac NA
PS: Only Awk script/One-liners
The code I'm using to print matching columns:
awk 'FNR==NR{a[$1]++;next}a[$1]{print $1,"\t",$2}'
$ awk 'NR==FNR{a[$1]=$2;next} {print $1, ($1 in a? a[$1]: "NA") }' file2 file1
abc NA
abcd 100
def 200
cab 500
kac NA
Using join and sort (hopefully portable):
export LC_ALL=C
sort -k1 file1 > /tmp/sorted1
sort -k1 file2 > /tmp/sorted2
join -a 1 -e NA -o 0,2.2 /tmp/sorted1 /tmp/sorted2
In bash you can use here-files in a single line:
LC_ALL=C join -a 1 -e NA -o 0,2.2 <(LC_ALL=C sort -k1 file1) <(LC_ALL=C sort -k1 file2)
Note 1, this gives output sorted by 1st column:
abc NA
abcd 100
cab 500
def 200
kac NA
Note 2, the commands may work even without LC_ALL=C. Important is that all sort and join commands are using the same locale.

How to extract some missing rows by comparing two different files in linux?

I have two diferrent files which some rows are missing in one of the files. I want to make a new file including those non-common rows between two files. as and example, I have following files:
file1:
id1
id22
id3
id4
id43
id100
id433
file2:
id1
id2
id22
id3
id4
id8
id43
id100
id433
id21
I want to extract those rows which exist in file2 but do not in file1:
new file:
id2
id8
id21
any suggestion please?
Use the comm utility (assumes bash as the shell):
comm -13 <(sort file1) <(sort file2)
Note how the input must be sorted for this to work, so your delta will be sorted, too.
comm uses an (interleaved) 3-column layout:
column 1: lines only in file1
column 2: lines only in file2
column 3: lines in both files
-13 suppresses columns 1 and 2, which prints only the values exclusive to file2.
Caveat: For lines to be recognized as common to both files they must match exactly - seemingly identical lines that differ in terms of whitespace (as is the case in the sample data in the question as of this writing, where file1 lines have a trailing space) will not match.
cat -et is a command that visualizes line endings and control characters, which is helpful in diagnosing such problems.
For instance, cat -et file1 would output lines such as id1 $, making it obvious that there's a trailing space at the end of the line (represented as $).
If instead of cleaning up file1 you want to compare the files as-is, try:
comm -13 <(sed -E 's/ +$//' file1 | sort) <(sort file2)
A generalized solution that trims leading and trailing whitespace from the lines of both files:
comm -13 <(sed -E 's/^[[:blank:]]+|[[:blank:]]+$//g' file1 | sort) \
<(sed -E 's/^[[:blank:]]+|[[:blank:]]+$//g' file2 | sort)
Note: The above sed commands require either GNU or BSD sed.
Edit: I only wanted to change 1 character but 6 is the minimum... Delete this...
You can try to sort both files then count the duplicate rows and select only those row where the count is 1
sort file1 file2 | uniq -c | awk '$1 == 1 {print $2}'

how to sort a file according to another file?

Is there a unix oneliner or some other quick way on linux to sort a file according to a permutation set by the sorting of another file?
i.e.:
file1: (separated by CRLFs, not spaces)
2
3
7
4
file2:
a
b
c
d
sorted file1:
2
3
4
7
so the result of this one liner should be
sorted file2:
a
b
d
c
paste file1 file2 | sort | cut -f2
Below is a perl one-liner that will print the contents of file2 based on the sorted input of file1.
perl -n -e 'BEGIN{our($x,$t,#a)=(0,1,)}if($t){$a[$.-1]=$_}else{$a[$.-1].=$_ unless($.>$x)};if(eof){$t=0;$x=$.;close ARGV};END{foreach(sort #a){($j,$l)=split(/\n/,$_,2);print qq($l)}}' file1 file2
Note: If the files are different lengths, the output will only print up to the shortest file length.
For example, if file-A has 5 lines and file-B has 8 lines then the output will only be 5 lines.

Resources