Comparing two files in linux terminal - linux

There are two files called "a.txt" and "b.txt" both have a list of words. Now I want to check which words are extra in "a.txt" and are not in "b.txt".
I need a efficient algorithm as I need to compare two dictionaries.

if you have vim installed,try this:
vimdiff file1 file2
or
vim -d file1 file2
you will find it fantastic.

Sort them and use comm:
comm -23 <(sort a.txt) <(sort b.txt)
comm compares (sorted) input files and by default outputs three columns: lines that are unique to a, lines that are unique to b, and lines that are present in both. By specifying -1, -2 and/or -3 you can suppress the corresponding output. Therefore comm -23 a b lists only the entries that are unique to a. I use the <(...) syntax to sort the files on the fly, if they are already sorted you don't need this.

If you prefer the diff output style from git diff, you can use it with the --no-index flag to compare files not in a git repository:
git diff --no-index a.txt b.txt
Using a couple of files with around 200k file name strings in each, I benchmarked (with the built-in timecommand) this approach vs some of the other answers here:
git diff --no-index a.txt b.txt
# ~1.2s
comm -23 <(sort a.txt) <(sort b.txt)
# ~0.2s
diff a.txt b.txt
# ~2.6s
sdiff a.txt b.txt
# ~2.7s
vimdiff a.txt b.txt
# ~3.2s
comm seems to be the fastest by far, while git diff --no-index appears to be the fastest approach for diff-style output.
Update 2018-03-25 You can actually omit the --no-index flag unless you are inside a git repository and want to compare untracked files within that repository. From the man pages:
This form is to compare the given two paths on the filesystem. You can omit the --no-index option when running the command in a working tree controlled by Git and at least one of the paths points outside the working tree, or when running the command outside a working tree controlled by Git.

Try sdiff (man sdiff)
sdiff -s file1 file2

You can use diff tool in linux to compare two files. You can use --changed-group-format and --unchanged-group-format options to filter required data.
Following three options can use to select the relevant group for each option:
'%<' get lines from FILE1
'%>' get lines from FILE2
'' (empty string) for removing lines from both files.
E.g: diff --changed-group-format="%<" --unchanged-group-format="" file1.txt file2.txt
[root#vmoracle11 tmp]# cat file1.txt
test one
test two
test three
test four
test eight
[root#vmoracle11 tmp]# cat file2.txt
test one
test three
test nine
[root#vmoracle11 tmp]# diff --changed-group-format='%<' --unchanged-group-format='' file1.txt file2.txt
test two
test four
test eight

You can also use: colordiff: Displays the output of diff with colors.
About vimdiff: It allows you to compare files via SSH, for example :
vimdiff /var/log/secure scp://192.168.1.25/var/log/secure
Extracted from: http://www.sysadmit.com/2016/05/linux-diferencias-entre-dos-archivos.html

Also, do not forget about mcdiff - Internal diff viewer of GNU Midnight Commander.
For example:
mcdiff file1 file2
Enjoy!

Use comm -13 (requires sorted files):
$ cat file1
one
two
three
$ cat file2
one
two
three
four
$ comm -13 <(sort file1) <(sort file2)
four

You can also use:
sdiff file1 file2
To display differences side by side within your terminal!

diff a.txt b.txt | grep '<'
can then pipe to cut for a clean output
diff a.txt b.txt | grep '<' | cut -c 3

Here is my solution for this :
mkdir temp
mkdir results
cp /usr/share/dict/american-english ~/temp/american-english-dictionary
cp /usr/share/dict/british-english ~/temp/british-english-dictionary
cat ~/temp/american-english-dictionary | wc -l > ~/results/count-american-english-dictionary
cat ~/temp/british-english-dictionary | wc -l > ~/results/count-british-english-dictionary
grep -Fxf ~/temp/american-english-dictionary ~/temp/british-english-dictionary > ~/results/common-english
grep -Fxvf ~/results/common-english ~/temp/american-english-dictionary > ~/results/unique-american-english
grep -Fxvf ~/results/common-english ~/temp/british-english-dictionary > ~/results/unique-british-english

Using awk for it. Test files:
$ cat a.txt
one
two
three
four
four
$ cat b.txt
three
two
one
The awk:
$ awk '
NR==FNR { # process b.txt or the first file
seen[$0] # hash words to hash seen
next # next word in b.txt
} # process a.txt or all files after the first
!($0 in seen)' b.txt a.txt # if word is not hashed to seen, output it
Duplicates are outputed:
four
four
To avoid duplicates, add each newly met word in a.txt to seen hash:
$ awk '
NR==FNR {
seen[$0]
next
}
!($0 in seen) { # if word is not hashed to seen
seen[$0] # hash unseen a.txt words to seen to avoid duplicates
print # and output it
}' b.txt a.txt
Output:
four
If the word lists are comma-separated, like:
$ cat a.txt
four,four,three,three,two,one
five,six
$ cat b.txt
one,two,three
you have to do a couple of extra laps (forloops):
awk -F, ' # comma-separated input
NR==FNR {
for(i=1;i<=NF;i++) # loop all comma-separated fields
seen[$i]
next
}
{
for(i=1;i<=NF;i++)
if(!($i in seen)) {
seen[$i] # this time we buffer output (below):
buffer=buffer (buffer==""?"":",") $i
}
if(buffer!="") { # output unempty buffers after each record in a.txt
print buffer
buffer=""
}
}' b.txt a.txt
Output this time:
four
five,six

Related

linux merge multiple files, but skip lines starting with '#'

I have a list of 10 files that I want to merge into one file.
file1.txt
file2.txt
...
file10.txt
I normally do this with cat
cat file*.txt > merged_file.txt
However, I don't want the lines starting with '#' to be included in the merged_file.txt. How do I do this?
Something like this:
cat file*.txt | egrep -v '^#.*$' > merged_file.txt

Comparing two files and applying the differences

on a Linux based system, I can easily compare two files, e.g.:
diff file1.txt file2.txt
...and see the difference between them.
What if I want to take all lines that are unique to file2.txt and apply them to file1.txt so that file1.txt will now contain everything it had + lines from file2.txt that it didn't have before? Is there an easy way to do it?
Using patch
You can use diff's output to create a patch file.
diff original_file file_with_new_lines > patch_file
You can edit patch_file to keep only the additions, since you only want the new lines.
Then you can use the patch command to apply this patch file:
patch original_file patch_file
If you don't mind appending the sorted diff to your file, you can use comm:
cat file1.txt <(comm -13 <(sort f1.txt) <(sort f2.txt)) > file1.txt.patched
or
comm -13 <(sort f1.txt) <(sort f2.txt) | cat file1.txt - > file1.txt.patched
This will append the unique lines from file2.txt to file1.txt.

Compare two files in unix and add the delta to one file

Both files has lines of string and numeric data minimum of 2000 lines.
How to add non duplicate data from file2.txt to file1.txt.
Basically file2 has the new data lines but we also want to ensure we are not adding duplicate lines to file1.txt.
File1.txt > this is the main data file
File2.txt > this file has the new data we want to add to file1
thanks,
Sort the two files together with the -u option to remove duplicates.
sort -u File1.txt File2.txt > NewFile.txt && mv NewFile.txt File1.txt
Another option if the file is sorted, just to have some choice (and I like comm :) )
comm --check-order --output-delimiter='' -13 File1.txt File2.txt >> File1.txt
use awk:
awk '!a[$0]++' File1.txt File2.txt
You can use grep, like this:
# grep those lines from file2 which are not in file1
grep -vFf file1 file2 > new_file2
# append the results to file1
cat new_file2 >> file1

shell script to compare two files and write the difference to third file

I want to compare two files and redirect the difference between the two files to third one.
file1:
/opt/a/a.sql
/opt/b/b.sql
/opt/c/c.sql
In case any file has # before /opt/c/c.sql, it should skip #
file2:
/opt/c/c.sql
/opt/a/a.sql
I want to get the difference between the two files. In this case, /opt/b/b.sql should be stored in a different file. Can anyone help me to achieve the above scenarios?
file1
$ cat file1 #both file1 and file2 may contain spaces which are ignored
/opt/a/a.sql
/opt/b/b.sql
/opt/c/c.sql
/opt/h/m.sql
file2
$ cat file2
/opt/c/c.sql
/opt/a/a.sql
Do
awk 'NR==FNR{line[$1];next}
{if(!($1 in line)){if($0!=""){print}}}
' file2 file1 > file3
file3
$ cat file3
/opt/b/b.sql
/opt/h/m.sql
Notes:
The order of files passed to awk is important here, pass the file to check - file2 here - first followed by the master file -file1.
Check awk documentation to understand what is done here.
You can use some tools like cat, sed, sort and uniq.
The main observation is this: if the line is in both files then it is not unique in cat file1 file2.
Furthermore in cat file1 file2| sort, all doubles are in sequence. Using uniq -u we get unique lines and have this pipe:
cat file1 file2 | sort | uniq -u
Using sed to remove leading whitespace, empty and comment lines, we get this final pipe:
cat file1 file2 | sed -r 's/^[ \t]+//; /^#/ d; /^$/ d;' | sort | uniq -u > file3

How to find which line is missing in another file

on Linux box
I have one file as below
A.txt
1
2
3
4
Second file as below
B.txt
1
2
3
6
I want to know what is inside A.txt but not in B.txt
i.e. it should print value 4
I want to do that on Linux.
awk 'NR==FNR{a[$0]=1;next}!a[$0]' B A
didn't test, give it a try
Use comm if the files are sorted as your sample input shows:
$ comm -23 A.txt B.txt
4
If the files are unsorted, see #Kent's awk solution.
You can also do this using grep by combining the -v (show non-matching lines), -x (match whole lines) and -f (read patterns from file) options:
$ grep -v -x -f B.txt A.txt
4
This does not depend on the order of the files - it will remove any lines from A that match a line in B.
(An addition to #rjmunro's answer)
The proper way to use grep for this is:
$ grep -F -v -x -f B.txt A.txt
4
Without the -F flag, grep interprets PATTERN, read from B.txt, as a basic regular expression (BRE), which is undesired here, and can cause troubles. -F flag makes grep treat PATTERN as a set of newline-separated strings. For instance:
$ cat A.txt
&
^
[
]
$ cat B.txt
[
^
]
|
$ grep -v -x -f B.txt A.txt
grep: B.txt:1: Invalid regular expression
$ grep -F -v -x -f B.txt A.txt
&
Using diff:
diff --changed-group-format='%<' --unchanged-group-format='' A.txt B.txt

Resources