How to display only different rows using diff (bash) - linux

How can I display only different rows using diff in a separate file?
For example, the file number 1 contains the line:
1;john;125;3
1;tom;56;2
2;jack;10;5
A file number 2 contains the following lines:
1;john;125;3
1;tom;58;2
2;jack;10;5
How to make in the following happen?
1;tom;58;2

a.txt:
1;john;125;3
1;tom;56;2
2;jack;10;5
b.txt:
1;john;125;3
1;tom;58;2
2;jack;10;5
Use comm:
comm -13 a.txt b.txt
1;tom;58;2
The command line options to comm are pretty straight-forward:
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)

Here's a simple solution that I think is better than diff:
sort file1 file2 | uniq -u
sort file1 file2 concatenates the two files and sorts it
uniq -u prints the unique lines (that do not repeat). It requires the input to be pre-sorted.

Assuming you want to retain only the lines unique to file 2 you can do:
comm -13 file1 file2
Note that the comm command expects the two files to be in sorted order.

Using group format specifiers you can suppress printing of unchanged lines and print only changed lines for changed
diff --changed-group-format="%>" --unchanged-group-format="" file1 file2

Related

How to compare the columns of file1 to the columns of file2, select matching values, and output to new file using grep or unix commands

I have two files, file1 and file2, where the target_id compose the first column in both.
I want to compare file1 to file2, and only keep the rows of file1 which match the target_id in file2.
file2:
target_id
ENSMUST00000128641.2
ENSMUST00000185334.7
ENSMUST00000170213.2
ENSMUST00000232944.2
Any help would be appreciated.
% grep -x -f file1 file2 resulted in no output in my terminal
Sample data that actually shows overlaps between the files.
file1.csv:
target_id,KO_1D_7dpi,KO_2D_7dpi
ENSMUST00000178537.2,0,0
ENSMUST00000178862.2,0,0
ENSMUST00000196221.2,0,0
ENSMUST00000179664.2,0,0
ENSMUST00000177564.2,0,0
file2.csv
target_id
ENSMUST00000178537.2
ENSMUST00000196221.2
ENSMUST00000177564.2
Your grep command, but swapped:
$ grep -F -f file2.csv file1.csv
target_id,KO_1D_7dpi,KO_2D_7dpi
ENSMUST00000178537.2,0,0
ENSMUST00000196221.2,0,0
ENSMUST00000177564.2,0,0
Edit: we can add the -F argument since it is a fixed-string search. Plus it adds protection against the . matching something else as a regex. Thanks to #Sundeep for the recommendation.

How to compare two text files for the same exact text using BASH Script?

Let's say I have two text files that I need to extract data out of. The text of the two files is as follows:
File1.txt
ami-1234567
ami-1234567654
ami-23456
File-2.txt
ami-1234567654
ami-23456
ami-2345678965
I want all the data of file2.txt which looks same from file1.txt.
This is litteratly my first comment so I hope it works,
but you can try using diff:
diff file1.txt file2.txt
Did you try join?
join -o 0 File1.txt File2.txt
ami-1234567654
ami-23456
remark: For join to work correctly, it needs your files to be sorted, which seems to be the case with your sample.
Just another option:
$ comm -1 -2 <(sort file1.txt) <(sort file2.txt)
The options specify that "unique" limes from first file (-1) and second file (-2) should be omitted.
This is basically the same as
$ join <(sort file1.txt) <(sort file2.txt)
Note that the sorting in both examples happens without creating an intermediate temp file, if you don't want to bother creating one.
I don`t know if I proper understand You:
but You can Try sort this files (after extract):
sort file1 > file1.sorted
sort file2 > file2.sorted

Find common lines between two files

File 1:
6
9219045
71608707
105853666
106000373
106000464
106000814
106001204
106001483
106002054
File 2:
6,rO0ABXNyADljb20uYW1hem9uLnBvaW50c3BsYXRmb3JtLnV0aWwuUG9pbnRzUGxhdGZvcm1DcnlwdE1lc3NhZ2Xio1+sC+m4CAIABFsACGNpcGhlcklWdAACW0JbAApjaXBoZXJUZXh0cQB+AAFMAAxtYXRlcmlhbE5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAA5tYXRlcmlhbFNlcmlhbHQAEExqYXZhL2xhbmcvTG9uZzt4cHVyAAJbQqzzF/gGCFTgAgAAeHAAAAAQufMrUK+8A4e0iJV4ktLQgXVxAH4ABQAAAEBNoyuUZLYRLaBqLvsvzHxxv63pO+4UPsRqpp/oHURcBdT6NES2G5H6+Kc3yjZOXDIIhHN1efAxyM/iWD0qDev9dAAwY29tLmFtYXpvbi5wb2ludHMuZW5jcnlwdGlvbi5rZXkuYWNjb3VudHNzZXJ2aWNlc3IADmphdmEubGFuZy5Mb25nO4vkkMyPI98CAAFKAAV2YWx1ZXhyABBqYXZhLmxhbmcuTnVtYmVyhqyVHQuU4IsCAAB4cAAAAAAAAAAB,jp-points
55555,rO0ABXNyADljb20uYW1hem9uLnBvaW50c3BsYXRmb3JtLnV0aWwuUG9pbnRzUGxhdGZvcm1DcnlwdE1lc3NhZ2Xio1+sC+m4CAIABFsACGNpcGhlcklWdAACW0JbAApjaXBoZXJUZXh0cQB+AAFMAAxtYXRlcmlhbE5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAA5tYXRlcmlhbFNlcmlhbHQAEExqYXZhL2xhbmcvTG9uZzt4cHVyAAJbQqzzF/gGCFTgAgAAeHAAAAAQ5C9LG75v8+ENmmteRa/bBHVxAH4ABQAAAFBgXjgKk6KvTg4FiPfWF/7Ittzk/MpmlBecYkc9Bc+3mAV7R58rcl1hGkFdk3MagFXjUsunbE0qcV+Gy+DwhUWpBYDpA3p9q9oO8zwDJfFqCHQAMGNvbS5hbWF6b24ucG9pbnRzLmVuY3J5cHRpb24ua2V5LmFjY291bnRzc2VydmljZXNyAA5qYXZhLmxhbmcuTG9uZzuL5JDMjyPfAgABSgAFdmFsdWV4cgAQamF2YS5sYW5nLk51bWJlcoaslR0LlOCLAgAAeHAAAAAAAAAAAQ==,jp-points
74292,rO0ABXNyADljb20uYW1hem9uLnBvaW50c3BsYXRmb3JtLnV0aWwuUG9pbnRzUGxhdGZvcm1DcnlwdE1lc3NhZ2Xio1+sC+m4CAIABFsACGNpcGhlcklWdAACW0JbAApjaXBoZXJUZXh0cQB+AAFMAAxtYXRlcmlhbE5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAA5tYXRlcmlhbFNlcmlhbHQAEExqYXZhL2xhbmcvTG9uZzt4cHVyAAJbQqzzF/gGCFTgAgAAeHAAAAAQPxjL0KWZoaYxWY7clP57tnVxAH4ABQAAAFB6WiMY05SU2WiYqaC7CzwMP2kQ51ec9mkIPh7R4fz2LPwfT8VNpAwH0QLM3I497D2JLfK13S6S90dxpU1ny2VBwaU4imxVchwo7YrcvwvEZXQAMGNvbS5hbWF6b24ucG9pbnRzLmVuY3J5cHRpb24ua2V5LmFjY291bnRzc2VydmljZXNyAA5qYXZhLmxhbmcuTG9uZzuL5JDMjyPfAgABSgAFdmFsdWV4cgAQamF2YS5sYW5nLk51bWJlcoaslR0LlOCLAgAAeHAAAAAAAAAAAQ==,jp-points
File 1 has only one column and I am sorting the file with the command sort -n file1
File 2 has three columns and I am sorting the file with command sort -t "," -k 1n,1 file2 which is sorting on the basis of ist column.
Now, I want to find the rows in file2 that are starting from lines in file1
Commands that I have tried:
grep -w -f file1 file2
join -t "," -1 1 -2 1 -o 2.2 file1 file2
But, I am not getting desired results. Please provide me with alternate approach. File 1 has rows 7124458 and File 2 has row 42987432.
Use awk:
awk -F, 'FNR == NR { ++a[$0]; next } $1 in a' file1 file2
Output:
6,rO0ABXNyADljb20uYW1hem9uLnBvaW50c3BsYXRmb3JtLnV0aWwuUG9pbnRzUGxhdGZvcm1DcnlwdE1lc3NhZ2Xio1+sC+m4CAIABFsACGNpcGhlcklWdAACW0JbAApjaXBoZXJUZXh0cQB+AAFMAAxtYXRlcmlhbE5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAA5tYXRlcmlhbFNlcmlhbHQAEExqYXZhL2xhbmcvTG9uZzt4cHVyAAJbQqzzF/gGCFTgAgAAeHAAAAAQufMrUK+8A4e0iJV4ktLQgXVxAH4ABQAAAEBNoyuUZLYRLaBqLvsvzHxxv63pO+4UPsRqpp/oHURcBdT6NES2G5H6+Kc3yjZOXDIIhHN1efAxyM/iWD0qDev9dAAwY29tLmFtYXpvbi5wb2ludHMuZW5jcnlwdGlvbi5rZXkuYWNjb3VudHNzZXJ2aWNlc3IADmphdmEubGFuZy5Mb25nO4vkkMyPI98CAAFKAAV2YWx1ZXhyABBqYXZhLmxhbmcuTnVtYmVyhqyVHQuU4IsCAAB4cAAAAAAAAAAB,jp-point
join(1) assumes both files are sorted alphabetically on the join fields. Try sorting the inputs without -n.
(To be more precise, it depends on the LC_COLLATE setting. If you are sorting for the benefit of two programs talking to each other, it is probably more reliable to set LC_ALL=C for both join and sort to avoid any glitches due to locale settings.)

How to select uncommon rows from two large files using linux (in terminal)?

Both have two columns: names and IDs.(files are in xls or txt format)
File 1:
AAA K0125
ccc K0234
BMN_a K0567
BMN_c K0567
File 2:
AKP K0897
BMN_a K0567
ccc K0234
I want to print uncommon rows using these two files.
how can it be done using linux terminal.
Try something like this:-
join "-t " -j 1 -v 1 file1 file2
Considering the two files are sorted.
First sort both the files and then use comm utility with -3 option
sort file1 > file1_sorted
sort file2 > file2_sorted
comm -3 file1_sorted file2_sorted
A portion from man comm
-3 suppress column 3 (lines that appear in both files)
Output:
AAA K0125
AKP K0897
BMN_c K0567

Comparing two files in linux terminal

There are two files called "a.txt" and "b.txt" both have a list of words. Now I want to check which words are extra in "a.txt" and are not in "b.txt".
I need a efficient algorithm as I need to compare two dictionaries.
if you have vim installed,try this:
vimdiff file1 file2
or
vim -d file1 file2
you will find it fantastic.
Sort them and use comm:
comm -23 <(sort a.txt) <(sort b.txt)
comm compares (sorted) input files and by default outputs three columns: lines that are unique to a, lines that are unique to b, and lines that are present in both. By specifying -1, -2 and/or -3 you can suppress the corresponding output. Therefore comm -23 a b lists only the entries that are unique to a. I use the <(...) syntax to sort the files on the fly, if they are already sorted you don't need this.
If you prefer the diff output style from git diff, you can use it with the --no-index flag to compare files not in a git repository:
git diff --no-index a.txt b.txt
Using a couple of files with around 200k file name strings in each, I benchmarked (with the built-in timecommand) this approach vs some of the other answers here:
git diff --no-index a.txt b.txt
# ~1.2s
comm -23 <(sort a.txt) <(sort b.txt)
# ~0.2s
diff a.txt b.txt
# ~2.6s
sdiff a.txt b.txt
# ~2.7s
vimdiff a.txt b.txt
# ~3.2s
comm seems to be the fastest by far, while git diff --no-index appears to be the fastest approach for diff-style output.
Update 2018-03-25 You can actually omit the --no-index flag unless you are inside a git repository and want to compare untracked files within that repository. From the man pages:
This form is to compare the given two paths on the filesystem. You can omit the --no-index option when running the command in a working tree controlled by Git and at least one of the paths points outside the working tree, or when running the command outside a working tree controlled by Git.
Try sdiff (man sdiff)
sdiff -s file1 file2
You can use diff tool in linux to compare two files. You can use --changed-group-format and --unchanged-group-format options to filter required data.
Following three options can use to select the relevant group for each option:
'%<' get lines from FILE1
'%>' get lines from FILE2
'' (empty string) for removing lines from both files.
E.g: diff --changed-group-format="%<" --unchanged-group-format="" file1.txt file2.txt
[root#vmoracle11 tmp]# cat file1.txt
test one
test two
test three
test four
test eight
[root#vmoracle11 tmp]# cat file2.txt
test one
test three
test nine
[root#vmoracle11 tmp]# diff --changed-group-format='%<' --unchanged-group-format='' file1.txt file2.txt
test two
test four
test eight
You can also use: colordiff: Displays the output of diff with colors.
About vimdiff: It allows you to compare files via SSH, for example :
vimdiff /var/log/secure scp://192.168.1.25/var/log/secure
Extracted from: http://www.sysadmit.com/2016/05/linux-diferencias-entre-dos-archivos.html
Also, do not forget about mcdiff - Internal diff viewer of GNU Midnight Commander.
For example:
mcdiff file1 file2
Enjoy!
Use comm -13 (requires sorted files):
$ cat file1
one
two
three
$ cat file2
one
two
three
four
$ comm -13 <(sort file1) <(sort file2)
four
You can also use:
sdiff file1 file2
To display differences side by side within your terminal!
diff a.txt b.txt | grep '<'
can then pipe to cut for a clean output
diff a.txt b.txt | grep '<' | cut -c 3
Here is my solution for this :
mkdir temp
mkdir results
cp /usr/share/dict/american-english ~/temp/american-english-dictionary
cp /usr/share/dict/british-english ~/temp/british-english-dictionary
cat ~/temp/american-english-dictionary | wc -l > ~/results/count-american-english-dictionary
cat ~/temp/british-english-dictionary | wc -l > ~/results/count-british-english-dictionary
grep -Fxf ~/temp/american-english-dictionary ~/temp/british-english-dictionary > ~/results/common-english
grep -Fxvf ~/results/common-english ~/temp/american-english-dictionary > ~/results/unique-american-english
grep -Fxvf ~/results/common-english ~/temp/british-english-dictionary > ~/results/unique-british-english
Using awk for it. Test files:
$ cat a.txt
one
two
three
four
four
$ cat b.txt
three
two
one
The awk:
$ awk '
NR==FNR { # process b.txt or the first file
seen[$0] # hash words to hash seen
next # next word in b.txt
} # process a.txt or all files after the first
!($0 in seen)' b.txt a.txt # if word is not hashed to seen, output it
Duplicates are outputed:
four
four
To avoid duplicates, add each newly met word in a.txt to seen hash:
$ awk '
NR==FNR {
seen[$0]
next
}
!($0 in seen) { # if word is not hashed to seen
seen[$0] # hash unseen a.txt words to seen to avoid duplicates
print # and output it
}' b.txt a.txt
Output:
four
If the word lists are comma-separated, like:
$ cat a.txt
four,four,three,three,two,one
five,six
$ cat b.txt
one,two,three
you have to do a couple of extra laps (forloops):
awk -F, ' # comma-separated input
NR==FNR {
for(i=1;i<=NF;i++) # loop all comma-separated fields
seen[$i]
next
}
{
for(i=1;i<=NF;i++)
if(!($i in seen)) {
seen[$i] # this time we buffer output (below):
buffer=buffer (buffer==""?"":",") $i
}
if(buffer!="") { # output unempty buffers after each record in a.txt
print buffer
buffer=""
}
}' b.txt a.txt
Output this time:
four
five,six

Resources