How to select uncommon rows from two large files using linux (in terminal)? - linux

Both have two columns: names and IDs.(files are in xls or txt format)
File 1:
AAA K0125
ccc K0234
BMN_a K0567
BMN_c K0567
File 2:
AKP K0897
BMN_a K0567
ccc K0234
I want to print uncommon rows using these two files.
how can it be done using linux terminal.

Try something like this:-
join "-t " -j 1 -v 1 file1 file2
Considering the two files are sorted.

First sort both the files and then use comm utility with -3 option
sort file1 > file1_sorted
sort file2 > file2_sorted
comm -3 file1_sorted file2_sorted
A portion from man comm
-3 suppress column 3 (lines that appear in both files)
Output:
AAA K0125
AKP K0897
BMN_c K0567

Related

How to compare two text files for the same exact text using BASH Script?

Let's say I have two text files that I need to extract data out of. The text of the two files is as follows:
File1.txt
ami-1234567
ami-1234567654
ami-23456
File-2.txt
ami-1234567654
ami-23456
ami-2345678965
I want all the data of file2.txt which looks same from file1.txt.
This is litteratly my first comment so I hope it works,
but you can try using diff:
diff file1.txt file2.txt
Did you try join?
join -o 0 File1.txt File2.txt
ami-1234567654
ami-23456
remark: For join to work correctly, it needs your files to be sorted, which seems to be the case with your sample.
Just another option:
$ comm -1 -2 <(sort file1.txt) <(sort file2.txt)
The options specify that "unique" limes from first file (-1) and second file (-2) should be omitted.
This is basically the same as
$ join <(sort file1.txt) <(sort file2.txt)
Note that the sorting in both examples happens without creating an intermediate temp file, if you don't want to bother creating one.
I don`t know if I proper understand You:
but You can Try sort this files (after extract):
sort file1 > file1.sorted
sort file2 > file2.sorted

Find common lines between two files

File 1:
6
9219045
71608707
105853666
106000373
106000464
106000814
106001204
106001483
106002054
File 2:
6,rO0ABXNyADljb20uYW1hem9uLnBvaW50c3BsYXRmb3JtLnV0aWwuUG9pbnRzUGxhdGZvcm1DcnlwdE1lc3NhZ2Xio1+sC+m4CAIABFsACGNpcGhlcklWdAACW0JbAApjaXBoZXJUZXh0cQB+AAFMAAxtYXRlcmlhbE5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAA5tYXRlcmlhbFNlcmlhbHQAEExqYXZhL2xhbmcvTG9uZzt4cHVyAAJbQqzzF/gGCFTgAgAAeHAAAAAQufMrUK+8A4e0iJV4ktLQgXVxAH4ABQAAAEBNoyuUZLYRLaBqLvsvzHxxv63pO+4UPsRqpp/oHURcBdT6NES2G5H6+Kc3yjZOXDIIhHN1efAxyM/iWD0qDev9dAAwY29tLmFtYXpvbi5wb2ludHMuZW5jcnlwdGlvbi5rZXkuYWNjb3VudHNzZXJ2aWNlc3IADmphdmEubGFuZy5Mb25nO4vkkMyPI98CAAFKAAV2YWx1ZXhyABBqYXZhLmxhbmcuTnVtYmVyhqyVHQuU4IsCAAB4cAAAAAAAAAAB,jp-points
55555,rO0ABXNyADljb20uYW1hem9uLnBvaW50c3BsYXRmb3JtLnV0aWwuUG9pbnRzUGxhdGZvcm1DcnlwdE1lc3NhZ2Xio1+sC+m4CAIABFsACGNpcGhlcklWdAACW0JbAApjaXBoZXJUZXh0cQB+AAFMAAxtYXRlcmlhbE5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAA5tYXRlcmlhbFNlcmlhbHQAEExqYXZhL2xhbmcvTG9uZzt4cHVyAAJbQqzzF/gGCFTgAgAAeHAAAAAQ5C9LG75v8+ENmmteRa/bBHVxAH4ABQAAAFBgXjgKk6KvTg4FiPfWF/7Ittzk/MpmlBecYkc9Bc+3mAV7R58rcl1hGkFdk3MagFXjUsunbE0qcV+Gy+DwhUWpBYDpA3p9q9oO8zwDJfFqCHQAMGNvbS5hbWF6b24ucG9pbnRzLmVuY3J5cHRpb24ua2V5LmFjY291bnRzc2VydmljZXNyAA5qYXZhLmxhbmcuTG9uZzuL5JDMjyPfAgABSgAFdmFsdWV4cgAQamF2YS5sYW5nLk51bWJlcoaslR0LlOCLAgAAeHAAAAAAAAAAAQ==,jp-points
74292,rO0ABXNyADljb20uYW1hem9uLnBvaW50c3BsYXRmb3JtLnV0aWwuUG9pbnRzUGxhdGZvcm1DcnlwdE1lc3NhZ2Xio1+sC+m4CAIABFsACGNpcGhlcklWdAACW0JbAApjaXBoZXJUZXh0cQB+AAFMAAxtYXRlcmlhbE5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAA5tYXRlcmlhbFNlcmlhbHQAEExqYXZhL2xhbmcvTG9uZzt4cHVyAAJbQqzzF/gGCFTgAgAAeHAAAAAQPxjL0KWZoaYxWY7clP57tnVxAH4ABQAAAFB6WiMY05SU2WiYqaC7CzwMP2kQ51ec9mkIPh7R4fz2LPwfT8VNpAwH0QLM3I497D2JLfK13S6S90dxpU1ny2VBwaU4imxVchwo7YrcvwvEZXQAMGNvbS5hbWF6b24ucG9pbnRzLmVuY3J5cHRpb24ua2V5LmFjY291bnRzc2VydmljZXNyAA5qYXZhLmxhbmcuTG9uZzuL5JDMjyPfAgABSgAFdmFsdWV4cgAQamF2YS5sYW5nLk51bWJlcoaslR0LlOCLAgAAeHAAAAAAAAAAAQ==,jp-points
File 1 has only one column and I am sorting the file with the command sort -n file1
File 2 has three columns and I am sorting the file with command sort -t "," -k 1n,1 file2 which is sorting on the basis of ist column.
Now, I want to find the rows in file2 that are starting from lines in file1
Commands that I have tried:
grep -w -f file1 file2
join -t "," -1 1 -2 1 -o 2.2 file1 file2
But, I am not getting desired results. Please provide me with alternate approach. File 1 has rows 7124458 and File 2 has row 42987432.
Use awk:
awk -F, 'FNR == NR { ++a[$0]; next } $1 in a' file1 file2
Output:
6,rO0ABXNyADljb20uYW1hem9uLnBvaW50c3BsYXRmb3JtLnV0aWwuUG9pbnRzUGxhdGZvcm1DcnlwdE1lc3NhZ2Xio1+sC+m4CAIABFsACGNpcGhlcklWdAACW0JbAApjaXBoZXJUZXh0cQB+AAFMAAxtYXRlcmlhbE5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAA5tYXRlcmlhbFNlcmlhbHQAEExqYXZhL2xhbmcvTG9uZzt4cHVyAAJbQqzzF/gGCFTgAgAAeHAAAAAQufMrUK+8A4e0iJV4ktLQgXVxAH4ABQAAAEBNoyuUZLYRLaBqLvsvzHxxv63pO+4UPsRqpp/oHURcBdT6NES2G5H6+Kc3yjZOXDIIhHN1efAxyM/iWD0qDev9dAAwY29tLmFtYXpvbi5wb2ludHMuZW5jcnlwdGlvbi5rZXkuYWNjb3VudHNzZXJ2aWNlc3IADmphdmEubGFuZy5Mb25nO4vkkMyPI98CAAFKAAV2YWx1ZXhyABBqYXZhLmxhbmcuTnVtYmVyhqyVHQuU4IsCAAB4cAAAAAAAAAAB,jp-point
join(1) assumes both files are sorted alphabetically on the join fields. Try sorting the inputs without -n.
(To be more precise, it depends on the LC_COLLATE setting. If you are sorting for the benefit of two programs talking to each other, it is probably more reliable to set LC_ALL=C for both join and sort to avoid any glitches due to locale settings.)

How to find common words in multiple files

I have 4 text files that contain server names as follows: (each file had about 400 lines in with various server names)
Server1
Server299
Server140
Server15
I would like to compare the files and what I want to find is server names common to all 4 files.
I've got no idea where to start - I've got access to Excel, and Linux bash. Any clever ideas?
I've used vlookup in excel to compare 2 columns but dont think this can used for 4 columns?
One way would be to say:
cat file1 file2 file3 file4 | sort | uniq -c | awk '$1==4 {print $2}'
Another way:
comm -12 <(comm -12 <(comm -12 <(sort file1) <(sort file2)) <(sort file3)) <(sort file4)

Merge two files on Linux keeping only lines that appear in both files

In Linux, how can I merge two files and only keep lines that have a match in both files?
Each line is separated by a newline (\n).
So far, I found to sort it, then use comm -12. Is this the best approach (assuming it's correct)?
fileA contains
aaa
bbb
ccc
ddd
fileB contains
aaa
ddd
eee
and I'd like a new file to contain
aaa
ddd
Provided both of your two input files are lexicographically sorted, you can indeed use comm:
$ comm -12 fileA fileB > fileC
If that's not the case, you should sort your input files first:
$ comm -12 <(sort fileA) <(sort fileB) > fileC

How to display only different rows using diff (bash)

How can I display only different rows using diff in a separate file?
For example, the file number 1 contains the line:
1;john;125;3
1;tom;56;2
2;jack;10;5
A file number 2 contains the following lines:
1;john;125;3
1;tom;58;2
2;jack;10;5
How to make in the following happen?
1;tom;58;2
a.txt:
1;john;125;3
1;tom;56;2
2;jack;10;5
b.txt:
1;john;125;3
1;tom;58;2
2;jack;10;5
Use comm:
comm -13 a.txt b.txt
1;tom;58;2
The command line options to comm are pretty straight-forward:
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
Here's a simple solution that I think is better than diff:
sort file1 file2 | uniq -u
sort file1 file2 concatenates the two files and sorts it
uniq -u prints the unique lines (that do not repeat). It requires the input to be pre-sorted.
Assuming you want to retain only the lines unique to file 2 you can do:
comm -13 file1 file2
Note that the comm command expects the two files to be in sorted order.
Using group format specifiers you can suppress printing of unchanged lines and print only changed lines for changed
diff --changed-group-format="%>" --unchanged-group-format="" file1 file2

Resources