Comparing two unsorted lists in linux, listing the unique in the second file - linux

I have 2 files with a list of numbers (telephone numbers).
I'm looking for a method of listing the numbers in the second file that is not present in the first file.
I've tried the various methods with:
comm (getting some weird sorting errors)
fgrep -v -x -f second-file.txt first-file.txt (unsure of the result, there should be more)

grep -Fxv -f first-file.txt second-file.txt
Basically looks for all lines in second-file.txt which don't match any line in first-file.txt. Might be slow if the files are large.
Also, once you sort the files (Use sort -n if they are numeric), then comm should also have worked. What error does it give? Try this:
comm -23 second-file-sorted.txt first-file-sorted.txt

You need to use comm:
comm -13 first.txt second.txt
will do the job.
ps. order of first and second file in command line matters.
also you may need to sort files before:
comm -13 <(sort first.txt) <(sort second.txt)
in case files are numerical add -n option to sort.

This should work
comm -13 <(sort file1) <(sort file2)
It seems sort -n (numeric) cannot work with comm, which uses sort (alphanumeric) internally
f1.txt
1
2
21
50
f2.txt
1
3
21
50
21 should appear in third column
#WRONG
$ comm <(sort -n f1.txt) <(sort -n f2.txt)
1
2
21
3
21
50
#OK
$ comm <(sort f1.txt) <(sort f2.txt)
1
2
21
3
50

cat f1.txt f2.txt | sort |uniq > file3

Related

Comparing two files and applying the differences

on a Linux based system, I can easily compare two files, e.g.:
diff file1.txt file2.txt
...and see the difference between them.
What if I want to take all lines that are unique to file2.txt and apply them to file1.txt so that file1.txt will now contain everything it had + lines from file2.txt that it didn't have before? Is there an easy way to do it?
Using patch
You can use diff's output to create a patch file.
diff original_file file_with_new_lines > patch_file
You can edit patch_file to keep only the additions, since you only want the new lines.
Then you can use the patch command to apply this patch file:
patch original_file patch_file
If you don't mind appending the sorted diff to your file, you can use comm:
cat file1.txt <(comm -13 <(sort f1.txt) <(sort f2.txt)) > file1.txt.patched
or
comm -13 <(sort f1.txt) <(sort f2.txt) | cat file1.txt - > file1.txt.patched
This will append the unique lines from file2.txt to file1.txt.

remove line character form csv file

I have 2 csv files, file1 contain 1000 email address and file2 contain 150 email address which are already exist in file1.
I wonder if there is a Linux command to remove the 150 email from file1 ?
I test this :
grep -vf file2.csv file1.csv > file3.csv
it's works
This should work, with the added benefit of providing sorted output:
comm -23 <(sort file1) <(sort file2)

How to find common words in multiple files

I have 4 text files that contain server names as follows: (each file had about 400 lines in with various server names)
Server1
Server299
Server140
Server15
I would like to compare the files and what I want to find is server names common to all 4 files.
I've got no idea where to start - I've got access to Excel, and Linux bash. Any clever ideas?
I've used vlookup in excel to compare 2 columns but dont think this can used for 4 columns?
One way would be to say:
cat file1 file2 file3 file4 | sort | uniq -c | awk '$1==4 {print $2}'
Another way:
comm -12 <(comm -12 <(comm -12 <(sort file1) <(sort file2)) <(sort file3)) <(sort file4)

How to display only different rows using diff (bash)

How can I display only different rows using diff in a separate file?
For example, the file number 1 contains the line:
1;john;125;3
1;tom;56;2
2;jack;10;5
A file number 2 contains the following lines:
1;john;125;3
1;tom;58;2
2;jack;10;5
How to make in the following happen?
1;tom;58;2
a.txt:
1;john;125;3
1;tom;56;2
2;jack;10;5
b.txt:
1;john;125;3
1;tom;58;2
2;jack;10;5
Use comm:
comm -13 a.txt b.txt
1;tom;58;2
The command line options to comm are pretty straight-forward:
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
Here's a simple solution that I think is better than diff:
sort file1 file2 | uniq -u
sort file1 file2 concatenates the two files and sorts it
uniq -u prints the unique lines (that do not repeat). It requires the input to be pre-sorted.
Assuming you want to retain only the lines unique to file 2 you can do:
comm -13 file1 file2
Note that the comm command expects the two files to be in sorted order.
Using group format specifiers you can suppress printing of unchanged lines and print only changed lines for changed
diff --changed-group-format="%>" --unchanged-group-format="" file1 file2

Finding Set Complement in Unix

Given this two files:
$ cat A.txt $ cat B.txt
3 11
5 1
1 12
2 3
4 2
I want to find lines number that is in A "BUT NOT" in B.
What's the unix command for it?
I tried this but seems to fail:
comm -3 <(sort -n A.txt) <(sort -n B.txt) | sed 's/\t//g'
comm -2 -3 <(sort A.txt) <(sort B.txt)
should do what you want, if I understood you correctly.
Edit: Actually, comm needs the files to be sorted in lexicographical order, so you don't want -n in your sort command:
$ cat A.txt
1
4
112
$ cat B.txt
1
112
# Bad:
$ comm -2 -3 <(sort -n B.txt) <(sort -n B.txt)
4
comm: file 1 is not in sorted order
112
# OK:
$ comm -2 -3 <(sort A.txt) <(sort B.txt)
4
you can try this
$ awk 'FNR==NR{a[$0];next} (!($0 in a))' B.txt A.txt
5
4
note that the awk solution works, but retains duplicates in A (which aren't in B); the python solution de-dupes the result
also note that comm doesn't compute a true set difference; if a line is repeated in A, and repeated fewer times in B, comm will leave the "extra" line(s) in the result:
$ cat A.txt
120
121
122
122
$ cat B.txt
121
122
121
$ comm -23 <(sort A.txt) <(sort B.txt)
120
122
if this behavior is undesired, use sort -u to remove duplicates (only the dupes in A matter):
$ comm -23 <(sort -u A.txt) <(sort B.txt)
120
I wrote a program recently called Setdown that does Set operations from the cli.
It can perform set operations by writing a definition similar to what you would write in a Makefile:
someUnion: "file-1.txt" \/ "file-2.txt"
someIntersection: "file-1.txt" /\ "file-2.txt"
someDifference: someUnion - someIntersection
Its pretty cool and you should check it out. I personally don't recommend using ad-hoc commands that were not built for the job to perform set operations. It won't work well when you really need to do many set operations or if you have any set operations that depend on each other. Not only that but setdown lets you write set operations that depend on other set operations!
At any rate, I think that it's pretty cool and you should totally check it out.
Note: I think that Setdown is much better than comm simply because Setdown does not require that you correctly sort your inputs. Instead Setdown will sort your inputs for you AND it uses external sort. So it can handle massive files. I consider this a major benefit because the number of times that I have forgotten to sort the files that I passed into comm is beyond count.
Here is another way to do it with join:
join -v1 <(sort A.txt) <(sort B.txt)
From the documentation on join:
‘-v file-number’
Print a line for each unpairable line in file file-number (either ‘1’ or ‘2’), instead of the normal output.

Resources