Finding Set Complement in Unix - linux

Given this two files:
$ cat A.txt $ cat B.txt
3 11
5 1
1 12
2 3
4 2
I want to find lines number that is in A "BUT NOT" in B.
What's the unix command for it?
I tried this but seems to fail:
comm -3 <(sort -n A.txt) <(sort -n B.txt) | sed 's/\t//g'

comm -2 -3 <(sort A.txt) <(sort B.txt)
should do what you want, if I understood you correctly.
Edit: Actually, comm needs the files to be sorted in lexicographical order, so you don't want -n in your sort command:
$ cat A.txt
1
4
112
$ cat B.txt
1
112
# Bad:
$ comm -2 -3 <(sort -n B.txt) <(sort -n B.txt)
4
comm: file 1 is not in sorted order
112
# OK:
$ comm -2 -3 <(sort A.txt) <(sort B.txt)
4

you can try this
$ awk 'FNR==NR{a[$0];next} (!($0 in a))' B.txt A.txt
5
4

note that the awk solution works, but retains duplicates in A (which aren't in B); the python solution de-dupes the result
also note that comm doesn't compute a true set difference; if a line is repeated in A, and repeated fewer times in B, comm will leave the "extra" line(s) in the result:
$ cat A.txt
120
121
122
122
$ cat B.txt
121
122
121
$ comm -23 <(sort A.txt) <(sort B.txt)
120
122
if this behavior is undesired, use sort -u to remove duplicates (only the dupes in A matter):
$ comm -23 <(sort -u A.txt) <(sort B.txt)
120

I wrote a program recently called Setdown that does Set operations from the cli.
It can perform set operations by writing a definition similar to what you would write in a Makefile:
someUnion: "file-1.txt" \/ "file-2.txt"
someIntersection: "file-1.txt" /\ "file-2.txt"
someDifference: someUnion - someIntersection
Its pretty cool and you should check it out. I personally don't recommend using ad-hoc commands that were not built for the job to perform set operations. It won't work well when you really need to do many set operations or if you have any set operations that depend on each other. Not only that but setdown lets you write set operations that depend on other set operations!
At any rate, I think that it's pretty cool and you should totally check it out.
Note: I think that Setdown is much better than comm simply because Setdown does not require that you correctly sort your inputs. Instead Setdown will sort your inputs for you AND it uses external sort. So it can handle massive files. I consider this a major benefit because the number of times that I have forgotten to sort the files that I passed into comm is beyond count.

Here is another way to do it with join:
join -v1 <(sort A.txt) <(sort B.txt)
From the documentation on join:
‘-v file-number’
Print a line for each unpairable line in file file-number (either ‘1’ or ‘2’), instead of the normal output.

Related

How to compare two text files for the same exact text using BASH Script?

Let's say I have two text files that I need to extract data out of. The text of the two files is as follows:
File1.txt
ami-1234567
ami-1234567654
ami-23456
File-2.txt
ami-1234567654
ami-23456
ami-2345678965
I want all the data of file2.txt which looks same from file1.txt.
This is litteratly my first comment so I hope it works,
but you can try using diff:
diff file1.txt file2.txt
Did you try join?
join -o 0 File1.txt File2.txt
ami-1234567654
ami-23456
remark: For join to work correctly, it needs your files to be sorted, which seems to be the case with your sample.
Just another option:
$ comm -1 -2 <(sort file1.txt) <(sort file2.txt)
The options specify that "unique" limes from first file (-1) and second file (-2) should be omitted.
This is basically the same as
$ join <(sort file1.txt) <(sort file2.txt)
Note that the sorting in both examples happens without creating an intermediate temp file, if you don't want to bother creating one.
I don`t know if I proper understand You:
but You can Try sort this files (after extract):
sort file1 > file1.sorted
sort file2 > file2.sorted

Merging files in reverse

I am working on the logs, they are in multiple number.
lets assume the following files has the content
file1
1
file2
2
file3
3
by using the command cat file* the result would be
1
2
3
but i am looking for some thing , while i use the regex/command using file* i want the output to be some thing like this.
3
2
1
could some one help me please.
Pass the output of cat to tac :
$ cat file*
1
2
3
$ cat file* | tac
3
2
1
You may call
ls -1r file* | xargs cat
in order to specify the order of the files. Its output is different from the tac solution, since each single logfile is in the correct order. (Perhaps this is not even the desired output).

How to find common words in multiple files

I have 4 text files that contain server names as follows: (each file had about 400 lines in with various server names)
Server1
Server299
Server140
Server15
I would like to compare the files and what I want to find is server names common to all 4 files.
I've got no idea where to start - I've got access to Excel, and Linux bash. Any clever ideas?
I've used vlookup in excel to compare 2 columns but dont think this can used for 4 columns?
One way would be to say:
cat file1 file2 file3 file4 | sort | uniq -c | awk '$1==4 {print $2}'
Another way:
comm -12 <(comm -12 <(comm -12 <(sort file1) <(sort file2)) <(sort file3)) <(sort file4)

Comparing two files in linux terminal

There are two files called "a.txt" and "b.txt" both have a list of words. Now I want to check which words are extra in "a.txt" and are not in "b.txt".
I need a efficient algorithm as I need to compare two dictionaries.
if you have vim installed,try this:
vimdiff file1 file2
or
vim -d file1 file2
you will find it fantastic.
Sort them and use comm:
comm -23 <(sort a.txt) <(sort b.txt)
comm compares (sorted) input files and by default outputs three columns: lines that are unique to a, lines that are unique to b, and lines that are present in both. By specifying -1, -2 and/or -3 you can suppress the corresponding output. Therefore comm -23 a b lists only the entries that are unique to a. I use the <(...) syntax to sort the files on the fly, if they are already sorted you don't need this.
If you prefer the diff output style from git diff, you can use it with the --no-index flag to compare files not in a git repository:
git diff --no-index a.txt b.txt
Using a couple of files with around 200k file name strings in each, I benchmarked (with the built-in timecommand) this approach vs some of the other answers here:
git diff --no-index a.txt b.txt
# ~1.2s
comm -23 <(sort a.txt) <(sort b.txt)
# ~0.2s
diff a.txt b.txt
# ~2.6s
sdiff a.txt b.txt
# ~2.7s
vimdiff a.txt b.txt
# ~3.2s
comm seems to be the fastest by far, while git diff --no-index appears to be the fastest approach for diff-style output.
Update 2018-03-25 You can actually omit the --no-index flag unless you are inside a git repository and want to compare untracked files within that repository. From the man pages:
This form is to compare the given two paths on the filesystem. You can omit the --no-index option when running the command in a working tree controlled by Git and at least one of the paths points outside the working tree, or when running the command outside a working tree controlled by Git.
Try sdiff (man sdiff)
sdiff -s file1 file2
You can use diff tool in linux to compare two files. You can use --changed-group-format and --unchanged-group-format options to filter required data.
Following three options can use to select the relevant group for each option:
'%<' get lines from FILE1
'%>' get lines from FILE2
'' (empty string) for removing lines from both files.
E.g: diff --changed-group-format="%<" --unchanged-group-format="" file1.txt file2.txt
[root#vmoracle11 tmp]# cat file1.txt
test one
test two
test three
test four
test eight
[root#vmoracle11 tmp]# cat file2.txt
test one
test three
test nine
[root#vmoracle11 tmp]# diff --changed-group-format='%<' --unchanged-group-format='' file1.txt file2.txt
test two
test four
test eight
You can also use: colordiff: Displays the output of diff with colors.
About vimdiff: It allows you to compare files via SSH, for example :
vimdiff /var/log/secure scp://192.168.1.25/var/log/secure
Extracted from: http://www.sysadmit.com/2016/05/linux-diferencias-entre-dos-archivos.html
Also, do not forget about mcdiff - Internal diff viewer of GNU Midnight Commander.
For example:
mcdiff file1 file2
Enjoy!
Use comm -13 (requires sorted files):
$ cat file1
one
two
three
$ cat file2
one
two
three
four
$ comm -13 <(sort file1) <(sort file2)
four
You can also use:
sdiff file1 file2
To display differences side by side within your terminal!
diff a.txt b.txt | grep '<'
can then pipe to cut for a clean output
diff a.txt b.txt | grep '<' | cut -c 3
Here is my solution for this :
mkdir temp
mkdir results
cp /usr/share/dict/american-english ~/temp/american-english-dictionary
cp /usr/share/dict/british-english ~/temp/british-english-dictionary
cat ~/temp/american-english-dictionary | wc -l > ~/results/count-american-english-dictionary
cat ~/temp/british-english-dictionary | wc -l > ~/results/count-british-english-dictionary
grep -Fxf ~/temp/american-english-dictionary ~/temp/british-english-dictionary > ~/results/common-english
grep -Fxvf ~/results/common-english ~/temp/american-english-dictionary > ~/results/unique-american-english
grep -Fxvf ~/results/common-english ~/temp/british-english-dictionary > ~/results/unique-british-english
Using awk for it. Test files:
$ cat a.txt
one
two
three
four
four
$ cat b.txt
three
two
one
The awk:
$ awk '
NR==FNR { # process b.txt or the first file
seen[$0] # hash words to hash seen
next # next word in b.txt
} # process a.txt or all files after the first
!($0 in seen)' b.txt a.txt # if word is not hashed to seen, output it
Duplicates are outputed:
four
four
To avoid duplicates, add each newly met word in a.txt to seen hash:
$ awk '
NR==FNR {
seen[$0]
next
}
!($0 in seen) { # if word is not hashed to seen
seen[$0] # hash unseen a.txt words to seen to avoid duplicates
print # and output it
}' b.txt a.txt
Output:
four
If the word lists are comma-separated, like:
$ cat a.txt
four,four,three,three,two,one
five,six
$ cat b.txt
one,two,three
you have to do a couple of extra laps (forloops):
awk -F, ' # comma-separated input
NR==FNR {
for(i=1;i<=NF;i++) # loop all comma-separated fields
seen[$i]
next
}
{
for(i=1;i<=NF;i++)
if(!($i in seen)) {
seen[$i] # this time we buffer output (below):
buffer=buffer (buffer==""?"":",") $i
}
if(buffer!="") { # output unempty buffers after each record in a.txt
print buffer
buffer=""
}
}' b.txt a.txt
Output this time:
four
five,six

Comparing two unsorted lists in linux, listing the unique in the second file

I have 2 files with a list of numbers (telephone numbers).
I'm looking for a method of listing the numbers in the second file that is not present in the first file.
I've tried the various methods with:
comm (getting some weird sorting errors)
fgrep -v -x -f second-file.txt first-file.txt (unsure of the result, there should be more)
grep -Fxv -f first-file.txt second-file.txt
Basically looks for all lines in second-file.txt which don't match any line in first-file.txt. Might be slow if the files are large.
Also, once you sort the files (Use sort -n if they are numeric), then comm should also have worked. What error does it give? Try this:
comm -23 second-file-sorted.txt first-file-sorted.txt
You need to use comm:
comm -13 first.txt second.txt
will do the job.
ps. order of first and second file in command line matters.
also you may need to sort files before:
comm -13 <(sort first.txt) <(sort second.txt)
in case files are numerical add -n option to sort.
This should work
comm -13 <(sort file1) <(sort file2)
It seems sort -n (numeric) cannot work with comm, which uses sort (alphanumeric) internally
f1.txt
1
2
21
50
f2.txt
1
3
21
50
21 should appear in third column
#WRONG
$ comm <(sort -n f1.txt) <(sort -n f2.txt)
1
2
21
3
21
50
#OK
$ comm <(sort f1.txt) <(sort f2.txt)
1
2
21
3
50
cat f1.txt f2.txt | sort |uniq > file3

Resources