How to find which line is missing in another file - linux

on Linux box
I have one file as below
A.txt
1
2
3
4
Second file as below
B.txt
1
2
3
6
I want to know what is inside A.txt but not in B.txt
i.e. it should print value 4
I want to do that on Linux.

awk 'NR==FNR{a[$0]=1;next}!a[$0]' B A
didn't test, give it a try

Use comm if the files are sorted as your sample input shows:
$ comm -23 A.txt B.txt
4
If the files are unsorted, see #Kent's awk solution.

You can also do this using grep by combining the -v (show non-matching lines), -x (match whole lines) and -f (read patterns from file) options:
$ grep -v -x -f B.txt A.txt
4
This does not depend on the order of the files - it will remove any lines from A that match a line in B.

(An addition to #rjmunro's answer)
The proper way to use grep for this is:
$ grep -F -v -x -f B.txt A.txt
4
Without the -F flag, grep interprets PATTERN, read from B.txt, as a basic regular expression (BRE), which is undesired here, and can cause troubles. -F flag makes grep treat PATTERN as a set of newline-separated strings. For instance:
$ cat A.txt
&
^
[
]
$ cat B.txt
[
^
]
|
$ grep -v -x -f B.txt A.txt
grep: B.txt:1: Invalid regular expression
$ grep -F -v -x -f B.txt A.txt
&

Using diff:
diff --changed-group-format='%<' --unchanged-group-format='' A.txt B.txt

Related

Why does Linux grep not give the correct count for line breaks?

On Ubuntu 10.04.4 LTS, I did the following small test and got a surprising result:
First, I created a file with 5 lines and name it as a.txt:
echo -e "1\n2\n3\n4\n5" > a.txt
$ cat a.txt
1
2
3
4
5
Then I run wc to count the number of lines
$ wc -l a.txt
5 a.txt
However, when I run grep to count the number of lines that have line breaks I got an answer that I did not understand:
$ grep -c -P '\n' a.txt
3
My question is: how does grep get this number? Shouldn't it be 4?
Please Read The Fine Manual!
seq 1 5 | wc -l
5
seq 1 5 | grep -ac $'\n'
5
I don't understand where is the problem!?
seq 1 5 | hd
00000000 31 0a 32 0a 33 0a 34 0a 35 0a |1.2.3.4.5.|
Explanation:
-a switch tell grep to open file in binary mode. IE don't care about text formatting.
$'\n' syntax is resolved by bash himself, before running grep. Doing this give the ability to pass control characters as arguments to any command under bash.
Grep cannot see new line character. It searches for inline pattern.
Consider using grep -c -P '$' a.txt to match the ending of each line.
The newline character is not part of lines. grep uses the newline character as the record separator, and removes it from the lines, so that patterns with $ work as expected. For example, to search for lines ending with foo you can use the pattern foo$ instead of foo\n$. That would be very inconvenient.
So grep -c -P '\n' a.txt should give you 0. If you're getting 3, that sounds extremely strange, but perhaps it can be explained the highly experimental remark in man grep:
-P, --perl-regexp
Interpret PATTERN as a Perl regular expression (PCRE, see
below). This is highly experimental and grep -P may warn of
unimplemented features.
I'm in Debian/Wheezy, which is much more recent than Ubuntu 10.04. If the -P is "highly experimental" today, it's not too difficult to imagine it was buggy in older systems. This is just a guess though.
To count the number of newlines, use wc -l, not a grep -c hack.
Btw, interestingly:
$ printf hello >> a.txt
$ wc -l a.txt
5 a.txt
$ grep -c '' a.txt
6
That is, printf doesn't print a newline, so after we append "hello" to a.txt, there won't be a newline at the end of the file. So wc -l counts newline characters, not exactly "lines", and grep '' (empty string) matches all lines.
I think you want to use
$ grep -c -P "." a.txt
5
$ echo "6" >> a.txt
$ grep -c -P "." a.txt
6
$ cat a.txt
1
2
3
4
5
6

Exact grep -f command in Linux

I have 2 txt files in Linux.
A.txt contents (each line will contain a number):
1
2
3
B.txt contents (each line will contain a number):
1
2
3
10
20
30
grep -f A.txt B.txt results below:
1
2
3
10
20
30
Is there a way to grep in such a way I will get only the exact match, i.e. not 10, 20, 30?
Thanks in advance
For exact match use -x switch
grep -x -f A.txt B.txt
EDIT: If you don't want grep's regex capabilities and need to treat search pattern as fixed-strings then use -F switch as:
grep -xF -f A.txt B.txt
As anubhava pointed out, grep -x will match the whole line. there's another switch -w for matching word. So grep -wf A.txt B.txt will show matches if a word from A.txt matches with a word in B.txt
Try adding the -w flag:
grep -wf A.txt B.txt
This will give you exact result which is under:
1
2
3
Thanks
you can try to identify the file name which contains different contents
# cat a.txt
1
2
3
# cat b.txt
1
2
3
10
20
30
# grep -L a.txt b.txt
b.txt

Comparing two files in linux terminal

There are two files called "a.txt" and "b.txt" both have a list of words. Now I want to check which words are extra in "a.txt" and are not in "b.txt".
I need a efficient algorithm as I need to compare two dictionaries.
if you have vim installed,try this:
vimdiff file1 file2
or
vim -d file1 file2
you will find it fantastic.
Sort them and use comm:
comm -23 <(sort a.txt) <(sort b.txt)
comm compares (sorted) input files and by default outputs three columns: lines that are unique to a, lines that are unique to b, and lines that are present in both. By specifying -1, -2 and/or -3 you can suppress the corresponding output. Therefore comm -23 a b lists only the entries that are unique to a. I use the <(...) syntax to sort the files on the fly, if they are already sorted you don't need this.
If you prefer the diff output style from git diff, you can use it with the --no-index flag to compare files not in a git repository:
git diff --no-index a.txt b.txt
Using a couple of files with around 200k file name strings in each, I benchmarked (with the built-in timecommand) this approach vs some of the other answers here:
git diff --no-index a.txt b.txt
# ~1.2s
comm -23 <(sort a.txt) <(sort b.txt)
# ~0.2s
diff a.txt b.txt
# ~2.6s
sdiff a.txt b.txt
# ~2.7s
vimdiff a.txt b.txt
# ~3.2s
comm seems to be the fastest by far, while git diff --no-index appears to be the fastest approach for diff-style output.
Update 2018-03-25 You can actually omit the --no-index flag unless you are inside a git repository and want to compare untracked files within that repository. From the man pages:
This form is to compare the given two paths on the filesystem. You can omit the --no-index option when running the command in a working tree controlled by Git and at least one of the paths points outside the working tree, or when running the command outside a working tree controlled by Git.
Try sdiff (man sdiff)
sdiff -s file1 file2
You can use diff tool in linux to compare two files. You can use --changed-group-format and --unchanged-group-format options to filter required data.
Following three options can use to select the relevant group for each option:
'%<' get lines from FILE1
'%>' get lines from FILE2
'' (empty string) for removing lines from both files.
E.g: diff --changed-group-format="%<" --unchanged-group-format="" file1.txt file2.txt
[root#vmoracle11 tmp]# cat file1.txt
test one
test two
test three
test four
test eight
[root#vmoracle11 tmp]# cat file2.txt
test one
test three
test nine
[root#vmoracle11 tmp]# diff --changed-group-format='%<' --unchanged-group-format='' file1.txt file2.txt
test two
test four
test eight
You can also use: colordiff: Displays the output of diff with colors.
About vimdiff: It allows you to compare files via SSH, for example :
vimdiff /var/log/secure scp://192.168.1.25/var/log/secure
Extracted from: http://www.sysadmit.com/2016/05/linux-diferencias-entre-dos-archivos.html
Also, do not forget about mcdiff - Internal diff viewer of GNU Midnight Commander.
For example:
mcdiff file1 file2
Enjoy!
Use comm -13 (requires sorted files):
$ cat file1
one
two
three
$ cat file2
one
two
three
four
$ comm -13 <(sort file1) <(sort file2)
four
You can also use:
sdiff file1 file2
To display differences side by side within your terminal!
diff a.txt b.txt | grep '<'
can then pipe to cut for a clean output
diff a.txt b.txt | grep '<' | cut -c 3
Here is my solution for this :
mkdir temp
mkdir results
cp /usr/share/dict/american-english ~/temp/american-english-dictionary
cp /usr/share/dict/british-english ~/temp/british-english-dictionary
cat ~/temp/american-english-dictionary | wc -l > ~/results/count-american-english-dictionary
cat ~/temp/british-english-dictionary | wc -l > ~/results/count-british-english-dictionary
grep -Fxf ~/temp/american-english-dictionary ~/temp/british-english-dictionary > ~/results/common-english
grep -Fxvf ~/results/common-english ~/temp/american-english-dictionary > ~/results/unique-american-english
grep -Fxvf ~/results/common-english ~/temp/british-english-dictionary > ~/results/unique-british-english
Using awk for it. Test files:
$ cat a.txt
one
two
three
four
four
$ cat b.txt
three
two
one
The awk:
$ awk '
NR==FNR { # process b.txt or the first file
seen[$0] # hash words to hash seen
next # next word in b.txt
} # process a.txt or all files after the first
!($0 in seen)' b.txt a.txt # if word is not hashed to seen, output it
Duplicates are outputed:
four
four
To avoid duplicates, add each newly met word in a.txt to seen hash:
$ awk '
NR==FNR {
seen[$0]
next
}
!($0 in seen) { # if word is not hashed to seen
seen[$0] # hash unseen a.txt words to seen to avoid duplicates
print # and output it
}' b.txt a.txt
Output:
four
If the word lists are comma-separated, like:
$ cat a.txt
four,four,three,three,two,one
five,six
$ cat b.txt
one,two,three
you have to do a couple of extra laps (forloops):
awk -F, ' # comma-separated input
NR==FNR {
for(i=1;i<=NF;i++) # loop all comma-separated fields
seen[$i]
next
}
{
for(i=1;i<=NF;i++)
if(!($i in seen)) {
seen[$i] # this time we buffer output (below):
buffer=buffer (buffer==""?"":",") $i
}
if(buffer!="") { # output unempty buffers after each record in a.txt
print buffer
buffer=""
}
}' b.txt a.txt
Output this time:
four
five,six

how can I move all lines beginning in 'foobar' to the end of a file?

Say I have a script with a number of lines beginning foobar
I would like to move all of the lines to the end of the document while keeping their order
e.g. go from:
# There's a Polar Bear
# In our Frigidaire--
foobar['brangelina'] <- 2
# He likes it 'cause it's cold in there.
# With his seat in the meat
foobar['billybob'] <- 1
# And his face in the fish
to
# There's a Polar Bear
# In our Frigidaire--
# He likes it 'cause it's cold in there.
# With his seat in the meat
# And his face in the fish
foobar['brangelina'] <- 2
foobar['billybob'] <- 1
This is as far as I have gotten:
grep foobar file.txt > newfile.txt
sed -i 's/foobar//g' foo.txt
cat newfile.txt > foo.txt
This might work:
sed '/^foobar/{H;$!d;s/.*//};$G;s/\n*//' input_file
EDIT: Amended for the corner case when foobar is on the last line
This will do:
grep -v ^foobar file.txt > tmp1.txt
grep ^foobar file.txt > tmp2.txt
cat tmp1.txt tmp2.txt > newfile.txt
rm tmp1.txt tmp2.txt
The -v option returns all the lines which do not match the given pattern. The ^ marks the beginning of a line, so ^foobar matches lines beginning with foobar.
grep -v ^foobar file.txt > file1.txt
grep ^foobar file.txt > file2.txt
cat file2.txt >> file1.txt
grep -v ^foobar file.txt >newfile.txt
grep ^foobar file.txt >>newfile.txt
no need for temporary file
You can also do:
vim file.txt -c 'g/^foobar/m$' -c 'wq'
The -c switch means an Ex command follows, the g commands operates on all lines containing the pattern given, and the action is here m$ which means “move to end of file” (it preserves order). wq weans “save and exit vim”.
If this is too slow you can also prevent vim from reading vimrc:
vim -u NONE file.txt -c 'g/^foobar/m$' -c 'wq'

Omitting the first line from any Linux command output

I have a requirement where i'd like to omit the 1st line from the output of ls -latr "some path" Since I need to remove total 136 from the below output
So I wrote ls -latr /home/kjatin1/DT_901_linux//autoInclude/system | tail -q which excluded the 1st line, but when the folder is empty it does not omit it. Please tell me how to omit 1st line in any linux command output
The tail program can do this:
ls -lart | tail -n +2
The -n +2 means “start passing through on the second line of output”.
Pipe it to awk:
awk '{if(NR>1)print}'
or sed
sed -n '1!p'
ls -lart | tail -n +2 #argument means starting with line 2
This is a quick hacky way: ls -lart | grep -v ^total.
Basically, remove any lines that start with "total", which in ls output should only be the first line.
A more general way (for anything):
ls -lart | sed "1 d"
sed "1 d" means only print everything but first line.
You can use awk command:
For command output use pipe: | awk 'NR>1'
For output of file: awk 'NR>1' file.csv

Resources