Comparing two text files with each other - linux

If I had to text files, for example:
file1.txt
apple
orange
pear
banana
file2.txt
banana
pear
How would I take all phrases on the lines of file2.txt away from file1.txt
So file1.txt would be left with:
apple
orange

grep -v -F -f file2.txt file1.txt
-v means listing only the lines of file1.txt that do not match the pattern, and -f means taking the patterns from the file, in this case — file2.txt. And -F — interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched.
grep command is built-in on OS X and Linux. On Windows you'll have to install it; for example via Cygwin.

combine file1 not file2
On Debian and derivatives, combine can be found in the moreutils package.

If the files are huge (but must also be sorted), comm may be preferable to the more general grep solution proposed by Ivan since it operates line by line and thus, would not need to load the entirety of file2.txt into memory (or search it for each line).
comm -3 file1-sorted.txt file2-sorted.txt | sed 's/^\t//'
The sed command is needed to remove a leading tab inserted by comm.

Related

How to make a strict match with awk

I am querying one file with the other file and have them as following:
File1:
Angela S Darvill| text text text text
Helen Stanley| text text text text
Carol Haigh S|text text text text .....
File2:
Carol Haigh
Helen Stanley
Angela Darvill
This command:
awk 'NR==FNR{_[$1];next} ($1 in _)' File2.txt File1.txt
returns lines that overlap, BUT doesn’t have a strict match. Having a strict match, only Helen Stanley should have been returned.
How do you restrict awk on a strict overlap?
With your shown samples please try following. You were on right track, you need to do 2 things, 1st: take whole line as an index in array a while reading file2.txt and set field seapeator to | before awk starts reading file1
awk -F'|' 'NR==FNR{a[$0];next} $1 in a' File2.txt File1.txt
Command above doesn’t work for me (I am on Mac, don’t know whether it matters), but
awk 'NR==FNR{_[$0];next} ($1 in _)' File2.txt. FS="|" File1.txt
worked well
You can also use grep to match from File2.txt as a list of regexes to make an exact match.
You can use sed to prepare the matches. Here is an example:
sed -E 's/[ \t]*$//; s/^(.*)$/^\1|/' File2.txt
^Carol Haigh|
^Helen Stanley|
^Angela Darvill|
...
Then use process with that sed as an -f argument to grep:
grep -f <(sed -E 's/[ \t]*$//; s/^(.*)$/^\1|/' File2.txt) File1.txt
Helen Stanley| text text text text
Since your example File2.txt has trailing spaces, the sed has s/[ \t]*$//; as the first substitution. If your actual file does not have those trading spaces, you can do:
grep -f <(sed -E 's/.*/^&|/' File2.txt) File1.txt
Ed Morton brings up a good point that grep will still interpret RE meta-characters in File2.txt. You can use the flag -F so only literal strings are used:
grep -F -f <(sed -E 's/.*/&|/' File2.txt) File1.txt

cat command merging two *.txt with missing columns mac osx

I like to use the cat command to join several *.txt files under mac osx.
my first file1.txt looks like:
a;b;c;d
1;2;3;4
second file2.txt:
a;b
5;6
7;8
what I want:
a;b;c;d
1;2;3;4
5;6;;
7;8;;
my question: can I skip the header from the second file in the output file? And how is cat dealing with the missing columns? writing NaNs?
maybe this command could do it?
head -1 file1.txt > all.txt;
tail -n +2 -q file*.txt >> all.txt
I don't think the cat command alone will deal with removing the headers or mark any missing columns, since all it does is concatenate files. But if you know the highest possible number of columns, you can do something like this:
cat file1.txt <( tail -n+2 file2.txt ) | gawk -F';' -v OFS=';' '{NF=4}1'
Where NF=4 is the highest number of columns (in your example, 4).
The command above is concatenating file1.txt with a header-less version of file2.txt, using the output of a subcommand as input (operator <( ) ). You can use the <( ) as many times you want for each file you're wanting to concatenate. The final command, gawk, was adapted from this answer) and it's padding out the column delimiters for you.
(note: use brew install gawk if gawk isn't found; Mac OS X's awk won't work)
If not having the first header doesn't bother you and you don't want to use cat, you could do:
gawk -F';' -v OFS=';' '{NF=4}1' file*.txt | egrep -v '^a;b'

Append a file in the middle of another file in bash

I need to append a file in a specific location of another file.
I got the line number so, my file is:
file1.txt:
I
am
Cookie
While the second one is
file2.txt:
a
black
dog
named
So, after the solution, file1.txt should be like
I
am
a
black
dog
named
Cookie
The solution should be compatible with the presence of characters like " and / in both files.
Any tool is ok as long as it's native (I mean, no new software installation).
Another option apart from what RavinderSingh13 suggested using sed:
To add the text of file2.txt into file1.txt after a specific line:
sed -i '2 r file2.txt' file1.txt
Output:
I
am
a
black
dog
named
Cookie
Further to add the file after a matched pattern:
sed -i '/^YourPattern/ r file2.txt' file1.txt
Could you please try following and let me know if this helps you.
awk 'FNR==3{system("cat file2.txt")} 1' file1.txt
Output will be as follows.
I
am
a
black
dog
named
Cookie
Explanation: Checking here if line number is 3 while reading Input_file named file1.txt, if yes then using system utility of awk which will help us to call shell's commands, then I am printing the file2.txt with use of cat command. Then mentioning 1 will be printing all the lines from file1.txt. Thus we could concatenate lines from file2.txt into file1.txt.
How about
head -2 file1 && cat file2 && tail -1 file1
You can count the number of lines to decide head and tail parameters in file1 using
wc -l file1

How to remove lines contained in file 1 from file 2 if in file 2 they are prefixed?

I have the following situation:
source.txt
ID1:email1#domain1.com
ID2:email2#domain2.com
ID3:email3#domain3.com
...
IDs are numeric strings, e.g. 1234, 23412, 897... (one or more digits).
exclude.txt
emailX#domainX.com
emailY#domainY.com
emailZ#domainZ.com
...
i.e. only emails, no IDs.
I want to remove all lines from source.txt which contain emails listed in exclude.txt, preserving the ID:email pairs for the lines which are not removed.
How can I do that with linux command line tools (or simple bash script if needed)?
You can do it easily with awk:
awk -F":" 'NR==FNR{a[$1];next}(!($2 in a))' exclude.txt source.txt
Alternative with grep:
grep -v -F -f exclude.txt source.txt
Use grep with care, since grep does a regex matching. You might need to add also -w option to grep (word matching)

diff command to get number of different lines only

Can I use the diff command to find out how many lines do two files differ in?
I don't want the contextual difference, just the total number of lines that are different between two files. Best if the result is just a single integer.
diff can do all the first part of the job but no counting; wc -l does the rest:
diff -y --suppress-common-lines file1 file2 | wc -l
Yes you can, and in true Linux fashion you can use a number of commands piped together to perform the task.
First you need to use the diff command, to get the differences in the files.
diff file1 file2
This will give you an output of a list of changes. The ones your interested in are the lines prefixed with a '>' symbol
You use the grep tool to filter these out as follows
diff file1 file2 | grep "^>"
finally, once you have a list of the changes your interested in, you simply use the wc command in line mode to count the number of changes.
diff file1 file2 | grep "^>" | wc -l
and you have a perfect example of the philosophy that Linux is all about.

Resources