How to compare two files in shell [duplicate] - linux

This question already has answers here:
Find the similarities between two files
(2 answers)
Closed 8 years ago.
I have a file file1 and File2. I want to compare file1 with file2 and generate a file3 which contains the lines in file2 which are present in file1.

You may try this:-
awk 'FNR==NR{a[$0];next} ($0 in a)' file2 file1
Use grep
$ grep -w -f file1 file2
-f is used to tell grep to obtain parameters from a file.
-w matches whole words.
EDIT:-
You may try this:
fgrep -x -f file2 -v file1
-x match whole line
-f is used to tell grep to obtain parameters from a file.
-v inverts results

Related

Extracting lines from a file containing keywords that are present in another file [duplicate]

This question already has answers here:
Inner join on two text files
(5 answers)
Closed 5 years ago.
File1 (keywords present in it (after 2nd comma) for picking Ex: GOLD, BRO, ...)
File2 (extraction of lines from here)
File1:
ABC,123,GOLD,20171201,GOLDFUTURE
ABC,467,SILVER,20171201,SILVERFUTURE
ABC,987,BRO,20171201,BROFUTURE
File2:
XYZ,32,RUBY,20171201,RUBY
XYZ,33,GOLD,20171201,GOLD
XYZ,34,CEMENT,20171201,CEMENT
XYZ,35,PILLAR,20171201,pillar
XYZ,36,CNBC,20171201,CNBC
XYZ,37,CBX,20171201,CBX
XYZ,38,BRO,20171201,BRO
I want Linux commands(awk-sed-cat-grep etc) to get output file:
which is:
XYZ,33,GOLD,20171201,GOLD
XYZ,38,BRO,20171201,BRO
I have found commands online:
grep -F -f File1 File2
awk 'FNR==NR {a[$0];next} ($NF in a)' File1 File2
awk 'FNR==NR {a[$0];next} ($0 in a)' File1 File2
diff File1 File2
In the point 3. I am picking up whole lines from File1 for comparison, is there any way to pickup a keyword after comma? Or is there any way to insert File separator in the awk command of point 2.
Could you please try following and let me know if this helps you.
awk -F, 'FNR==NR{a[$3];next} ($3 in a)' File1 File2

shell script to compare two files and write the difference to third file

I want to compare two files and redirect the difference between the two files to third one.
file1:
/opt/a/a.sql
/opt/b/b.sql
/opt/c/c.sql
In case any file has # before /opt/c/c.sql, it should skip #
file2:
/opt/c/c.sql
/opt/a/a.sql
I want to get the difference between the two files. In this case, /opt/b/b.sql should be stored in a different file. Can anyone help me to achieve the above scenarios?
file1
$ cat file1 #both file1 and file2 may contain spaces which are ignored
/opt/a/a.sql
/opt/b/b.sql
/opt/c/c.sql
/opt/h/m.sql
file2
$ cat file2
/opt/c/c.sql
/opt/a/a.sql
Do
awk 'NR==FNR{line[$1];next}
{if(!($1 in line)){if($0!=""){print}}}
' file2 file1 > file3
file3
$ cat file3
/opt/b/b.sql
/opt/h/m.sql
Notes:
The order of files passed to awk is important here, pass the file to check - file2 here - first followed by the master file -file1.
Check awk documentation to understand what is done here.
You can use some tools like cat, sed, sort and uniq.
The main observation is this: if the line is in both files then it is not unique in cat file1 file2.
Furthermore in cat file1 file2| sort, all doubles are in sequence. Using uniq -u we get unique lines and have this pipe:
cat file1 file2 | sort | uniq -u
Using sed to remove leading whitespace, empty and comment lines, we get this final pipe:
cat file1 file2 | sed -r 's/^[ \t]+//; /^#/ d; /^$/ d;' | sort | uniq -u > file3

grep a large list against a large file

I am currently trying to grep a large list of ids (~5000) against an even larger csv file (3.000.000 lines).
I want all the csv lines, that contain an id from the id file.
My naive approach was:
cat the_ids.txt | while read line
do
cat huge.csv | grep $line >> output_file
done
But this takes forever!
Are there more efficient approaches to this problem?
Try
grep -f the_ids.txt huge.csv
Additionally, since your patterns seem to be fixed strings, supplying the -F option might speed up grep.
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified by
POSIX.)
Use grep -f for this:
grep -f the_ids.txt huge.csv > output_file
From man grep:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero
patterns, and therefore matches nothing. (-f is specified by POSIX.)
If you provide some sample input maybe we can even improve the grep condition a little more.
Test
$ cat ids
11
23
55
$ cat huge.csv
hello this is 11 but
nothing else here
and here 23
bye
$ grep -f ids huge.csv
hello this is 11 but
and here 23
grep -f filter.txt data.txt gets unruly when filter.txt is larger than a couple of thousands of lines and hence isn't the best choice for such a situation. Even while using grep -f, we need to keep a few things in mind:
use -x option if there is a need to match the entire line in the second file
use -F if the first file has strings, not patterns
use -w to prevent partial matches while not using the -x option
This post has a great discussion on this topic (grep -f on large files):
Fastest way to find lines of a file from another larger file in Bash
And this post talks about grep -vf:
grep -vf too slow with large files
In summary, the best way to handle grep -f on large files is:
Matching entire line:
awk 'FNR==NR {hash[$0]; next} $0 in hash' filter.txt data.txt > matching.txt
Matching a particular field in the second file (using ',' delimiter and field 2 in this example):
awk -F, 'FNR==NR {hash[$1]; next} $2 in hash' filter.txt data.txt > matching.txt
and for grep -vf:
Matching entire line:
awk 'FNR==NR {hash[$0]; next} !($0 in hash)' filter.txt data.txt > not_matching.txt
Matching a particular field in the second file (using ',' delimiter and field 2 in this example):
awk -F, 'FNR==NR {hash[$0]; next} !($2 in hash)' filter.txt data.txt > not_matching.txt
You may get a significant search speedup with ugrep to match the strings in the_ids.txt in your large huge.csv file:
ugrep -F -f the_ids.txt huge.csv
This works with GNU grep too, but I expect ugrep to run several times faster.

Separating a joined file to original files in Linux

I know that to append or join multiple files in Linux, we can use the command: cat file1 >> file2.
But I couldn't find any command to separate file1 from file2 after joining them. In other words, I want both original file1 and file2 back again. I tried to use the split command but it just dismembers a file into multiple files with the same size.
Is there a way to do it?
There is no such command, since no information about what was file1 or file2 is retained. The new combined file is just a data stream.
In order to "split" them back up, you need rules about how to do so (such as, how many bytes file1 and file2 were).
When you perform the concatenation, the system doesn't keep track of how the resulting file was created. So it has no way of remembering where the original split was located in that file.
Can you explain what you are trying to do ?
No problem, as long as you still have file1:
$ echo foobar >file1
$ echo blah >file2
$ cat file1 >> file2
$ truncate -s $(( $(stat -c '%s' file2) - $(stat -c '%s' file1) )) file2
$ cat file2
blah
Also, instead of stat -c '%s' filename you can use wc -c filename | cut -f 1 -d ' ', which is longer but more portable.

How to search for lines from one file in another in linux

I have file1 and file2. I need to search and print out if any line in file1 exists in file2. If not exists, I need to know there is no entry exists by any way. How can I achieve this in linux command ?
Just need grep:
grep -f file1 file2
If you want process file1 as fixed string, not regex pattern, parameter -F need be added.
grep -F -f file1 file2
grep can do this task.
grep -f File1 File2
However other commands like diff and cmp can also be used.

Resources