Get lines of file1 which are not in file2 - linux

I have two long, but sorted files. How to get all lines of first file which are not in second file ?
file1
0000_aaa_b
0001_bccc_b
0002_bcc <------ file2 have not that line
0003_aaa_d
0006_xxx
...
file2
0000_aaa_b
0001_bccc_b
0003_aaa_d
0006_xxx
...

This is what the comm command is for:
$ comm -3 file1 file2
0002_bcc
From man comm:
DESCRIPTION
Compare sorted files FILE1 and FILE2 line by line.
With no options, produce three-column output. Column one contains
lines unique to FILE1, column two contains lines unique to FILE2, and
column three contains lines common to both files.
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)

Just run a diff on them:
diff -c file1 file2
The -c (for "context") flag will only display the lines that are different, with two lines surrounding each line.

Related

match 1,2,5 columns of file1 with 1,2,3 columns of file2 respectively and output should have matched rows from file 2. second file is zipped file .gz

file1
3 1234581 A C rs123456
file2 zipped file .gz
1 1256781 rs987656 T C
3 1234581 rs123456 A C
22 1792471 rs928376 G T
output
3 1234581 rs123456 A C
I tried
zcat file2.gz | awk 'NR==FNR{a[$1,$2,$5]++;next} a[$1,$2,$3]' file1.txt - > output.txt
but it is not working
Please try following awk code for your shown samples. Use zcat to read your .gz file and then pass it as 2nd input to awk program for reading, after its done reading with file1.
zcat your_file.gz | awk 'FNR==NR{arr[$1,$2,$5];next} (($1,$2,$3) in arr)' file1 -
Fixes in OP's attempt:
You need not to increment value of array while creating it in file1. Just existence of indexes in it will be enough.
While checking condition in reading file2(passed by zcat command) just check if respective fields are present in array if yes then print that line.

Iterate over two files in linux || column comparison

we have two files File1 and File 2
File 1 columns
Name Age
abc 12
bcd 14
File2 Columns
Age
12
14
I was to Iterate over the second column of File1 and First column of File2 in single loop and Then check if they are same.
Note:- note number of Rows in both the files are same and I am using .sh shell
First make a temporary file from file1 that should be the same as file2.
The field name might have spaces, so remove everything until the last space.
When you have done this you can compare the files.
sed 's/.* //' file1 > file1.tmp
diff file1.tmp file2

Extract lines from File2 already found File1

Using linux commandline, i need to output the lines from text file2 that are already found in file1.
File1:
C
A
G
E
B
D
H
F
File2:
N
I
H
J
K
M
D
L
A
Output:
A
D
H
Thanks!
You are looking for the tools 'grep'
Check this out.
Lets say you have inputs in file1 & file2 files
grep -f file1 file2
will return you
H
D
A
A more flexible tool to use would be awk
awk 'NR==FNR{lines[$0]++; next} $1 in lines'
Example
$ awk 'NR==FNR{lines[$0]++; next} $1 in lines' file1 file2
H
D
A
What it does?
NR==FNR{lines[$0]++; next}
NR==FNR checks if the file number of records is equal to the overall number of records. This is true only for the first file, file1
lines[$0]++ Here we create an associative array with the line, $0 in file 1 as index.
$0 in lines This line works only for the second file because of the next in previous action. This checks if the line in file 2 is there in the saved array lines, if yes the default action of printing the entire line is taken
Awk is more flexible than the grep as you can columns in file1 with any column in file 2 and decides to print any column rather than printing the entire line
This is what the comm utility does, but you have to sort the files first: To get the lines in common between the 2 files:
comm -12 <(sort File1) <(sort File2)

Bash: grep selected text from a file

I have two files, file1 :
abc/def/ghi/ss/sfrere/sfs
xyz/pqr/sef/ert/wwqwq/bh
file2:
ind abc def
bcf pqr sss
i wish to grep text file from file1, such that any words on any line of file2 match on one line of file1, so in this case answer would be first line, as abc and def are present in first line of file1. 2 or more words from lines of flie 1 should match in any line of file 2.
This should do the trick,
awk 'FNR==NR{a[$1];next}{for(i in a){c=0;for(j=1;j<=NF;j++){if(index(i,$j)>0)c++}if(c>=2)print i}}' file1.txt file2.txt
Explanation
FNR==NR{a[$1];next} will iterate through first File1.txt and store lines in a.
for(i in a) will loop through the above stored lines,
c=0 just to have a number check to keep track of number of columns matched.
for(j=1;j<NF;j++) loop through columns in lines of File2.txt
if(index(i,$j)>0)c++ increment counter if one of the columns in File2.txt is in a line of File1.txt.
if(c>=2)print i Your given condition that it should match at least 2 columns, then we print line from File1.txt.
This is the most straight forward way that I could think of, I'm sure there are crazier ways to do this.
on huge file
sed 's/\([^ ]*\) \([^ ]*\) \([^ ]*\)/(\1.*\2)|(\2.*\1)|(\1.*\3)|(\3.*\1)|(\2.*\3)|(\3.*\2)/' file2 >/tmp/file2.egrep
egrep -f /tmp/file2.egrep file1
rm >/tmp/file2.egrep
create a temporary pattern matching for egrep based on file2 content

Best way to print rows not common to two large files in Unix

I have two files which are of following format.
File1: - It contains 4 column. First field is ID in text format and rest of columns are also some text values.
id1 val12 val13 val14
id2 val22 val23 val24
id3 val32 val33 val34
File2 - In file two I only have IDs.
id1
id2
Output
id3 val32 val33 val34
My question is: How to find rows from first file whose ID(first field) does not appear in second file. Size of both files in pretty large with file1 containing 42 million rows, size 8GB and file2 contains 33 million IDs. Order of IDs in two files might not be same.
Assuming the two files are sorted by id, then something like
join "-t " -j 1 -v 1 file1 file2
should do it.
You could do like this with awk:
awk 'FNR == NR { h[$1] = 1; next } !h[$1]' file2 file1
The first block gathers ids from file2 into the h hash. The last part (!h[$1]) executes the default block ({ print $0 }) if the id wasn't present in file2.
I don't claim that this is the "best" way to do it because best can include a number of trade-off criteria, but here's one way:
You can do this with the -f option to specify File2 as the file containing search patterns to grep:
grep -v -f File2 File1 > output
And as #glennjackman suggests:
One way to force the id to match at the beginning of the line:grep -vf <(sed 's/^/^/' File2) File1

Resources