Best way to print rows not common to two large files in Unix - linux

I have two files which are of following format.
File1: - It contains 4 column. First field is ID in text format and rest of columns are also some text values.
id1 val12 val13 val14
id2 val22 val23 val24
id3 val32 val33 val34
File2 - In file two I only have IDs.
id1
id2
Output
id3 val32 val33 val34
My question is: How to find rows from first file whose ID(first field) does not appear in second file. Size of both files in pretty large with file1 containing 42 million rows, size 8GB and file2 contains 33 million IDs. Order of IDs in two files might not be same.

Assuming the two files are sorted by id, then something like
join "-t " -j 1 -v 1 file1 file2
should do it.

You could do like this with awk:
awk 'FNR == NR { h[$1] = 1; next } !h[$1]' file2 file1
The first block gathers ids from file2 into the h hash. The last part (!h[$1]) executes the default block ({ print $0 }) if the id wasn't present in file2.

I don't claim that this is the "best" way to do it because best can include a number of trade-off criteria, but here's one way:
You can do this with the -f option to specify File2 as the file containing search patterns to grep:
grep -v -f File2 File1 > output
And as #glennjackman suggests:
One way to force the id to match at the beginning of the line:grep -vf <(sed 's/^/^/' File2) File1

Related

Iterate over two files in linux || column comparison

we have two files File1 and File 2
File 1 columns
Name Age
abc 12
bcd 14
File2 Columns
Age
12
14
I was to Iterate over the second column of File1 and First column of File2 in single loop and Then check if they are same.
Note:- note number of Rows in both the files are same and I am using .sh shell
First make a temporary file from file1 that should be the same as file2.
The field name might have spaces, so remove everything until the last space.
When you have done this you can compare the files.
sed 's/.* //' file1 > file1.tmp
diff file1.tmp file2

Find rows with the same value in a column in two files

I've got two files (millions of columns)
File1.txt, ~4k rows
some_key1 some_text1
some_key2 some_text2
...
some_keyn some_textn
File2.txt, ~20 M rows
some_key11 some_key11 some_text1
some_key22 some_key22 some_text2
...
some_keynn some_keynn some_textn
When there is an exact match between column 2 in File1.txt and column 3 in File2.txt, I want to print out the particular rows from both files.
EDIT
I've tried this (I forgot to write it) but it doesn't work
awk 'NR{a[$2]}==FNR{b[$3]}'$1 in a{print $1}' file1.txt file2.txt
You need to fix your awk program
To print all records in file2 if field 1 (file1) exists in field 3 (file2):-
awk 'NR==FNR{A[$2];next}$3 in A' file1.txt file2.txt
some_key11 some_key11 some_text1
some_key22 some_key22 some_text2
...
some_keynn some_keynn some_textn
To print just field 1 in file2 if field 1 (file1) exists in field 3 (file2):-
awk 'NR==FNR{A[$2];next}$3 in A{ print $1 }' file1.txt file2.txt
some_key11
some_key22
...
some_keynn
Let's say your dataset is big in both dimensions - rows and columns. Then you want to use join. To use join, you have to sort your data first. Something along those lines:
<File1.txt sort -k2,2 > File1-sorted.txt
<File2.txt sort -k3,3 -S1G > File2-sorted.txt
join -1 2 -2 3 File1-sorted.txt File2-sorted.txt > matches.txt
The sort -k2,2 means 'sort whole rows so the values of second column are in ascending order. The join -1 2 means 'the key in the first file is the second column'.
If your files are bigger than say 100 MB it pays of to assign additional memory to the sort via the -S option. The rule of thumb is to assign 1.3 times the size of the input to avoid any disk swapping by sort. But only if your system can handle that.
If one of your data files is very small (say up to 100 lines), you can consider doing something like
<File2.txt fgrep -F <( <File1.txt cut -f2 ) > File2-matches.txt
to avoid the sort, but then you'd have to look up the 'keys' from that file.
The decision which one to use is very similar to the 'hash join' and 'merge join' in the database world.

Join the original sorted files, include 2 fields in one file and 1 field in 2nd file

I need help with linux command.
I have 2 files StockSort and SalesSort. They are sorted and they have 3 fields each. I know how to sort 1 field in 1st file and 1 field in 2nd file. But I can't get a right syntax for joining two fields in 1st file and only 1 field in second file. I also need to save it i na new file.
So far I have this command, but it doesn't work.I think the mistake is in "2,3" part, where I need to combine two fields from the 1st file.
join -1 2,3 -2 2 StockSort SalesSort >FinalReport
StockSort file
3976:diode:350
4105:resistor:750
4250:resistor:500
SalesSort file
3976:120:net
4105:250:chg
5500:100:pde
Output should be like this:
3976:350:120
4105:750:250
4250:500:100
You can try
join -t: -o 1.1,1.3,2.2 stocksort salesort
where
-t set the column separator
-o is the output format (a comma sep. list of filenumber.fieldnumber)
Here is an awk:
$ awk 'BEGIN{ FS=OFS=":"}
FNR==NR {Stock[$1]=$3; next}
$1 in Stock{ print $1,Stock[$1],$2}' StockSort SalesSort

Linux - putting lines that contain a string at a specific column in a new file

I want to pull all rows from a text file in linux which contain a specific number (in this case 9913) in a specific column (column 4). This is a tab-delimited file, so I am calling this a column, though I am not sure it is.
In some cases, there is only one number in column 4, but in other lines there are multiple numbers in this column (ex. 9913; 4444; 5555). I would like to get any rows for which the number 9913 appears in the 4th column (whether or not it is the only number or in a list). How do I put all lines which contain the number 9913 in column 4 and put them in their own file?
Here is an example of what I have tried:
cat file.txt | grep 9913 > newFile.txt
result is a mixture of the following:
CDR1as CDR1as ENST00000003100.8 9913 AAA-GGCAGCAAGGGACUAAAA (files that I want)
CDR1as CDR1as ENST00000399139.1 9606 GUCCCCA................(file ex. I don't want)
I do not get any results when calling a specific column. Shown by the helper below, the code is not recognizing the columns I think, and I get blank files when using awk.
awk '$4 == "9913"' file.txt > newfile.txt
will give me no transfer of data to a new file.
Thanks
This is one way of doing it
awk '$4 == "9913" {print $0}' file.txt > newfile.txt
or just
awk '$4 == "9913"' file.txt > newfile.txt

Get lines of file1 which are not in file2

I have two long, but sorted files. How to get all lines of first file which are not in second file ?
file1
0000_aaa_b
0001_bccc_b
0002_bcc <------ file2 have not that line
0003_aaa_d
0006_xxx
...
file2
0000_aaa_b
0001_bccc_b
0003_aaa_d
0006_xxx
...
This is what the comm command is for:
$ comm -3 file1 file2
0002_bcc
From man comm:
DESCRIPTION
Compare sorted files FILE1 and FILE2 line by line.
With no options, produce three-column output. Column one contains
lines unique to FILE1, column two contains lines unique to FILE2, and
column three contains lines common to both files.
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
Just run a diff on them:
diff -c file1 file2
The -c (for "context") flag will only display the lines that are different, with two lines surrounding each line.

Resources