Find rows with the same value in a column in two files - linux

I've got two files (millions of columns)
File1.txt, ~4k rows
some_key1 some_text1
some_key2 some_text2
...
some_keyn some_textn
File2.txt, ~20 M rows
some_key11 some_key11 some_text1
some_key22 some_key22 some_text2
...
some_keynn some_keynn some_textn
When there is an exact match between column 2 in File1.txt and column 3 in File2.txt, I want to print out the particular rows from both files.
EDIT
I've tried this (I forgot to write it) but it doesn't work
awk 'NR{a[$2]}==FNR{b[$3]}'$1 in a{print $1}' file1.txt file2.txt

You need to fix your awk program
To print all records in file2 if field 1 (file1) exists in field 3 (file2):-
awk 'NR==FNR{A[$2];next}$3 in A' file1.txt file2.txt
some_key11 some_key11 some_text1
some_key22 some_key22 some_text2
...
some_keynn some_keynn some_textn
To print just field 1 in file2 if field 1 (file1) exists in field 3 (file2):-
awk 'NR==FNR{A[$2];next}$3 in A{ print $1 }' file1.txt file2.txt
some_key11
some_key22
...
some_keynn

Let's say your dataset is big in both dimensions - rows and columns. Then you want to use join. To use join, you have to sort your data first. Something along those lines:
<File1.txt sort -k2,2 > File1-sorted.txt
<File2.txt sort -k3,3 -S1G > File2-sorted.txt
join -1 2 -2 3 File1-sorted.txt File2-sorted.txt > matches.txt
The sort -k2,2 means 'sort whole rows so the values of second column are in ascending order. The join -1 2 means 'the key in the first file is the second column'.
If your files are bigger than say 100 MB it pays of to assign additional memory to the sort via the -S option. The rule of thumb is to assign 1.3 times the size of the input to avoid any disk swapping by sort. But only if your system can handle that.
If one of your data files is very small (say up to 100 lines), you can consider doing something like
<File2.txt fgrep -F <( <File1.txt cut -f2 ) > File2-matches.txt
to avoid the sort, but then you'd have to look up the 'keys' from that file.
The decision which one to use is very similar to the 'hash join' and 'merge join' in the database world.

Related

Iterate over two files in linux || column comparison

we have two files File1 and File 2
File 1 columns
Name Age
abc 12
bcd 14
File2 Columns
Age
12
14
I was to Iterate over the second column of File1 and First column of File2 in single loop and Then check if they are same.
Note:- note number of Rows in both the files are same and I am using .sh shell
First make a temporary file from file1 that should be the same as file2.
The field name might have spaces, so remove everything until the last space.
When you have done this you can compare the files.
sed 's/.* //' file1 > file1.tmp
diff file1.tmp file2

Join the original sorted files, include 2 fields in one file and 1 field in 2nd file

I need help with linux command.
I have 2 files StockSort and SalesSort. They are sorted and they have 3 fields each. I know how to sort 1 field in 1st file and 1 field in 2nd file. But I can't get a right syntax for joining two fields in 1st file and only 1 field in second file. I also need to save it i na new file.
So far I have this command, but it doesn't work.I think the mistake is in "2,3" part, where I need to combine two fields from the 1st file.
join -1 2,3 -2 2 StockSort SalesSort >FinalReport
StockSort file
3976:diode:350
4105:resistor:750
4250:resistor:500
SalesSort file
3976:120:net
4105:250:chg
5500:100:pde
Output should be like this:
3976:350:120
4105:750:250
4250:500:100
You can try
join -t: -o 1.1,1.3,2.2 stocksort salesort
where
-t set the column separator
-o is the output format (a comma sep. list of filenumber.fieldnumber)
Here is an awk:
$ awk 'BEGIN{ FS=OFS=":"}
FNR==NR {Stock[$1]=$3; next}
$1 in Stock{ print $1,Stock[$1],$2}' StockSort SalesSort

sort and remove duplicate based on different columns in a file

I have a file in which there are three columns as (yyyy-mm-dd hh:mm:ss.000 12-digit number) :
2016-11-30 23:40:45.578 5001234567890
2016-11-30 23:40:45.568 5001234567890
2016-11-30 23:40:45.578 5001234567890
2016-11-30 23:40:45.478 5001234567891
2016-11-30 23:40:45.578 5001234567891
I want to first sort the file based on the date-time(first two columns) and then have to remove the rows having duplicate numbers (third column). So after this the above file will look like:
2016-11-30 23:40:45.478 5001234567891
2016-11-30 23:40:45.568 5001234567890
I have used sort with key and awk command(as below) but the results aren't correct..(I am not very sure which entries are being removed as the file that I am processing are too big.)
Commands:
sort -k1 inputFile > sortedInputFile<br/>
awk '!seen[$3]++' sortedInputFile > outputFile<br/>
I am not sure how to do this.
If you want to keep the earliest instance of each 3rd column entry, you can sort twice; the first time to group duplicates and the second time to restore the sort by time, after duplicates are removed. (The following assumes a default sort works with both dates and values and that all lines have three columns with consistent whitespace.)
sort -k3 -k1,2 inputFile | uniq -f2 | sort > sortedFile
The -f2 option to uniq tells it to start the comparison at the end of the second field, so that the date fields are not considered.
If milliseconds doesn't matter, following is another approach which removes the milliseconds and performs the sort and uniq:
awk '{print $1" "substr($2,1,index($2,".")-1)" "$3 }' file1.txt | sort | uniq
Here is one in awk. It groups on the $3 and stores the earliest timestamp but the output order is random, so the output should be piped to sort.
$ awk '
(a[$3] == "" || a[$3] > ($1 OFS $2)) && a[$3]=($1 OFS $2) { next }
END{ for(i in a) print a[i], i }
' file # | sort goes here
2016-11-30 23:40:45.568 5001234567890
2016-11-30 23:40:45.478 5001234567891

Create diff between two files based on specific column

I have the following problem.
Say I have 2 files:
A.txt
1 A1
2 A2
B.txt
1 B1
2 B2
3 B3
I want to make diff which is based only on values of first column, so the result should be
3 B3
How this problem can be solved with bash in linux?
[ awk ] is your friend
awk 'NR==FNR{f[$1];next}{if($1 in f){next}else{print}}' A.txt B.txt
or more simply
awk 'NR==FNR{f[$1];next}!($1 in f){print}' A.txt B.txt
or even more simply
awk 'NR==FNR{f[$1];next}!($1 in f)' A.txt B.txt
A bit of explanation will certainly help
NR & FNR are awk built-in variables which stand for total number of records - including current - processed so far and total number of records - including current - processed so far in the current file respectively and they will be equal only for the first file processed.
f[$1] creates the array f at first and then adds $1 as a key if the same key doesn't yet exist. If no value is assigned, then f[$1] is auto-initialized to zero, but this aspect doesn't find a use in your case
next goes to the next record with out processing rest of the awk script.
Note that {if($1 in f){next}else{print}} part will be processed only for the second (and subsequent if any) file/s.
$1 in f checks if the the key $1 exists in the array f
The if-else-print part is self explanatory.
Note in the third version, the {print} is omitted coz the default action for awk is printing !!
awk 'NR==FNR{array[$1];next} !($1 in array)' a.txt b.txt
3 B3
Like this in bash but only if you are really not interested in the second column at all:
diff <(cut -f1 -d" " A.txt) <(cut -f1 -d" " B.txt)

Best way to print rows not common to two large files in Unix

I have two files which are of following format.
File1: - It contains 4 column. First field is ID in text format and rest of columns are also some text values.
id1 val12 val13 val14
id2 val22 val23 val24
id3 val32 val33 val34
File2 - In file two I only have IDs.
id1
id2
Output
id3 val32 val33 val34
My question is: How to find rows from first file whose ID(first field) does not appear in second file. Size of both files in pretty large with file1 containing 42 million rows, size 8GB and file2 contains 33 million IDs. Order of IDs in two files might not be same.
Assuming the two files are sorted by id, then something like
join "-t " -j 1 -v 1 file1 file2
should do it.
You could do like this with awk:
awk 'FNR == NR { h[$1] = 1; next } !h[$1]' file2 file1
The first block gathers ids from file2 into the h hash. The last part (!h[$1]) executes the default block ({ print $0 }) if the id wasn't present in file2.
I don't claim that this is the "best" way to do it because best can include a number of trade-off criteria, but here's one way:
You can do this with the -f option to specify File2 as the file containing search patterns to grep:
grep -v -f File2 File1 > output
And as #glennjackman suggests:
One way to force the id to match at the beginning of the line:grep -vf <(sed 's/^/^/' File2) File1

Resources