Join two TSV files with inner join - linux

I have 2 TSV files:
TSV file 1:
A B
hello 0.5
bye 0.4
TSV file 2:
C D
hello 1
country 5
I want to join the 2 TSV files together based on file1.A=file2.C
How can i do it with the join function in linux?
Hoping to get this:
Text B D
hello 0.5 1
bye 0.4
country 5
Not getting any output with this:
join -j 1 <(sort -k1 file1.tsv) <(sort -k1 file2.tsv)

A little hairy, but here is a solution using awk and associative arrays.
awk 'FNR == 1 {h[length(h) + 1] = $2}
FILENAME ~ /test1.tsv/ && FNR > 1 {t1[$1]=$2}
FILENAME ~ /test2.tsv/ && FNR > 1 {t2[$1]=$2}
END{print "Text\t"h[1]"\t"h[2];
for(x in t1){print x"\t"t1[x]"\t"t2[x]}
for(x in t2){print x"\t"t1[x]"\t"t2[x]}}' test1.tsv test2.tsv |
sort | uniq

File1
$ cat file1
A B
hello 0.5
bye 0.4
File2
$ cat file2
C D
hello 1
country 5
Output
$ awk 'NR==1{print "Text","B","D"}FNR==1{next}FNR==NR{A[$1]=$2;next}{print $0,(f=$1 in A ? A[$1] : ""; if(f)delete A[$1]}END{for(i in A)print i,"",A[i]}' OFS='\t' file2 file1
Text B D
hello 0.5 1
bye 0.4
country 5
Better Readable Version
awk '
# Print header when NR = 1, this happens only when awk reads first file
NR==1{print "Text","B","D"}
# Number of Records relative to the current input file.
# When awk reads from the multiple input file,
# awk NR variable will give the total number of records relative to all the input file.
# Awk FNR will give you number of records for each input file
# So when awk reads first line, stop processing and go to next line
# this is just to skip header from each input file
FNR==1{
next
}
# FNR==NR is only true while reading first file (file2)
FNR==NR{
# Build assicioative array on the first column of the file
# where array element is second column
A[$1]=$2
# Skip all proceeding blocks and process next line
next
}
{
# Check index ($1 = column1) from second argument (file1) exists in array A
# if exists variable f will be 1 (true) otherwise 0 (false)
# As long as above state is true
# print current line and element of array A where index is column1
print $0,( f=$1 in A ? A[$1] : "" )
# Delete array element corresponding to index $1, if f is true
if(f)delete A[$1]
}
# Finally in END block print array elements one by one,
# from file2 which does not exists in file1
END{
for(i in A)
print i,"",A[i]
}
' OFS='\t' file2 file1

In your title you state you want to perform an inner join. Your example output suggests you want an outer join.
If you want an inner join as the title suggest, I recommend you use eBay's fabulous tsv-utils, particularly the tsv-join command as follows:
tsv-join -H --filter-file 1.tsv --key-fields A --data-fields C --append-fields B 2.tsv
No awk magic needed, just a simple well documented command with easily understandable options.
The above produces a proper inner join, you'd just need to rename the join key to text:
C D B
hello 1 0.5

Related

Print column 1 and 4 that have the less price using awk

I have this file that contains the car name, colour and price:
Toyota#Red#4500
Sedan#Blue#2600
Hyunda#Black#5000
Dudge#White#3900
Lymozeen#Black#2400
The output should display the car name and the price that is less than 5000:
Lymozeen#2400
Sedan#2600
Dudge#3900
Toyota#4500
I have tried this following code:
awk '{if($3 <= 5000)print $1,$3}' myfile
I'd suggest breaking this up. First, sort the content of the file based on the value of the third column, then select the lines of interest with your condition. Here's how:
awk -F'#' '{ print $NF, $0}' myfile | sort -n | awk -F'[# ]' '{if($1<5000)print $5"#"$7}'
One in GNU awk:
$ gawk '
BEGIN {
FS=OFS="#" # set field separators
}
$3<5000 { # if less than 5k
a[NR]=$3 # on NR hash price
b[NR]=$1 # on NR hash brand
}
END { # in the end
PROCINFO["sorted_in"]="#val_num_asc" # set for traverse order
for(i in a) # loop in ascendeing price order
print b[i],a[i] # output
}' file
Output:
Lymozeen#2400
Sedan#2600
Dudge#3900
Toyota#4500

replace pattern in file 2 with pattern in file 1 if contingency is met

I have two tab delimted data files the file1 looks like:
cluster_j_72 cluster-32 cluster-32 cluster_j_72
cluster_j_75 cluster-33 cluster-33 cluster_j_73
cluster_j_8 cluster-68 cluster-68 cluster_j_8
the file2 looks like:
NODE_148 67545 97045 cluster-32
NODE_221 1 42205 cluster-33
NODE_168 1 24506 cluster-68
I would like to confirm that, for a given row, in file1 columns 2 and 3; as well as 1 and 4 are identical. If this is the case then I would like to take the value for that row from column 2 (file 1) find it in file2 and replace it with the value from column 1 (file 1). Thus the new output of file 2 would look like this (note because column 1 and 4 dont match for cluster 33 (file1) the pattern is not replaced in file2):
NODE_148 67545 97045 cluster_j_72
NODE_221 1 42205 cluster-33
NODE_168 1 24506 cluster_j_8
I have been able to get the contingency correct (here printing the value from file1 i'd like to use to replace a value in file2):
awk '{if($2==$3 && $1==$4){print $1}}'file1
If I could get sed to draw values ($2 and $1) from file1 while looking in file 2 this would work:
sed 's/$2(from file1)/$1(from file1)/' file2
But I don't seem to be able to nest this sed in the previous awk statement, nor get sed to look for a pattern originating in a different file than it's looking in.
thanks!
You never need sed when you're using awk since awk can do anything that sed can do.
This might be what you're trying to do:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
if ( ($1 == $4) && ($2 == $3) ) {
map[$2] = $1
}
next
}
$4 in map { $4 = map[$4] }
{ print }
$ awk -f tst.awk file1 file2
NODE_148 67545 97045 cluster_j_72
NODE_221 1 42205 cluster-33
NODE_168 1 24506 cluster_j_8

Finding matches in 2 files and printing the field above the match

File1:
2987571 2988014
4663633 4668876
4669084 4669827
4669873 4670130
4670212 4670604
4670604 4672469
4672502 4672621
4672723 4673088
4673102 4673518
4673521 4673895
4679698 4680174
5756724 5757680
5757937 5758506
5758855 5759202
5759940 5771528
5772524 5773063
5773005 5773106
5773063 5773452
5773486 5773776
5773836 5774189
File2:
gene complement(6864294..6865061)
/locus_tag="HCH_06747"
CDS complement(6864294..6865061)
/locus_tag="HCH_06747"
/codon_start=1
/transl_table=11
/product="hypothetical protein"
/protein_id="ABC33372.1"
/translation="MIKQLVRPLFTGKGPNFSELSAKECGVGEYQLRYKLPGNTIHIG
MPDAPVPARVNLNADLFDSYGPKKLYNRTFVQMEFEKWAYKGRFLQGDSGLLSKMSLH
IDVNHAERHTEFRKGDLDSLELYLKKDLWNYYETERNIDGEQGANWEARYEFDHPDEM
RAKGYVPPDTLVLVRLPEIYERAPINGLEWLHYQIRGEGIPGPRHTFYWVYPMTDSFY
LTFSFWMTTEIGNRELKVQEMYEDAKRIMSMVELRKE"
gene complement(6865197..6865964)
/locus_tag="HCH_06748"
CDS complement(6865197..6865964)
/locus_tag="HCH_06748"
/codon_start=1
/transl_table=11
/product="hypothetical protein"
/protein_id="ABC33373.1"
/translation="MIKQIVRPLFTGKGPNFSELNVKECGIGDYLLRYKLPGNTIDIG
MPDAPVPSRVNLNADLFDSYDPKKLYNRTFVQMEFEWWAYRGLFLQGDSGLLSKMSLH
IDVNRINPNSPLGGSDLESLETYLREDYWDYYEAEKNIDGVPGSNWQKRYDFDNPDEV
RAKGYIPVRRLVLVLLPEIYVKERINDVEWLHYSIDGEGIAGTNITYYWAYPLTNNYY
LTFSFRTTTELGRNEQRYQRMLEDAKQIMSMVELCKG"
gene complement(6865961..6867109)
/locus_tag="HCH_06749"
CDS complement(6865961..6867109)
The goal here is to take each number of the 1st file's 1st column and see if that number appears in the second file. If yes, I want to print the line right above the match in the file2: "/locus_tag"
For example, if in file1 we have 6864294, and this number is also present on file2, then I'd like to print: /locus_tag="HCH_06747"
Here's a rough sample:
awk '
NR==FNR { # hash file 1 to a
a[$1]
next
}
{
q=$0
while(match($0,/[0-9]+/)) { # find all numeric strings
if((substr($0,RSTART,RLENGTH) in a)) # test if it is in a
print p # and output previous record p
$0=substr($0,RSTART+RLENGTH) # remove match from record
}
p=q # store current record to p
}' file1 file2
/locus_tag="HCH_06747"
Tried this and I think it will work:
for i in $(cat file1 | awk -F " " '{print $1 '\n'; print $2}')
do
grep -m1 $i file2 -A1 | tail -1
done

Emacs: how to concatenate two rows together to form unique identifier? [duplicate]

Input where identifier specified by two rows 1-2
L1_I L1_I C-14 <---| unique idenfier
WWPTH WWPT WWPTH <---| on two rows
1 2 3
Goal: how to concatenate the rows?
L1_IWWPTH L1_IWWPT C-14WWPTH <--- unique identifier
1 2 3
P.s. I will accept the simplest and most elegant solution.
Assuming that the input is in a file called file:
$ awk 'NR==1{for (i=1;i<=NF;i++) a[i]=$i;next} NR==2{for (i=1;i<=NF;i++) printf "%-20s",a[i] $i;print"";next} 1' file
L1_IWWPTH L1_IWWPT C-14WWPTH
1 2 3
How it works
NR==1{for (i=1;i<=NF;i++) a[i]=$i;next}
For the first line, save all the column headings in the array a. Then, skip over the rest of the commands and jump to the next line.
NR==2{for (i=1;i<=NF;i++) printf "%-20s",a[i] $i;print"";next}
For the second line, print all the column headings, merging together the ones from the first and second rows. Then, skip over the rest of the commands and jump to the next line.
1
1 is awk's cryptic shorthand for print the line as is. This is done for all lines after the seconds.
Tab-separated columns with possible missing columns
If columns are tab-separated:
awk -F'\t' 'NR==1{for (i=1;i<=NF;i++) a[i]=$i;next} NR==2{for (i=1;i<=NF;i++) printf "%s\t",a[i] $i;print"";next} 1' file
If you plan to use python, you can use zip in the following way:
input = [['L1_I', 'L1_I', 'C-14'], ['WWPTH','WWPT','WWPTH'],[1,2,3]]
output = [[i+j for i,j in zip(input[0],input[1])]] + input[2:]
print output
output:
[['L1_IWWPTH', 'L1_IWWPT', 'C-14WWPTH'], [1, 2, 3]]
#!/usr/bin/awk -f
NR == 1 {
split($0, a)
next
}
NR == 2 {
for (b in a)
printf "%-20s", a[b] $b
print ""
next
}
1

Scalable way of deleting all lines from a file where the line starts with one of many values

Given an input file of variable values (example):
A
B
D
What is a script to remove all lines from another file which start with one of the above values? For example, the file contents:
A
B
C
D
Would end up being:
C
The input file is of the order of 100,000 variable values. The file to be mangled is of the order of several million lines.
awk '
NR==FNR { # IF this is the first file in the arg list THEN
list[$0] # store the contents of the current record as an index or array "list"
next # skip the rest of the script and so move on to the next input record
} # ENDIF
{ # This MUST be the second file in the arg list
for (i in list) # FOR each index "i" in array "list" DO
if (index($0,i) == 1) # IF "i" starts at the 1st char on the current record THEN
next # move on to the next input record
}
1 # Specify a true condition and so invoke the default action of printing the current record.
' file1 file2
An alternative approach to building up an array and then doing a string comparison on each element would be to build up a Regular Expression, e.g.:
...
list = list "|" $0
...
and then doing an RE comparison:
...
if ($0 ~ list)
next
...
but I'm not sure that'd be any faster than the loop and you'd then have to worry about RE metacharacters appearing in file1.
If all of your values in file1 are truly single characters, though, then this approach of creating a character list to use in an RE comparison might work well for you:
awk 'NR==FNR{list = list $0; next} $0 !~ "^[" list "]"' file1 file2
You can also achieve this using egrep:
egrep -vf <(sed 's/^/^/' file1) file2
Lets see it in action:
$ cat file1
A
B
$ cat file2
Asomething
B1324
C23sd
D2356A
Atext
CtestA
EtestB
Bsomething
$ egrep -vf <(sed 's/^/^/' file1) file2
C23sd
D2356A
CtestA
EtestB
This would remove lines that start with one of the values in file1.
You can use comm to display the lines that are not common to both files, like this:
comm -3 file1 file2
Will print:
C
Notice that for this for this to work, both files have to be sorted, if they aren't sorted you can bypass that using
comm -3 <(sort file1) <(sort file2)

Resources