Faster solution to compare files in bash - linux

file1:
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468
chr1 14969 15038 NR_024540_1_r_WASH7P_69
chr1 15795 15947 NR_024540_2_r_WASH7P_152
chr1 16606 16765 NR_024540_3_r_WASH7P_15
chr1 16857 17055 NR_024540_4_r_WASH7P_198
and file2:
NR_024540 11
I need find match file2 in file1 and print whole file1 + second column of file2
So ouptut is:
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11
My solution is very slow in bash:
#!/bin/bash
while read line; do
c=$(echo $line | awk '{print $1}')
d=$(echo $line | awk '{print $2}')
grep $c file1 | awk -v line="$d" -v OFS="\t" '{print $1,$2,$3,$4"_"line}' >> output
done < file2
I am prefer FASTER any bash or awk solution. Output can be modified, but need keep all the informations (order of column can be different).
EDIT:
Right now it looks like fastest solution according #chepner:
#!/bin/bash
while read -r c d; do
grep $c file1 | awk -v line="$d" -v OFS="\t" '{print $1,$2,$3,$4"_"line}'
done < file2 > output

In a single Awk command,
awk 'FNR==NR{map[$1]=$2; next}{ for (i in map) if($0 ~ i){$(NF+1)=map[i]; print; next}}' file2 file1
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11
A more readable version in a multi-liner
FNR==NR {
# map the values from 'file2' into the hash-map 'map'
map[$1]=$2
next
}
# On 'file1' do
{
# Iterate through the array map
for (i in map){
# If there is a direct regex match on the line with the
# element from the hash-map, print it and append the
# hash-mapped value at last
if($0 ~ i){
$(NF+1)=map[i]
print
next
}
}
}

Another solution using join and sed, Under the assumption that file1 and file2 are sorted
join <(sed -r 's/[^ _]+_[^_]+/& &/' file1) file2 -1 4 -2 1 -o "1.1 1.2 1.3 1.5 2.2" > output
If the output order doesn't matter, to use awk
awk 'FNR==NR{d[$1]=$2; next}
{split($4,v,"_"); key=v[1]"_"v[2]; if(key in d) print $0, d[key]}
' file2 file1
you get,
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11

try this -
cat file2
NR_024540 11
NR_024541 12
cat file11
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468
chr1 14361 14829 NR_024542_0_r_DDX11L1,WASH7P_468
chr1 14969 15038 NR_024540_1_r_WASH7P_69
chr1 15795 15947 NR_024540_2_r_WASH7P_152
chr1 16606 16765 NR_024540_3_r_WASH7P_15
chr1 16857 17055 NR_024540_4_r_WASH7P_198
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468
chr1 14969 15038 NR_024540_1_r_WASH7P_69
chr1 15795 15947 NR_024540_2_r_WASH7P_152
chr1 16606 16765 NR_024540_3_r_WASH7P_15
awk 'NR==FNR{a[$1]=$2;next} substr($4,1,9) in a {print $0,a[substr($4,1,9)]}' file2 file11
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
Performance - (Tested for 55000 records)
time awk 'NR==FNR{a[$1]=$2;next} substr($4,1,9) in a {print $0,a[substr($4,1,9)]}' file2 file1 > output1
real 0m0.16s
user 0m0.14s
sys 0m0.01s

You are starting a lot of external programs unnecessarily. Let read split the incoming line from file2 for you instead of calling awk twice. There is also no need to run grep; awk can do the filtering itself.
while read -r c d; do
awk -v field="$c" -v line="$d" -v OFS='\t' '$0 ~ field {print $1,$2,$3,$4"_"line}' file1
done < file2 > output

If the searched string is always the same length (length("NR_024540")==9):
awk 'NR==FNR{a[$1]=$2;next} (i=substr($4,1,9)) && (i in a){print $0, a[i]}' file2 file1
Explained:
NR==FNR { # process file2
a[$1]=$2 # hash record using $1 as the key
next # skip to next record
}
(i=substr($4,1,9)) && (i in a) { # read the first 9 bytes of $4 to i and search in a
print $0, a[i] # output if found
}

awk -F '[[:blank:]_]+' '
FNR==NR { a[$2]=$3 ;next }
{ if ( $5 in a ) $0 = $0 " " a[$5] }
7
' file2 file1
Comment:
use _ as extra field separator so file names are easier to compare in both file (using only the number part).
7 is for fun, it's just a non 0 value -> print the line
i don't change the field (NF+1, ...) so we keep the original format adding just the referenced number
a smaller oneliner code (optimized for code size) (assuming non empty lines in file1 that are mandatory). if separator are only space, you can remplace [:blank:] by a space char
awk -F '[[:blank:]_]+' 'NF==3{a[$2]=$3;next}$0=$0" "a[$5]' file2 file1

No awk or sed needed. This assumes file2 is only one line:
n="`cut -f 2 file2`" ; while read x ; do echo "$x $n" ; done < file1

Related

Appending the line even though there is no match with awk

I am trying to compare two files and append another column if there is certain condition satisfied.
file1.txt
1 101 111 . BCX 123
1 298 306 . CCC 234
1 299 305 . DDD 345
file2.txt
1 101 111 BCX P1#QQQ
1 299 305 DDD P2#WWW
The output should be:
1 101 111 . BCX 123;P1#QQQ
1 298 306 . CCC 234
1 299 305 . DDD 345;P2#WWW
What I can do is, to only do this for the lines having a match:
awk 'NR==FNR{ a[$1,$2,$3,$4]=$5; next }{ s=SUBSEP; k=$1 s $2 s $3 s $5 }k in a{ print $0,a[k] }' file2.txt file1.txt
1 101 111 . BCX 123 P1#QQQ
1 299 305 . DDD 345 P2#WWW
But then, I am missing the second line in file1.
How can I still keep it even though there is no match with file2 regions?
If you want to print every line, you need your print command not to be limited by your condition.
awk '
NR==FNR {
a[$1,$2,$3,$4]=$5; next
}
{
s=SUBSEP; k=$1 s $2 s $3 s $5
}
k in a {
$6=$6 ";" a[k]
}
1' file2.txt file1.txt
The 1 is shorthand that says "print every line". It's a condition (without command statements) that always evaluates "true".
The k in a condition simply replaces your existing 6th field with the concatenated one. If the condition is not met, the replacement doesn't happen, but we still print because of the 1.
Following awk may help you in same.
awk 'FNR==NR{a[$1,$2,$3,$4]=$NF;next} (($1,$2,$3,$5) in a){print $0";"a[$1,$2,$3,$5];next} 1' file2.txt file1.txt
Output will be as follows.
1 101 111 . BCX 123;P1#QQQ
1 298 306 . CCC 234
1 299 305 . DDD 345;P2#WWW
another awk
$ awk ' {t=5-(NR==FNR); k=$1 FS $2 FS $3 FS $t}
NR==FNR {a[k]=$NF; next}
k in a {$0=$0 ";" a[k]}1' file2 file1
1 101 111 . BCX 123;P1#QQQ
1 298 306 . CCC 234
1 299 305 . DDD 345;P2#WWW
last component of the key is either 4th or 5th field based on first or second file input; set it accordingly and use a single k variable in the script. Note that
t=5-(NR==FNR)
can be written as conventionally,
t=NR==FNR?4:5

Sorting a file and putting them in different files

I am trying to sort a file that has different genomic regions, and each region has a letter&number combination to itself.
I want to sort the whole file in terms of each genomic location (columns1,2,3),and if these 3 are the same,
and extract it into a new separate file.
My input is:
1.txt
chr1 10 20 . . 00000 ACTGBACA
chr1 10 20 . + 11111 AACCCCHQ
chr1 18 40 . . 0 AA12KCCHQ
chr7 22 23 . . 21 KLJMWQKD
chr7 22 23 . . 8 XJKFIRHFBF24
chrX 199 201 . . KK AVJI24
What I am expecting is:
chr1.10-20.txt
chr1 10 20 ACTGBACA
chr1 10 20 AACCCCHQ
chr1.18-40.txt
chr1 18 40 AA12KCCHQ
chr7.22-23.txt
chr7 22 23 KLJMWQKD
chr7 22 23 XJKFIRHFBF24
chrX.199-201.txt
chrX 199 201 AVJI24
I was experimenting splitting a document with awk, but it is not what I want to do.
awk -F, '{print > $1$2$3".txt"}' 1.txt
It gives me the file names with all the rows, and inside the files, it is again the whole row, even though I need only column 1,2,3 and 7.
>ls
1.txt
chr1 10 20 . + 11111 AACCCCHQ.txt
chr7 22 23 . . 21 KLJMWQKD.txt
chrX 199 201 . . KK AVJI24.txt
chr1 10 20 . . 00000 ACTGBACA.txt
chr1 18 40 . . 0 AA12KCCHQ.txt
chr7 22 23 . . 8 XJKFIRHFBF24.txt
>cat chr1\ \ \ \ 10\ \ 20\ .\ +\ 11111\ AACCCCHQ.txt
chr1 10 20 . + 11111 AACCCCHQ
I would appreciate if you can show me how to fix the file names and its content.
Take a look at this:
#!/bin/sh
INPUT="$1"
while read -r LINE; do
GEN_LOC="$(echo "$LINE" | tr -s ' ' '.' | cut -d '.' -f 1,2,3)"
echo "$LINE" | tr -s ' ' | cut -d ' ' -f 1,2,3,6,7 >> "${GEN_LOC}.txt"
done < "$INPUT"
This script will take an input file in the format you posted and read it in line-by-line. For each line, it will replace the extra whitespace with dots for the filename and cut it down to fields 1, 2, and 3 (storing it in the $GEN_LOC variable). Then, it will append the whole $LINE to a file named ${GEN_LOC}.txt. If there are multiple lines that end up outputting to the same filename, that's fine - the line will just append. This does not take into account previous runs, so if you run this twice, it will continually append to the existing files. Hope this helps!

How to remove n characters from a specific column using sed/awk/perl

I have the following tab delimited data:
chr1 3119713 3119728 MA05911Bach1Mafk 839 +
chr1 3119716 3119731 MA05011MAFNFE2 860 +
chr1 3120036 3120051 MA01502Nfe2l2 866 +
What I want to do is to remove 7 characters from 4th column.
Resulting in
chr1 3119713 3119728 Bach1Mafk 839 +
chr1 3119716 3119731 MAFNFE2 860 +
chr1 3120036 3120051 Nfe2l2 866 +
How can I do that?
Note the output needs to be also TAB separated.
I'm stuck with the following code, which replaces from the first
column onward, which I don't want
sed 's/^.\{7\}//' myfile.txt
awk '{ $4 = substr($4, 8); print }'
perl -anE'$F[3] =~ s/.{7}//; say join "\t", #F' data.txt
or
perl -anE'substr $F[3],0,7,""; say join "\t", #F' data.txt
With sed
$ sed -E 's/^(([^\t]+\t){3}).{7}/\1/' myfile.txt
chr1 3119713 3119728 Bach1Mafk 839 +
chr1 3119716 3119731 MAFNFE2 860 +
chr1 3120036 3120051 Nfe2l2 866 +
-E use extended regular expressions, to avoid having to use \ for (){}. Some sed versions might need -r instead of -E
^(([^\t]+\t){3}) capture the first three columns, easy to change number of columns if needed
.{7} characters to delete from 4th column
\1 the captured columns
Use -i option for in-place editing
With perl you can use \K for variable length positive lookbehind
perl -pe 's/^([^\t]+\t){3}\K.{7}//' myfile.txt

Process multiple files and append them in linux/unix

I have over 100 files with at least 5-8 columns (tab-separated) in each file. I need to extract first three columns from each file and add fourth column with some predefined text and append them.
Let's say I have 3 files: file001.txt, file002.txt, file003.txt.
file001.txt:
chr1 1 2 15
chr2 3 4 17
file002.txt:
chr1 1 2 15
chr2 3 4 17
file003.txt:
chr1 1 2 15
chr2 3 4 17
combined_file.txt:
chr1 1 2 f1
chr2 3 4 f1
chr1 1 2 f2
chr2 3 4 f2
chr1 1 2 f3
chr2 3 4 f3
For simplicity I kept file contents same.
My script is as follows:
#!/bin/bash
for i in {1..3}; do
j=$(printf '%03d' $i)
awk 'BEGIN { OFS="\t"}; {print $1,$2,$3}' file${j}.txt | awk -v k="$j" 'BEGIN {print $0"\t$k”}' | cat >> combined_file.txt
done
But the script is giving the following errors:
awk: non-terminated string $k”}... at source line 1
context is
<<<
awk: giving up
source line number 2
awk: non-terminated string $k”}... at source line 1
context is
<<<
awk: giving up
source line number 2
Can some one help me to figure it out?
You don't need two different awk scripts. And you don't use $ to refer to variables in awk, that's used to refer to input fields (i.e. $k means access the field whose number is in the variable k).
for i in {1..3}; do
j=$(printf '%03d' $i)
awk -v k="$j" -v OFS='\t' '{print $1, $2, $3, k}' file$j.txt
done > combined_file.txt
As pointed out in the comments your problem is youre trying to use odd characters as if they were double quotes. Once you fix that though, you don't need a loop or any of that other complexity all you need is:
$ awk 'BEGIN{FS=OFS="\t"} {$NF="f"ARGIND} 1' file*
chr1 1 2 f1
chr2 3 4 f1
chr1 1 2 f2
chr2 3 4 f2
chr1 1 2 f3
chr2 3 4 f3
The above used GNU awk for ARGIND.

Modifying text using awk

I am trying to modify text files using awk. There are three columns and I want to delete part of the text in the first column:
range=chr1 20802865 20802871
range=chr1 23866528 23866534
to
chr1 20802865 20802871
chr1 23866528 23866534
How can I do this?
I've tried awk '{ substr("range=chr*", 7) }' and awk '{sub(/[^[:space:]]*\\/, "")}1' but it deletes all the contents of the file.
Set the field separator as = and print the second field:
# With awk
$ awk -F= '{print $2}' file
chr1 20802865 20802871
chr1 23866528 23866534
# Or with cut
$ cut -d= -f2 file
chr1 20802865 20802871
chr1 23866528 23866534
# How about grep
$ grep -Po '(?<==).*' file
chr1 20802865 20802871
chr1 23866528 23866534
# Temp file needed
$ cut -d= -f2 file > tmp; mv tmp file
Both awk, cut and grep require temporary files if you want to store the changes back into file, a better solution would be to use sed:
sed -i 's/range=//' file
This substitutes range= with nothing and the -i means the changes are done in-place so no need to handle the temporary files stuff as sed does it for you.
If you don't need to use awk, you can use sed, which I find a bit simpler. Hopefully you are familiar with regex operators, like ^ and ..
$ cat awkens
range=chr1 20802865 20802871
range=chr1 23866528 23866534
$ sed 's/^range=//' awkens
chr1 20802865 20802871
chr1 23866528 23866534
It looks like you are using tabs instead of spaces as delimiters in your file, so:
awk 'BEGIN{FS="[=\t]"; OFS="\t"} {print $2, $3, $4}' input_file
or
awk 'BEGIN{FS="[=\t]"; OFS="\t"} {$1=""; gsub("\t\t", "\t"); print}' input_file

Resources