Process multiple files and append them in linux/unix - linux

I have over 100 files with at least 5-8 columns (tab-separated) in each file. I need to extract first three columns from each file and add fourth column with some predefined text and append them.
Let's say I have 3 files: file001.txt, file002.txt, file003.txt.
file001.txt:
chr1 1 2 15
chr2 3 4 17
file002.txt:
chr1 1 2 15
chr2 3 4 17
file003.txt:
chr1 1 2 15
chr2 3 4 17
combined_file.txt:
chr1 1 2 f1
chr2 3 4 f1
chr1 1 2 f2
chr2 3 4 f2
chr1 1 2 f3
chr2 3 4 f3
For simplicity I kept file contents same.
My script is as follows:
#!/bin/bash
for i in {1..3}; do
j=$(printf '%03d' $i)
awk 'BEGIN { OFS="\t"}; {print $1,$2,$3}' file${j}.txt | awk -v k="$j" 'BEGIN {print $0"\t$k”}' | cat >> combined_file.txt
done
But the script is giving the following errors:
awk: non-terminated string $k”}... at source line 1
context is
<<<
awk: giving up
source line number 2
awk: non-terminated string $k”}... at source line 1
context is
<<<
awk: giving up
source line number 2
Can some one help me to figure it out?

You don't need two different awk scripts. And you don't use $ to refer to variables in awk, that's used to refer to input fields (i.e. $k means access the field whose number is in the variable k).
for i in {1..3}; do
j=$(printf '%03d' $i)
awk -v k="$j" -v OFS='\t' '{print $1, $2, $3, k}' file$j.txt
done > combined_file.txt

As pointed out in the comments your problem is youre trying to use odd characters as if they were double quotes. Once you fix that though, you don't need a loop or any of that other complexity all you need is:
$ awk 'BEGIN{FS=OFS="\t"} {$NF="f"ARGIND} 1' file*
chr1 1 2 f1
chr2 3 4 f1
chr1 1 2 f2
chr2 3 4 f2
chr1 1 2 f3
chr2 3 4 f3
The above used GNU awk for ARGIND.

Related

Expand each line of text file according to their corresponding numbers on linux

Can I transfer this first format to the second one just by basic shell procession or awk or sed on linux?
This is a toy example:
This kind of text file is what I have, three cols, col2 and col3 like range, left close and right open,
chr1 0 2 0
chr1 2 6 1.5
chr2 0 3 0
chr2 3 10 2.1
Transfer to describe each position as:
chr1 0 0
chr1 1 0
chr1 2 1.5
chr1 3 1.5
chr1 4 1.5
chr1 5 1.5
chr2 0 0
chr2 1 0
chr2 2 0
chr2 3 2.1
...
chr2 9 2.1
This can be done by awk,
awk '{for(i=$2;i<$3;i++)print $1,i,$4}' file
Set the start and end of the range as $2 and $3, respectively.
And Print as request for the range in each line.
Another option is to use set and map operations with bedops, bedmap, and cut:
$ bedops --chop 1 foo.bed | bedmap --faster --echo --echo-map-id --delim "\t" - foo.bed | cut -f1,2,4 > answer.txt
Might offer some flexibility if other types of divisions and signal mapping are needed.

Appending the line even though there is no match with awk

I am trying to compare two files and append another column if there is certain condition satisfied.
file1.txt
1 101 111 . BCX 123
1 298 306 . CCC 234
1 299 305 . DDD 345
file2.txt
1 101 111 BCX P1#QQQ
1 299 305 DDD P2#WWW
The output should be:
1 101 111 . BCX 123;P1#QQQ
1 298 306 . CCC 234
1 299 305 . DDD 345;P2#WWW
What I can do is, to only do this for the lines having a match:
awk 'NR==FNR{ a[$1,$2,$3,$4]=$5; next }{ s=SUBSEP; k=$1 s $2 s $3 s $5 }k in a{ print $0,a[k] }' file2.txt file1.txt
1 101 111 . BCX 123 P1#QQQ
1 299 305 . DDD 345 P2#WWW
But then, I am missing the second line in file1.
How can I still keep it even though there is no match with file2 regions?
If you want to print every line, you need your print command not to be limited by your condition.
awk '
NR==FNR {
a[$1,$2,$3,$4]=$5; next
}
{
s=SUBSEP; k=$1 s $2 s $3 s $5
}
k in a {
$6=$6 ";" a[k]
}
1' file2.txt file1.txt
The 1 is shorthand that says "print every line". It's a condition (without command statements) that always evaluates "true".
The k in a condition simply replaces your existing 6th field with the concatenated one. If the condition is not met, the replacement doesn't happen, but we still print because of the 1.
Following awk may help you in same.
awk 'FNR==NR{a[$1,$2,$3,$4]=$NF;next} (($1,$2,$3,$5) in a){print $0";"a[$1,$2,$3,$5];next} 1' file2.txt file1.txt
Output will be as follows.
1 101 111 . BCX 123;P1#QQQ
1 298 306 . CCC 234
1 299 305 . DDD 345;P2#WWW
another awk
$ awk ' {t=5-(NR==FNR); k=$1 FS $2 FS $3 FS $t}
NR==FNR {a[k]=$NF; next}
k in a {$0=$0 ";" a[k]}1' file2 file1
1 101 111 . BCX 123;P1#QQQ
1 298 306 . CCC 234
1 299 305 . DDD 345;P2#WWW
last component of the key is either 4th or 5th field based on first or second file input; set it accordingly and use a single k variable in the script. Note that
t=5-(NR==FNR)
can be written as conventionally,
t=NR==FNR?4:5

How to remove n characters from a specific column using sed/awk/perl

I have the following tab delimited data:
chr1 3119713 3119728 MA05911Bach1Mafk 839 +
chr1 3119716 3119731 MA05011MAFNFE2 860 +
chr1 3120036 3120051 MA01502Nfe2l2 866 +
What I want to do is to remove 7 characters from 4th column.
Resulting in
chr1 3119713 3119728 Bach1Mafk 839 +
chr1 3119716 3119731 MAFNFE2 860 +
chr1 3120036 3120051 Nfe2l2 866 +
How can I do that?
Note the output needs to be also TAB separated.
I'm stuck with the following code, which replaces from the first
column onward, which I don't want
sed 's/^.\{7\}//' myfile.txt
awk '{ $4 = substr($4, 8); print }'
perl -anE'$F[3] =~ s/.{7}//; say join "\t", #F' data.txt
or
perl -anE'substr $F[3],0,7,""; say join "\t", #F' data.txt
With sed
$ sed -E 's/^(([^\t]+\t){3}).{7}/\1/' myfile.txt
chr1 3119713 3119728 Bach1Mafk 839 +
chr1 3119716 3119731 MAFNFE2 860 +
chr1 3120036 3120051 Nfe2l2 866 +
-E use extended regular expressions, to avoid having to use \ for (){}. Some sed versions might need -r instead of -E
^(([^\t]+\t){3}) capture the first three columns, easy to change number of columns if needed
.{7} characters to delete from 4th column
\1 the captured columns
Use -i option for in-place editing
With perl you can use \K for variable length positive lookbehind
perl -pe 's/^([^\t]+\t){3}\K.{7}//' myfile.txt

Faster solution to compare files in bash

file1:
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468
chr1 14969 15038 NR_024540_1_r_WASH7P_69
chr1 15795 15947 NR_024540_2_r_WASH7P_152
chr1 16606 16765 NR_024540_3_r_WASH7P_15
chr1 16857 17055 NR_024540_4_r_WASH7P_198
and file2:
NR_024540 11
I need find match file2 in file1 and print whole file1 + second column of file2
So ouptut is:
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11
My solution is very slow in bash:
#!/bin/bash
while read line; do
c=$(echo $line | awk '{print $1}')
d=$(echo $line | awk '{print $2}')
grep $c file1 | awk -v line="$d" -v OFS="\t" '{print $1,$2,$3,$4"_"line}' >> output
done < file2
I am prefer FASTER any bash or awk solution. Output can be modified, but need keep all the informations (order of column can be different).
EDIT:
Right now it looks like fastest solution according #chepner:
#!/bin/bash
while read -r c d; do
grep $c file1 | awk -v line="$d" -v OFS="\t" '{print $1,$2,$3,$4"_"line}'
done < file2 > output
In a single Awk command,
awk 'FNR==NR{map[$1]=$2; next}{ for (i in map) if($0 ~ i){$(NF+1)=map[i]; print; next}}' file2 file1
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11
A more readable version in a multi-liner
FNR==NR {
# map the values from 'file2' into the hash-map 'map'
map[$1]=$2
next
}
# On 'file1' do
{
# Iterate through the array map
for (i in map){
# If there is a direct regex match on the line with the
# element from the hash-map, print it and append the
# hash-mapped value at last
if($0 ~ i){
$(NF+1)=map[i]
print
next
}
}
}
Another solution using join and sed, Under the assumption that file1 and file2 are sorted
join <(sed -r 's/[^ _]+_[^_]+/& &/' file1) file2 -1 4 -2 1 -o "1.1 1.2 1.3 1.5 2.2" > output
If the output order doesn't matter, to use awk
awk 'FNR==NR{d[$1]=$2; next}
{split($4,v,"_"); key=v[1]"_"v[2]; if(key in d) print $0, d[key]}
' file2 file1
you get,
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11
try this -
cat file2
NR_024540 11
NR_024541 12
cat file11
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468
chr1 14361 14829 NR_024542_0_r_DDX11L1,WASH7P_468
chr1 14969 15038 NR_024540_1_r_WASH7P_69
chr1 15795 15947 NR_024540_2_r_WASH7P_152
chr1 16606 16765 NR_024540_3_r_WASH7P_15
chr1 16857 17055 NR_024540_4_r_WASH7P_198
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468
chr1 14969 15038 NR_024540_1_r_WASH7P_69
chr1 15795 15947 NR_024540_2_r_WASH7P_152
chr1 16606 16765 NR_024540_3_r_WASH7P_15
awk 'NR==FNR{a[$1]=$2;next} substr($4,1,9) in a {print $0,a[substr($4,1,9)]}' file2 file11
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
Performance - (Tested for 55000 records)
time awk 'NR==FNR{a[$1]=$2;next} substr($4,1,9) in a {print $0,a[substr($4,1,9)]}' file2 file1 > output1
real 0m0.16s
user 0m0.14s
sys 0m0.01s
You are starting a lot of external programs unnecessarily. Let read split the incoming line from file2 for you instead of calling awk twice. There is also no need to run grep; awk can do the filtering itself.
while read -r c d; do
awk -v field="$c" -v line="$d" -v OFS='\t' '$0 ~ field {print $1,$2,$3,$4"_"line}' file1
done < file2 > output
If the searched string is always the same length (length("NR_024540")==9):
awk 'NR==FNR{a[$1]=$2;next} (i=substr($4,1,9)) && (i in a){print $0, a[i]}' file2 file1
Explained:
NR==FNR { # process file2
a[$1]=$2 # hash record using $1 as the key
next # skip to next record
}
(i=substr($4,1,9)) && (i in a) { # read the first 9 bytes of $4 to i and search in a
print $0, a[i] # output if found
}
awk -F '[[:blank:]_]+' '
FNR==NR { a[$2]=$3 ;next }
{ if ( $5 in a ) $0 = $0 " " a[$5] }
7
' file2 file1
Comment:
use _ as extra field separator so file names are easier to compare in both file (using only the number part).
7 is for fun, it's just a non 0 value -> print the line
i don't change the field (NF+1, ...) so we keep the original format adding just the referenced number
a smaller oneliner code (optimized for code size) (assuming non empty lines in file1 that are mandatory). if separator are only space, you can remplace [:blank:] by a space char
awk -F '[[:blank:]_]+' 'NF==3{a[$2]=$3;next}$0=$0" "a[$5]' file2 file1
No awk or sed needed. This assumes file2 is only one line:
n="`cut -f 2 file2`" ; while read x ; do echo "$x $n" ; done < file1

Difference between column sum in file 1 and column sum in file 2

I have this input
file 1 file 2
A 10 222 77.11 11
B 20 2222 1.215 22
C 30 22222 12.021 33
D 40 222222 145.00 44
The output I need is (11+22+33+44)- (10+20+30+40) = 110-100=10
thank you in advance for your help
Here you go:
paste file1.txt file2.txt | awk '{s1+=$5; s2+=$2} END {print s1-s2}'
Or better yet (clever of #falsetru's answer to do the summing with a single variable):
paste file1.txt file2.txt | awk '{sum+=$5-$2} END {print sum}'
If you want to work with column N in file1 and column M in file2, this might be "easier", but less efficient:
paste <(awk '{print $N}' file1) <(awk '{print $M}' file2.txt) | awk '{sum+=$2-$1} END {print sum}'
It's easier in the sense that you don't have to count the right position of the column in the second file, but less efficient because of the added extra awk sub-processes.
Using awk:
$ cat test.txt
file 1 file 2
A 10 222 77.11 11
B 20 2222 1.215 22
C 30 22222 12.021 33
D 40 222222 145.00 44
$ tail -n +2 test.txt | awk '{s += $5 - $2} END {print s}'
10

Resources