Replacing value in column with another value in txt file using awk - linux

I am new to linux and awk scripting. I have tab delim txt file like follows:
AAA 134 145 Sat 150 167
AAA 156 167 Sat 150 167
AAA 175 187 Sat 150 167
I would like replace only the value in last row, second column(175) with the value in the last row,5th column(150+1) so that my final output should look like
AAA 134 145 Sat 150 167
AAA 156 167 Sat 150 167
AAA 151 187 Sat 150 167
I tried awk '$2=$5+1' file.txt but it changes all the values in second column which I don't want. I want replace only 175 with 150(+1). Kindly guide me

The difficulty is that, unlike sed, awk does not tell us when we are working on the last row. Here is one work-around:
$ awk 'NR>1{print last} {last=$0} END{$0=last;$2=$5+1;print}' OFS='\t' file.txt
AAA 134 145 Sat 150 167
AAA 156 167 Sat 150 167
AAA 151 187 Sat 150 167
This works by keeping the previous line in the variable last. In more detail:
NR>1{print last}
For every row, except the first, print last.
last=$0
Update the value of last.
END{$0=last; $2=$5+1; print}
When we have reached the end of the file, update field 2 and print.
OFS='\t'
Set the field separator on output to a tab.
Alternate method
This approach reads the file twice, first to count the number of lines and the second time to change the last row. Consequently, this is less efficient but it might be easier to understand:
$ awk -v n="$(wc -l <file.txt)" 'NR==n{$2=$5+1} 1' OFS='\t' file.txt
AAA 134 145 Sat 150 167
AAA 156 167 Sat 150 167
AAA 151 187 Sat 150 167
Changing the first row instead
$ awk 'NR==1{$2=$5+1} 1' OFS='\t' file.txt
AAA 151 145 Sat 150 167
AAA 156 167 Sat 150 167
AAA 175 187 Sat 150 167
Changing the first row and the last row
$ awk 'NR==1{$2=$5+1} NR>1{print last} {last=$0} END{$0=last;if(NR>1)$2=$5+1;print}' OFS='\t' file.txt
AAA 151 145 Sat 150 167
AAA 156 167 Sat 150 167
AAA 151 187 Sat 150 167

#John1024 's answer is very informative.
awk have builtin getline function to process file.
It returns 1 on sucess, 0 on end of file and -1 on an error.
awk '{
line=$0;
if (getline == 0 ) {
$2=$5+1;
print $0;
} else {
print line RS $0;
}
}' OFS='\t' file.txt

Related

Remove \r\ character from String pattern matched in AWK

I'm quite new to AWK so apologies for the basic question. I've found many references for removing windows end-line characters from files but none that match a regular expression and subsequently remove the windows end line characters.
I have a file named infile.txt that contains a line like so:
...
DATAFILE data5v.dat
...
Within a shell script I want to capture the filename argument data5v.dat from this infile.txt and remove any carriage return character, \r, IF present. The carriage return may not always be present. So I have to match a word and then remove the \r subsequently.
I have tried the following but it is not working how I expect:
FILENAME=$(awk '/DATAFILE/ { print gsub("\r", "", $2) }' $INFILE)
Can I store the string returned from matching my regex /DATAFILE/ in a variable within my AWK statement to subsequently apply gsub?
File names can contain spaces, including \rs, blanks and tabs, so to do this robustly you can't remove all \rs with gsub() and you can't rely on there being any field, e.g. $2, that contains the whole file name.
If your input fields are tab-separated you need:
awk '/DATAFILE/ { sub(/[^\t]+\t/,""); sub(/\r$/,""); print }' file
or this otherwise:
awk '/DATAFILE/ { sub(/[^[:space:]]+[[:space:]]+/,""); sub(/\r$/,""); print }' file
The above assumes your file names don't start with spaces and don't contain newlines.
To test any solution for robustness try:
printf 'DATAFILE\tfoo \r bar\r\n' | awk '...' | cat -TEv
and make sure that the output looks like it does below:
$ printf 'DATAFILE\tfoo \r\tbar\r\n' | awk '/DATAFILE/ { sub(/[^\t]+\t/,""); sub(/\r$/,""); print }' | cat -TEv
foo ^M^Ibar$
$ printf 'DATAFILE\tfoo \r\tbar\r\n' | awk '/DATAFILE/ { sub(/[^[:space:]]+[[:space:]]+/,""); sub(/\r$/,""); print }' | cat -TEv
foo ^M^Ibar$
Note the blank, ^M (CR), and ^I (tab) in the middle of the file name as they should be but no ^M at the end of the line.
If your version of cat doesn't support -T or -E then do whatever you normally do to look for non-printing chars, e.g. od -c or vi the output.
With GNU awk, would you please try the following:
FILENAME=$(awk -v RS='\r?\n' '/DATAFILE/ {print $2}' "$INFILE")
echo "$FILENAME"
It assigns the record separator RS to a sequence of zero or one \r followed by \n.
As a side note, it is not recommended to use uppercases for user's variable names because it may conflict with system reserved variable names.
Awk simply applies each line of script to each input line. You can easily remove the carriage return and then apply some other logic to the input line. For example,
FILENAME=$(awk '/\r/ { sub(/\r/, "") }
/DATAFILE/ { print $2 }' "$INFILE")
Notice also When to wrap quotes around a shell variable.
who says you need gnu-awk :
gecho -ne "test\r\nabc\n\rdef\n" \
\
| mawk NF=NF FS='\r' OFS='' | odview
0000000 1953719668 1667391754 1717920778 10
t e s t \n a b c \n d e f \n
164 145 163 164 012 141 142 143 012 144 145 146 012
t e s t nl a b c nl d e f nl
116 101 115 116 10 97 98 99 10 100 101 102 10
74 65 73 74 0a 61 62 63 0a 64 65 66 0a
0000015
gawk -P posix mode is also fine with it :
gecho -ne "test\r\nabc\n\rdef\n" \
\
| gawk -Pe NF=NF FS='\r' OFS='' | odview
0000000 1953719668 1667391754 1717920778 10
t e s t \n a b c \n d e f \n
164 145 163 164 012 141 142 143 012 144 145 146 012
t e s t nl a b c nl d e f nl
116 101 115 116 10 97 98 99 10 100 101 102 10
74 65 73 74 0a 61 62 63 0a 64 65 66 0a
0000015

How to print file name with data from where data is coming?

I have below mentioned files in path 1,
fb1.tril.cap
fb2.tril.cap
fb3.tril.cap
For example data in file fb1.tril.cap are like shown below,
AT99565 150 500 (DEST 81)
AT99565 101 501 (DEST 883)
AT99565 152 502 (419)
For example data in file fb2.tril.cap are like shown below,
AT99565 103 1503 (DEST 165)
AT99565 104 154 (DEST 199)
For example data in file fb3.tril.cap are like shown below,
RT61446 80 863 (DEST 968)
RT20447 32 39 (DEST 570)
RT51224 73 74 (592)
I had written code like shown below to print my required fields,
while read file_name
do
cat ${file_name} | awk -F' ' '$4 == "(DEST" { print
$1, $2, $3, $5}' | awk -F')' '{print $1, $2, $3, $4}' | uniq >> output.csv
done < path_1
I'm getting below output,
AT99565 150 500 81
AT99565 101 501 883
AT99565 103 1503 165
AT99565 104 154 199
RT61446 80 863 968
RT20447 32 39 570
But i want to print file name also along with data from where data is coming, like shown below,
AT99565 150 500 81 fb1.tril.cap
AT99565 101 501 883 fb1.tril.cap
AT99565 103 1503 165 fb2.tril.cap
AT99565 104 154 199 fb2.tril.cap
RT61446 80 863 968 fb3.tril.cap
RT20447 32 39 570 fb3.tril.cap
Can anyone help me to complete my job by printing file name as well along with the data. Thanks in advance.
First I am not able to test my code solution but this code might be run for you.
while read file_name
do
cat ${file_name} | awk -F' ' '$5 == "(DEST" { print
$1, $2, $3, $5}' | awk -F')' '{print $1, $2, $3, $4, $file_name}' | uniq >> output.csv
done < path_1
A sed one-liner:
sed 's/[()]\|DEST//g;F' fb*.tril.cap | sed -n 'h;n;G;s/\n/ /gp'
How it works:
s/[()]\|DEST//g: Instead of parsing (DEST and such, just remove them. What's left after that are the four desired items.
Then use sed's File name command to print the file name.
Since F prints immediately, a 2nd sed invocation is needed to juggle the output a bit.
If the output spacing is too uneven, add a tr to convert the spaces into tabs:
sed 's/[()]\|DEST//g;F' fb*.tril.cap | sed -n 'h;n;G;s/\n/ /gp' |
tr -s ' '
Using Perl one liner
> ls -l fb*tril*cap
-rw-r--r-- 1 aaaaa bbbbb 77 Dec 6 09:20 fb1.tril.cap
-rw-r--r-- 1 aaaaa bbbbb 58 Dec 6 09:21 fb2.tril.cap
-rw-r--r-- 1 aaaaa bbbbb 74 Dec 6 09:21 fb3.tril.cap
> perl -lane ' print $_,$ARGV if $F[3]=~/\(DEST/ and s/\(DEST //g and s/\)//g ' fb*tril*cap
AT99565 150 500 81 fb1.tril.cap
AT99565 101 501 883 fb1.tril.cap
AT99565 103 1503 165 fb2.tril.cap
AT99565 104 154 199 fb2.tril.cap
RT61446 80 863 968 fb3.tril.cap
RT20447 32 39 570 fb3.tril.cap
>

Appending the line even though there is no match with awk

I am trying to compare two files and append another column if there is certain condition satisfied.
file1.txt
1 101 111 . BCX 123
1 298 306 . CCC 234
1 299 305 . DDD 345
file2.txt
1 101 111 BCX P1#QQQ
1 299 305 DDD P2#WWW
The output should be:
1 101 111 . BCX 123;P1#QQQ
1 298 306 . CCC 234
1 299 305 . DDD 345;P2#WWW
What I can do is, to only do this for the lines having a match:
awk 'NR==FNR{ a[$1,$2,$3,$4]=$5; next }{ s=SUBSEP; k=$1 s $2 s $3 s $5 }k in a{ print $0,a[k] }' file2.txt file1.txt
1 101 111 . BCX 123 P1#QQQ
1 299 305 . DDD 345 P2#WWW
But then, I am missing the second line in file1.
How can I still keep it even though there is no match with file2 regions?
If you want to print every line, you need your print command not to be limited by your condition.
awk '
NR==FNR {
a[$1,$2,$3,$4]=$5; next
}
{
s=SUBSEP; k=$1 s $2 s $3 s $5
}
k in a {
$6=$6 ";" a[k]
}
1' file2.txt file1.txt
The 1 is shorthand that says "print every line". It's a condition (without command statements) that always evaluates "true".
The k in a condition simply replaces your existing 6th field with the concatenated one. If the condition is not met, the replacement doesn't happen, but we still print because of the 1.
Following awk may help you in same.
awk 'FNR==NR{a[$1,$2,$3,$4]=$NF;next} (($1,$2,$3,$5) in a){print $0";"a[$1,$2,$3,$5];next} 1' file2.txt file1.txt
Output will be as follows.
1 101 111 . BCX 123;P1#QQQ
1 298 306 . CCC 234
1 299 305 . DDD 345;P2#WWW
another awk
$ awk ' {t=5-(NR==FNR); k=$1 FS $2 FS $3 FS $t}
NR==FNR {a[k]=$NF; next}
k in a {$0=$0 ";" a[k]}1' file2 file1
1 101 111 . BCX 123;P1#QQQ
1 298 306 . CCC 234
1 299 305 . DDD 345;P2#WWW
last component of the key is either 4th or 5th field based on first or second file input; set it accordingly and use a single k variable in the script. Note that
t=5-(NR==FNR)
can be written as conventionally,
t=NR==FNR?4:5

awk with zero output

I have four column and I would like to do this:
INUPUT=
429 0 10 0
287 115 89 64
0 629 0 10
542 0 7 0
15 853 0 12
208 587 5 4
435 203 12 0
604 411 27 3
0 232 0 227
471 395 5 5
802 706 15 15
1288 1135 11 23
1063 386 13 2
603 678 7 14
0 760 0 11
awk '{if (($2+$4)/($1+$3)<0.2 || ($1+$3)==0) print $0; else if (($1+$3)/($2+$4)<0.2 || ($2+$4)==0) print $0; else print $0}' INPUT
But I have error message :
awk: cmd. line:1: (FILENAME=- FNR=3) fatal: division by zero attempted
Even if I have added condition:
...|| ($1+$3)==0...
Can somebody explain me what I am doing wrong?
Thank you so much.
PS: print $0 is just for illustration.
Move the "($1+$3) == 0" to be the first clause of the if statement. Awk will evalulate them in turn. Hence it still attempts the first clause of the if statement first, triggering the divide by zero attempt. If the first clause is true, it won't even attempt to evaulate the second one. So:-
awk '{if (($1+$3)==0 || ($2+$4)/($1+$3)<0.2) print $0; else if (($1+$3)/($2+$4)<0.2 || ($2+$4)==0) print $0; else print $0}' INPUT
You're already dividing by zero in your conditional statement ($1+$3=0 on the ninth line of your list). That's where the error comes from. You should change the ordering in your conditional statement: first verify that $1+$3!=0 and only then use it to define your next condition.

Problems combining awk scripts

I am trying to use awk to parse a tab delimited table -- there are several duplicate entries in the first column, and I need to remove the duplicate rows that have a smaller total sum of the other 4 columns in the table. I can remove the first or second row easily, and sum the columns, but I'm having trouble combining the two. For my purposes there will never be more than 2 duplicates.
Example file: http://pastebin.com/u2GBnm2D
Desired output in this case would be to remove the rows:
lmo0330 1 1 0 1
lmo0506 7 21 2 10
And keep the other two rows with the same gene id in the column. The final parsed file would look like this: http://pastebin.com/WgDkm5ui
Here's what I have tried (this doesn't do anything. But the first part removes the second duplicate, and the second part sums the counts):
awk 'BEGIN {!a[$1]++} {for(i=1;i<=NF;i++) t+=$i; print t; t=0}'
I tried modifying the 2nd part of the script in the best answer of this question: Removing lines containing a unique first field with awk?
awk 'FNR==NR{a[$1]++;next}(a[$1] > 1)' ./infile ./infile
But unfortunately I don't really understand what's going on well enough to get it working. Can anyone help me out? I think I need to replace the a[$1] > 1 part with [remove (first duplicate count or 2nd duplicate count depending on which is larger].
EDIT: I'm also using GNU Awk 3.1.7 if that matters.
You can use this awk command:
awk 'NR == 1 {
print;
next
} {
s = $2+$3+$4+$5
} s >= sum[$1] {
sum[$1] = s;
if (!($1 in rows))
a[++n] = $1;
rows[$1] = $0
} END {
for(i=1; i<=n; i++)
print rows[a[i]]
}' file | column -t
Output:
gene SRR034450.out.rpkm_0 SRR034451.out.rpkm_0 SRR034452.out.rpkm_0 SRR034453.out.rpkm_0
lmo0001 160 323 533 293
lmo0002 135 317 504 306
lmo0003 1 4 5 3
lmo0004 35 59 58 48
lmo0005 113 218 257 187
lmo0006 279 519 653 539
lmo0007 563 1053 1165 1069
lmo0008 34 84 203 107
lmo0009 13 45 90 49
lmo0010 57 210 237 169
lmo0011 65 224 247 179
lmo0012 65 226 250 215
lmo0013 342 500 738 682
lmo0014 662 1032 1283 1311
lmo0015 321 413 631 637
lmo0016 175 253 273 325
lmo0017 3 6 6 6
lmo0018 33 38 46 45
lmo0019 13 1 39 1
lmo0020 3 12 28 15
lmo0021 3 4 14 12
lmo0022 2 3 5 1
lmo0023 2 0 3 2
lmo0024 1 0 2 6
lmo0330 1 1 1 3
lmo0506 151 232 60 204

Resources