Need help on formating a line using sed - linux

The following is the line that I wanted to split it to tab separate part.
>VFG000676(gb|AAD32411)_(lef)_anthrax_toxin_lethal_factor_precursor_[Anthrax_toxin_(VF0142)]_[Bacillus_anthracis_str._Sterne]
the output that I want is
>VFG000676\t(gb|AAD32411)\t(lef)\tanthrax_toxin_lethal_factor_precursor\t [Anthrax_toxin_(VF0142)]\t[Bacillus_anthracis_str._Sterne]
I used this command
grep '>' x.fa | sed 's/^>\(.*\) (gi.*) \(.*\) \[\(.*\)\].*/\1\t\2\t\3/' | sed 's/ /_/g' > output.tsv
but the output is not what I want.
UPDATE: I finally fixed the issue by using the following code
grep '>' VFs_no_block.fa | sed 's/^>\(.*\)\((.*)\) \((.*)\) \(.*\) \(\[.*(.*)]\) \(\[.*]\).*/\1\t\2\t\3\t\4\t\5\t\6/' | sed 's/ /_/g' > VFDB_annotation_reference.tsv

Change OFS="\\t" to OFS="\t" if you really wanted literal tabs:
$ cat tst.awk
BEGIN { OFS="\\t" }
{
c=0
while ( match($0,/\[[^][]+\]|\([^)(]+\)|[^][)(]+/) ) {
tgt = substr($0,RSTART,RLENGTH)
gsub(/^_+|_+$/,"",tgt)
if (tgt != "") {
printf "%s%s", (c++ ? OFS : ""), tgt
}
$0 = substr($0,RSTART+RLENGTH)
}
print
}
$ awk -f tst.awk file
>VFG000676\t(gb|AAD32411)\t(lef)\tanthrax_toxin_lethal_factor_precursor\t[Anthrax_toxin_(VF0142)]\t[Bacillus_anthracis_str._Sterne]

Related

Difficulty to create .txt file from loop in bash

I've this data :
cat >data1.txt <<'EOF'
2020-01-27-06-00;/dev/hd1;100;/
2020-01-27-12-00;/dev/hd1;100;/
2020-01-27-18-00;/dev/hd1;100;/
2020-01-27-06-00;/dev/hd2;200;/usr
2020-01-27-12-00;/dev/hd2;200;/usr
2020-01-27-18-00;/dev/hd2;200;/usr
EOF
cat >data2.txt <<'EOF'
2020-02-27-06-00;/dev/hd1;120;/
2020-02-27-12-00;/dev/hd1;120;/
2020-02-27-18-00;/dev/hd1;120;/
2020-02-27-06-00;/dev/hd2;230;/usr
2020-02-27-12-00;/dev/hd2;230;/usr
2020-02-27-18-00;/dev/hd2;230;/usr
EOF
cat >data3.txt <<'EOF'
2020-03-27-06-00;/dev/hd1;130;/
2020-03-27-12-00;/dev/hd1;130;/
2020-03-27-18-00;/dev/hd1;130;/
2020-03-27-06-00;/dev/hd2;240;/usr
2020-03-27-12-00;/dev/hd2;240;/usr
2020-03-27-18-00;/dev/hd2;240;/usr
EOF
I would like to create a .txt file for each filesystem ( so hd1.txt, hd2.txt, hd3.txt and hd4.txt ) and put in each .txt file the sum of the value from each FS from each dataX.txt. I've some difficulties to explain in english what I want, so here an example of the result wanted
Expected content for the output file hd1.txt:
2020-01;/dev/hd1;300;/
2020-02;/dev/hd1;360;/
2020-03;/dev/hd1;390:/
Expected content for the file hd2.txt:
2020-01;/dev/hd2;600;/usr
2020-02;/dev/hd2;690;/usr
2020-03;/dev/hd2;720;/usr
The implementation I've currently tried:
for i in $(cat *.txt | awk -F';' '{print $2}' | cut -d '/' -f3| uniq)
do
cat *.txt | grep -w $i | awk -F';' -v date="$(cat *.txt | awk -F';' '{print $1}' | cut -d'-' -f-2 | uniq )" '{sum+=$3} END {print date";"$2";"sum}' >> $i
done
But it doesn't works...
Can you show me how to do that ?
Because the format seems to be so constant, you can delimit the input with multiple separators and parse it easily in awk:
awk -v FS='[;-/]' '
prev != $9 {
if (length(output)) {
print output >> fileoutput
}
prev = $9
sum = 0
}
{
sum += $9
output = sprintf("%s-%s;/%s/%s;%d;/%s", $1, $2, $7, $8, sum, $11)
fileoutput = $8 ".txt"
}
END {
print output >> fileoutput
}
' *.txt
Tested on repl generates:
+ cat hd1.txt
2020-01;/dev/hd1;300;/
2020-02;/dev/hd1;360;/
2020-03;/dev/hd1;390;/
+ cat hd2.txt
2020-01;/dev/hd2;600;/usr
2020-02;/dev/hd2;690;/usr
2020-03;/dev/hd2;720;/usr
Alternatively, you could -v FS=';' and use split to split first and second column to extract the year and month and the hdX number.
If you seek a bash solution, I suggest you invert the loops - first iterate over files, then over identifiers in second column.
for file in *.txt; do
prev=
output=
while IFS=';' read -r date dev num path; do
hd=$(basename "$dev")
if [[ "$hd" != "${prev:-}" ]]; then
if ((${#output})); then
printf "%s\n" "$output" >> "$fileoutput"
fi
sum=0
prev="$hd"
fi
sum=$((sum + num))
output=$(
printf "%s;%s;%d;%s" \
"$(cut -d'-' -f1-2 <<<"$date")" \
"$dev" "$sum" "$path"
)
fileoutput="${hd}.txt"
done < "$file"
printf "%s\n" "$output" >> "$fileoutput"
done
You could also almost translate awk to bash 1:1 by doing IFS='-;/' in while read loop.

Translate Chinese to urlencoding in awk

I have a .txt file. And each line contains Chinese. I want to translate the Chinese to urlencoding.
How can I get it?
txt.file
http://wiki.com/ 中文
http://wiki.com/ 中国
target.file
http://wiki.com/%E4%B8%AD%E6%96%87
http://wiki.com/%E4%B8%AD%E5%9B%BD
I found a shell script way to approach it like this:
echo '中文' | tr -d '\n' | xxd -plain | sed 's/\(..\)/%\1/g' | tr '[a-z]' '[A-Z]'
So, I wanna embed it in awk like this, but I failed:
awk -F'\t' '{
a=system("echo '"$2"'| tr -d '\n' | xxd -plain | \
sed 's/\(..\)/%\1/g' | tr '[a-z]' '[A-Z]");
print $1a
}' txt.file
I have tried another way to write an outside function and call it in awk, code like this, failed it again.
zh2url()
{
echo $1 | tr -d '\n' | xxd -plain | sed 's/\(..\)/%\1/g' | tr '[a-z]' '[A-Z]'
}
export -f zh2url
awk -F'\t' "{a=system(\"zh2url $2\");print $1a}" txt.file
Please implement it with awk command because I actually have another thing need to handle in awk at the same time.
With GNU awk for co-processes, etc.:
$ cat tst.awk
function xlate(old, cmd, new) {
cmd = "xxd -plain"
printf "%s", old |& cmd
close(cmd,"to")
if ( (cmd |& getline rslt) > 0 ) {
new = toupper(gensub(/../,"%&","g",rslt))
}
close(cmd)
return new
}
BEGIN { FS="\t" }
{ print $1 xlate($2) }
$ awk -f tst.awk txt.file
http://wiki.com/%E4%B8%AD%E6%96%87
http://wiki.com/%E4%B8%AD%E5%9B%BD

tr command in awk to change the column values

I am using in my shell script TR command in awk to mask the data. Below example file affects only first line of the my file when i used tr command in awk. when i use the same in while loop and called the awk command inside of it then its working fine but it taking very long time to get completed. Now my requirement i want to mask many columns[example :$1, $5, $9] in the same file(file.txt) and this should affect the whole file not first line and i want to achieve this as much as faster to mask the data. Please advise
cat file.txt
========
abcbchs,degehek
abcbchs,degehek
abcbchs,degehek
abcbchs,degehek
abcbchs,degehek
abcbchs,degehek,lskjsjshsh
abcbchs,degehek
abcbchs,degehek,lskjsjshsh
OUTPUT
awk -F"," -v OFS="," '{ "echo \""$1"\" | tr \"a-c\" \"e-f\" | tr \"0-5\" \"6-9\"" | getline $1 }7' file.txt
effffhs,degehek
abcbchs,degehek
abcbchs,degehek
abcbchs,degehek
abcbchs,degehek
abcbchs,degehek,lskjsjshsh
abcbchs,degehek
abcbchs,degehek,lskjsjshsh
Expected output
effffhs,degehek
effffhs,degehek
effffhs,degehek
effffhs,degehek
effffhs,degehek
effffhs,degehek,lskjsjshsh
effffhs,degehek
effffhs,degehek,lskjsjshsh
The code you found runs an external shell command pipeline on each input line. Like you discovered, that's an awfully inefficient way to do what you are asking. Awk isn't really an ideal choice for this task at all. Maybe try Perl.
perl -F, -lane '$F[$_] =~ tr/a-c/e-f/ =~ tr/0-5/6-9/ for (0, 4, 8); print join(",", #F)' file
The -F, option is like with Awk, but Perl doesn't automatically split the input line. With -a it does, splitting into an array named #F, and with -n it loops over all input lines. The -l is a convenience to remove newlines from each input line and adding one back when you print.
Notice how the columns are numbered from zero, not one, like in Awk; so the indices in the for loop access the first, fifth, and ninth elements of #F.
You forgot to close() the command after every invocation. Here's the correct way to write it:
$ cat tst.awk
BEGIN { FS=OFS="," }
{
cmd="echo '" $1 "' | tr 'a-c' 'e-f' | tr '0-5' '6-9'"
$1 = ( (cmd | getline line) > 0 ? line : $1 )
close(cmd)
print
}
$ awk -f tst.awk file
effffhs,degehek
effffhs,degehek
effffhs,degehek
effffhs,degehek
effffhs,degehek
effffhs,degehek,lskjsjshsh
effffhs,degehek
effffhs,degehek,lskjsjshsh
You also didn't protect yourself from getline failures, hence the extra complexity around the getline call, see http://awk.info/?tip/getline.
Given your comments, this shows how to modify multiple fields (1, 3, and 5 in this case) simultaneously:
$ cat tst.awk
BEGIN { FS=OFS="," }
{
cmd = "echo '" $0 "' | tr 'a-c' 'e-f' | tr '0-5' '6-9'"
new = ( (cmd | getline line) > 0 ? line : $1 )
close(cmd)
split(new,tmp)
for (i in tmp) {
if (i ~ /^(1|3|5)$/) {
$i = tmp[i]
}
}
print
}
$ cat file
abc,abc,abc,abc,abc
abc,abc,abc,abc,abc,abc,abc
abc,abc,abc,abc,abc,abc
abc,abc,abc,abc
$ awk -f tst.awk file
eff,abc,eff,abc,eff
eff,abc,eff,abc,eff,abc,abc
eff,abc,eff,abc,eff,abc
eff,abc,eff,abc
To handle quotes in the input data:
$ cat tst.awk
BEGIN { FS=OFS="," }
{
gsub(/'/,SUBSEP)
cmd = "echo '" $0 "' | tr 'a-c' 'e-f' | tr '0-5' '6-9'"
new = ( (cmd | getline line) > 0 ? line : $1 )
close(cmd)
split(new,tmp)
for (i in tmp) {
if (i ~ /^(1|3|5)$/) {
$i = tmp[i]
}
}
gsub(SUBSEP,"'")
print
}
$ cat file
a'c,abc,a"c,abc,abc
abc,a'c,abc,a"c,abc,abc,abc
abc,abc,abc,abc,abc,abc
abc,abc,abc,abc
$ awk -f tst.awk file
e'f,abc,e"f,abc,eff
eff,a'c,eff,a"c,eff,abc,abc
eff,abc,eff,abc,eff,abc
eff,abc,eff,abc
If you don't have any particular control char that's guaranteed not to appear in your input, you can create a non-existent string to use instead of SUBSEP above by using the technique described at the end of https://stackoverflow.com/a/29237745/1745001

How to extract words between two characters in linux?

I have the following stored in a file named tmp.txt
user/config/jars/content-config-factory-3.2.0.0.jar
I need to store this word to a variable -
$variable=content-config-factory
I have written the following
while read line
do
var=$(echo $line | awk 'BEGIN{FS="\/"; OFS=" "} {print $NF}' )
var=$(echo $var | awk 'BEGIN{FS="-"; OFS=" "} {print $(1)}' )
echo $var
done < tmp.txt
This returns the result "content" instead of "content-config-factory".
Can anyone please tell me how to extract a word between two characters from a string efficiently.
An awk solution would be like
awk -F/ '{sub("-[^-]+$", "", $NF); print $NF}
Test
$ echo "user/config/jars/content-config-factory-3.2.0.0.jar" | awk -F/ '{sub("-[^-]+$", "", $NF); print $NF}'
content-config-factory
You can try this way also and get your expected result
variable=$(sed 's:.*/\(.*\)-.*:\1:' FileName)
echo $variable
OutPut :
content-config-factory
You could use grep,
grep -oP '(?<=/)[^/]*(?=-\d+\.)' file
Example:
$ var=$(echo 'user/config/jars/content-config-factory-3.2.0.0.jar' | grep -oP '(?<=/)[^/]*(?=-\d+\.)')
$ echo "$var"
content-config-factory

Strip whitespaces in AWK

I have tried to remove all the white spaces but it's not taking it.
awk -F, -v OFS=", " 'NR==1 {
print $0,"FILENAME,DATE_LOADED,TEST";
next
}
{
line=$0
key=echo "${11//[[:space:]]/}" "${12//[[:space:]]/}" "${57//[[:space:]]/}"
key | getline
k=$0
cmd="md5 <<<"k
cmd | getline
md5sum=$0
print line, ENVIRON["FILE"], ENVIRON["ISODATE"], md5sum
}' $FILE > $NAME"_ready.csv"
If I try this it throws errors. I tried all options and really at a loss here.
key=echo $11$12$57 | tr -d
awk '{gsub(" +","");print $0}' test.text
If you want only awk based solution then this one liner will remove all the spaces from the file.

Resources