Translate Chinese to urlencoding in awk - linux

I have a .txt file. And each line contains Chinese. I want to translate the Chinese to urlencoding.
How can I get it?
txt.file
http://wiki.com/ 中文
http://wiki.com/ 中国
target.file
http://wiki.com/%E4%B8%AD%E6%96%87
http://wiki.com/%E4%B8%AD%E5%9B%BD
I found a shell script way to approach it like this:
echo '中文' | tr -d '\n' | xxd -plain | sed 's/\(..\)/%\1/g' | tr '[a-z]' '[A-Z]'
So, I wanna embed it in awk like this, but I failed:
awk -F'\t' '{
a=system("echo '"$2"'| tr -d '\n' | xxd -plain | \
sed 's/\(..\)/%\1/g' | tr '[a-z]' '[A-Z]");
print $1a
}' txt.file
I have tried another way to write an outside function and call it in awk, code like this, failed it again.
zh2url()
{
echo $1 | tr -d '\n' | xxd -plain | sed 's/\(..\)/%\1/g' | tr '[a-z]' '[A-Z]'
}
export -f zh2url
awk -F'\t' "{a=system(\"zh2url $2\");print $1a}" txt.file
Please implement it with awk command because I actually have another thing need to handle in awk at the same time.

With GNU awk for co-processes, etc.:
$ cat tst.awk
function xlate(old, cmd, new) {
cmd = "xxd -plain"
printf "%s", old |& cmd
close(cmd,"to")
if ( (cmd |& getline rslt) > 0 ) {
new = toupper(gensub(/../,"%&","g",rslt))
}
close(cmd)
return new
}
BEGIN { FS="\t" }
{ print $1 xlate($2) }
$ awk -f tst.awk txt.file
http://wiki.com/%E4%B8%AD%E6%96%87
http://wiki.com/%E4%B8%AD%E5%9B%BD

Related

How can I add a new line at the end of the output? (Linux help)

i am using this code
cut -c1 | tr -d '\n'
to basically take and print out the first letter of every line. the problem is, I need a new line at the end, but only at the end, after the word "caroline" (these are the content of the testfile
Cannot use AWK, basename, grep, egrep, fgrep or rgrep
Use echo
echo $( cut -c1 | tr -d '\n' ) \n
cut -c1 | tr -d '\n'; echo -e '\n'
Try using awk utility, something like following:-
awk -F\| '$1 > 0 { print substr($1,1,1)}' testfile.txt

Optimize Multiline Pipe to Awk in Bash Function

I have this function:
field_get() {
while read data; do
echo $data | awk -F ';' -v number=$1 '{print $number}'
done
}
which can be used like this:
cat filename | field_get 1
in order to extract the first field from some piped in input. This works but I'm iterating on each line and it's slower than expected.
Does anybody know how to avoid this iteration?
I tried to use:
stdin=$(cat)
echo $stdin | awk -F ';' -v number=$1 '{print $number}'
but the line breaks get lost and it treats all the stdin as a single line.
IMPORTANT: I need to pipe in the input because in general I DO NOT have just to cat a file. Assume that the file is multiline, the problem is that actually. I know I can use "awk something filename" but that won't help me.
Just lose the while. Awk is a while loop in itself:
field_get() {
awk -F ';' -v number=$1 '{print $number}'
}
$ echo 1\;2\;3 | field_get 2
2
Update:
Not sure what you mean by your comment on multiline pipe and file but:
$ cat foo
1;2
3;4
$ cat foo | field_get 1
1
3
Use either stdin or file
field_get() {
awk -F ';' -v number="$1" '{print $number}' "${2:-/dev/stdin}"
}
Test Results:
$ field_get() {
awk -F ';' -v number="$1" '{print $number}' "${2:-/dev/stdin}"
}
$ echo '1;2;3;4' >testfile
$ field_get 3 testfile
3
$ echo '1;2;3;4' | field_get 2
2
No need to use a while loop and then awk. awk itself can read the input file. Where $1 is the argument passed to your script.
cat script.ksh
awk -v field="$1" '{print $field}' Input_file
./script.ksh 1
This is a job for the cut command:
cut -d';' -f1 somefile

Need help on formating a line using sed

The following is the line that I wanted to split it to tab separate part.
>VFG000676(gb|AAD32411)_(lef)_anthrax_toxin_lethal_factor_precursor_[Anthrax_toxin_(VF0142)]_[Bacillus_anthracis_str._Sterne]
the output that I want is
>VFG000676\t(gb|AAD32411)\t(lef)\tanthrax_toxin_lethal_factor_precursor\t [Anthrax_toxin_(VF0142)]\t[Bacillus_anthracis_str._Sterne]
I used this command
grep '>' x.fa | sed 's/^>\(.*\) (gi.*) \(.*\) \[\(.*\)\].*/\1\t\2\t\3/' | sed 's/ /_/g' > output.tsv
but the output is not what I want.
UPDATE: I finally fixed the issue by using the following code
grep '>' VFs_no_block.fa | sed 's/^>\(.*\)\((.*)\) \((.*)\) \(.*\) \(\[.*(.*)]\) \(\[.*]\).*/\1\t\2\t\3\t\4\t\5\t\6/' | sed 's/ /_/g' > VFDB_annotation_reference.tsv
Change OFS="\\t" to OFS="\t" if you really wanted literal tabs:
$ cat tst.awk
BEGIN { OFS="\\t" }
{
c=0
while ( match($0,/\[[^][]+\]|\([^)(]+\)|[^][)(]+/) ) {
tgt = substr($0,RSTART,RLENGTH)
gsub(/^_+|_+$/,"",tgt)
if (tgt != "") {
printf "%s%s", (c++ ? OFS : ""), tgt
}
$0 = substr($0,RSTART+RLENGTH)
}
print
}
$ awk -f tst.awk file
>VFG000676\t(gb|AAD32411)\t(lef)\tanthrax_toxin_lethal_factor_precursor\t[Anthrax_toxin_(VF0142)]\t[Bacillus_anthracis_str._Sterne]

Sed to replace tab with Control-A in shell script

Working with data and trying to convert a tab-delimited to control-a in a shell script. Using command-line, I would represent control-a by doing the following sequence, 'control-v, control-a'.
Here is my code in a .sh:
echo "tab delimited query here" | sed 's/ /'\001'/g' > $OUTPUT_FILE
However, this doesn't work. I've also tried the following:
'\x001'
x1
'\x1'
You can use tr here:
echo $'tab\tdelimited\tquery\there' | tr '\t' $'\x01'
To demonstrate that it has been replaced:
echo $'tab\tdelimited\tquery\there' | tr '\t' $'\x01' | cat -vte
Output:
tab^Adelimited^Aquery^Ahere$
sed alternative:
echo $'tab\tdelimited\tquery\there' | sed $'s/\t/\x01/g'
awk alternative:
echo $'tab\tdelimited\tquery\there' | awk -F '\t' -v OFS='\x01' '{$1=$1} 1'

Using awk to modify output

I have a command that is giving me the output:
/home/konnor/md5sums:ea66574ff0daad6d0406f67e4571ee08 counted-file.xml.20131003-083611
I need the output to be:
ea66574ff0daad6d0406f67e4571ee08 counted-file.xml
The closest I got was:
$ echo /home/konnor/md5sums:ea66574ff0daad6d0406f67e4571ee08 counted-file.xml.20131003-083611 | awk '{ printf "%s", $1 }; END { printf "\n" }'
/home/konnor/md5sums:ea66574ff0daad6d0406f67e4571ee08
I'm not familiar with awk but I believe this is the command I want to use, any one have any ideas?
Or just a sed oneliner:
echo /home/konnor/md5sums:ea66574ff0daad6d0406f67e4571ee08 counted-file.xml.20131003-083611 \
| sed -E 's/.*:(.*\.xml).*/\1/'
$ echo "/home/konnor/md5sums:ea66574ff0daad6d0406f67e4571ee08 counted-file.xml.20131003-083611" |
cut -d: -f2 |
cut -d. -f1-2
ea66574ff0daad6d0406f67e4571ee08 counted-file.xml
Note that this relies on the dot . being present as in counted-file.xml.
$ awk -F[:.] -v OFS="." '{print $2,$3}' <<< "/home/konnor/md5sums:ea66574ff0daad6d0406f67e4571ee08 counted-file.xml.20131003-083611"
ea66574ff0daad6d0406f67e4571ee08 counted-file.xml
not sure if this is ok for you:
sed 's/^.*:\(.*\)\.[^.]*$/\1/'
with your example:
kent$ echo "/home/konnor/md5sums:ea66574ff0daad6d0406f67e4571ee08 counted-file.xml.20131003-083611"|sed 's/^.*:\(.*\)\.[^.]*$/\1/'
ea66574ff0daad6d0406f67e4571ee08 counted-file.xml
this grep line works too:
grep -Po ':\K.*(?=\..*?$)'

Resources