What's the function of `'` in `printf "%x" "'你"`? - linux

I want to get the hexadecimal value of 你, someone tell me to use printf "%x" "'你", but I don't know what's the function of ' in printf "%x" "'你", why use ' before 你?

From the bash manual:
Arguments to non-string format specifiers are treated as C constants, except that a leading plus or minus sign is allowed, and if the leading character is a single or double quote, the value is the ASCII value of the following character.
%x is a numeric specifier, not a string one, so this section applies. The documentation is a bit wrong (or outdated) when it speaks about ASCII values, but it's correct in spirit: an argument of '你 evaluates to the numerical value of the unicode codepoint 你 (without the quote, it would be a syntax error, since 你 isn't a number). The codepoint value that it evaluates to is then formatted in hexadecimal by %x.

Related

Printing a string containing utf-8 byte sequences on perl

I'm new to perl, and I'm trying to print out the folderName from mork files (from Thunderbird).
From: https://github.com/KevinGoodsell/mork-converter/blob/master/doc/mork-format.txt
The second type of special character sequence is a dollar sign
followed by two hexadecimal digits which give the value of the
replacement byte. This is often used for bytes that are non-printable
as ASCII characters, especially in UTF-16 text. For example, a string
with the Unicode snowman character (U+2603):
☃snowman☃
may be represented as UTF-16 text in an Alias this way:
<(83=$03$26s$00n$00o$00w$00m$00a$00n$00$03$26)>
From all the Thunderbird files I've seen it's actually encoded in UTF-8 (2 to 4 bytes).
The following characters need to be escaped (with \) within the string to be used literally: $, ) and \
Example: aaa\$AA$C3$B1b$E2$98$BA$C3$AD\\x08 should print aaa$AAñb☺í\x08
$C3$B1 is ñ; $E2$98$BA is ☺; $C3$ADis í
I tried using the regex to replaced unescaped $ into \x
my $unescaped = qr/(?<!\\)(?:(\\\\)*)/;
$folder =~ s/$unescaped\$/\\x/g;
$folder =~ s/\\([\\$)])/$1/g; # unescape "\ $ ("
Within perl it just prints the literal string.
My workaround is feeding it into bash's printf and it succeeds... unless there's a literal "\x" in the string
$ folder=$(printf "$(mork.pl 8777646a.msf)")
$ echo "$folder"
aaa$AAñb☺í
Questions i consulted:
Convert UTF-8 character sequence to real UTF-8 bytes
But it seems it interprets every byte by itself, not in groups.
In Perl, how can I convert an array of bytes to a Unicode string?
I don't know how to apply this solution to my use case.
Is there any way to achieve this in perl?
The following substitution seems to work for your input:
s/\\([\$\\])|\$(..)/$2 ? chr hex $2 : $1/ge;
Capture \$ or \\, if matched, replace them with $ or \. Otherwise, capture $.. and convert to the corresponding byte.
If you want to work with the result in Perl, don't forget to decode it from UTF-8.
$chars = decode('UTF-8', $bytes);

Select sequences in a fasta file with more than 300 aa and "C" occurs at least 4 times

I have a fasta file which contains protein sequences. I'd like to select sequences with more than 300 amino acids and Cysteine (C) amino acid appears more than 4 times.
I've used this command to select sequences with more than 300 aa:
cat 72hDOWN-fasta.fasta | bioawk -c fastx 'length($seq) > 300{ print ">"$name; print $seq }'
Some sequence example:
>jgi|Triasp1|216614|CE216613_3477
MPSLYLTSALGLLSLLPAAQAGWNPNSKDNIVVYWGQDAGSIGQNRLSYYCENAPDVDVI
NISFLVGITDLNLNLANVGNNCTAFAQDPNLLDCPQVAADIVECQQTYGKTIMMSLFGST
YTESGFSSSSTAVSAAQEIWAMFGPVQSGNSTPRPFGNAVIDGFDFDLEDPIENNMEPFA
AELRSLTSAATSKKFYLSAAPQCVYPDASDESFLQGEVAFDWLNIQFYNNGCGTSYYPSG
YNYATWDNWAKTVSANPNTKLLVGTPASVHAVNFANYFPTNDQLAGAISSSKSYDSFAGV
MLWDMAQLFGNPGYLDLIVADLGGASTPPPPASTTLSTVTRSSTASTGPTSPPPSGGSVP
QWGQCGGQGYTGPTQCQSPYTCVVESQWWSSCQ*
I do not know bioawk but I assume it is identical to awk with some initial parsing and constant definitions.
I would proceed as follows. Assuming you want the find the strings with more then 4 times the letter C in and a length of more than 300, then you could do :
bioawk -c fastx '
(length($seq) > 300) && (gsub("C","C",$seq)>4) {
print ">"$name; print $seq
}' 72hDOWN-fasta.fasta
but this assumes that seq is the full character sequence.
The idea behind it is the following. The gsub command performs substitutions in strings and returns the total substitutions it did. Hence, if we substitute all characters "C" with "C" we actually did not change the string, but get the total amount of "C"'s in the string back.
From the POSIX standard IEEE Std 1003.1-2017:
gsub(ere, repl[, in]): Behave like sub (see below), except that it shall replace all occurrences of the regular expression (like
the ed utility global substitute) in $0 or in the in argument,
when specified.
sub(ere, repl[, in ]): Substitute the string repl in place of the first instance of the extended regular expression ere in string in
and return the number of substitutions. An <ampersand> ( &
) appearing in the string repl shall be replaced by the string from in
that matches the ERE. An <ampersand> preceded with a
<backslash> shall be interpreted as the literal
<ampersand> character. An occurrence of two consecutive
<backslash> characters shall be interpreted as just a single
literal <backslash> character. Any other occurrence of a
<backslash> (for example, preceding any other character) shall
be treated as a literal <backslash> character. Note that if repl
is a string literal (the lexical token STRING; see Grammar), the
handling of the <ampersand> character occurs after any lexical
processing, including any lexical <backslash>-escape sequence
processing. If in is specified and it is not an lvalue (see
Expressions in awk), the behavior is undefined. If in is omitted, awk
shall use the current record ($0) in its place.
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.

How to remove 0 in the power of a scientific notation in printf

In shell script, I want 23.343 to be output as 2.334E+1. When using printf "%.3E\n" 23.343, I got 2.334E+01. So how to remove that 0 by some trick in printf? And also is there a way to output the above number as 0.233E+02 by printf?
With printf itself, you don't:
e, E The double argument is rounded and converted in the style [-]d.ddde±dd where there is one digit before the decimal-point character
... The exponent always contains at least two digits; if the value is zero, the exponent is 00.
But in e.g. Bash, you can remove the zero from the exponent manually:
$ x=$(printf "%.3E\n" 23.343)
$ $ echo "${x/E+0/E+}"
2.334E+1
(I missed the other part of the question.)

Bash: How to check if the last three string characters equals '***'

I know how to do it for one character:
[ "${filename: -1}" == "*" ]
Is it possible to do it for more?
Why have you stopped in the -1 value?
The manual pages of bash give the answer:
${parameter:offset}
${parameter:offset:length}
If offset evaluates to a number less than zero, the value is used as
an offset in characters from the end of the value of parameter. If
length evaluates to a number less than zero, it is interpreted as an
offset in characters from the end of the value of parameter rather
than a number of characters, and the expansion is the characters
between offset and that result. Note that a negative offset must be
separated from the colon by at least one space to avoid being confused
with the ‘:-’ expansion.
Therefore
[ "${filename: -3}" == "***" ]

Bash: ${string:$i:1} what does this mean?

This is the script. It reverses a string entered by the user:
#!/bin/bash
read -p "Enter string:" string
len=${#string}
for (( i=$len-1; i>=0; i-- ))
do
# "${string:$i:1}"extract single single character from string.
reverse="$reverse${string:$i:1}"
done
echo "$reverse"
I don't understand the following part of the script. What is this? Looks like some kind of extended variable interpolation.
${string:$i:1}
in bash doing something lik this: ${string:3:1} means: take substring starting from the character at pos 3 (0-based, so the 4th character), and length = 1 character.
for example:
string=abc
then ${string:0:1} equals a and ${string:2:1} equals c.
This script takes the value of the variable $i: so it just takes the character at position $i.
It's substring expansion.
from the man pages:
${parameter:offset:length}
Substring Expansion. Expands to up to length characters of parameter starting at the character specified by offset. If length is omitted, expands to the
substring of parameter starting at the character specified by offset. length and offset are arithmetic expressions (see ARITHMETIC EVALUATION below). If
offset evaluates to a number less than zero, the value is used as an offset from the end of the value of parameter. Arithmetic expressions starting with a -
must be separated by whitespace from the preceding : to be distinguished from the Use Default Values expansion. If length evaluates to a number less than
zero, and parameter is not # and not an indexed or associative array, it is interpreted as an offset from the end of the value of parameter rather than a
number of characters, and the expansion is the characters between the two offsets. If parameter is #, the result is length positional parameters beginning
at offset. If parameter is an indexed array name subscripted by # or *, the result is the length members of the array beginning with ${parameter[offset]}.
A negative offset is taken relative to one greater than the maximum index of the specified array. Substring expansion applied to an associative array proâ
duces undefined results. Note that a negative offset must be separated from the colon by at least one space to avoid being confused with the :- expansion.
Substring indexing is zero-based unless the positional parameters are used, in which case the indexing starts at 1 by default. If offset is 0, and the posiâ
tional parameters are used, $0 is prefixed to the list.

Resources