Bash. How to convert UTF-8 to hex encode? - string

I have got one variable with string of UTF-8 text. I want to get string like \xAA\xBB\xCC or, it seems to be encoded as \Uxxxxxxxx or some such... How can I to realize it?

I could to do it with Python3 (.7):
def stou(x):
s = ''
for i in x:
s = s + '\\U' + hex(ord(i))[2:]
return s
But I'd like to resolve it by native bash methods and (or) by standard, almost native utils of Linux, like base64 or find. I'm just trying to create file server and in usual format I have problems with space-chars. So I try to find some another method to keep it.

Using perl:
$ echo -ne "12345 =\n= me + Дварфы" | perl -0777 -CS -nE 'say map { sprintf "\\U%x", $_ } unpack "U*"'
\U31\U32\U33\U34\U35\U20\U3d\Ua\U3d\U20\U6d\U65\U20\U2b\U20\U414\U432\U430\U440\U444\U44b
Basically, reads all of its standard input as one UTF-8 encoded chunk, converts each codepoint to a number, and prints them out in base 16 with a leading \U before each one.

Related

how to "decdump" a string in bash?

I need to convert a string into a sequence of decimal ascii code using bash command.
example:
for the string 'abc' the desired output would be 979899 where a=97, b=98 and c=99 in ascii decimal code.
I was able to achieve this with ascii hex code using xxd.
printf '%s' 'abc' | xxd -p
which gives me the result: 616263
where a=61, b=62 and c=63 in ascii hexadecimal code.
Is there an equivalent to xxd that gives the result in ascii decimal code instead of ascii hex code?
If you don't mind the results are merged into a line, please try the following:
echo -n "abc" | xxd -p -c 1 |
while read -r line; do
echo -n "$(( 16#$line ))"
done
Result:
979899
str=abc
printf '%s' $str | od -An -tu1
The -An gets rid of the address line, which od normally outputs, and the -tu1 treats each input byte as unsigned integer. Note that it assumes that one character is one byte, so it won't work with Unicode, JIS or the like.
If you really don't want spaces in the result, pipe it further into tr -d ' '.
Unicode Solution
What makes this problem annoying is that you have to pipeline characters when converting from hex to decimal. So you can't do a simple conversion from char to hex to dec as some characters hex representations are longer than others.
Both of these solutions are compatible with unicode and use a character's code point. In both solutions, a newline is chosen as separator for clarity; change this to '' for no separator.
Bash
sep='\n'
charAry=($(printf 'abc🎶' | grep -o .))
for i in "${charAry[#]}"; do
printf "%d$sep" "'$i"
done && echo
97
98
99
127926
Python (in Bash)
Here, we use a list comprehension to convert every character to a decimal number (ord), join it as a string and print it. sys.stdin.read() allows us to use Python inline to get input from a pipe. If you replace input with your intended string, this solution is then cross-platform.
printf '%s' 'abc🎶' | python -c "
import sys
input = sys.stdin.read()
sep = '\n'
print(sep.join([str(ord(i)) for i in input]))"
97
98
99
127926
Edit: If all you care about is using hex regardless of encoding, use #user1934428's answer

Printing a string containing utf-8 byte sequences on perl

I'm new to perl, and I'm trying to print out the folderName from mork files (from Thunderbird).
From: https://github.com/KevinGoodsell/mork-converter/blob/master/doc/mork-format.txt
The second type of special character sequence is a dollar sign
followed by two hexadecimal digits which give the value of the
replacement byte. This is often used for bytes that are non-printable
as ASCII characters, especially in UTF-16 text. For example, a string
with the Unicode snowman character (U+2603):
☃snowman☃
may be represented as UTF-16 text in an Alias this way:
<(83=$03$26s$00n$00o$00w$00m$00a$00n$00$03$26)>
From all the Thunderbird files I've seen it's actually encoded in UTF-8 (2 to 4 bytes).
The following characters need to be escaped (with \) within the string to be used literally: $, ) and \
Example: aaa\$AA$C3$B1b$E2$98$BA$C3$AD\\x08 should print aaa$AAñb☺í\x08
$C3$B1 is ñ; $E2$98$BA is ☺; $C3$ADis í
I tried using the regex to replaced unescaped $ into \x
my $unescaped = qr/(?<!\\)(?:(\\\\)*)/;
$folder =~ s/$unescaped\$/\\x/g;
$folder =~ s/\\([\\$)])/$1/g; # unescape "\ $ ("
Within perl it just prints the literal string.
My workaround is feeding it into bash's printf and it succeeds... unless there's a literal "\x" in the string
$ folder=$(printf "$(mork.pl 8777646a.msf)")
$ echo "$folder"
aaa$AAñb☺í
Questions i consulted:
Convert UTF-8 character sequence to real UTF-8 bytes
But it seems it interprets every byte by itself, not in groups.
In Perl, how can I convert an array of bytes to a Unicode string?
I don't know how to apply this solution to my use case.
Is there any way to achieve this in perl?
The following substitution seems to work for your input:
s/\\([\$\\])|\$(..)/$2 ? chr hex $2 : $1/ge;
Capture \$ or \\, if matched, replace them with $ or \. Otherwise, capture $.. and convert to the corresponding byte.
If you want to work with the result in Perl, don't forget to decode it from UTF-8.
$chars = decode('UTF-8', $bytes);

Perl Program to Print Unicode From Hex Value

I am starting up with Perl and confused on how to render unicode characters given a hex string variable.
#!/usr/bin/perl
use warnings;
foreach my $i (0..10000) {
my $hex = sprintf("%X", $i);
print("unicode of $i is \x{$hex}\n");
}
print("\x{2620}\n");
print("\x{BEEF}\n");
Gives me the warning: Illegal hexadecimal digit '$' ignored at perl.pl line 9.
and no value prints for \x{$hex}
Both chr($num) and pack('W', $num) produce a string consisting of the single character with the specified value, just like "\x{XXXX}" does.
As such, you can use
print("unicode of $i is ".chr(hex($hex))."\n");
or just
print("unicode of $i is ".chr($i)."\n");
Note that your program makes no sense without
use open ':std', ':encoding(UTF-8)';
Yup. You can't do that. No variable interpolation allowed in the middle of a \x like that. You can use chr() to get that character though.
Randal's answer is correct. For more info, you might want to read perluniintro.
From there, you can find, for example:
At run-time you can use:
use charnames ();
my $hebrew_alef_from_name
= charnames::string_vianame("HEBREW LETTER ALEF");
my $hebrew_alef_from_code_point = charnames::string_vianame("U+05D0");

Interpolating ASCII with utf8 gives error in open()

As stated in the title, the problem seems to be that I have one string read from an ASCII file, and another that is utf8; when I use interpolation to form a string, and then pass that string to open(), it seems to get munged, and I get an error. Here is a minimal example:
#!/usr/bin/perl
use open ":encoding(utf8)";
use strict;
open (FILE,"<u");
my $p = <FILE>;
$p =~ s/\s+$//;
close FILE;
print "p=",$p,"\n";
if ($p eq "cat") {print "yes\n"} else {"no\n"}
my $file = "påminnelser"; # note the circle over the "a"
my $x = "$p <$file |";
print "x=$x\n";
open (FILE, $x);
close FILE;
It seems to make a difference that the string $p is read from the external file u, which looks like this:
cat
My code is utf8, while file u is ASCII, according to the 'file' utility:
---- rintintin a $ file u
u: ASCII text
---- rintintin a $ file bug.pl
bug.pl: Perl script, UTF-8 Unicode text executable
The result looks like this:
---- rintintin a $ ./bug.pl
p=cat
yes
x=cat <påminnelser |
sh: 1: cannot open påminnelser: No such file
The filename has been munged somewhere inside the call to open(). Although $p eq "cat" is true, if I simply set $p="cat" in the code rather than reading it from the file, the error goes away. I would guess that this is because my source code file is utf8.
Can anyone explain what is happening here and how to fix it?
[EDIT] As described in my comment on Dmitri Chubarov's answer, it turns out that my minimal example actually didn't correctly represent the bug in my original program. This question describes the actual bug: Should perl's File::Glob always be post-filtered through utf8::decode?
You should add
use utf8;
pragma to your script in order for the Perl source text be interpreted as UTF8.
By default Perl source is interpreted as a stream of bytes, therefore the
my $file = "påminnelser"
is turned into a string of bytes that is interpreted according to the default encoding.

Unicode-aware strings(1) program

Does anybody have a code sample for a unicode-aware strings program? Programming language doesn't matter. I want something that essentially does the same thing as the unix command "strings", but that also functions on unicode text (UTF-16 or UTF-8), pulling runs of english-language characters and punctuation. (I only care about english characters, not any other alphabet).
Thanks!
Do you just want to use it, or do you for some reason insist on the code?
On my Debian system, it seems strings command can do this out of the box. See the exercept from the manpage:
--encoding=encoding
Select the character encoding of the strings that are to be found. Possible values for encoding are: s = single-7-bit-byte characters (ASCII, ISO 8859,
etc., default), S = single-8-bit-byte characters, b = 16-bit bigendian, l = 16-bit littleendian, B = 32-bit bigendian, L = 32-bit littleendian. Useful
for finding wide character strings.
Edit: OK. I don't know C# so this may be a bit hairy, but basically, you need to search for sequences of alternating zeros and English characters.
byte b;
int i=0;
while(!endOfInput()) {
b=getNextByte();
LoopBegin:
if(!isEnglish(b)) {
if(i>0) // report successful match of length i
i=0;
continue;
}
if(endOfInput()) break;
if((b=getNextByte())!=0)
goto LoopBegin;
i++; // found another character
}
This should work for little-endian.
I had a similar problem and tried the "strings -e ..." but I just found options for fix width chars encoding. (UTF-8 encoding is variable width).
Remeber thar by default characters outside ascii need extra strings options. This includes almost all non English language strings.
Nevertheless "-e S" (single 8 bits chars) output includes UTF-8 chars.
I wrote a very simple (opinion-ed) Perl script that applies a
"strings -e S ... | iconv ..." to the input files.
I believe it is easy to tune it for specific restrictions.
Usage: utf8strings [options] file*
#!/usr/bin/perl -s
our ($all,$windows,$enc); ## use -all ignore the "3 letters word" restriction
use strict;
use utf8::all;
$enc = "ms-ansi" if $windows; ##
$enc = "utf8" unless $enc ; ## defaul encoding=utf8
my $iconv = "iconv -c -f $enc -t utf8 |";
for (#ARGV){ s/(.*)/strings -e S '$1'| $iconv/;}
my $word=qr/[a-zçáéíóúâêôàèìòùüãõ]{3}/i; # adapt this to your case
while(<>){
# next if /regular expressions for common garbage/;
print if ($all or /$word/);
}
In some situations, this approach produce some extra garbage.

Resources