Printing a string containing utf-8 byte sequences on perl

Printing a string containing utf-8 byte sequences on perl - linux

I'm new to perl, and I'm trying to print out the folderName from mork files (from Thunderbird).
From: https://github.com/KevinGoodsell/mork-converter/blob/master/doc/mork-format.txt
The second type of special character sequence is a dollar sign
followed by two hexadecimal digits which give the value of the
replacement byte. This is often used for bytes that are non-printable
as ASCII characters, especially in UTF-16 text. For example, a string
with the Unicode snowman character (U+2603):
☃snowman☃
may be represented as UTF-16 text in an Alias this way:
<(83=$03$26s$00n$00o$00w$00m$00a$00n$00$03$26)>
From all the Thunderbird files I've seen it's actually encoded in UTF-8 (2 to 4 bytes).
The following characters need to be escaped (with \) within the string to be used literally: $, ) and \
Example: aaa\$AA$C3$B1b$E2$98$BA$C3$AD\\x08 should print aaa$AAñb☺í\x08
$C3$B1 is ñ; $E2$98$BA is ☺; $C3$ADis í
I tried using the regex to replaced unescaped $ into \x
my $unescaped = qr/(?<!\\)(?:(\\\\)*)/;
$folder =~ s/$unescaped\$/\\x/g;
$folder =~ s/\\([\\$)])/$1/g; # unescape "\ $ ("
Within perl it just prints the literal string.
My workaround is feeding it into bash's printf and it succeeds... unless there's a literal "\x" in the string
$ folder=$(printf "$(mork.pl 8777646a.msf)")
$ echo "$folder"
aaa$AAñb☺í
Questions i consulted:
Convert UTF-8 character sequence to real UTF-8 bytes
But it seems it interprets every byte by itself, not in groups.
In Perl, how can I convert an array of bytes to a Unicode string?
I don't know how to apply this solution to my use case.
Is there any way to achieve this in perl?

The following substitution seems to work for your input:
s/\\([\$\\])|\$(..)/$2 ? chr hex $2 : $1/ge;
Capture \$ or \\, if matched, replace them with $ or \. Otherwise, capture $.. and convert to the corresponding byte.
If you want to work with the result in Perl, don't forget to decode it from UTF-8.
$chars = decode('UTF-8', $bytes);

Related

Bash. How to convert UTF-8 to hex encode?

I have got one variable with string of UTF-8 text. I want to get string like \xAA\xBB\xCC or, it seems to be encoded as \Uxxxxxxxx or some such... How can I to realize it?

I could to do it with Python3 (.7):
def stou(x):
s = ''
for i in x:
s = s + '\\U' + hex(ord(i))[2:]
return s
But I'd like to resolve it by native bash methods and (or) by standard, almost native utils of Linux, like base64 or find. I'm just trying to create file server and in usual format I have problems with space-chars. So I try to find some another method to keep it.

Using perl:
$ echo -ne "12345 =\n= me + Дварфы" | perl -0777 -CS -nE 'say map { sprintf "\\U%x", $_ } unpack "U*"'
\U31\U32\U33\U34\U35\U20\U3d\Ua\U3d\U20\U6d\U65\U20\U2b\U20\U414\U432\U430\U440\U444\U44b
Basically, reads all of its standard input as one UTF-8 encoded chunk, converts each codepoint to a number, and prints them out in base 16 with a leading \U before each one.

Text containing special characters in command line cannot be well read

I have a function analyze_text: string -> unit to analyze a text. As a result, (most of the time,) ./analyze aText launches the function with the argument.
let usage_msg = "./analyze [options] TEXT" in
Arg.parse options analyze_text usage_msg;
However, I realize that when the text contains special characters like ", ' or !, it cannot be well read. Does anyone know if there is a way to well wrap the text and give it to the function?

On the shell there are many shell characters. You can escape the shell characters by enclosing your input in single quotes.
$ echo 'a*$b"$c"!d'
a*$b"$c"!d
If your input itself contains single quote. You'll have to enclose that in the double quotes and concatenate with the rest of substrings of input which are enclosed in single quotes.
e.g. You want to print: He$l!o Wo$r'ld
You can do it like:
$ echo 'He$l!o Wo$r'"'"'ld'
He$l!o Wo$r'ld

In your case, the culprit is not your OCaml code, but the behavior of your shell, e.g., bash. When you enter text in the bash command line prompt many characters have special meaning, e.g., ", ', $, \ and so on. To hush the special meaning of a character in bash you can either escape it with the backslash, e.g., \$, \\, \' or delimit with single quotes (but you will still need to escape single quotes in the single-quotes-delimited text.
The general approach is that when your input is actual text or data, not a sequence of commands and options, you should read the input from a file or from the standard input channel. This also helps, when the size of the input is large, as most of the shells limit (sometimes significantly) the total number of characters that can be passed through the command line. In vanilla OCaml, you can input the whole file into a single string using the following simple code
let read_file filename =
let buf = Buffer.create 4096 in
let chan = open_in filename in
begin
try while true do Buffer.add_channel buf chan 4096 done
with End_of_file -> ()
end;
Buffer.contents buf
Then you don't need to deal with any special characters, as your input will be the file and no shell in between will do any interpretations. You can even analyze binary data with that.

how to "decdump" a string in bash?

I need to convert a string into a sequence of decimal ascii code using bash command.
example:
for the string 'abc' the desired output would be 979899 where a=97, b=98 and c=99 in ascii decimal code.
I was able to achieve this with ascii hex code using xxd.
printf '%s' 'abc' | xxd -p
which gives me the result: 616263
where a=61, b=62 and c=63 in ascii hexadecimal code.
Is there an equivalent to xxd that gives the result in ascii decimal code instead of ascii hex code?

If you don't mind the results are merged into a line, please try the following:
echo -n "abc" | xxd -p -c 1 |
while read -r line; do
echo -n "$(( 16#$line ))"
done
Result:
979899

str=abc
printf '%s' $str | od -An -tu1
The -An gets rid of the address line, which od normally outputs, and the -tu1 treats each input byte as unsigned integer. Note that it assumes that one character is one byte, so it won't work with Unicode, JIS or the like.
If you really don't want spaces in the result, pipe it further into tr -d ' '.

Unicode Solution
What makes this problem annoying is that you have to pipeline characters when converting from hex to decimal. So you can't do a simple conversion from char to hex to dec as some characters hex representations are longer than others.
Both of these solutions are compatible with unicode and use a character's code point. In both solutions, a newline is chosen as separator for clarity; change this to '' for no separator.
Bash
sep='\n'
charAry=($(printf 'abc🎶' | grep -o .))
for i in "${charAry[#]}"; do
printf "%d$sep" "'$i"
done && echo
97
98
99
127926
Python (in Bash)
Here, we use a list comprehension to convert every character to a decimal number (ord), join it as a string and print it. sys.stdin.read() allows us to use Python inline to get input from a pipe. If you replace input with your intended string, this solution is then cross-platform.
printf '%s' 'abc🎶' | python -c "
import sys
input = sys.stdin.read()
sep = '\n'
print(sep.join([str(ord(i)) for i in input]))"
97
98
99
127926
Edit: If all you care about is using hex regardless of encoding, use #user1934428's answer

Why $'\0' or $'\x0' is an empty string? Should be the null-character, isn't it?

bash allows $'string' expansion. My man bash says:
Words of the form $'string' are treated specially.
The word expands to string, with backslash-escaped characters replaced as specified by the ANSI C standard.
Backslash escape sequences, if present, are decoded as follows:
\a alert (bell)
\b backspace
\e
\E an escape character
\f form feed
\n new line
\r carriage return
\t horizontal tab
\v vertical tab
\ backslash
\' single quote
\" double quote
\nnn the eight-bit character whose value is the octal value nnn (one to three digits)
\xHH the eight-bit character whose value is the hexadecimal value HH (one or two hex digits)
\cx a control-x character
The expanded result is single-quoted, as if the dollar sign had not been present.
But why does bash not convert $'\0' and $'\x0' into a null character?
Is it documented? Is there a reason? (Is it a feature or a limitation or even a bug?)
$ hexdump -c <<< _$'\0'$'\x1\x2\x3\x4_'
0000000 _ 001 002 003 004 _ \n
0000007
echo gives the expected result:
> hexdump -c < <( echo -e '_\x0\x1\x2\x3_' )
0000000 _ \0 001 002 003 _ \n
0000007
My bash version
$ bash --version | head -n 1
GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)
Why echo $'foo\0bar' does not behave as echo -e 'foo\0bar'?

It's a limitation. bash does not allow string values to contain interior NUL bytes.
Posix (and C) character strings cannot contain interior NULs. See, for example, the Posix definition of character string (emphasis added):
3.92 Character String
A contiguous sequence of characters terminated by and including the first null byte.
Similarly, standard C is reasonably explicit about the NUL character in character strings:
§5.2.1p2 …A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.
Posix explicitly forbids the use of NUL (and /) in filenames (XBD 3.170) or in environment variables (XBD 8.1 "... are considered to end with a null byte."
In this context, shell command languages, including bash, tend to use the same definition of a character string, as a sequence of non-NUL characters terminated by a single NUL.
You can pass NULs freely through bash pipes, of course, and nothing stops you from assigning a shell variable to the output of a program which outputs a NUL byte. However, the consequences are "unspecified" according to Posix (XSH 2.6.3 "If the output contains any null bytes, the behavior is unspecified."). In bash, the NULs are removed, unless you insert a NUL into a string using bash's C-escape syntax ($'\0'), in which case the NUL will end up terminating the value.
On a practical note, consider the difference between the two following ways of attempting to insert a NUL into the stdin of a utility:
$ # Prefer printf to echo -n
$ printf $'foo\0bar' | wc -c
3
$ printf 'foo\0bar' | wc -c
7
$ # Bash extension which is better for strings which might contain %
$ printf %b 'foo\0bar' | wc -c
7

But why does bash not convert $'\0' and $'\x0' into a null character?
Because a null character terminates a string.
$ echo $'hey\0you'
hey

It is a null character, but it depends on what you mean by that.
The null character represents an empty string, which is what you get when you expand it. It is a special case and I think that is implied by the documentation but not actually stated.
In C binary zero '\0' terminates a string and on its own also represents an empty string. Bash is written in C, so it probably follows from that.
Edit: POSIX mentions a null string in a number of places. In the "Base definitions" it defines a null string as:
3.146 Empty String (or Null String)
A string whose first byte is a null byte.

Unicode-aware strings(1) program

Does anybody have a code sample for a unicode-aware strings program? Programming language doesn't matter. I want something that essentially does the same thing as the unix command "strings", but that also functions on unicode text (UTF-16 or UTF-8), pulling runs of english-language characters and punctuation. (I only care about english characters, not any other alphabet).
Thanks!

Do you just want to use it, or do you for some reason insist on the code?
On my Debian system, it seems strings command can do this out of the box. See the exercept from the manpage:
--encoding=encoding
Select the character encoding of the strings that are to be found. Possible values for encoding are: s = single-7-bit-byte characters (ASCII, ISO 8859,
etc., default), S = single-8-bit-byte characters, b = 16-bit bigendian, l = 16-bit littleendian, B = 32-bit bigendian, L = 32-bit littleendian. Useful
for finding wide character strings.
Edit: OK. I don't know C# so this may be a bit hairy, but basically, you need to search for sequences of alternating zeros and English characters.
byte b;
int i=0;
while(!endOfInput()) {
b=getNextByte();
LoopBegin:
if(!isEnglish(b)) {
if(i>0) // report successful match of length i
i=0;
continue;
}
if(endOfInput()) break;
if((b=getNextByte())!=0)
goto LoopBegin;
i++; // found another character
}
This should work for little-endian.

I had a similar problem and tried the "strings -e ..." but I just found options for fix width chars encoding. (UTF-8 encoding is variable width).
Remeber thar by default characters outside ascii need extra strings options. This includes almost all non English language strings.
Nevertheless "-e S" (single 8 bits chars) output includes UTF-8 chars.
I wrote a very simple (opinion-ed) Perl script that applies a
"strings -e S ... | iconv ..." to the input files.
I believe it is easy to tune it for specific restrictions.
Usage: utf8strings [options] file*
#!/usr/bin/perl -s
our ($all,$windows,$enc); ## use -all ignore the "3 letters word" restriction
use strict;
use utf8::all;
$enc = "ms-ansi" if $windows; ##
$enc = "utf8" unless $enc ; ## defaul encoding=utf8
my $iconv = "iconv -c -f $enc -t utf8 |";
for (#ARGV){ s/(.*)/strings -e S '$1'| $iconv/;}
my $word=qr/[a-zçáéíóúâêôàèìòùüãõ]{3}/i; # adapt this to your case
while(<>){
# next if /regular expressions for common garbage/;
print if ($all or /$word/);
}
In some situations, this approach produce some extra garbage.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Printing a string containing utf-8 byte sequences on perl - linux

Related

Bash. How to convert UTF-8 to hex encode?

Text containing special characters in command line cannot be well read

how to "decdump" a string in bash?

Why $'\0' or $'\x0' is an empty string? Should be the null-character, isn't it?

Unicode-aware strings(1) program

Categories

Resources