Unicode-aware strings(1) program - string

Does anybody have a code sample for a unicode-aware strings program? Programming language doesn't matter. I want something that essentially does the same thing as the unix command "strings", but that also functions on unicode text (UTF-16 or UTF-8), pulling runs of english-language characters and punctuation. (I only care about english characters, not any other alphabet).
Thanks!

Do you just want to use it, or do you for some reason insist on the code?
On my Debian system, it seems strings command can do this out of the box. See the exercept from the manpage:
--encoding=encoding
Select the character encoding of the strings that are to be found. Possible values for encoding are: s = single-7-bit-byte characters (ASCII, ISO 8859,
etc., default), S = single-8-bit-byte characters, b = 16-bit bigendian, l = 16-bit littleendian, B = 32-bit bigendian, L = 32-bit littleendian. Useful
for finding wide character strings.
Edit: OK. I don't know C# so this may be a bit hairy, but basically, you need to search for sequences of alternating zeros and English characters.
byte b;
int i=0;
while(!endOfInput()) {
b=getNextByte();
LoopBegin:
if(!isEnglish(b)) {
if(i>0) // report successful match of length i
i=0;
continue;
}
if(endOfInput()) break;
if((b=getNextByte())!=0)
goto LoopBegin;
i++; // found another character
}
This should work for little-endian.

I had a similar problem and tried the "strings -e ..." but I just found options for fix width chars encoding. (UTF-8 encoding is variable width).
Remeber thar by default characters outside ascii need extra strings options. This includes almost all non English language strings.
Nevertheless "-e S" (single 8 bits chars) output includes UTF-8 chars.
I wrote a very simple (opinion-ed) Perl script that applies a
"strings -e S ... | iconv ..." to the input files.
I believe it is easy to tune it for specific restrictions.
Usage: utf8strings [options] file*
#!/usr/bin/perl -s
our ($all,$windows,$enc); ## use -all ignore the "3 letters word" restriction
use strict;
use utf8::all;
$enc = "ms-ansi" if $windows; ##
$enc = "utf8" unless $enc ; ## defaul encoding=utf8
my $iconv = "iconv -c -f $enc -t utf8 |";
for (#ARGV){ s/(.*)/strings -e S '$1'| $iconv/;}
my $word=qr/[a-zçáéíóúâêôàèìòùüãõ]{3}/i; # adapt this to your case
while(<>){
# next if /regular expressions for common garbage/;
print if ($all or /$word/);
}
In some situations, this approach produce some extra garbage.

Related

Perl: Converting strings to Unicode

I have a MySql database that stores strings with the Unicode characters encoded using an XML type format (i.e., &#nnnnn; ). An example of one of these strings would be: 概述 which represents the Unicode characters: 概述
Perl lets me make this conversion in my application if I hard-code the strings in the format:
\x{6982}\x{8ff0}or even:\N{U+6982}\N{U+8ff0}
To me it seems like a simple matter of changing the format from &#nnnnn; to \x{nnnn}
The Perl application seems to require hex numbers whereas the MySql is outputting integers.
I wanted to do this simple conversion in Regex. So I matched the integer using:
m/\&\#(\d{3,5});/;
Then I converted the match to hex using:
sprintf('{%04x}',$1)
Then I added in the necessary: \x{ }
I was easily able to create strings that contained: "\x{6982}\x{8ff0}"
But none of them were printed by the application as Unicode. They were simply printed as they were created: symbols and text.
I found out that if you hard-coded these strings into the program, Perl would "interpolate" them into Unicode characters. But if they were created as a string, the "interpolation" did not take place.
I tried to force the interpolation by using various functions such as:
Encode::decode('UTF-8', "some string" );
Encode::encode('UTF-8', "some string" );
But that wasn't what those functions were intended for.
I also tried to use Perl's manual string interpolation
$v="${ \($v) }";
But that did not convert the string "\x{6982}\x{8ff0}" into Unicode. It simply remained the same string as before.
I came across an example using "eval()".
while($unicodeString =~ m/\&\#(\d{3,5});/) {
$_=$unicodeString; ## in the XML form of (spaces added so you could see it here): & #27010; & #36848;
m/\&\#(\d{3,5});/; ## Matches the integer number in the Unicode
my $y=q(\x).sprintf('{%04x}',$1); ## Converts the integer to hex and adds the \x{}
my $v = eval qq{"$y"}; ## Performs the interpolation of the string to get the Unicode
$unicodeString =~ s/\&\#(\d{3,5});/$v/; ## Replaces the old code with the new Unicode character
}
This conversion works now. But I am not happy with the repeated use of eval() to convert each character: one-at-a-time. I could build my string in the While loop and then simply eval() the new string. But I would prefer to only eval() those small strings that were specifically matched in Regex.
Is there a better way of converting an XML string (with Unicode characters shown as integers) into a string that contains the actual Unicode characters?
How can I easily go from a string that contains:
我认识到自己的长处和短处,并追求自我发展。
to one with:
我认识到自己的长处和短处,并追求自我发展。
The documents I need to convert contain thousands of these characters.
Here is a simple example of how you can replace the unicode escapes using the chr function:
use feature qw(say);
use strict;
use warnings;
use open qw( :encoding(utf-8) :std );
my $str = "概述";
$str =~ s/&#(\d+);/chr $1/eg;
printf "%vX\n", $str;
say $str;
Output:
6982.8FF0
概述
I didn't find a module that decode XML entities because they are normally only found in XML, and the XML parser handles them. But, it's pretty easy to recreate.
use feature qw( say state );
sub decode_xml_entities_inplace {
state $ents = {
amp => "&",
lt => "<",
gt => ">",
quot => '"',
apos => "'",
};
$_[0] =~ s{
&
(?: \# (?: x([0-9a-fA-F]+)
| ([0-9]+)
)
| (\w+)
)
;
}{
if (defined($1)) { chr(hex($1)) }
elsif (defined($2)) { chr($2) }
else { $ents->{$3} // $& }
}xeg;
}
my $s = "概述";
decode_xml_entities_inplace($s);
say $s;
Of course, if you simply need to handle the decimal numeric entities, the above simplifies to
use feature qw( state );
my $s = "概述";
$s =~ s{ &\# ([0-9]+) ; }{ chr($1) }xeg;
say $s;

Bash. How to convert UTF-8 to hex encode?

I have got one variable with string of UTF-8 text. I want to get string like \xAA\xBB\xCC or, it seems to be encoded as \Uxxxxxxxx or some such... How can I to realize it?
I could to do it with Python3 (.7):
def stou(x):
s = ''
for i in x:
s = s + '\\U' + hex(ord(i))[2:]
return s
But I'd like to resolve it by native bash methods and (or) by standard, almost native utils of Linux, like base64 or find. I'm just trying to create file server and in usual format I have problems with space-chars. So I try to find some another method to keep it.
Using perl:
$ echo -ne "12345 =\n= me + Дварфы" | perl -0777 -CS -nE 'say map { sprintf "\\U%x", $_ } unpack "U*"'
\U31\U32\U33\U34\U35\U20\U3d\Ua\U3d\U20\U6d\U65\U20\U2b\U20\U414\U432\U430\U440\U444\U44b
Basically, reads all of its standard input as one UTF-8 encoded chunk, converts each codepoint to a number, and prints them out in base 16 with a leading \U before each one.

Printing a string containing utf-8 byte sequences on perl

I'm new to perl, and I'm trying to print out the folderName from mork files (from Thunderbird).
From: https://github.com/KevinGoodsell/mork-converter/blob/master/doc/mork-format.txt
The second type of special character sequence is a dollar sign
followed by two hexadecimal digits which give the value of the
replacement byte. This is often used for bytes that are non-printable
as ASCII characters, especially in UTF-16 text. For example, a string
with the Unicode snowman character (U+2603):
☃snowman☃
may be represented as UTF-16 text in an Alias this way:
<(83=$03$26s$00n$00o$00w$00m$00a$00n$00$03$26)>
From all the Thunderbird files I've seen it's actually encoded in UTF-8 (2 to 4 bytes).
The following characters need to be escaped (with \) within the string to be used literally: $, ) and \
Example: aaa\$AA$C3$B1b$E2$98$BA$C3$AD\\x08 should print aaa$AAñb☺í\x08
$C3$B1 is ñ; $E2$98$BA is ☺; $C3$ADis í
I tried using the regex to replaced unescaped $ into \x
my $unescaped = qr/(?<!\\)(?:(\\\\)*)/;
$folder =~ s/$unescaped\$/\\x/g;
$folder =~ s/\\([\\$)])/$1/g; # unescape "\ $ ("
Within perl it just prints the literal string.
My workaround is feeding it into bash's printf and it succeeds... unless there's a literal "\x" in the string
$ folder=$(printf "$(mork.pl 8777646a.msf)")
$ echo "$folder"
aaa$AAñb☺í
Questions i consulted:
Convert UTF-8 character sequence to real UTF-8 bytes
But it seems it interprets every byte by itself, not in groups.
In Perl, how can I convert an array of bytes to a Unicode string?
I don't know how to apply this solution to my use case.
Is there any way to achieve this in perl?
The following substitution seems to work for your input:
s/\\([\$\\])|\$(..)/$2 ? chr hex $2 : $1/ge;
Capture \$ or \\, if matched, replace them with $ or \. Otherwise, capture $.. and convert to the corresponding byte.
If you want to work with the result in Perl, don't forget to decode it from UTF-8.
$chars = decode('UTF-8', $bytes);

Perl Program to Print Unicode From Hex Value

I am starting up with Perl and confused on how to render unicode characters given a hex string variable.
#!/usr/bin/perl
use warnings;
foreach my $i (0..10000) {
my $hex = sprintf("%X", $i);
print("unicode of $i is \x{$hex}\n");
}
print("\x{2620}\n");
print("\x{BEEF}\n");
Gives me the warning: Illegal hexadecimal digit '$' ignored at perl.pl line 9.
and no value prints for \x{$hex}
Both chr($num) and pack('W', $num) produce a string consisting of the single character with the specified value, just like "\x{XXXX}" does.
As such, you can use
print("unicode of $i is ".chr(hex($hex))."\n");
or just
print("unicode of $i is ".chr($i)."\n");
Note that your program makes no sense without
use open ':std', ':encoding(UTF-8)';
Yup. You can't do that. No variable interpolation allowed in the middle of a \x like that. You can use chr() to get that character though.
Randal's answer is correct. For more info, you might want to read perluniintro.
From there, you can find, for example:
At run-time you can use:
use charnames ();
my $hebrew_alef_from_name
= charnames::string_vianame("HEBREW LETTER ALEF");
my $hebrew_alef_from_code_point = charnames::string_vianame("U+05D0");

How can I convert a string to a number in Perl?

I have a string which holds a decimal value in it and I need to convert that string into a floating point variable. So an example of the string I have is "5.45" and I want a floating point equivalent so I can add .1 to it. I have searched around the internet, but I only see how to convert a string to an integer.
You don't need to convert it at all:
% perl -e 'print "5.45" + 0.1;'
5.55
This is a simple solution:
Example 1
my $var1 = "123abc";
print $var1 + 0;
Result
123
Example 2
my $var2 = "abc123";
print $var2 + 0;
Result
0
Perl is a context-based language. It doesn't do its work according to the data you give it. Instead, it figures out how to treat the data based on the operators you use and the context in which you use them. If you do numbers sorts of things, you get numbers:
# numeric addition with strings:
my $sum = '5.45' + '0.01'; # 5.46
If you do strings sorts of things, you get strings:
# string replication with numbers:
my $string = ( 45/2 ) x 4; # "22.522.522.522.5"
Perl mostly figures out what to do and it's mostly right. Another way of saying the same thing is that Perl cares more about the verbs than it does the nouns.
Are you trying to do something and it isn't working?
Google lead me here while searching on the same question phill asked (sorting floats) so I figured it would be worth posting the answer despite the thread being kind of old. I'm new to perl and am still getting my head wrapped around it but brian d foy's statement "Perl cares more about the verbs than it does the nouns." above really hits the nail on the head. You don't need to convert the strings to floats before applying the sort. You need to tell the sort to sort the values as numbers and not strings.
i.e.
my #foo = ('1.2', '3.4', '2.1', '4.6');
my #foo_sort = sort {$a <=> $b} #foo;
See http://perldoc.perl.org/functions/sort.html for more details on sort
As I understand it int() is not intended as a 'cast' function for designating data type it's simply being (ab)used here to define the context as an arithmetic one. I've (ab)used (0+$val) in the past to ensure that $val is treated as a number.
$var += 0
probably what you want. Be warned however, if $var is string could not be converted to numeric, you'll get the error, and $var will be reset to 0:
my $var = 'abc123';
print "var = $var\n";
$var += 0;
print "var = $var\n";
logs
var = abc123
Argument "abc123" isn't numeric in addition (+) at test.pl line 7.
var = 0
Perl really only has three types: scalars, arrays, and hashes. And even that distinction is arguable. ;) The way each variable is treated depends on what you do with it:
% perl -e "print 5.4 . 3.4;"
5.43.4
% perl -e "print '5.4' + '3.4';"
8.8
In comparisons it makes a difference if a scalar is a number of a string. And it is not always decidable. I can report a case where perl retrieved a float in "scientific" notation and used that same a few lines below in a comparison:
use strict;
....
next unless $line =~ /and your result is:\s*(.*)/;
my $val = $1;
if ($val < 0.001) {
print "this is small\n";
}
And here $val was not interpreted as numeric for e.g. "2e-77" retrieved from $line. Adding 0 (or 0.0 for good ole C programmers) helped.
Perl is weakly typed and context based. Many scalars can be treated both as strings and numbers, depending on the operators you use.
$a = 7*6; $b = 7x6; print "$a $b\n";
You get 42 777777.
There is a subtle difference, however. When you read numeric data from a text file into a data structure, and then view it with Data::Dumper, you'll notice that your numbers are quoted. Perl treats them internally as strings.
Read:$my_hash{$1} = $2 if /(.+)=(.+)\n/;.
Dump:'foo' => '42'
If you want unquoted numbers in the dump:
Read:$my_hash{$1} = $2+0 if /(.+)=(.+)\n/;.
Dump:'foo' => 42
After $2+0 Perl notices that you've treated $2 as a number, because you used a numeric operator.
I noticed this whilst trying to compare two hashes with Data::Dumper.

Resources