Converting Unicode escaped characters (\u#### format) into UTF-8 in VBA? - string

I would like to convert the special characters in "text" variable back normal in VBA!
Dim text As String
text = "Cs\u00fct\u00f6rt\u00f6k"
text = Encoding.utf8.GetString(Encoding.ASCII.GetBytes(text))
MsgBox text
'Csütörtök would be the correct result
But in the above code Excel 2013 gives me an error about "Encoding" method.. cant parse it.
It should be working just like this online converter if you put in the text value:
http://www.rapidmonkey.com/unicodeconverter/reverse.jsp
Is there any good solution for this problem? A one-line code maybe?
Thanks in advance!

The Encoding class does not decode escape sequences, you have to do that manually by parsing the string yourself. For that matter, VB strings use UTF-16, so you do not need to use the Encoding class at all. Simply replace characters 2-7 ("\u00fc") with a single &H00FC character, replace characters 9-14 ("\u00f6") with a single &H00F6 character, etc and then you are done. Each \uXXXX sequence represents a single Unicode codepoint.

Related

Ursina Unicode Display

When i try to paste in a Unicode in The Text Entity as a text. it comes up with the error:
:text(warning): No definition in for character U+21d5
I have also tried to use the Unicode string itself inside the text string.
Anyone know how to do it?

Convert Unicode Escape to Hebrew text

I have the following text in a json file:
"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"
which represents the text "אחוזת פולג" in Hebrew.
no matter which encoding/decoding i use i don't seem to get it right with
Python 3.
if for example ill try:
text = "\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092".encode('unicode-escape')
print(text)
i get that text is:
b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'
which in bytecode is almost the correct text, if i was able to remove only one backslash and turn
b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'
into
text = b'\xd7\x90\xd7\x97\xd7\x95\xd7\x96\xd7\xaa \xd7\xa4\xd7\x95\xd7\x9c\xd7\x92'
(note how i changed double slash to single slash) then
text.decode('utf-8')
would yield the correct text in Hebrew.
but i am struggling to do so and couldn't manage to create a piece of code which will do that for me (and not manually as i just showed...)
any help much appreciated...
This string does not "represent" Hebrew text (at least not as unicode code points, UTF-16, UTF-8, or in any well-known way at all). Instead, it represents a sequence of UTF-16 code units, and this sequence consists mostly of multiplication signs, currency signs, and some weird control characters.
It looks like the original character data has been encoded and decoded several times with some strange combination of encodings.
Assuming that this is what literally is saved in your JSON file:
"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa \u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"
you can recover the Hebrew text as follows:
(jsonInput
.encode('latin-1')
.decode('raw_unicode_escape')
.encode('latin-1')
.decode('utf-8')
)
For the above example, it gives:
'אחוזת פולג'
If you are using a JSON deserializer to read in the data, then you should of course omit the .encode('latin-1').decode('raw_unicode_escape') steps, because the JSON deserializer would already interpret the escape sequences for you. That is, after the text element is loaded by JSON deserializer, it should be sufficient to just encode it as latin-1 and then decode it as utf-8. This works because latin-1 (ISO-8859-1) is an 8-bit character encoding that corresponds exactly to the first 256 code points of unicode, whereas your strangely broken text encodes each byte of UTF-8 encoding as an ASCII-escape of an UTF-16 code unit.
I'm not sure what you can do if your JSON contains both the broken escape sequences and valid text at the same time, it might be that the latin-1 doesn't work properly any more. Please don't apply this transformation to your JSON file unless the JSON itself contains only ASCII, it would only make everything worse.

Is there a standard function that converts multi-line text into ASCII escaped form and vice versa?

UPDATE:
The right terminology has been escaping me. What I am looking for is a function that converts multi-line ASCII text into ASCII escaped form.
Is there a standard function that converts multi-line text into ASCII escaped form and vice versa?
I need to store multi-line text as name=value pairs, basically line .ini files, where Value is ASCII escaped text which fits on a single line, but I prefer the format that doesn't use numeric codes to express the non-printing characters if such a format exists.
The multi-line text can be long, up to 65K in length.
How about to use Base64?
Base64 is used to encode attached files of E-mail. Base64 can convert any kinds of data into strings made of characters upto 64 kinds (Upper and Lower case alphabet(52 kinds),0 to 9 (10 kinds), "-" and "+").
Large picture (over 1MB) can be encoded by Base64, so 65K charactes may not make trouble.
In Windows .ini files, you can use the whole section to store multiline data.
[key1]
several lines
of data
[key2]
another
Read it with GetPrivateProfileSection. To get a list of keys, use GetPrivateProfileSectionNames.

How do I create spacial characters used in XML in my C# code

I have the following string, read from an XML attribute:
"OnTrak 4-3/4”, 6-3/4”, 8-1/4” / MPR"
In my C# application it shows up nicely formatted like this
"OnTrak 4-3/4”, 6-3/4”, 8-1/4” / MPR"
This is the form I see in the debugger, a combobox, or on this forum (if I don't indent to specify code).
What I want to do is specify the same string as a C# variable and have it show up nicely formatted when the application runs. Unfortunately, all I get is the string as I literally typed it.
I have tried to play around with converting the encoding from ASCII to UTF8 with no luck. How can I get this special character properly formatted, and where can I find a list of these symbols?
Those are called XML entities. Use HttpUtility.HtmlDecode to decode them back to plain text like you would like. Credit goes to C#, function to replace all html special characters with normal text characters for how to convert entities in C#
Note that converting from ASCII to UTF8 (and Unicode etc.) is called changing the character set and is usually done when specific characters are in the string. For instance if you strings contained Chinese characters you couldn't use ASCII. In this simple case you shouldn't need to convert character sets because C# strings are Unicode character set by default and XML entities are Unicode based (I believe).

Replace characters in a string to fit browser, C#, From UTF8 encoding

I need to replace the characters å, ä, ö to browser friendly chars. For example ä should become %E4.
I tried weirdString = Uri.EscapeUriString(weirdString);
But it doesnt convert the åäö to the right sign. Help please?
Edit: Tried this:
ASCIIEncoding ascii = new ASCIIEncoding();
byte[] asciicharacters = Encoding.UTF8.GetBytes("vägen");
byte[] asciiArray = Encoding.Convert(Encoding.UTF8, Encoding.ASCII, asciicharacters);
string finalString = ascii.GetString(asciiArray);
string fixedAddrString = HttpUtility.HtmlEncode(finalString);
If you need the characters to display on the page and not part of a URL, you should use Server.HtmlEncode.
var encodedString = Server.HtmlEncode(myString);
HTML encoding makes sure that text is displayed correctly in the browser and not interpreted by the browser as HTML. For example, if a text string contains a less than sign (<) or greater than sign (>), the browser would interpret these characters as the opening or closing bracket of an HTML tag. When the characters are HTML encoded, they are converted to the strings < and >, which causes the browser to display the less than sign and greater than sign correctly.
Update:
Since you are using UTF-8, these characters are escaped to UTF-8, not ASCII.
You need to convert the string from UTF-8 to ASCII, using the Encoding classes before you try to escape them. That is, if you do want the ASCII values to come up.
See here.
Try HttpUtility.HtmlDecode.
If this doesent working too you can do a simple string.Replace for this chars.

Resources