Convert Unicode Escape to Hebrew text - python-3.x

I have the following text in a json file:
"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"
which represents the text "אחוזת פולג" in Hebrew.
no matter which encoding/decoding i use i don't seem to get it right with
Python 3.
if for example ill try:
text = "\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092".encode('unicode-escape')
print(text)
i get that text is:
b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'
which in bytecode is almost the correct text, if i was able to remove only one backslash and turn
b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'
into
text = b'\xd7\x90\xd7\x97\xd7\x95\xd7\x96\xd7\xaa \xd7\xa4\xd7\x95\xd7\x9c\xd7\x92'
(note how i changed double slash to single slash) then
text.decode('utf-8')
would yield the correct text in Hebrew.
but i am struggling to do so and couldn't manage to create a piece of code which will do that for me (and not manually as i just showed...)
any help much appreciated...

This string does not "represent" Hebrew text (at least not as unicode code points, UTF-16, UTF-8, or in any well-known way at all). Instead, it represents a sequence of UTF-16 code units, and this sequence consists mostly of multiplication signs, currency signs, and some weird control characters.
It looks like the original character data has been encoded and decoded several times with some strange combination of encodings.
Assuming that this is what literally is saved in your JSON file:
"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa \u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"
you can recover the Hebrew text as follows:
(jsonInput
.encode('latin-1')
.decode('raw_unicode_escape')
.encode('latin-1')
.decode('utf-8')
)
For the above example, it gives:
'אחוזת פולג'
If you are using a JSON deserializer to read in the data, then you should of course omit the .encode('latin-1').decode('raw_unicode_escape') steps, because the JSON deserializer would already interpret the escape sequences for you. That is, after the text element is loaded by JSON deserializer, it should be sufficient to just encode it as latin-1 and then decode it as utf-8. This works because latin-1 (ISO-8859-1) is an 8-bit character encoding that corresponds exactly to the first 256 code points of unicode, whereas your strangely broken text encodes each byte of UTF-8 encoding as an ASCII-escape of an UTF-16 code unit.
I'm not sure what you can do if your JSON contains both the broken escape sequences and valid text at the same time, it might be that the latin-1 doesn't work properly any more. Please don't apply this transformation to your JSON file unless the JSON itself contains only ASCII, it would only make everything worse.

Related

Which encoding can be used to mess up common English alphabet [a-zA-Z]?

I have a input string which consists of common English alphabet and space and its original encoding is utf-8
I want to find a encoding which can obfuscate this string. In other words, I want to purposefully apply this encoding to "mess up" my original input string. I have tried several in this list : java supported encoding. But the original input string prints out unchanged.

Reading a text file with unicode characters - Python3

I am trying to read a text file which has unicode characters (u) and other tags (\n, \u) in the text, here is an example:
(u'B9781437714227000962', u'Definition\u2014Human papillomavirus
(HPV)\u2013related proliferation of the vaginal mucosa that leads to
extensive, full-thickness loss of maturation of the vaginal
epithelium.\n')
How can remove these unicode tags using python3 in Linux operating system?
To remove unicode escape sequence (or better: to translate them), in python3:
a.encode('utf-8').decode('unicode_escape')
The decode part will translate the unicode escape sequences to the relative unicode characters. Unfortunately such (un-)escape do no work on strings, so you need to encode the string first, before to decode it.
But as pointed in the question comment, you have a serialized document. Try do unserialize it with the correct tools, and you will have automatically also the unicode "unescaping" part.

How do I get from Éphémère to Éphémère in Python3?

I've tried all kinds of combinations of encode/decode with options 'surrogatepass' and 'surrogateescape' to no avail. I'm not sure what format this is in (it might even be a bug in Autoit), but I know for a fact the information is in there because at least one online utf decoder got it right. On the online converter website, I specified the file as utf8 and the output as utf16, and the output was as expected.
This issue is called mojibake, and your specific case occurs if you have a text stream that was encoded with UTF-8, and you decode it with Windows-1252 (which is a superset of ISO 8859-1).
So, as you have already found out, you have to decode this file with UTF-8, rather than with the default encoding of Python (which appears to be Windows-1252 in your case).
Let's see why these specific garbled characters appear in your example, namely:
É at the place of É
é at the place of é
è at the place of è
The following table summarises what's going on:
All of É, é, and è are non-ASCII characters, and they are encoded with UTF-8 to 2-byte long codes.
For example, the UTF-8 code for É is:
11000011 10001001
On the other hand, Windows-1252 is an 8-bit encoding, that is, it encodes every character of its character set to 8 bits, i.e. one byte.
So, if you now decode the bit sequence 11000011 10001001 with Windows-1252, then Windows-1252 interprets this as two 1-byte codes, each representing a separate character, rather than a 2-byte code representing a single character:
The first byte 11000011 (C3 in hexadecimal) happens to be the Windows-1252 code of the character à (Unicode code point U+00C3).
The second byte 10001001 (89 in hexadecimal) happens to be the Windows-1252 code of the character ‰ (Unicode code point U+2030).
You can look up these mappings here.
So, that's why your decoding renders É instead of É. Idem for the other non-ASCII characters é and è.
My issue was during the file reading. I solved it by specifying encoding='utf-8' in the options for open().
open(filePath, 'r', encoding='utf-8')

Is there a standard function that converts multi-line text into ASCII escaped form and vice versa?

UPDATE:
The right terminology has been escaping me. What I am looking for is a function that converts multi-line ASCII text into ASCII escaped form.
Is there a standard function that converts multi-line text into ASCII escaped form and vice versa?
I need to store multi-line text as name=value pairs, basically line .ini files, where Value is ASCII escaped text which fits on a single line, but I prefer the format that doesn't use numeric codes to express the non-printing characters if such a format exists.
The multi-line text can be long, up to 65K in length.
How about to use Base64?
Base64 is used to encode attached files of E-mail. Base64 can convert any kinds of data into strings made of characters upto 64 kinds (Upper and Lower case alphabet(52 kinds),0 to 9 (10 kinds), "-" and "+").
Large picture (over 1MB) can be encoded by Base64, so 65K charactes may not make trouble.
In Windows .ini files, you can use the whole section to store multiline data.
[key1]
several lines
of data
[key2]
another
Read it with GetPrivateProfileSection. To get a list of keys, use GetPrivateProfileSectionNames.

How do I create spacial characters used in XML in my C# code

I have the following string, read from an XML attribute:
"OnTrak 4-3/4”, 6-3/4”, 8-1/4” / MPR"
In my C# application it shows up nicely formatted like this
"OnTrak 4-3/4”, 6-3/4”, 8-1/4” / MPR"
This is the form I see in the debugger, a combobox, or on this forum (if I don't indent to specify code).
What I want to do is specify the same string as a C# variable and have it show up nicely formatted when the application runs. Unfortunately, all I get is the string as I literally typed it.
I have tried to play around with converting the encoding from ASCII to UTF8 with no luck. How can I get this special character properly formatted, and where can I find a list of these symbols?
Those are called XML entities. Use HttpUtility.HtmlDecode to decode them back to plain text like you would like. Credit goes to C#, function to replace all html special characters with normal text characters for how to convert entities in C#
Note that converting from ASCII to UTF8 (and Unicode etc.) is called changing the character set and is usually done when specific characters are in the string. For instance if you strings contained Chinese characters you couldn't use ASCII. In this simple case you shouldn't need to convert character sets because C# strings are Unicode character set by default and XML entities are Unicode based (I believe).

Resources