I am trying to read a text file which has unicode characters (u) and other tags (\n, \u) in the text, here is an example:
(u'B9781437714227000962', u'Definition\u2014Human papillomavirus
(HPV)\u2013related proliferation of the vaginal mucosa that leads to
extensive, full-thickness loss of maturation of the vaginal
epithelium.\n')
How can remove these unicode tags using python3 in Linux operating system?
To remove unicode escape sequence (or better: to translate them), in python3:
a.encode('utf-8').decode('unicode_escape')
The decode part will translate the unicode escape sequences to the relative unicode characters. Unfortunately such (un-)escape do no work on strings, so you need to encode the string first, before to decode it.
But as pointed in the question comment, you have a serialized document. Try do unserialize it with the correct tools, and you will have automatically also the unicode "unescaping" part.
Related
I use the accumulo shell to inspecte tables.I find I can insert english,But When I insert Chinese,I see some garbled in shell .how to deal with this problem.Can I set utf-8 in Accumulo?
The answer is no. Unfortunately, displaying UTF-8 characters is not currently possible. Accumulo primarily deals with raw bytes, not Strings of Characters. The shell currently (up to 2.0.0-alpha-2, at least) has very limited capabilities to display wide Unicode characters. The shell's behavior is to, for convenience, show printable 7-bit ASCII characters; the rest are shown as hex-encoded form.
While the current shell capabilities are limited, it is an open source project, that would welcome patches to better support any UTF-8 printable characters.
I have the following text in a json file:
"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"
which represents the text "אחוזת פולג" in Hebrew.
no matter which encoding/decoding i use i don't seem to get it right with
Python 3.
if for example ill try:
text = "\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092".encode('unicode-escape')
print(text)
i get that text is:
b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'
which in bytecode is almost the correct text, if i was able to remove only one backslash and turn
b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'
into
text = b'\xd7\x90\xd7\x97\xd7\x95\xd7\x96\xd7\xaa \xd7\xa4\xd7\x95\xd7\x9c\xd7\x92'
(note how i changed double slash to single slash) then
text.decode('utf-8')
would yield the correct text in Hebrew.
but i am struggling to do so and couldn't manage to create a piece of code which will do that for me (and not manually as i just showed...)
any help much appreciated...
This string does not "represent" Hebrew text (at least not as unicode code points, UTF-16, UTF-8, or in any well-known way at all). Instead, it represents a sequence of UTF-16 code units, and this sequence consists mostly of multiplication signs, currency signs, and some weird control characters.
It looks like the original character data has been encoded and decoded several times with some strange combination of encodings.
Assuming that this is what literally is saved in your JSON file:
"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa \u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"
you can recover the Hebrew text as follows:
(jsonInput
.encode('latin-1')
.decode('raw_unicode_escape')
.encode('latin-1')
.decode('utf-8')
)
For the above example, it gives:
'אחוזת פולג'
If you are using a JSON deserializer to read in the data, then you should of course omit the .encode('latin-1').decode('raw_unicode_escape') steps, because the JSON deserializer would already interpret the escape sequences for you. That is, after the text element is loaded by JSON deserializer, it should be sufficient to just encode it as latin-1 and then decode it as utf-8. This works because latin-1 (ISO-8859-1) is an 8-bit character encoding that corresponds exactly to the first 256 code points of unicode, whereas your strangely broken text encodes each byte of UTF-8 encoding as an ASCII-escape of an UTF-16 code unit.
I'm not sure what you can do if your JSON contains both the broken escape sequences and valid text at the same time, it might be that the latin-1 doesn't work properly any more. Please don't apply this transformation to your JSON file unless the JSON itself contains only ASCII, it would only make everything worse.
UPDATE:
The right terminology has been escaping me. What I am looking for is a function that converts multi-line ASCII text into ASCII escaped form.
Is there a standard function that converts multi-line text into ASCII escaped form and vice versa?
I need to store multi-line text as name=value pairs, basically line .ini files, where Value is ASCII escaped text which fits on a single line, but I prefer the format that doesn't use numeric codes to express the non-printing characters if such a format exists.
The multi-line text can be long, up to 65K in length.
How about to use Base64?
Base64 is used to encode attached files of E-mail. Base64 can convert any kinds of data into strings made of characters upto 64 kinds (Upper and Lower case alphabet(52 kinds),0 to 9 (10 kinds), "-" and "+").
Large picture (over 1MB) can be encoded by Base64, so 65K charactes may not make trouble.
In Windows .ini files, you can use the whole section to store multiline data.
[key1]
several lines
of data
[key2]
another
Read it with GetPrivateProfileSection. To get a list of keys, use GetPrivateProfileSectionNames.
In Linux, I am converting UTF-8 to ISO-8859-1 file using the following command:
iconv -f UTF-8 -t ISO-8859-1//TRANSLIT input.txt > out.txt
After conversion, when I open the out.txt
¿Quién Gómez is translated to ¿Quien Gomez.
Why are é and ó and others not translated correctly?
There are (at least) two ways to represent the accented letter é in Unicode: as a single code point U+00E9, LATIN SMALL LETTER E WITH ACUTE, and as a two-character sequence e (U+0065) followed by U+0301, COMBINING ACUTE ACCENT.
Your input file uses the latter encoding, which iconv apparently is unable to translate to Latin-1 (ISO-8859-1). With the //TRANSLIT suffix, it passes through the unaccented e unmodified and drops the combining character.
You'll probably need to convert the input so it doesn't use combining characters, replacing the sequence U+0065 U+0301 by a single code point U+00E9 (represented in 2 bytes). Either that, or arrange for whatever generates your input file to use that encoding in the first place.
So that's the problem; I don't currently know exactly how to correct it.
Keith, you are right. I found the answer from Oracle Community Sergiusz Wolicki. Here I am quoting his answer word for word. I am posting for somebody who may have this problem. "The problem is that your data is stored in the Unicode decomposed form, which is legal but seldom used for Western European languages. 'é' is stored as 'e' (0x65=U+0065) plus combining acute accent (0xcc, 0x81 = U+0301). Most simple conversion tools, including standard Oracle client/server conversion, do not account for this and do not convert a decomposed characters into a pre-composed character from ISO 8859-1. They try to convert each of the two codes independently, yielding 'e' plus some replacement for the accent character, which does not exist in ISO 8859-1. You see the result correctly in SQL Developer because there is no conversion involved and SQL Developer rendering code is capable of combining the two codes into one character, as expected.
As 'é' and 'ó' have pre-composed forms available in both Unicode and ISO 8859-1, the work around is to add COMPOSE function to your query. Thus, set NLS_LANG as I advised previously and add COMPOSE around column expressions to your query." Thank you very much, Keith
When pasting text from outside sources into a plain-text editor (e.g. TextMate or Sublime Text 2) a common problem is that special characters are often pasted in as well. Some of these characters render fine, but depending on the source, some might not display correctly (usually showing up as a question mark with a box around it).
So this is actually 2 questions:
Given a special character (e.g., ’ or ♥) can I determine the UTF-8 character code used to display that character from inside my text editor, and/or convert those characters to their character codes?
For those "extra-special" characters that come in as garbage, is there any way to figure out what encoding was used to display that character in the source text, and can those characters somehow be converted to UTF-8?
My favorite site for looking up characters is fileformat.info. They have a great Unicode character search that includes a lot of useful information about each character and its various encodings.
If you see the question mark with a box, that means you pasted something that can't be interpreted, often because it's not legal UTF-8 (not every byte sequence is legal UTF-8). One possibility is that it's UTF-16 with an endian mode that your editor isn't expecting. If you can get the full original source into a file, the file command is often the best tool for determining the encoding.
At &what I built a tool to focus on searching for characters. It indexes all the Unicode and HTML entity tables, but also supplements with hacker dictionaries and a database of keywords I've collected, so you can search for words like heart, quot, weather, umlaut, hash, cloverleaf and get what you want. By focusing on search, it avoids having to hunt around the Unicode pages, which can be frustrating. Give it a try.