Python3 - Convert unicode literals string to unicode string - python-3.x

From command line parameters (sys.argv) I receive string of unicode literals like this: '\u041f\u0440\u0438\u0432\u0435\u0442\u0021'
For example this script uni.py:
import sys
print(sys.argv[1])
command line:
python uni.py \u041f\u0440\u0438\u0432\u0435\u0442\u0021
output:
\u041f\u0440\u0438\u0432\u0435\u0442\u0021
I want to convert it to unicode string 'Привет!'

You don't have to convert it the Unicode, because it already is Unicode. In Python 3.x, strings are Unicode by default. You only have to convert them (to or from bytes) when you want to read or write bytes, for example, when writing to a file.
If you just print the string, you'll get the correct result, assuming your terminal supports the characters.
print('\u041f\u0440\u0438\u0432\u0435\u0442\u0021')
This will print:
Привет!
UPDATE
After updating your question it became clear to me that the mentioned string is not really a string literal (or unicode literal), but input from the command line. In that case you could use the "unicode-escape" encoding to get the result you want.
Note that encoding works from Unicode to bytes, and decoding works from bytes to Unicode. In this case you want a transformation from Unicode to Unicode, so you have to add a "dummy" decoding step using latin-1 encoding, which transparently converts Unicode codepoints to bytes.
The following code will print the correct result for your example:
text = sys.argv[1].encode('latin-1').decode('unicode-escape')
print(text)
UPDATE 2
Alternatively, you could use ast.literal_eval() to parse the string from the input. However, this method expects a proper Python literal, including the quotes. You could do something like to solve this:
text = ast.literal_eval("'" + sys.argv[1] + "'")
But note that this would break if you would have a quote as part of your input string. I think it's a bit of a hack, since the method is probably not intended for the purpose you use it. The unicode-escape is simpler and robuster. However, what the best solution is depends on what you're building.

Related

How to properly decode a utf-8 string with mixed-in octal escapes?

When parsing /proc/self/mountinfo on Linux some fields of each line describing each a mount may very well contain utf-8 encoded characters. Since the line format of mountinfo separates fields by spaces, mountinfo escapes at least (space) and \ (backslash) as "\040" and "\134" (literally!). How can I convert a field value ("/tmp/a\ 😀", Python string '/tmp/a\\134\\040😀') back into a non-escaped string?
Is there a better way than the following rather involved one (from https://stackoverflow.com/a/26311382)? That is, with less encoding/decoding chaining?
>>> s='/tmp/a\\134\\040😀'
>>> s.encode().decode('unicode-escape').encode('latin-1').decode('utf-8')
'/tmp/a\\ 😀'
PS: Don't ask why anyone sane would use such path names; this is just for illustrational purposes ;)

.encode("utf-8") does not seem to support my emoji?

Sorry, this is likely a simple solution. In a server response, I am trying to return an emoji. Right now, I have this line:
return [b"Hello World " “👋”.encode(“utf-8”)]
However, I get the following error:
return [b"Hello World " “�👋”.encode(“utf-8”)]
^
SyntaxError: invalid character in identifier
What I'd like to see is: Hello World 👋
The problem is that a byte string b'...' cannot contain a character which does not fit into a byte. But you don't need a byte string here anyway; encode will convert a string to bytes - that's what it does.
Try
return ["Hello World “👋”".encode("utf-8")]
The quotes in your question were wacky; I assume you want curly quotes around the emoji and honest-to-GvR quotes around the Python string literals.
In Python, you can put (almost) any Unicode character in a string literal.
Also, you can use most Unicode letters in identifiers, eg. if you think it's appropriate to define a variable α (Greek lower-case alpha).
But you can't use "fancy quotes" for delimiting string literals.
Look carefully: the double quotes around the emoji (and also around utf-8) aren't straight ASCII quotes, but rather typographical ones – the ones word processors substitute when typing a double quote in text.
Make sure you use a proper programming editor or an IDE for coding.
Then the line will look like this:
return [b"Hello World " "👋".encode("utf-8")]
Edit:
I realise that this still doesn't work: you cannot mix byte string and Unicode string literals (even though here the Unicode literal gets converted to bytes later).
Instead, you have to concatenate them with the + operator:
return [b"Hello World " + "👋".encode("utf-8")]
Or use a single string literal, like tripleee suggested.
There are several problems. Your code:
return [b"Hello World " “👋”.encode(“utf-8”)]
You see, you are using “ and ” twice, instead of the proper double quote character ("). You should use a proper code editor, or disable character conversion. You should care also on copy and paste.
As you see from the error, the problem is not the emoji on the string, but a problem of identifier (and it is a syntax error), so unknown characters outside string.
But if you correct to:
return [b"Hello World " "👋".encode("utf-8")]
you still have an error: SyntaxError: cannot mix bytes and nonbytes literals.
This because Python will merge strings before to call the function encode, but one is a b-string and the other it is a normal string.
So you could use one of the following:
return [b"Hello World " + "👋".encode("utf-8")] # this force order of operator
or the following two (that are equivalent).
return [b"Hello World " "👋"]
return [b"Hello World 👋"]
Python3 uses UTF-8 as default source encoding, so your editor will already encode the emoji into UTF-8, so you can use it in a b-string (UTF-8 encoded).
Note: this is not a very safe assumption: one could manually force source code to be in other encodings, but in that case, you will have probably also problem with the first method (saving a file with emoji in other encodings).

Python 3: unescape literal hex, unicode and python escapes

What is the proper way to do the following:
accept any unicode string (this gets read in from a file that contains utf-8 encoded strings)
if the string contains literal escapes (sequences) of the form \xA0\xB0 (so in Python this will show as \xA0\XB0) replace them with the actual characters or with some fallback (e.g. a space) if this is an invalid code
if the string contains literal unicode character escapes of the form \u0008 (so in python this will show as \u0008) replace with the actual character represented by that code
if the string contains python string escapes of the form \n or \t (so in python this will show as \n and \t), replace with the actual character represented by that code
As a bonus, also replace all HTML entities of the form &#a0;
all actual unicode characters should remain unchanged
Basically, something that converts back all the rubbish one sometimes gets when people have created a file in the wrong way and it now contains the literal escapes instead of the code represented by those escapes.
This has GOT to be a solved problem, but all I can find are rather clumsy and incomplete solutions.

Convert Unicode Escape to Hebrew text

I have the following text in a json file:
"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"
which represents the text "אחוזת פולג" in Hebrew.
no matter which encoding/decoding i use i don't seem to get it right with
Python 3.
if for example ill try:
text = "\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092".encode('unicode-escape')
print(text)
i get that text is:
b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'
which in bytecode is almost the correct text, if i was able to remove only one backslash and turn
b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'
into
text = b'\xd7\x90\xd7\x97\xd7\x95\xd7\x96\xd7\xaa \xd7\xa4\xd7\x95\xd7\x9c\xd7\x92'
(note how i changed double slash to single slash) then
text.decode('utf-8')
would yield the correct text in Hebrew.
but i am struggling to do so and couldn't manage to create a piece of code which will do that for me (and not manually as i just showed...)
any help much appreciated...
This string does not "represent" Hebrew text (at least not as unicode code points, UTF-16, UTF-8, or in any well-known way at all). Instead, it represents a sequence of UTF-16 code units, and this sequence consists mostly of multiplication signs, currency signs, and some weird control characters.
It looks like the original character data has been encoded and decoded several times with some strange combination of encodings.
Assuming that this is what literally is saved in your JSON file:
"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa \u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"
you can recover the Hebrew text as follows:
(jsonInput
.encode('latin-1')
.decode('raw_unicode_escape')
.encode('latin-1')
.decode('utf-8')
)
For the above example, it gives:
'אחוזת פולג'
If you are using a JSON deserializer to read in the data, then you should of course omit the .encode('latin-1').decode('raw_unicode_escape') steps, because the JSON deserializer would already interpret the escape sequences for you. That is, after the text element is loaded by JSON deserializer, it should be sufficient to just encode it as latin-1 and then decode it as utf-8. This works because latin-1 (ISO-8859-1) is an 8-bit character encoding that corresponds exactly to the first 256 code points of unicode, whereas your strangely broken text encodes each byte of UTF-8 encoding as an ASCII-escape of an UTF-16 code unit.
I'm not sure what you can do if your JSON contains both the broken escape sequences and valid text at the same time, it might be that the latin-1 doesn't work properly any more. Please don't apply this transformation to your JSON file unless the JSON itself contains only ASCII, it would only make everything worse.

Why doesn't print() in Python 3.2 seem to be defaulting to UTF-8?

I'm writing scripts to clean up unicode text files (stored as UTF-8), and I chose to use Python 3.x (3.2) rather than the more popular 2.x because 3.x is supposed to default to UTF-8. Maybe I'm doing something wrong, but it seems that the print statement, at least, still is not defaulting to UTF-8. If I try to print a string (msg below is a string) that contains special characters, I still get a UnicodeEncodeError like this:
print(label, msg)
... in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0968' in position
38: character maps to <undefined>
If I use the encode() method first (which does nicely default to UTF-8), I can avoid the error:
print(label, msg.encode())
This also works for printing objects or lists containing unicode strings--something I often have to do when debugging--since str() seems to default to UTF-8. But do I really need to remember to use print(str(myobj).encode()) every single time I want to do a print(myobj) ? If so, I suppose I could try to wrap it with my own function, but I'm not confident about handling all the argument permutations that print() supports.
Also, my script loads regular expressions from a file and applies them one by one. Before applying encode(), I was able to print something fairly legible to the console:
msg = 'Applying regex {} of {}: {}'.format(i, len(regexes), regex._findstr)
print(msg)
Applying regex 5 of 15: ^\\ge[0-9]*\b([ ]+[0-9]+\.)?[ ]*
However, this crashes if the regex includes literal unicode characters, so I applied encode() to the string first. But now the regexes are very hard to read on-screen (and I suspect I may have similar trouble if I try to write code that saves these regexes back to disk):
msg = 'Applying regex {} of {}: {}'.format(i, len(regexes), regex._findstr)
print(msg.encode())
b'Applying regex 5 of 15: ^\\\\ge[0-9]*\\b([ ]+[0-9]+\\.)?[ ]*'
I'm not very experienced yet in Python, so I may be misunderstanding. Any explanations or links to tutorials (for Python 3.x; most of what I see online is for 2.x) would be much appreciated.
print doesn't default to any encoding, it just uses whatever encoding the output device (like a console) claims to support. Your console encoding appears to be non-unicode, so print tries to encode your unicode strings in that encoding, and fails. The easiest way to get around this is to tell the console to use utf8 (like export LC_ALL=en_US.UTF-8 on unix systems).
The easier way to proceed is to only use unicode in your script, and only use encoded data when you want to interact with the "outside" world. That is, when you have input to decode or output to encode.
To do so, everytime you read something, use decode, everytime you output something, use encode.
For you regex, use the re.UNICODE flag.
I know that this doesn't exactly answer your question point by point, but I think that applying such a methodology should keep you safe from encoding issues.

Resources