What is the proper way to do the following:
accept any unicode string (this gets read in from a file that contains utf-8 encoded strings)
if the string contains literal escapes (sequences) of the form \xA0\xB0 (so in Python this will show as \xA0\XB0) replace them with the actual characters or with some fallback (e.g. a space) if this is an invalid code
if the string contains literal unicode character escapes of the form \u0008 (so in python this will show as \u0008) replace with the actual character represented by that code
if the string contains python string escapes of the form \n or \t (so in python this will show as \n and \t), replace with the actual character represented by that code
As a bonus, also replace all HTML entities of the form &#a0;
all actual unicode characters should remain unchanged
Basically, something that converts back all the rubbish one sometimes gets when people have created a file in the wrong way and it now contains the literal escapes instead of the code represented by those escapes.
This has GOT to be a solved problem, but all I can find are rather clumsy and incomplete solutions.
Related
When parsing /proc/self/mountinfo on Linux some fields of each line describing each a mount may very well contain utf-8 encoded characters. Since the line format of mountinfo separates fields by spaces, mountinfo escapes at least (space) and \ (backslash) as "\040" and "\134" (literally!). How can I convert a field value ("/tmp/a\ π", Python string '/tmp/a\\134\\040π') back into a non-escaped string?
Is there a better way than the following rather involved one (from https://stackoverflow.com/a/26311382)? That is, with less encoding/decoding chaining?
>>> s='/tmp/a\\134\\040π'
>>> s.encode().decode('unicode-escape').encode('latin-1').decode('utf-8')
'/tmp/a\\ π'
PS: Don't ask why anyone sane would use such path names; this is just for illustrational purposes ;)
From command line parameters (sys.argv) I receive string of unicode literals like this: '\u041f\u0440\u0438\u0432\u0435\u0442\u0021'
For example this script uni.py:
import sys
print(sys.argv[1])
command line:
python uni.py \u041f\u0440\u0438\u0432\u0435\u0442\u0021
output:
\u041f\u0440\u0438\u0432\u0435\u0442\u0021
I want to convert it to unicode string 'ΠΡΠΈΠ²Π΅Ρ!'
You don't have to convert it the Unicode, because it already is Unicode. In Python 3.x, strings are Unicode by default. You only have to convert them (to or from bytes) when you want to read or write bytes, for example, when writing to a file.
If you just print the string, you'll get the correct result, assuming your terminal supports the characters.
print('\u041f\u0440\u0438\u0432\u0435\u0442\u0021')
This will print:
ΠΡΠΈΠ²Π΅Ρ!
UPDATE
After updating your question it became clear to me that the mentioned string is not really a string literal (or unicode literal), but input from the command line. In that case you could use the "unicode-escape" encoding to get the result you want.
Note that encoding works from Unicode to bytes, and decoding works from bytes to Unicode. In this case you want a transformation from Unicode to Unicode, so you have to add a "dummy" decoding step using latin-1 encoding, which transparently converts Unicode codepoints to bytes.
The following code will print the correct result for your example:
text = sys.argv[1].encode('latin-1').decode('unicode-escape')
print(text)
UPDATE 2
Alternatively, you could use ast.literal_eval() to parse the string from the input. However, this method expects a proper Python literal, including the quotes. You could do something like to solve this:
text = ast.literal_eval("'" + sys.argv[1] + "'")
But note that this would break if you would have a quote as part of your input string. I think it's a bit of a hack, since the method is probably not intended for the purpose you use it. The unicode-escape is simpler and robuster. However, what the best solution is depends on what you're building.
Sorry, this is likely a simple solution. In a server response, I am trying to return an emoji. Right now, I have this line:
return [b"Hello World " βπβ.encode(βutf-8β)]
However, I get the following error:
return [b"Hello World " βοΏ½πβ.encode(βutf-8β)]
^
SyntaxError: invalid character in identifier
What I'd like to see is: Hello World π
The problem is that a byte string b'...' cannot contain a character which does not fit into a byte. But you don't need a byte string here anyway; encode will convert a string to bytes - that's what it does.
Try
return ["Hello World βπβ".encode("utf-8")]
The quotes in your question were wacky; I assume you want curly quotes around the emoji and honest-to-GvR quotes around the Python string literals.
In Python, you can put (almost) any Unicode character in a string literal.
Also, you can use most Unicode letters in identifiers, eg. if you think it's appropriate to define a variable Ξ± (Greek lower-case alpha).
But you can't use "fancy quotes" for delimiting string literals.
Look carefully: the double quotes around the emoji (and also around utf-8) aren't straight ASCII quotes, but rather typographical ones β the ones word processors substitute when typing a double quote in text.
Make sure you use a proper programming editor or an IDE for coding.
Then the line will look like this:
return [b"Hello World " "π".encode("utf-8")]
Edit:
I realise that this still doesn't work: you cannot mix byte string and Unicode string literals (even though here the Unicode literal gets converted to bytes later).
Instead, you have to concatenate them with the + operator:
return [b"Hello World " + "π".encode("utf-8")]
Or use a single string literal, like tripleee suggested.
There are several problems. Your code:
return [b"Hello World " βπβ.encode(βutf-8β)]
You see, you are using β and β twice, instead of the proper double quote character ("). You should use a proper code editor, or disable character conversion. You should care also on copy and paste.
As you see from the error, the problem is not the emoji on the string, but a problem of identifier (and it is a syntax error), so unknown characters outside string.
But if you correct to:
return [b"Hello World " "π".encode("utf-8")]
you still have an error: SyntaxError: cannot mix bytes and nonbytes literals.
This because Python will merge strings before to call the function encode, but one is a b-string and the other it is a normal string.
So you could use one of the following:
return [b"Hello World " + "π".encode("utf-8")] # this force order of operator
or the following two (that are equivalent).
return [b"Hello World " "π"]
return [b"Hello World π"]
Python3 uses UTF-8 as default source encoding, so your editor will already encode the emoji into UTF-8, so you can use it in a b-string (UTF-8 encoded).
Note: this is not a very safe assumption: one could manually force source code to be in other encodings, but in that case, you will have probably also problem with the first method (saving a file with emoji in other encodings).
UPDATE:
The right terminology has been escaping me. What I am looking for is a function that converts multi-line ASCII text into ASCII escaped form.
Is there a standard function that converts multi-line text into ASCII escaped form and vice versa?
I need to store multi-line text as name=value pairs, basically line .ini files, where Value is ASCII escaped text which fits on a single line, but I prefer the format that doesn't use numeric codes to express the non-printing characters if such a format exists.
The multi-line text can be long, up to 65K in length.
How about to use Base64?
Base64 is used to encode attached files of E-mail. Base64 can convert any kinds of data into strings made of characters upto 64 kinds (Upper and Lower case alphabet(52 kinds),0 to 9 (10 kinds), "-" and "+").
Large picture (over 1MB) can be encoded by Base64, so 65K charactes may not make trouble.
In Windows .ini files, you can use the whole section to store multiline data.
[key1]
several lines
of data
[key2]
another
Read it with GetPrivateProfileSection. To get a list of keys, use GetPrivateProfileSectionNames.
I have the following string, read from an XML attribute:
"OnTrak 4-3/4β, 6-3/4β, 8-1/4β / MPR"
In my C# application it shows up nicely formatted like this
"OnTrak 4-3/4β, 6-3/4β, 8-1/4β / MPR"
This is the form I see in the debugger, a combobox, or on this forum (if I don't indent to specify code).
What I want to do is specify the same string as a C# variable and have it show up nicely formatted when the application runs. Unfortunately, all I get is the string as I literally typed it.
I have tried to play around with converting the encoding from ASCII to UTF8 with no luck. How can I get this special character properly formatted, and where can I find a list of these symbols?
Those are called XML entities. Use HttpUtility.HtmlDecode to decode them back to plain text like you would like. Credit goes to C#, function to replace all html special characters with normal text characters for how to convert entities in C#
Note that converting from ASCII to UTF8 (and Unicode etc.) is called changing the character set and is usually done when specific characters are in the string. For instance if you strings contained Chinese characters you couldn't use ASCII. In this simple case you shouldn't need to convert character sets because C# strings are Unicode character set by default and XML entities are Unicode based (I believe).