Decode an Unicode escaped character from PyQt5 QLabel widget? - python-3.x

I am trying to read in a text sequence from a QLineEdit that might contain Unicode escape sequence and print it out to a QLabel and display the proper character in PyQt5 and Python 3.4.
I tried many different things that I read here on stackoverflow but couldn't find a working solution for Python 3.
def on_pushButton_clicked(self):
text = self.lineEdit.text()
self.label.setText(text)
Now if I do something like this:
decodedText = str("dsfadsfa \u2662 \u8f1d \u2662").encode("utf-8")
self.label.setText(text.decode("utf-8")
This does print out the proper characters. If I apply the same to the above method I get the escaped sequences.
I don't get what is the difference between the str() returned by QLineEdit's text() and the str("\u2662"). Why does the one encode the characters properly and the other one doesn't?

The difference is that "\u2662" isn't a string with a Unicode escape, it's a string literal with a Unicode escape. A string with the same Unicode escape would be "\\u2662".
3>> codecs.getdecoder('unicode-escape')('\\u2662')
('♢', 6)

Related

How to python delete unwanted icon characters

Is there a way to get a string to get rid of icon characters automatically?
input: This is String 💋👌✅this is string✅✍️string✍️✔️
output wish: This is String this is stringstring
replace('💋👌✅', '') is not used because the icon character changes within each string without our prior knowledge of the content
Try this:
import re
def strip_emoji(text):
RE_EMOJI = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
return RE_EMOJI.sub(r'', text)
print(strip_emoji('This is String 💋👌✅this is string✅✍️string✍️✔️'))
Consider using the re module in Python to replace characters that you don't want. Something like:
import re
re.sub(r'[^(a-z|A-Z)]', '', my_string)

Removing Narrow 'No-Break Space' Unicode Characters (U+00A0) in python nlp

Non-breaking spaces are printed as whitespace, but handled internally as \xa0. How do I remove all these characters at once?
So far I've replaced it directly:
text = text.replace('\u202f','')
text = text.replace('\u200d','')
text = text.replace('\xa0','')
But each time I scrape the text sentences from external source, These characters are different. How do I remove it all at once?
You can use regular expression substitution instead.
If you want to replace all whitespace, you can just use:
import re
text = re.sub(r'\s', '', text)
This includes all unicode whitespace, as described in the answer to this question.
From that answer, you can see that (at the time of writing), the unicode constants recognized as whitespace (e.g. \s) in Python regular expressions are these:
0x0009
0x000A
0x000B
0x000C
0x000D
0x001C
0x001D
0x001E
0x001F
0x0020
0x0085
0x00A0
0x1680
0x2000
0x2001
0x2002
0x2003
0x2004
0x2005
0x2006
0x2007
0x2008
0x2009
0x200A
0x2028
0x2029
0x202F
0x205F
0x3000
This looks as if this will fit your needs.

Parsing a non-Unicode string with Flask-RESTful

I have a webhook developed with Flask-RESTful which gets several parameters with POST.
One of the parameters is a non-Unicode string, encoded in cp1251.
Can't find a way to correctly parse this argument using reqparse.
Here is the fragment of my code:
parser = reqparse.RequestParser()
parser.add_argument('text')
msg = parser.parse_args()
Then, I write msg to a text file, and it looks like this:
{"text": "\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd !\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\n\n-- \n\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd."}
As you can see, Flask somehow replaces all Cyrillic characters with \ufffd. At the same time, non-Cyrillic characters, like ! or \n are processed correctly.
Anything I can do to advise RequestParser with the string encoding?
Here is my code for writing the text to disk:
f = open('log_msg.txt', 'w+')
f.write(json.dumps(msg))
f.close()
I tried f = open('log_msg.txt', 'w+', encoding='cp1251') with the same result.
Then, I tried
f = open('log_msg_ascii.txt', 'w+')
f.write(ascii(json.dumps(msg)))
Also, no difference.
So, I'm pretty sure it's RequestParser() tries to be too smart and can't understand the non-Unicode input.
Thanks!
Okay, I finally found a workaround. Thanks to #lenz for helping me with this issue. It seems that reqparse wrongly assumes that every string parameter comes as UTF-8. So when it sees a non-Unicode input field (among other Unicode fields!), it tries to load it as Unicode and fails. As a result, all characters are U+FFFD (replacement character).
So, to access that non-Unicode field, I did the following trick.
First, I load raw data using get_data(), decode it using cp1251 and parse with a simple regexp.
raw_data = request.get_data()
contents = raw_data.decode('windows-1251')
match = re.search(r'(?P<delim>--\w+\r?\n)Content-Disposition: form-data; name=\"text\"\r?\n(.*?)(?P=delim)', contents, re.MULTILINE | re.DOTALL)
text = match.group(2)
Not the most beautiful solution, but it works.

How to display chinese character in 65001 in python?

I am in win7 +python3.3.
import os
os.system("chcp 936")
fh=open("test.ch","w",encoding="utf-8")
fh.write("你")
fh.close()
os.system("chcp 65001")
fh=open("test.ch","r",encoding="utf-8").read()
print(fh)
Äã
>>> print(fh.encode("utf-8"))
b'\xe4\xbd\xa0'
How can i display the chinese character 你 in 65001?
If your terminal is capable of displaying the character directly (which it may not be due to font issues) then it should Just Work(tm).
>>> hex(65001)
>>> u"\ufde9"
'\ufde9'
>>> print(u"\ufde9")
﷩
To avoid the use of literals, note that in Python 3, at least, the chr() function will take a code point and return the associated Unicode character. So this works too, avoiding the need to do hex conversions.
>>> print(chr(65001))
﷩

How to display international scripts in QLabels?

I would like to display Indic, Arabic and Hebrew scripts in a QLabel, specifying the font type. When I try to pass a UTF-8 encoded string into a QLabel, the script is not rendered properly.
What is the correct way to display international (non-alphabetic) scripts in a QLabel?
Setting the text of a QLabel to a unicode string (unicode in python2, str in python3) should work fine.
In python2 you can use QString.fromUtf8 to convert a utf8-encoded str to an unicode QString, or .decode('utf-8') to a python unicode.
In PyQt4 on python3 QString is gone as now str is already unicode, so just use that.
For instance:
s = "اردو"
d = s.decode('utf-8')
label = QtGui.QLabel(d)
font = QtGui.QFont("Sheherazade", 40)
label.setFont(font)

Resources