How to display international scripts in QLabels? - string

I would like to display Indic, Arabic and Hebrew scripts in a QLabel, specifying the font type. When I try to pass a UTF-8 encoded string into a QLabel, the script is not rendered properly.
What is the correct way to display international (non-alphabetic) scripts in a QLabel?

Setting the text of a QLabel to a unicode string (unicode in python2, str in python3) should work fine.
In python2 you can use QString.fromUtf8 to convert a utf8-encoded str to an unicode QString, or .decode('utf-8') to a python unicode.
In PyQt4 on python3 QString is gone as now str is already unicode, so just use that.
For instance:
s = "اردو"
d = s.decode('utf-8')
label = QtGui.QLabel(d)
font = QtGui.QFont("Sheherazade", 40)
label.setFont(font)

Related

Decode a Python string

Sorry for the generic title.
I am receiving a string from an external source: txt = external_func()
I am copying/pasting the output of various commands to make sure you see what I'm talking about:
In [163]: txt
Out[163]: '\\xc3\\xa0 voir\\n'
In [164]: print(txt)
\xc3\xa0 voir\n
In [165]: repr(txt)
Out[165]: "'\\\\xc3\\\\xa0 voir\\\\n'"
I am trying to transform that text to UTF-8 (?) to have txt = "à voir\n", and I can't see how.
How can I do transformations on this variable?
You can encode your txt to a bytes-like object using the encode-method of the str class.
Then this byte-like object can be decoded again with the encoding unicode_escape.
Now you have your string with all escape sequences parsed, but latin-1 decoded. You still have to encode it with latin-1 and then decode it again with utf-8.
>>> txt = '\\xc3\\xa0 voir\\n'
>>> txt.encode('utf-8').decode('unicode_escape').encode('latin-1').decode('utf-8')
'à voir\n'
The codecs module also has an undocumented funciton called escape_decode:
>>> import codecs
>>> codecs.escape_decode(bytes('\\xc3\\xa0 voir\\n', 'utf-8'))[0].decode('utf-8')
'à voir\n'

python curses addstr y-offset: strange behavior with unicode

I am having trouble with python3 curses and unicode:
#!/usr/bin/env python3
import curses
import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
def doStuff(stdscr):
offset = 3
stdscr.addstr(0, 0, "わたし")
stdscr.addstr(0, offset, 'hello', curses.A_BOLD)
stdscr.getch() # pauses until a key's hit
curses.wrapper(doStuff)
I can display unicode characters just fine, but the y-offset argument to addstr ("offset" in my code) is not acting as expected; my screen displays "わたhello" instead of "わたしhello"
In fact the offset has very strange behavior:
- 0:hello
- 1:わhello
- 2:わhello
- 3:わたhello
- 4:わたhello
- 5:わたしhello
- 6:わたしhello
- 7:わたし hello
- 8:わたし hello
- 9:わたし hello
Note that the offset is not in bytes, since the characters are 3-byte unicode characters:
>>>len("わ".encode('utf-8'))
3
>>> len("わ")
1
I'm running python 4.8.3 and curses.version is "b'2.2'".
Does anyone know what's going on or how to debug this?
Thanks in advance.
You're printing 3 double-width characters. That is, each of those takes up two cells.
The length of the string in characters (or bytes) is not necessarily the same as the number of cells used for each character.
Python curses is just a thin layer over ncurses.
I'd expect the characters in lines 1,3,5 to be erased by putting a character onto the second cell of those double-width characters (ncurses is supposed to do this...), but that detail could be a bug in the terminal emulator).
Based on the response from Thomas, I found the wcwidth package (https://pypi.python.org/pypi/wcwidth) which has a function to return the length of a unicode string in cells.
Here's a full working example:
#!/usr/bin/env python3
import curses
import locale
from wcwidth import wcswidth
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
def doStuff(stdscr):
foo = "わたし"
offset = wcswidth(foo)
stdscr.addstr(0, 0, foo)
stdscr.addstr(0, offset, 'hello', curses.A_BOLD)
stdscr.getch() # pauses until a key's hit
curses.wrapper(doStuff)

How to handle utf-8 text with Python 3?

I need to parse various text sources and then print / store it somewhere.
Every time a non ASCII character is encountered, I can't correctly print it as it gets converted to bytes, and I have no idea how to view the correct characters.
(I'm quite new to Python, I come from PHP where I never had any utf-8 issues)
The following is a code example:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import codecs
import feedparser
url = "http://feeds.bbci.co.uk/japanese/rss.xml"
feeds = feedparser.parse(url)
title = feeds['feed'].get('title').encode('utf-8')
print(title)
file = codecs.open("test.txt", "w", "utf-8")
file.write(str(title))
file.close()
I'd like to print and write in a file the RSS title (BBC Japanese - ホーム) but instead the result is this:
b'BBC Japanese - \xe3\x83\x9b\xe3\x83\xbc\xe3\x83\xa0'
Both on screen and file. Is there a proper way to do this ?
In python3 bytes and str are two different types - and str is used to represent any type of string (also unicode), when you encode() something, you convert it from it's str representation to it's bytes representation for a specific encoding.
In your case in order to the decoded strings, you just need to remove the encode('utf-8') part:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import codecs
import feedparser
url = "http://feeds.bbci.co.uk/japanese/rss.xml"
feeds = feedparser.parse(url)
title = feeds['feed'].get('title')
print(title)
file = codecs.open("test.txt", "w", encoding="utf-8")
file.write(title)
file.close()
The function print(A) in python3 will first convert the string A to bytes with its original encoding, and then print it through 'gbk' encoding.
So if you want to print A in utf-8, you first need to convert A with gbk as follow:
print(A.encode('gbk','ignore').decode('gbk'))
JSON data to Unicode support for Japanese characters
def jsonFileCreation (messageData, fileName):
with open(fileName, "w", encoding="utf-8") as outfile:
json.dump(messageData, outfile, indent=8, sort_keys=False,ensure_ascii=False)

Decode an Unicode escaped character from PyQt5 QLabel widget?

I am trying to read in a text sequence from a QLineEdit that might contain Unicode escape sequence and print it out to a QLabel and display the proper character in PyQt5 and Python 3.4.
I tried many different things that I read here on stackoverflow but couldn't find a working solution for Python 3.
def on_pushButton_clicked(self):
text = self.lineEdit.text()
self.label.setText(text)
Now if I do something like this:
decodedText = str("dsfadsfa \u2662 \u8f1d \u2662").encode("utf-8")
self.label.setText(text.decode("utf-8")
This does print out the proper characters. If I apply the same to the above method I get the escaped sequences.
I don't get what is the difference between the str() returned by QLineEdit's text() and the str("\u2662"). Why does the one encode the characters properly and the other one doesn't?
The difference is that "\u2662" isn't a string with a Unicode escape, it's a string literal with a Unicode escape. A string with the same Unicode escape would be "\\u2662".
3>> codecs.getdecoder('unicode-escape')('\\u2662')
('♢', 6)

How to display chinese character in 65001 in python?

I am in win7 +python3.3.
import os
os.system("chcp 936")
fh=open("test.ch","w",encoding="utf-8")
fh.write("你")
fh.close()
os.system("chcp 65001")
fh=open("test.ch","r",encoding="utf-8").read()
print(fh)
Äã
>>> print(fh.encode("utf-8"))
b'\xe4\xbd\xa0'
How can i display the chinese character 你 in 65001?
If your terminal is capable of displaying the character directly (which it may not be due to font issues) then it should Just Work(tm).
>>> hex(65001)
>>> u"\ufde9"
'\ufde9'
>>> print(u"\ufde9")
﷩
To avoid the use of literals, note that in Python 3, at least, the chr() function will take a code point and return the associated Unicode character. So this works too, avoiding the need to do hex conversions.
>>> print(chr(65001))
﷩

Resources