I am having trouble with python3 curses and unicode:
#!/usr/bin/env python3
import curses
import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
def doStuff(stdscr):
offset = 3
stdscr.addstr(0, 0, "わたし")
stdscr.addstr(0, offset, 'hello', curses.A_BOLD)
stdscr.getch() # pauses until a key's hit
curses.wrapper(doStuff)
I can display unicode characters just fine, but the y-offset argument to addstr ("offset" in my code) is not acting as expected; my screen displays "わたhello" instead of "わたしhello"
In fact the offset has very strange behavior:
- 0:hello
- 1:わhello
- 2:わhello
- 3:わたhello
- 4:わたhello
- 5:わたしhello
- 6:わたしhello
- 7:わたし hello
- 8:わたし hello
- 9:わたし hello
Note that the offset is not in bytes, since the characters are 3-byte unicode characters:
>>>len("わ".encode('utf-8'))
3
>>> len("わ")
1
I'm running python 4.8.3 and curses.version is "b'2.2'".
Does anyone know what's going on or how to debug this?
Thanks in advance.
You're printing 3 double-width characters. That is, each of those takes up two cells.
The length of the string in characters (or bytes) is not necessarily the same as the number of cells used for each character.
Python curses is just a thin layer over ncurses.
I'd expect the characters in lines 1,3,5 to be erased by putting a character onto the second cell of those double-width characters (ncurses is supposed to do this...), but that detail could be a bug in the terminal emulator).
Based on the response from Thomas, I found the wcwidth package (https://pypi.python.org/pypi/wcwidth) which has a function to return the length of a unicode string in cells.
Here's a full working example:
#!/usr/bin/env python3
import curses
import locale
from wcwidth import wcswidth
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
def doStuff(stdscr):
foo = "わたし"
offset = wcswidth(foo)
stdscr.addstr(0, 0, foo)
stdscr.addstr(0, offset, 'hello', curses.A_BOLD)
stdscr.getch() # pauses until a key's hit
curses.wrapper(doStuff)
Related
I m trying to clean my sqlite database using python. At first I loaded using this code:
import sqlite3, pandas as pd
con = sqlite3.connect("DATABASE.db")
import sqlite3, pandas as pd
df = pd.read_sql_query("SELECT TITLE from DOCUMENT", con)
So I got the dirty words. for example this "Conciliaci\363n" I want to get "Conciliacion". I used this code:
df['TITLE']=df['TITle'].apply(lambda x: x.decode('iso-8859-1').encode('utf8'))
I got b'' in blank cells. and got 'Conciliaci\\363n' too. So maybe I'm doing wrong. how can I solve this problem. Thanks in advance.
It's unclear, but if your string contains a literal backslash and numbers like this:
>>> s= r"Conciliaci\363n" # A raw string to make a literal escape code
>>> s
'Conciliaci\\363n' # debug display of string shows an escaped backslash
>>> print(s)
Conciliaci\363n # printing prints the escape
Then this will decode it correctly:
>>> s.encode('ascii').decode('unicode-escape') # convert to byte string, then decode
'Conciliación'
If you want to lose the accent mark as your question shows, then decomposing the Unicode string, converting to ASCII ignoring errors, then converting back to a Unicode string will do it:
>>> s2 = s.encode('ascii').decode('unicode-escape')
>>> s2
'Conciliación'
>>> import unicodedata as ud
>>> ud.normalize('NFD',s2) # Make Unicode decomposed form
'Conciliación' # The ó is now an ASCII 'o' and a combining accent
>>> ud.normalize('NFD',s2).encode('ascii',errors='ignore').decode('ascii')
'Conciliacion' # accent isn't ASCII, so is removed
I'm looking for a formatted byte string literal. Specifically, something equivalent to
name = "Hello"
bytes(f"Some format string {name}")
Possibly something like fb"Some format string {name}".
Does such a thing exist?
No. The idea is explicitly dismissed in the PEP:
For the same reason that we don't support bytes.format(), you may
not combine 'f' with 'b' string literals. The primary problem
is that an object's __format__() method may return Unicode data
that is not compatible with a bytes string.
Binary f-strings would first require a solution for
bytes.format(). This idea has been proposed in the past, most
recently in PEP 461. The discussions of such a feature usually
suggest either
adding a method such as __bformat__() so an object can control how it is converted to bytes, or
having bytes.format() not be as general purpose or extensible as str.format().
Both of these remain as options in the future, if such functionality
is desired.
In 3.6+ you can do:
>>> a = 123
>>> f'{a}'.encode()
b'123'
You were actually super close in your suggestion; if you add an encoding kwarg to your bytes() call, then you get the desired behavior:
>>> name = "Hello"
>>> bytes(f"Some format string {name}", encoding="utf-8")
b'Some format string Hello'
Caveat: This works in 3.8 for me, but note at the bottom of the Bytes Object headline in the docs seem to suggest that this should work with any method of string formatting in all of 3.x (using str.format() for versions <3.6 since that's when f-strings were added, but the OP specifically asks about 3.6+).
From python 3.6.2 this percent formatting for bytes works for some use cases:
print(b"Some stuff %a. Some other stuff" % my_byte_or_unicode_string)
But as AXO commented:
This is not the same. %a (or %r) will give the representation of the string, not the string iteself. For example b'%a' % b'bytes' will give b"b'bytes'", not b'bytes'.
Which may or may not matter depending on if you need to just present the formatted byte_or_unicode_string in a UI or if you potentially need to do further manipulation.
As noted here, you can format this way:
>>> name = b"Hello"
>>> b"Some format string %b World" % name
b'Some format string Hello World'
You can see more details in PEP 461
Note that in your example you could simply do something like:
>>> name = b"Hello"
>>> b"Some format string " + name
b'Some format string Hello'
This was one of the bigger changes made from python 2 to python3. They handle unicode and strings differently.
This s how you'd convert to bytes.
string = "some string format"
string.encode()
print(string)
This is how you'd decode to string.
string.decode()
I had a better appreciation for the difference between Python 2 versus 3 change to unicode through this coursera lecture by Charles Severence. You can watch the entire 17 minute video or fast forward to somewhere around 10:30 if you want to get to the differences between python 2 and 3 and how they handle characters and specifically unicode.
I understand your actual question is how you could format a string that has both strings and bytes.
inBytes = b"testing"
inString = 'Hello'
type(inString) #This will yield <class 'str'>
type(inBytes) #this will yield <class 'bytes'>
Here you could see that I have a string a variable and a bytes variable.
This is how you would combine a byte and string into one string.
formattedString=(inString + ' ' + inBytes.encode())
I'm using the following subprocess call to use a command line tool. The output of the command line tool isn't printed in one go, it prints immediately on the command line, it generates over multiple lines over a period of time. The tool is bs1770gain and the command would be "path\to\bs1770gain.exe" "-i" "\path\to\audiofile.wav", By using the --loglevel parameter you can include more data but you cannot remove the progressive results being written to stdout.
I need stdout to return a human readable string (hence the stdout_formatted operation):
with subprocess.Popen(list_of_args, stdout=subprocess.PIPE, stderr=subprocess.PIPE) as proc:
stdout, stderr = proc.communicate()
stdout_formatted = stdout.decode('UTF-8')
stderr_formatted = stderr.decode('UTF-8')
However I can only view the variable as a human readable string if I print it e.g.
In [23]: print(stdout_formatted )
nalyzing ... [1/2] "filename.wav":
integrated: -2.73 LUFS / -20.27 LU [2/2]
"filename2.wav":
integrated: -4.47 LUFS / -18.53 LU
[ALBUM]:
integrated: -3.52 LUFS / -19.48 LU done.
In [24]: stdout_formatted
Out[24]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\.......
In [6]: stdout
Out[6]: b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\......
In [4]: type(stdout)
Out[4]: bytes
In [5]: type(stdout_formatted)
Out[5]: str
If you look carefully, the human readable chars are in the string (the first word is "analyzing"
I guessed that the stdout value needs decoding/encoding so I tried different ways:
stdout_formatted.encode("ascii")
Out[18]: b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g
stdout_formatted.encode("utf-8")
Out[17]: b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\
stdout.decode("utf-8")
Out[15]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\
stdout.decode("ascii")
Out[14]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\
bytes(stdout).decode("ascii")
Out[13]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\
I used a library called chardet to check the encoding of stdout:
import chardet
chardet.detect(stdout)
Out[26]: {'confidence': 1.0, 'encoding': 'ascii', 'language': ''}
I'm working on Windows 10 and have am using python 3.6 (the anaconda package and it's integrated Spyder IDE).
I'm kind of clutching at straws now - is it possible to capture what is displayed in the console when print is called in a variable or remove the unwanted bytecode in the stdout string?
You don't have UTF-8 data. You have UTF-16 data. UTF-16 uses two bytes for every character; characters in the ASCII and Latin-1 ranges (such as a), still use 2 bytes, but one of those bytes is always a \x00 NUL byte.
Because UTF-16 always uses 2 bytes for every character, their order starts to matter. Encoders can pick between the two options; one is called Little Endian, the other Big Endian. Normally, encoders then include a Byte Order Mark at the very start, so that the decoder knows which of the two order options to use when decoding.
Your posted data doesn't appear to include the BOM (I don't see the 0xFF and 0xFE bytes, but your data does look like it is using little-endian ordering. That fits with this being Windows; Windows always uses little-endian ordering for it's UTF-16 output.
If your data does have the BOM present, you can just decode as 'utf-16'. If the BOM is missing, use 'utf-16-le':
>>> sample = b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00'
>>> sample.decode('utf-16-le')
'analyzin'
>>> import codecs
>>> (codecs.BOM_UTF16_LE + sample)
b'\xff\xfea\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00'
>>> (codecs.BOM_UTF16_LE + sample).decode('utf-16')
'analyzin'
I had the same problem with the progress information added by bs1770gain. It is not said in the man page but there is a "--suppress-progress" flag !
So I just use this command line and everythig were OK ;)
bs1770gain --xml --suppress-progress "$filename"
Have you tried :
str(stdout_formatted).replace('\x00','')
?
I am trying to read in a text sequence from a QLineEdit that might contain Unicode escape sequence and print it out to a QLabel and display the proper character in PyQt5 and Python 3.4.
I tried many different things that I read here on stackoverflow but couldn't find a working solution for Python 3.
def on_pushButton_clicked(self):
text = self.lineEdit.text()
self.label.setText(text)
Now if I do something like this:
decodedText = str("dsfadsfa \u2662 \u8f1d \u2662").encode("utf-8")
self.label.setText(text.decode("utf-8")
This does print out the proper characters. If I apply the same to the above method I get the escaped sequences.
I don't get what is the difference between the str() returned by QLineEdit's text() and the str("\u2662"). Why does the one encode the characters properly and the other one doesn't?
The difference is that "\u2662" isn't a string with a Unicode escape, it's a string literal with a Unicode escape. A string with the same Unicode escape would be "\\u2662".
3>> codecs.getdecoder('unicode-escape')('\\u2662')
('♢', 6)
I am in win7 +python3.3.
import os
os.system("chcp 936")
fh=open("test.ch","w",encoding="utf-8")
fh.write("你")
fh.close()
os.system("chcp 65001")
fh=open("test.ch","r",encoding="utf-8").read()
print(fh)
Äã
>>> print(fh.encode("utf-8"))
b'\xe4\xbd\xa0'
How can i display the chinese character 你 in 65001?
If your terminal is capable of displaying the character directly (which it may not be due to font issues) then it should Just Work(tm).
>>> hex(65001)
>>> u"\ufde9"
'\ufde9'
>>> print(u"\ufde9")
To avoid the use of literals, note that in Python 3, at least, the chr() function will take a code point and return the associated Unicode character. So this works too, avoiding the need to do hex conversions.
>>> print(chr(65001))