Seemingly simple question: How do I print() a string in Python3? Should be a simple:
print(my_string)
But that doesn't work. Depending on the content of my_string, environment variables and the OS you use that will throw an UnicodeEncodeError exception:
>>> print("\u3423")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\u3423' in position 0: ordinal not in range(128)
Is there a clean portable way to fix this?
To expand a bit: The problem here is that a Python3 string contains Unicode encoded characters, while the Terminal can have any encoding. If you are lucky your Terminal can handle all the characters contained in the string and everything will be fine, if your Terminal can't (e.g. somebody set LANG=C), then you get an exception.
If you manually encode a string in Python3 you can supply an error handler that ignores or replaces unencodable characters:
"\u3423".encode("ascii", errors="replace")
For print() I don't see an easy way to plug in an error handler and even if there is, a plain error handler seems like a terrible idea as it would modify the data. A conditional error handler might work (i.e. check isatty() and decide based on that what to do), but it seems awfully hacky to go through all that trouble just to print() a string and I am not even sure that it wouldn't fail in some cases.
A real world example this problem would be for example this one:
Python3: UnicodeEncodeError only when run from crontab
The most practical way to solve this issue seems to be to force the output encoding to utf-8:surrogateescape. This will not only force UTF-8 output, but also ensure that surrogate escaped strings, as returned by os.fsdecode(), can be printed without throwing an exception. On command line this looks like this:
PYTHONIOENCODING=utf-8:surrogateescape python3 -c 'print("\udcff")'
To do this from within the program itself one has to reassign stdout and stderr, this can be done with (the line_buffering=True is important, as otherwise the output won't get flushed properly):
import sys
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, errors="surrogateescape", line_buffering=True)
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, errors="surrogateescape", line_buffering=True)
print("\udcff")
This approach will cause characters to be incorrectly displayed on terminals not set to UTF-8, but this to me seems to be strongly prefered over randomly throwing exceptions and making it impossible to print filenames without corrupting them, as they might not be in any valid encoding at all on Linux systems.
I read in a few places that utf-8:surrogateescape might become the default in the future, but as of Python 3.6.0b2 that is not the case.
Is there a clean portable way to fix this?
Set PYTHONIOENCODING=<encoding>:<error_handler> e.g.,
$ PYTHONIOENCODING=utf-8 python your_script.py >output-in-utf-8.txt
In your case, I'd configure your environment (LANG, LC_CTYPE) to accept non-ascii input:
$ locale charmap
The reason it is giving you an error is because it is trying to decipher what \u is. Just like \r is ascii for carriage return, \n - newline \t - tab etc...
If:
my_string = '\u112'
print(my_string)
That will give you an error, to print the '\' without it trying to find out what \ is is like so:
my_string = '\\u122'
print(my_string)
Output:
\u122
Related
I'm writing from pandas to csv like this:
df.to_csv(extractpath+'/extract_{}'.format(locname))
Now, when locname variable contains croatian characters, I get error
*UnicodeEncodeError: 'ascii' codec can't encode character '\u0161' in position 53: ordinal not in range(128)*
The only workaround I come up with is this:
df.to_csv(extractpath+'/extract_{}'.format(locname.encode('utf-8')))
However, although error is now gone, file names are not correct anymore, for example they look like:
*extract_b'Vara\xc5\xbedin'* instead of *extract_Varaždin*
How can I properly solve the problem?
I had wrongly set LANG variable in shell, without realizing it., so when I executed python script from shell script it didn't work. Now is fine after correcting LANG variable.
I used Bash to open a Python file. The Python file should read a utf-8 file and display it in the terminal. It gives me a bunch of ▒'s ("Aaron▒s" instead of "Aaron's"). Here's the code:
# It reads text from a text file (done).
f = open("draft.txt", "r", encoding="utf8")
# It handles apostrophes and non-ASCII characters.
print(f.read())
I've tried different combinations of:
read formats with the open function ("r" and "rb")
strip() and rstrip() method calls
decode() method calls
text file encoding (specifically ANSI, Unicode, Unicode big endian, and UTF-8).
It still doesn't display apostrophes (" ' ") properly. How do I make it display apostrophes instead of ▒'s?
The issue is with Git Bash. If I switch to Powershell, Python displays the apostrophes (Aaron's) perfectly. The semantic read errors (Aaron▒s) appear only with Git Bash. I'll give more details if I learn more about it.
Update: #jasonharper and #entpnerd suggested that the draft.txt apostrophe might be "apostrophe-ish" and not a legitimate apostrophe. I compared the draft.txt apostrophe (copy and pasted from a Google Doc) with an apostrophe directly entered. They look different (’ vs. '). In xxd, the value for the apostrophe-ish character is 92. An actual apostrophe is 27. Git Bash only supports the latter (unless there's just something I need to configure, which is more likely).
Second Update: Clarified that I'm using Git Bash. I wasn't aware that there were multiple terminals (is that the right way of putting it?) that ran Bash.
Consider the following terminal command line
python3 -c 'print("hören")'
In most terminals this prints "hören" (German for "to hear"), in some terminals you get the error
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6'
in position 1: ordinal not in range(128)
In my Python3 program I don't want that just printing out something can raise an exception like this, instead I'd rather like to output the characters which will not raise an exception.
So my questions is: How can I output in Python3 (unicode) strings while ignoring un-encodable characters?
Some notes
What I've tried so far
I tried using sys.stdout.write instead ofprint, but the encoding problem still can occur.
I tried encoding the string in byes via
bytes=line.encode('utf-8')
This never raises an exception on print, but even in capable terminals non-ascii characters are replaced by their code point numbers.
I tried using the decode method with the 'ignore' parameter:
bytes=line.encode('utf-8')
decoded=bytes.decode('utf-8', 'ignore')
print(decoded)
But the problem is not the decoding in the string but the enconding in the print function.
Here some terminals which appear not to be capable of all characters
bash shell inside Emacs on macOS.
Receiving a "printed" string in Applescript via do shell script, e.g.:
set txt to do shell script "/usr/local/bin/python3 -c \"print('hören')\" "
Update: These terminals all return from locale.getpreferredencoding()the value US-ASCII.
My preferred way is to set PYTHONIOENCODING variable depending on the terminal you are using.
For UTF-8-enabled terminals you can do:
export PYTHONIOENCODING='utf-8'
For printing '?'s in ASCII terminals, you can do:
export PYTHONIOENCODING='ascii:replace'
Or even better, if you don't care about the encoding, you should be able to do:
export PYTHONIOENCODING=':replace'
I am trying to make a program that converts the Arabic diacritics and letters into the Latin script. The letters work well in the program, but the diacritics can not be converted as I get an error every time I run the program.
At the beginning, I put the diacritics alone as keys but that did not work with me. please, see the last key, it contains َ ,which is a diacritic, but do not work properly as the letters:
def convert(lit):
ArEn = {'ا':'A', 'ل':'L', "و": "W", "َ":"a"}
end_word=[]
for i in range(len(lit)):
end_word.append(ArEn[lit[i]])
jon = ""
print(jon.join(end_word))
convert("الوَ")
However, I tried to fix the problem by using letters attached with diacritics as keys, but the program resulted in the same error:
the dictionary:
ArEn = {'ا':'A', 'ل':'L', "وَ":"Wa"}
the error:
Traceback (most recent call last):
File "C:\Users\Abdulaziz\Desktop\converter AR to EN SC.py", line 10, in <module>
convert("الوَ")
File "C:\Users\Abdulaziz\Desktop\converter AR to EN SC.py", line 5, in convert
end_word.append(ArEn[lit[i]])
KeyError: 'و'
The chances are rather there is a bug in the programing-code editor you are using for coding Python than on Pyhton itself.
Since you are using Python-3.x, the diacritics from the running progam point of view are just a single character, like any other, and there should be no issues at all.
From the cod-editor point of view, there are issues such as whether to advance one character when displaying certain special unicode characters or not, and maybe the " character itself can be show out of space - when one tries to manually correct the position of the ", one could place it out of order, leaving the special character actually outside the quoted string -
The fact you could solve the issue by re-editing the file suggests that is indeed what happened.
One way to avoid this is to put certain special characters - specially ones that have different displaying rules, is to escape then with the "\uxxxx" unicode codepoint unicode sequence. This will avoid yourself or other persons having issues when editing your file again in the future, since even i yu get it working now, the editor may show then incorrectly when they are opened, and by trying to fix it one might break the syntax again.
You can use a table on the web or Python3's interactive prompt to get the unicode codepoint of each character, ensuring the code part of the program is displayed in a deterministic way in any editor - (if you add the diacritical char as a comment on the same line, it will actually enhance the readability of your code - enormously if it is ever supposed to be edited by non Arabic speakers)
So, your above declaration, I used this snippet to extract the codepoints:
>>> ArEn = {'ا':'A', 'ل':'L', "و": "W", "َ":"a"}
>>> [print (hex(ord(yy)), yy ) for yy in ArEn.keys()]
0x648 و
0x644 ل
0x64e َ
0x627 ا
Which allows me to declare the dictionary like this:
ArEn = {
"\u0648": "W", # و
"\u0644": "L", # L
"\u064e": "a", # ۮ
"\u0627": "A", # ا
}
(And yes, I had trouble with displaying the characters on my terminal like I said you probably had on your editor while getting these - the fatha ("\u064e" - "a") character is tricky ! :-) )
Alternatively for using the codepoints in your code, is to use Python's unicode data module to discover and them use the actual character names - this can enhance readability further, and maybe by exploring unicodedata you can find out you don't even have to create this dictionary manually, but use that module instead -
In [16]: [print("\\u{:04x} - '{}' - {}".format(ord(yy), unicodedata.name(yy), yy) ) for yy in ArEn.keys()]
\u0648 - 'ARABIC LETTER WAW' - و
\u0644 - 'ARABIC LETTER LAM' - ل
\u064e - 'ARABIC FATHA' - َ
\u0627 - 'ARABIC LETTER ALEF' - ا
And from these full text names, you can get back to the character with the unicodedata.lookup function:
>>> unicodedata.lookup("ARABIC LETTER LAM")
'ل'
notes:
1) This requires Python3 - for Python2 one might try to prefix each string with u"" - but one dealign with these characters is far better off using Python 3, since unicode support is one of the big deals with it.
2) This also requires a terminal with a nice support for unicode characters using "utf-8" encoding - I am on a Linux system with the "konsole" terminal. On Windows, the idle Python prompt might work, but not the cmd Python prompt.
You might need proper indentation in python:
def convert(lit):
ArEn = {'ا':'A', 'ل':'L', "و":"W", "َ":"a", "ُ":"w", "":""}
end_word=[]
for i in range(len(lit)):
end_word.append(ArEn[lit[i]])
jon = ""
print(jon.join(end_word))
convert("اُلوَ")
Update: I just noticed, after years, that the letters and diacritics are put together in the first try. When I separated them, the program worked.
I just solved the problem!
I am not really sure if it is a mistake in python or something else, but as far as I know python does not support Arabic very well. Or maybe I made a problem in the program above.
I kept writing the same program and suddenly it worked very well.
I even added different diacritics and they worked properly.
def convert(lit):
ArEn = {'ا':'A', 'ل':'L', "و":"W", "َ":"a", "ُ":"w", "":""}
end_word=[]
for i in range(len(lit)):
end_word.append(ArEn[lit[i]])
jon = ""
print(jon.join(end_word))
convert("اُلوَ")
the reult is
AwLWa
I'm writing scripts to clean up unicode text files (stored as UTF-8), and I chose to use Python 3.x (3.2) rather than the more popular 2.x because 3.x is supposed to default to UTF-8. Maybe I'm doing something wrong, but it seems that the print statement, at least, still is not defaulting to UTF-8. If I try to print a string (msg below is a string) that contains special characters, I still get a UnicodeEncodeError like this:
print(label, msg)
... in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0968' in position
38: character maps to <undefined>
If I use the encode() method first (which does nicely default to UTF-8), I can avoid the error:
print(label, msg.encode())
This also works for printing objects or lists containing unicode strings--something I often have to do when debugging--since str() seems to default to UTF-8. But do I really need to remember to use print(str(myobj).encode()) every single time I want to do a print(myobj) ? If so, I suppose I could try to wrap it with my own function, but I'm not confident about handling all the argument permutations that print() supports.
Also, my script loads regular expressions from a file and applies them one by one. Before applying encode(), I was able to print something fairly legible to the console:
msg = 'Applying regex {} of {}: {}'.format(i, len(regexes), regex._findstr)
print(msg)
Applying regex 5 of 15: ^\\ge[0-9]*\b([ ]+[0-9]+\.)?[ ]*
However, this crashes if the regex includes literal unicode characters, so I applied encode() to the string first. But now the regexes are very hard to read on-screen (and I suspect I may have similar trouble if I try to write code that saves these regexes back to disk):
msg = 'Applying regex {} of {}: {}'.format(i, len(regexes), regex._findstr)
print(msg.encode())
b'Applying regex 5 of 15: ^\\\\ge[0-9]*\\b([ ]+[0-9]+\\.)?[ ]*'
I'm not very experienced yet in Python, so I may be misunderstanding. Any explanations or links to tutorials (for Python 3.x; most of what I see online is for 2.x) would be much appreciated.
print doesn't default to any encoding, it just uses whatever encoding the output device (like a console) claims to support. Your console encoding appears to be non-unicode, so print tries to encode your unicode strings in that encoding, and fails. The easiest way to get around this is to tell the console to use utf8 (like export LC_ALL=en_US.UTF-8 on unix systems).
The easier way to proceed is to only use unicode in your script, and only use encoded data when you want to interact with the "outside" world. That is, when you have input to decode or output to encode.
To do so, everytime you read something, use decode, everytime you output something, use encode.
For you regex, use the re.UNICODE flag.
I know that this doesn't exactly answer your question point by point, but I think that applying such a methodology should keep you safe from encoding issues.