Handling print() UnicodeEncodeError - python-3.x

I extract unicode text from a website and do some printing that can contain non-ascii characters. Printing those raises a UnicodeEncodeError. How should I handle this case? I need a solution that requires minimum changes to the existing code.
I tried encoding the strings (by wrapping sys.stdout) with UTF-8, but then I get a TypeError since the original sys.stdout.write() method expects str not bytes.
I don't want the characters to be lost. The user may want to pipe the output into a file that would in this case be UTF8 formatted.
Edit: Setting PYTHONIOENCODING=utf8 can fix the problem, but isn't there a way I can "imitate" this behaviour in Python? Since utf8 doesn't match the Windows console encoding, some characters look strange. But this is by far better than a crash of the program.
Anyway to format the output from Python so I don't need to modify environment variables?

Related

Convert unicode text to single byte ascii in Python

I have an input file whose data I need to process. The file is in UTF-16 even though every single character in it is just a standard ascii character.
I can NOT change the input file so that it doesn't use useless double byte characters to represent 100% English language single character data. I need to convert this in python, on Windows. (Please, no non-python solutions, thank you).
I want my python program to act on these strings and output a file which is NOT double-byte. I just want standard ascii strings (one byte per character)
I've googled a lot, see all sorts of related questions, but not mine. I'm frustrated with not being able to solve this seemingly very simple question and need.
EDIT: Here is the program I got to work. It is absurd. There must be an easier way. The chr(10) references in the code is because the input has lines and I couldn't find a nonabsurd way to do simple readline/writeline calls.
with open('Unicode.txt','r') as input:
with open('ASCII.txt','w') as output:
for line in input.readlines():
codelist=[code for code in line.encode('ascii','ignore') if code not in (0,10)]
if codelist:
output.write(''.join([chr(code) for code in codelist]+[chr(10)]))
Question solved after reading a hint from #Mark Ransom.
with open('unicode.txt','r',encoding='UTF-16') as input:
with open('ascii.txt','w',encoding='ascii') as output:
output.write(input.read())

√ not recognized in Terminal

For a class of mine I have to make a very basic calculator. I want to write the code in such a way that the user can just enter what they want to do (ex. √64) press return and get the answer. I wrote this:
if '√' in operation:
squareRoot = operation.replace('√','')
squareRootFinal = math.sqrt(int(squareRoot))
When I run this in IDLE, it works like a charm. However when I run this in Terminal I get the following error:
SyntaxError: Non-ASCII character '\xe2' in file x.py on line 50, but no encoding declared;
any suggestions?
Just declare the encoding. Python is begin a bit cautious here and not guessing the encoding of your text file. IDLE is a text editor and so has already guessed the encoding and stored things internally as unicode, which it can pass directly to the Python interpreter.
Put
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-
at the top of the file. (It's pretty unlikely nowadays that your encoding is not UTF-8.)
For reference, in past times files had a variety of different possible encodings. That means that the same text could be stored in different ways in binary, when written to disk. Almost all encodings have the same interpretation of bytes 0 to 127—the ASCII subset. But if any other bytes occur in the file, their meaning is potentially ambiguous.
However, in recent years, UTF-8 has become by far the most common encoding, so it's almost always a safe guess.

json.dump to a gzip file in python 3

I'm trying to write a object to a gzipped json file in one step (minimising code, and possibly saving memory space). My initial idea (python3) was thus:
import gzip, json
with gzip.open("/tmp/test.gz", mode="wb") as f:
json.dump({"a": 1}, f)
This however fails: TypeError: 'str' does not support the buffer interface, which I think has to do with a string not being encoded to bytes. So what is the proper way to do this?
Current solution that I'm unhappy with:
Opening the file in text-mode solves the problem:
import gzip, json
with gzip.open("/tmp/test.gz", mode="wt") as f:
json.dump({"a": 1}, f)
however I don't like text-mode. In my mind (and maybe this is wrong, but supported by this), text mode is used to fix line-endings. This shouldn't be an issue because json doesn't have line endings, but I don't like it (possibly) messing with my bytes, it (possibly) is slower because it's looking for line-endings to fix, and (worst of all) I don't understand why something about line-endings fixes my encoding problems?
offtopic: I should have dived into the docs a bit further than I initially did.
The python docs show:
Normally, files are opened in text mode, that means, you read and write strings from and to the file, which are encoded in a specific encoding (the default being UTF-8). 'b' appended to the mode opens the file in binary mode: now the data is read and written in the form of bytes objects. This mode should be used for all files that don’t contain text.
I don't fully agree that the result from a json encode is a string (I think it should be a set of bytes, since it explicitly defines that it uses utf-8 encoding), but I noticed this before. So I guess text mode it is.

How to write excel file with special characters through Perl script?

I am writing Excel file through perl code. When I insert data in XML file and view in any browser, I see correct data with special characters, but when I write the same data in Excel file, it is showing garbage characters.
For eg.:
(word from XML file on browser) Gràcia - (word from Excel file) Grà cia
I am using 'Spreadsheet::XLSX' for reading excel and 'Excel::Writer::XLSX' for writing excel.
Also need help in finding the encoding format of excel fields.
Do you have any idea? Thanks in advance.
This seems very much like UTF-8 to iso-8859-1 conversion going wrong - seems like a string that contains UTF-8, but is not marked as being UTF-8, is being passed to $worksheet->write(). Since http://metacpan.org/pod/Excel::Writer::XLSX#UNICODE-IN-EXCEL claims to handle unicode correctly, it seems to be a problem with your input string, not the write method itself.
As you don't post any code, and don't tell us where your strings come from, i can't tell why the strings aren't marked correctly.
You can probably get away with
Encode::_utf8_on($str)
before passing your strings to $worksheet->write(), but this might just as well break other things, if not all of your strings are really utf-8. Basically the answer is "get the utf-8 flag on your strings right when you read them".

Troubles with text encoding

I'm having some troubles with text encoding. Parsing a website gives me a Data.Text string
"Project - Fran\195\167ois Dubois",
which I need to write to a file. So I'm using Data.Text.Lazy.Encoding.encodeUtf8 to convert it into a Bytestring. The problem is that this yields garbled output:
"Project - François Dubois".
What am I missing here?
If you have gotten Fran\195\167ois inside your Data.Text, you already have a UTF-8-encoded François.
That's inconvenient because Data.Text[.Lazy] is supposed to be UTF-16 encoded text, and the two code units 195 and 167 are interpreted as the unicode code points 195 resp. 167 which are 'Ã' resp. '§'. If you UTF-8-encode the text, these are converted to the byte sequences c383 ([195,131]) resp c2a7 ([194,167]).
The most likely way for getting into this situation is that the data you got from the website was UTF-8 encoded, but was interpreted as ISO-8859-1 (Latin 1) encoded (or another 8-bit encoding; 8859-15 is widespread too).
The proper way of handling it is avoiding the situation altogether [that may not be possible, unfortunately].
If the source of your data states its encoding correctly - as a website should - find out the encoding and interpret the data accordingly. If an incorrect encoding is stated, you are of course out of luck, and if no encoding is specified, you have to guess right (the natural guess nowadays is UTF-8, at least for languages using a variant of the Latin alphabet).
If avoiding the situation is not possible, the easiest ways of fixing it are
replacing the occurrences of the offending sequence with the desired one before encoding:
encodeUtf8 $ replace (pack "Fran\195\167ois") (pack "Fran\231ois") contents
assuming everything else is ASCII or inadvertent UTF-8 too, interpret the Text code units as bytes:
Data.ByteString.Lazy.Char8.pack $ Data.Text.Lazy.unpack contents
The former is more efficient, but becomes inconvenient if there are many different misencodings (caused by different accented letters, for example). The latter works only in the assumed situation (no code units above 255 in the Text) and is rather inefficient for long texts.
I am not completely sure if less can show UTF-8 encoded characters properly. GVim can. You can check this link on SO to find out how you can view UTF-8 data in gVim.
And regarding the other issue of being able to pass this to graphviz, I think you need to set the encoding on the command-line as explained in the Graph NonAscii FAQ.
From what you are explaining, I think there are no issues with how the data is being persisted. If you pass the encoding properly to graphviz, I think your problem will be resolved.
P.S: Creating an answer since it is easier to create descriptive links

Resources