Python 3.1 server-side can't output Unicode string to client - python-3.x

I'm using a free web host but choosing not to work with any Python framework, and am stuck trying to print Chinese characters saved in the source file (using emacs to save file encoded in utf-8) to the resulting HTML page. I thought Unicode "just works" in Python 3.1 so I am baffled. I found three solutions that aren't working. I might just be missing a detail or two.
The host is Alwaysdata, and it has been straightforward to use, so I have little clue about details of how they put together the parts. All I do is upload or edit (with ssh) Python files to a www folder, change permissions, point a browser to the right URL, and it works.
My first attempt, which works on local IDLE (and also the server's Python command line interactive shell, which makes me even more confused why it won't work when it's passed to a browser)
#!/usr/bin/python3.1
mystr = "世界好"
print("Content-Type: text/html\n\n")
print("""<!DOCTYPE html>
<html><head><meta charset="utf-8"></head>
<body>""")
print(mystr)
The error is:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3:
ordinal not in range(128)
Then I tried
print(mystr.encode("utf-8"))
resulting in no error, but the following undesired output to the browser:
b'\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c'
Third, the following lines were added but got an error:
import sys
sys.setdefaultencoding("utf-8")
AttributeError: 'module' object has no attribute 'setdefaultencoding'
Finally, replacing print with f.write:
import codecs
f = codecs.open(sys.stdout, "w", "utf-8")
mystr = "你好世界"
...
f.write(mystr)
error:
TypeError: invalid file: <_io.TextIOWrapper name='<stdout>'
encoding='ANSI_X3.4-1968'>
How do I get the output to work? Do I need to use a framework for a quick fix?

It sounds like you are using CGI, which is a stupid API as it's using stdout, made for output to humans, to output to your browser. This is the basic source of your problems.
You need to encode it in UTF-8, and then write to sys.stdout.buffer instead of sys.stdout.
And after that, get yourself a webframework. Really, you'll be a lot happier.

Related

Python3 - urllib.request.urlopen and readlines to utf-8?

Consider this example:
import urllib.request # Python3 URL loading
filelist_url="https://www.w3.org/TR/PNG/iso_8859-1.txt"
filelist_fobj = urllib.request.urlopen(filelist_url)
#filelist_fobj_fulltext = filelist_fobj.read().decode('utf-8')
#print(filelist_fobj_fulltext) # ok, works
lines = filelist_fobj.readlines()
print(type(lines[0]))
This code prints out the type of the first entry, of the result returned by readlines() of the file object for the .urlopen()'d URL as:
<class 'bytes'>
... and in fact, all of the entries in the returned list are of the same type.
I am aware that I could do .read().decode('utf-8') as in the commented lines, and then split that result on \n -- however, I'd like to know: is there otherwise any way, to use urlopen with .readlines(), and get a list of ("utf-8") strings?
urllib.request.urlopen returns a http.client.HTTPResponse object, which implements the io.BufferedIOBase interface, which returns bytes.
The io module provides TextIOWrapper, which can wrap a BufferedIOBase object (or other similar objects) to add an encoding. The wrapped object's readlines method returns str objects decoded according to the coding you specified when you created the TextIOWrapper, so if you get the encoding right, everything will work. (On Unix-like systems, utf-8 is the default encoding, but apparently that's not the case on Windows. So if you want portability, you need to provide an encoding. I'll get back to that in a minute.)
So the following works fine:
>>> from urllib.request import urlopen
>>> from io import TextIOWrapper
>>> url="https://www.w3.org/TR/PNG/iso_8859-1.txt"
>>> with urlopen(url) as response:
... lines = TextIOWrapper(response, encoding='utf-8'):
...
>>> for line in lines[:5]: print(type(line), line.strip())
...
<class 'str'> The following are the graphical (non-control) characters defined by
<class 'str'> ISO 8859-1 (1987). Descriptions in words aren't all that helpful,
<class 'str'> but they're the best we can do in text. A graphics file illustrating
<class 'str'> the character set should be available from the same archive as this
<class 'str'> file.
It's worth noting that both the HTTPResponse object and the TextIOWrapper which wraps it implement the iterator protocol, so you can use a loop like for line in TextIOWrapper(response, ...): instead of saving the entire web page using readlines(). The iterator protocol can be a big win because it lets you start processing the web page before it has all been downloaded.
Since I work on a Linux system, I could have left out the encoding='utf-8' argument to TextIOWrapper, but regardless, the assumption is that I know that the file is UTF-8 encoded. That's a pretty safe assumption, but it's not universally valid. According to W3Techs survey (updated daily, at least when I wrote this answer), 97.6% of websites use UTF-8 encoding, which means that one in 40 does not. (If you restrict the survey to what W3Techs considers the top 1,000 sites, the percentage increases to 98.7%. But that's still not universal.)
Now, the conventional wisdom, which you'll find in a number of SO answers, is that you should dig the encoding out of the HTTP headers, which you can fairly easily do:
>>> # Tempting though this is, DO NOT DO IT. See below.
>>> with urlopen(url) as response:
... lines = TextIOWrapper(response,
... encoding=response.headers.get_content_charset()
... ).readlines()
...
Unfortunately, that will only work if the website declares the content encoding in the HTTP headers, and many sites prefer to put the encoding in a meta tag. So when I tried the above with a randomly-selected Windows-1252-encoded site (taken from the W3Techs survey), it failed with an encoding error:
>>> with urlopen(win1252_url) as response:
... lines = TextIOWrapper(response,
... encoding=response.headers.get_content_charset()
... ).readlines()
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 346: invalid continuation byte
Note that although the page is encoded in Windows-1252, that information was not provided in the HTTP headers, so TextIOWrapper chose the default encoding, which on my system is UTF-8. If I supply the correct encoding, I can read the page without problems, letting me see the encoding declaration in the page itself.
>>> with urlopen(win1252_url) as response:
... lines = TextIOWrapper(response,
... encoding='Windows-1252'
... ).readlines()
...
... print(lines[3].strip())>>> print(lines[3].strip())
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
Clearly, if the encoding is declared in the content itself, it's not possible to set the encoding before reading the content. So what to do in these cases?
The most general solution, and the simplest to code, appears to be the well-known BeautifulSoup package, which is capable of using a variety of techniques to detect the character encoding. Unfortunately, that requires parsing the entire page, which is a much more time-consuming task than just reading lines.
Another option would be to read the first kilobyte or so of the webpage, as bytes, and then try to find a meta tag. Content provider are supposed to put the meta tag close to the beginning of the web page, and it certainly has to come before the first non-ascii character. If you don't find a meta tag and there is no character encoding declared in the HTTP headers, then you could try to use a heuristic encoding detector on the bytes of the file already read.
The one thing you shouldn't do is rely on the character encoding declared in the HTTP header, regardless of the many suggestions to do so, which you will find here and elsewhere on the web. As we've already seen, the headers often don't contain this information anyway, but even when they do, it is often wrong anyway, because for a web designer it's easier to declare the encoding in the page itself than to reconfigure the server to send the correct headers. So you can't really rely on the HTTP header, and you should only use it if you have no other information to go on.

Google Translate API returns non UTF8 characters

Resolve
See in the end of this post for the solution
Good evening.
Im trying to play with the google translate v3 api.
And I arrive on a mystical encoding issue.
I do this :
def translate_text_langueTarget(texteToTranslate, langueTarget):
parent = client.location_path(project_id, location)
langueOrigin = detect_language(texteToTranslate)
if (langueOrigin == "en" and langueTarget == "en"):
return(texteToTranslate)
try:
response = client.translate_text(
parent=parent,
contents=[texteToTranslate],
mime_type='text/plain',
source_language_code=langueOrigin,
target_language_code=langueTarget)
translatedTexte = str(response.translations)[19:-3]
except:
translatedTexte = "Sorry my friend, the translation is lost on the internet"
print(response)
print(type(response))
print(response.translations)
print(type(response.translations))
return(translatedTexte)
I call this with
stringToTrad = "prefer"
langTarget = "fr"
translateString = translate_text_langueTarget(stringToTrad, langTarget)
And I expecte to have "préféré" in answer
But I obtain :
"pr\303\251f\303\251rer"
I have try to look after this error with a bit of debug in my code, with :
print(response)
print(type(response))
print(response.translations)
print(type(response.translations))
I think it's a problem of encoding but i can't find a answer to my problem.
I work in python and my scrip is tag :
#! /usr/bin/env python3
# coding: utf-8
in the header
Do you have an idea ?
Resolve.
I use :
translatedTexte = codecs.escape_decode(translatedTexte)[0]
translatedTexte = translatedTexte.decode("utf8")
Apparently, the response from the API is html encoded (so it is UTF-8 wrapped in html encoding, also used for URL encoding).
The solution is simple.
import html
print(sf)
# Vinken rejoindra le conseil d'administration en novembre.
print(html.unescape(sf))
# Vinken rejoindra le conseil d'administration en novembre.
+Info https://stackoverflow.com/a/48805931/4752223
API of Google Translate gives you UTF-8 text.
You got c3 a9 (303 251 as octal numbers) which it is really é, as expected.
So your code take the correct UTF-8 file and it writes it as maybe wrong encoding.
This line is just a myth, not useful:
# coding: utf-8
If you want that your code interpret input and output as UTF-8, you should explicitly say so. With your code, I assume that (one problem) is that you use print (better to write into a file). On Windows, by default, terminals are not UTF-8, but old "Windows ANSI like and extended also know as Windows 1252" encoding.
So write into a file (with explicit UTF-8 encoding), or just change terminal settings, to have UTF-8 terminal. In addition, you may have escape sequences, on results. To me, it smell much, to have results written in octal way. Not a think of standard Python (and it will complain, about wrong encoding). You may need to parse the response, to translate escape sequences.

how do I read a text file into my .py file with Flask and then get a line-by-line array? [duplicate]

This question already has an answer here:
Get path relative to executed flask app
(1 answer)
Closed 3 years ago.
Please see here for my command line output: (https://i.stack.imgur.com/l8Qbl.jpg)
It says the file /static/poemInput.txt doesn't exist.. but as you can see, it clearly exists when I execute ls static. Is there a problem with how I'm naming the files?
context:
I'm very new to flask but I have a python app that I want to deploy online. I am just trying to import the text file I use in the app, however it isn't found.
I know the text file should be in the static folder, and I used open_url as required.
I was getting the errors specified here so my with open block is inside a with.app.test_request_context(): block.
EDIT
I tried what was suggested by Luan Nguyen and used the app.open_resource() function, but then I get a UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128). This was fixed in python's open function (as in my original code) by setting encoding to latin1... how can I do this with flask's open_resource function?? I tried doing f.encode('latin1') but I got error: _io.TextIOWrapper' object has no attribute 'encode'
Essentially: how do I read a text file into my .py file with Flask and then get a line-by-line array?
The problem
The problem is with your open call. url_for returned this string '/static/poemInput.txt'. When you put this string directly into Python's open, it will find the file at <system_root>/static/poemInput.txt, and not at <your_project_directory>/static/poemInput.txt.
Solution
Given you might have a running Flask instance, you should be using Flask's open_resource function. With this structure:
/myapplication.py
/schema.sql
/static
/poemInput.txt
/style.css
/templates
/layout.html
/index.html
You could do something like:
with app.open_resource('static/poemInput.txt') as f:
contents = f.read()
do_something_with(contents)

Encoding issue, from html form data, to python print

I'm getting in Python 3 the data from an HTML form. To simplify to the maximum, my Python code looks like this:
#!/usr/bin/python3
import cgi
form = cgi.FieldStorage()
print('Content-type: text/html; charset=utf-8\n')
data = form.getvalue('nom')
print(data)
Now it prints (like it's supposed to) the name filled in the HTML form, however when that name has an accent (for example Valérie), then the accented character is printed as a ? (in this case Python prints Val?rie).
I know it's a problem of encoding (Python being notorious for this), and I've searched quite a bit (encode, decode, locale, etc...) but didn't get it to work unfortunately. If anyone knows how to fix this and have it print Valérie, I'd really appreciate it ;-)
EDIT: got it to work using print(data.encode('utf-8').decode('latin-1'))
Take care.

cgi python3 problems with encoding

I created a cgi script (running at localhost with apache) which will load text from textarea and then I will work with it. I have problems with characters like š,ť,é,.. that they are not displayed correctly. I tried it in many ways. Here is one version of my shortcode in which I am just searching for the right way to deal with it.
#!C:/Python33/python
# -*- coding: UTF-8 -*-
import cgi
import cgitb
cgitb.enable()
form = cgi.FieldStorage()
if form.getvalue('textcontent'):
text_content = form.getvalue('textcontent')
else:
text_content = ""
print ("Content-type:text/html")
print ()
print("<!DOCTYPE html>")
print ("<html>")
print ("<head>")
print("<meta charset='UTF-8'></meta>")
print ("</head>")
print ("<body>")
print ("<form>")
print ("text_area:<br />")
print ("<textarea name='textcontent' rows='5' cols='20'></textarea>")
print ("<br />")
print ("<input type='submit' value='submit form' />")
print ("</form>")
print("<p>")
print(text_content)
print("</p>")
print ("</body>")
print ("</html>")
This way is using UTF-8, when I try to write something, it looks like this (write to textarea and submit):
čítam -> ��tam
When I use latin-1 as python encoding and utf-8 as charset in html part it works like this:
časa -> časa (correctly)
but with characters with an accent mark (for example áno) it returns error:
UnicodeEncodeError: 'charmap' codec can't encode character '\\ufffd' in position 0: character maps to <undefined>\r
With sys.stdout.encoding it writes cp1250 encoding (work under windows) and with sys.getdefaultencoding() it returns utf-8
I tried also text_content = (form.getvalue('textcontent')).encode('utf-8') for example word číslo and result is b'\xef\xbf\xbd\xef\xbf\xbdslo'
I don't know how to handle this problem.
I need číslo -> číslo fo example.
UPDATE: Now I have UTF-8 for python as html encoding. It looks like work with text (comparing words with the dictionary,..) is going well, so the only one problem now is that output looks like ��tam, so I need to modify it to look like čítam instead of ��tam.
UPDATE 2: When encoding is UTF-8, and in browser UTF-8 too, it displays �s, when I change browser encoding to cp1250, it displays correctly, but when I refresh the site or click on Submit button it writes error UnicodeEncodeError: 'charmap' codec can't encode character '\\ufffd'
UPDATE 3: Tried it on linux and after a few problems I found out that apache server is using wrong encoding(ascii), but I can't accomplish this problem yet. Modified /etc/apache2/envvars to PATH LANG="sk_SK.UTF-8" but got some warning in the terminal by gedit that editing was not good. So encoding is still ascii.
write your form in this way:
<form accept-charset="utf-8">
put accept-charset = "utf-8" in your forms, it can solve this problems

Resources