Encoding issue, from html form data, to python print - python-3.x

I'm getting in Python 3 the data from an HTML form. To simplify to the maximum, my Python code looks like this:
#!/usr/bin/python3
import cgi
form = cgi.FieldStorage()
print('Content-type: text/html; charset=utf-8\n')
data = form.getvalue('nom')
print(data)
Now it prints (like it's supposed to) the name filled in the HTML form, however when that name has an accent (for example Valérie), then the accented character is printed as a ? (in this case Python prints Val?rie).
I know it's a problem of encoding (Python being notorious for this), and I've searched quite a bit (encode, decode, locale, etc...) but didn't get it to work unfortunately. If anyone knows how to fix this and have it print Valérie, I'd really appreciate it ;-)
EDIT: got it to work using print(data.encode('utf-8').decode('latin-1'))
Take care.

Related

weird characters in Pandas dataframe - how to standardize to UTF-8?

I'm using Python + Camelot (OCR library) to read a PDF, clean up, and write to Excel or csv. There are some non-standard dashes that print out a weird character.
Using Camelot means I'm not calling "read_csv". It's coming from the PDF. A value that is supposed to be "1-4" prints out as 1–4.
I fixed this using a regular expression but a colleague mentioned I should standardize to UTF-8. I tried to do that for the header like this:
header = df.iloc[0, 1:].str.encode('utf-8')
but then that value becomes b'1\xe2\x80\x934'.
Any advice? The goal is to simply use standard text.

Google Translate API returns non UTF8 characters

Resolve
See in the end of this post for the solution
Good evening.
Im trying to play with the google translate v3 api.
And I arrive on a mystical encoding issue.
I do this :
def translate_text_langueTarget(texteToTranslate, langueTarget):
parent = client.location_path(project_id, location)
langueOrigin = detect_language(texteToTranslate)
if (langueOrigin == "en" and langueTarget == "en"):
return(texteToTranslate)
try:
response = client.translate_text(
parent=parent,
contents=[texteToTranslate],
mime_type='text/plain',
source_language_code=langueOrigin,
target_language_code=langueTarget)
translatedTexte = str(response.translations)[19:-3]
except:
translatedTexte = "Sorry my friend, the translation is lost on the internet"
print(response)
print(type(response))
print(response.translations)
print(type(response.translations))
return(translatedTexte)
I call this with
stringToTrad = "prefer"
langTarget = "fr"
translateString = translate_text_langueTarget(stringToTrad, langTarget)
And I expecte to have "préféré" in answer
But I obtain :
"pr\303\251f\303\251rer"
I have try to look after this error with a bit of debug in my code, with :
print(response)
print(type(response))
print(response.translations)
print(type(response.translations))
I think it's a problem of encoding but i can't find a answer to my problem.
I work in python and my scrip is tag :
#! /usr/bin/env python3
# coding: utf-8
in the header
Do you have an idea ?
Resolve.
I use :
translatedTexte = codecs.escape_decode(translatedTexte)[0]
translatedTexte = translatedTexte.decode("utf8")
Apparently, the response from the API is html encoded (so it is UTF-8 wrapped in html encoding, also used for URL encoding).
The solution is simple.
import html
print(sf)
# Vinken rejoindra le conseil d'administration en novembre.
print(html.unescape(sf))
# Vinken rejoindra le conseil d'administration en novembre.
+Info https://stackoverflow.com/a/48805931/4752223
API of Google Translate gives you UTF-8 text.
You got c3 a9 (303 251 as octal numbers) which it is really é, as expected.
So your code take the correct UTF-8 file and it writes it as maybe wrong encoding.
This line is just a myth, not useful:
# coding: utf-8
If you want that your code interpret input and output as UTF-8, you should explicitly say so. With your code, I assume that (one problem) is that you use print (better to write into a file). On Windows, by default, terminals are not UTF-8, but old "Windows ANSI like and extended also know as Windows 1252" encoding.
So write into a file (with explicit UTF-8 encoding), or just change terminal settings, to have UTF-8 terminal. In addition, you may have escape sequences, on results. To me, it smell much, to have results written in octal way. Not a think of standard Python (and it will complain, about wrong encoding). You may need to parse the response, to translate escape sequences.

Pyx unicode text

So I am trying to generate postscript from Python.
Currently trying with PyX 0.14.1 on Python3.4.2,
but I am open to suggestions, if you know something simpler.
I was following mostly the suggestions found on the PyX
mailing list in this thread. This was Python2 and is quite old.
The following shows my current code after many changes:
from pyx import *
text.set(cls=text.LatexRunner, texenc='utf-8')
text.preamble(r'\usepackage{ucs}')
text.preamble(r'\usepackage[utf8x]{inputenc}')
c = canvas.canvas()
c.text(5, 5, "Sören Sundstrøm".encode("utf8"))
p = document.page(c, paperformat=document.paperformat.A4,
centered=0)
d = document.document([p])
d.writePSfile('test.ps')
PyX stops with a TexResultError. The interesting part of the error
shows what's happening in TeX:
pyx.text.TexResultError: unhandled TeX response (might be an error)
The expression passed to TeX was:
\ProcessPyXBox{b'S\xc3\xb6ren Sundstr\xc3\xb8m'%
}{1}%
\PyXInput{7}%
After parsing the return message from TeX, the following was left:
*
*! Undefined control sequence.
<argument> b'S\xc
3\xb 6ren Sundstr\xc 3\xb 8m'
<*> }{1}
(cut after 5 lines; use errordetail.full for all output)
So it looks like latex is receiving not utf-8,
but an escaped representation of utf-8.
My question: How do I pass the string to canvas.text correctly?
Or is my preamble wrong?
I also tried to follow this answer by wobsta here on SO,
but besides being much too complicated, it does not work for me either.
(Looks like PyX does not understand a metafont message in this case).
Running latex directly on a simple utf-8 input file with the same preamble
works fine by the way.
Looking into the PyX code revealed the problem.
The text module prepares an io.TextIOWrapper with utf-8 encoding to be used for TeX input. The string parameters in text.preamble and canvas.text are passed verbatim to the wrapper, so in Python 3 you just pass a string without any encoding necessary. Encoding will be done by the wrapper.
My original unsimplified code had another problem which made it difficult to solve this first problem. So for completeness here's the second problem and its solution. My original code had this order of operations:
from pyx import *
c = canvas.canvas()
# doing other stuff with canvas
text.set(cls=text.LatexRunner, texenc='utf-8')
text.preamble(r'\usepackage{ucs}')
text.preamble(r'\usepackage[utf8x]{inputenc}')
c.text(5, 5, "Sören Sundstrøm")
p = document.page(c, paperformat=document.paperformat.A4,
centered=0)
d = document.document([p])
d.writePSfile('test.ps')
This does not work either, because when a canvas is created it keeps a reference to a text.defaulttexrunner which is set up with the current settings of the text module. The changed text module settings never influence the canvas instance. So you have to set-up the text module before you create the canvas where you want to draw text into.
Thanks to anyone who looked into this.

python 3.3 basic error

I have python 3.3 installed.
i use the example they use on their site:
import urllib.request
response = urllib.request.urlopen('http://python.org/')
html = response.read()
the only thing that happens when I run it is I get this :
======RESTART=========
I know I am a rookie but I figured the example from python's own website should be able to work.
It doesn't. What am I doing wrong?Eventually I want to run this script from the website below. But I think urllib is not going to work as it is on that site. Can someone tell me if the code will work with python3.3???
http://flowingdata.com/2007/07/09/grabbing-weather-underground-data-with-beautifulsoup/
I think I see what's probably going on. You're likely using IDLE, and when it starts a new run of a program, it prints the
======RESTART=========
line to tell you that a fresh program is starting. That means that all the variables currently defined are reset and/or deleted, as appropriate.
Since your program didn't print any output, you didn't see anything.
The two lines I suggested adding were just tests to figure out what was going on, they're not needed in general. [Unless the window itself is automatically closing, which it shouldn't.] But as a rule, if you want to see output, you'll have to print what you're interested in.
Your example works for me. However, I suggest using requests instead of urllib2.
To simplify the example you linked to, it would look like:
from bs4 import BeautifulSoup
import requests
resp = requests.get("http://www.wunderground.com/history/airport/KBUF/2007/12/16/DailyHistory.html")
soup = BeautifulSoup(resp.text)

Python 3.1 server-side can't output Unicode string to client

I'm using a free web host but choosing not to work with any Python framework, and am stuck trying to print Chinese characters saved in the source file (using emacs to save file encoded in utf-8) to the resulting HTML page. I thought Unicode "just works" in Python 3.1 so I am baffled. I found three solutions that aren't working. I might just be missing a detail or two.
The host is Alwaysdata, and it has been straightforward to use, so I have little clue about details of how they put together the parts. All I do is upload or edit (with ssh) Python files to a www folder, change permissions, point a browser to the right URL, and it works.
My first attempt, which works on local IDLE (and also the server's Python command line interactive shell, which makes me even more confused why it won't work when it's passed to a browser)
#!/usr/bin/python3.1
mystr = "世界好"
print("Content-Type: text/html\n\n")
print("""<!DOCTYPE html>
<html><head><meta charset="utf-8"></head>
<body>""")
print(mystr)
The error is:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3:
ordinal not in range(128)
Then I tried
print(mystr.encode("utf-8"))
resulting in no error, but the following undesired output to the browser:
b'\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c'
Third, the following lines were added but got an error:
import sys
sys.setdefaultencoding("utf-8")
AttributeError: 'module' object has no attribute 'setdefaultencoding'
Finally, replacing print with f.write:
import codecs
f = codecs.open(sys.stdout, "w", "utf-8")
mystr = "你好世界"
...
f.write(mystr)
error:
TypeError: invalid file: <_io.TextIOWrapper name='<stdout>'
encoding='ANSI_X3.4-1968'>
How do I get the output to work? Do I need to use a framework for a quick fix?
It sounds like you are using CGI, which is a stupid API as it's using stdout, made for output to humans, to output to your browser. This is the basic source of your problems.
You need to encode it in UTF-8, and then write to sys.stdout.buffer instead of sys.stdout.
And after that, get yourself a webframework. Really, you'll be a lot happier.

Resources