cgi python3 problems with encoding - python-3.x

I created a cgi script (running at localhost with apache) which will load text from textarea and then I will work with it. I have problems with characters like š,ť,é,.. that they are not displayed correctly. I tried it in many ways. Here is one version of my shortcode in which I am just searching for the right way to deal with it.
#!C:/Python33/python
# -*- coding: UTF-8 -*-
import cgi
import cgitb
cgitb.enable()
form = cgi.FieldStorage()
if form.getvalue('textcontent'):
text_content = form.getvalue('textcontent')
else:
text_content = ""
print ("Content-type:text/html")
print ()
print("<!DOCTYPE html>")
print ("<html>")
print ("<head>")
print("<meta charset='UTF-8'></meta>")
print ("</head>")
print ("<body>")
print ("<form>")
print ("text_area:<br />")
print ("<textarea name='textcontent' rows='5' cols='20'></textarea>")
print ("<br />")
print ("<input type='submit' value='submit form' />")
print ("</form>")
print("<p>")
print(text_content)
print("</p>")
print ("</body>")
print ("</html>")
This way is using UTF-8, when I try to write something, it looks like this (write to textarea and submit):
čítam -> ��tam
When I use latin-1 as python encoding and utf-8 as charset in html part it works like this:
časa -> časa (correctly)
but with characters with an accent mark (for example áno) it returns error:
UnicodeEncodeError: 'charmap' codec can't encode character '\\ufffd' in position 0: character maps to <undefined>\r
With sys.stdout.encoding it writes cp1250 encoding (work under windows) and with sys.getdefaultencoding() it returns utf-8
I tried also text_content = (form.getvalue('textcontent')).encode('utf-8') for example word číslo and result is b'\xef\xbf\xbd\xef\xbf\xbdslo'
I don't know how to handle this problem.
I need číslo -> číslo fo example.
UPDATE: Now I have UTF-8 for python as html encoding. It looks like work with text (comparing words with the dictionary,..) is going well, so the only one problem now is that output looks like ��tam, so I need to modify it to look like čítam instead of ��tam.
UPDATE 2: When encoding is UTF-8, and in browser UTF-8 too, it displays �s, when I change browser encoding to cp1250, it displays correctly, but when I refresh the site or click on Submit button it writes error UnicodeEncodeError: 'charmap' codec can't encode character '\\ufffd'
UPDATE 3: Tried it on linux and after a few problems I found out that apache server is using wrong encoding(ascii), but I can't accomplish this problem yet. Modified /etc/apache2/envvars to PATH LANG="sk_SK.UTF-8" but got some warning in the terminal by gedit that editing was not good. So encoding is still ascii.

write your form in this way:
<form accept-charset="utf-8">
put accept-charset = "utf-8" in your forms, it can solve this problems

Related

Google Translate API returns non UTF8 characters

Resolve
See in the end of this post for the solution
Good evening.
Im trying to play with the google translate v3 api.
And I arrive on a mystical encoding issue.
I do this :
def translate_text_langueTarget(texteToTranslate, langueTarget):
parent = client.location_path(project_id, location)
langueOrigin = detect_language(texteToTranslate)
if (langueOrigin == "en" and langueTarget == "en"):
return(texteToTranslate)
try:
response = client.translate_text(
parent=parent,
contents=[texteToTranslate],
mime_type='text/plain',
source_language_code=langueOrigin,
target_language_code=langueTarget)
translatedTexte = str(response.translations)[19:-3]
except:
translatedTexte = "Sorry my friend, the translation is lost on the internet"
print(response)
print(type(response))
print(response.translations)
print(type(response.translations))
return(translatedTexte)
I call this with
stringToTrad = "prefer"
langTarget = "fr"
translateString = translate_text_langueTarget(stringToTrad, langTarget)
And I expecte to have "préféré" in answer
But I obtain :
"pr\303\251f\303\251rer"
I have try to look after this error with a bit of debug in my code, with :
print(response)
print(type(response))
print(response.translations)
print(type(response.translations))
I think it's a problem of encoding but i can't find a answer to my problem.
I work in python and my scrip is tag :
#! /usr/bin/env python3
# coding: utf-8
in the header
Do you have an idea ?
Resolve.
I use :
translatedTexte = codecs.escape_decode(translatedTexte)[0]
translatedTexte = translatedTexte.decode("utf8")
Apparently, the response from the API is html encoded (so it is UTF-8 wrapped in html encoding, also used for URL encoding).
The solution is simple.
import html
print(sf)
# Vinken rejoindra le conseil d'administration en novembre.
print(html.unescape(sf))
# Vinken rejoindra le conseil d'administration en novembre.
+Info https://stackoverflow.com/a/48805931/4752223
API of Google Translate gives you UTF-8 text.
You got c3 a9 (303 251 as octal numbers) which it is really é, as expected.
So your code take the correct UTF-8 file and it writes it as maybe wrong encoding.
This line is just a myth, not useful:
# coding: utf-8
If you want that your code interpret input and output as UTF-8, you should explicitly say so. With your code, I assume that (one problem) is that you use print (better to write into a file). On Windows, by default, terminals are not UTF-8, but old "Windows ANSI like and extended also know as Windows 1252" encoding.
So write into a file (with explicit UTF-8 encoding), or just change terminal settings, to have UTF-8 terminal. In addition, you may have escape sequences, on results. To me, it smell much, to have results written in octal way. Not a think of standard Python (and it will complain, about wrong encoding). You may need to parse the response, to translate escape sequences.

Encoding issues related to Python and foreign languages

Here's a problem I am facing with encoding and decoding texts.
I am trying to write a code that finds a 'string' or a 'byte' in a file, and return the path of the file.
Currently, since the files I am opening have encoding of 'windows-1252' or 'cp-1252', so I have been trying to:
1. encode my string into a byte corresponding to the encoding of the file
2. match the file and get the path of that file
I have a file, say 'f', that has the encoding of 'windows-1252' or 'cp-1252'. It includes a text that is in Chinese: '[跑Online農場]'
with open(os.path.join(root, filename), mode='rb') as f:
text = f.read()
print(encoding(text)) # encoding() is a separate function that I wrote that returns the encoding of the file
print(text)
Windows-1252
b'\x00StaticText\x00\x00\x12\x00[\xb6]Online\xb9A\xb3\xf5]\x00\x01\x00\x ...
As you may see, the 'binary' texts for [跑Online農場] is [\xb6]Online\xb9A\xb3\xf5]
However, the funny thing is that if I literally convert the string into bytes, I get:
enter_text = '[跑Online農場]'
print(bytes(enter_text, 'cp1252'))
UnicodeEncodeError: 'charmap' codec can't encode character '\u8dd1' in position 1: character maps to <undefined>
On the other hand, opening the file using
with open(os.path.join(root, filename), mode='r', encoding='cp-1252') as f ...
I get:
StaticText [¶]Online¹A³õ] €?‹ Œ î...
which I am not sure how I would 'translate' '[跑Online農場]' into '[¶]Online¹A³õ]'. Answer to this may also solve the problem
What should I do to correctly 'encode' the Chinese/Foreign characters so that it matches the 'rb' bytes that the Python returns?
Thank you!
Your encoding function is wrong: the codec of the file is probably CP950, but certainly not CP1252.
Note: guessing the encoding of a given byte string is always approximate.
There's no safe way of determining the encoding for sure.
If you have a byte string like
b'[\xb6]Online\xb9A\xb3\xf5]'
and you know it must translate (be decoded) into
'[跑Online農場]'
then what you can is trial and error with a few codecs.
I did this with the list of codecs supported by Python, searching for codecs for Chinese.
When using CP-1252 (the Windows version of Latin-1), as you did, you get mojibake:
>>> b'[\xb6]Online\xb9A\xb3\xf5]'.decode('cp1252')
'[¶]Online¹A³õ]'
When using CP-950 (the Windows codepage for Traditional Chinese), you get the expected output:
>>> b'[\xb6]Online\xb9A\xb3\xf5]'.decode('cp950')
'[跑Online農場]'
So: use CP-950 for reading the file.

Python 3 character encoding issue

i am selecting values from a MySQL // Maria DB that contains latin1 charset with latin1_swedish_ci collation. There are possible characters from different European language as Spanish ñ, German ä or Norwegian ø.
I get the data with
#!/usr/bin/env python3
# coding: utf-8
...
sql.execute("SELECT name FROM myTab")
for row in sql
print(row[0])
There is an error message:
UnicodeEncodeError: 'ascii' codec can't encode character '\xf1'
Okay I have changed my print to
print(str(row[0].encode('utf8')))
and the result looks like this:
b'\xc3\xb1'
i looked at this Working with utf-8 encoding in Python source but i have declard the header. Also decode('utf8').encode('cp1250') does not help
okay the encoding issue has been solved finaly. Coldspeed gave a important hind with loacle. therefore all kudos for him! Unfortunately it was not that easy.
I found a workaround that fix the problem.
import sys
sys.stdout = open(sys.stdout.fileno(), mode='w', encoding='utf8', buffering=1)
The solution is from Jack O'Connor. posted in this answer:
Python3 tries to automatically decode this string based on your locale settings. If your locale doesn't match up with the encoding on the string, you get garbled text, or it doesn't work at all. You can forcibly try encoding it with your locale and then decoding to cp1252 (it seems this is the encoding on the string).
print(row[0].encode('latin-1').decode('cp1252'))

How to solve the ascii problems when I try to download pictures?

Here are my codes:
import re,urllib
from urllib import request, parse
def gh(url):
html=urllib.request.urlopen(url).read().decode('utf-8')
return html
def gi(x):
r=r'src="(.+?\.jpg)"'
imgre=re.findall(r, x)
y=0
for iu in imgre:
urllib.request.urlretrieve(iu, '%s.jpg' %y)
y=y+1
va=gh('http://tieba.baidu.com/p/3497570603')
print(gi(va))
when it is running, it occurs:
UnicodeEncodeError: 'ascii' codec can't encode character '\u65e5' in position 873: ordinal not in range(128)
I have decoded the content of website with 'utf-8' which turns into string, and where is the 'ascii codec' problem from?
The problem is that the HTML content of http://tieba.baidu.com/p/3497570603 contains references to .png images so the non-greedy regular expression matches long strings of text such as
http://static.tieba.baidu.com/tb/editor/images/client/image_emoticon28.png" ><br><br><br><br>
...
title="蓝钻"><img src="http://imgsrc.baidu.com/forum/pic/item/bede9735e5dde711c981db20a0efce1b9f1661d5.jpg
Calling the urlretrieve() method with URLs consisting of long strings containing non-ASCII characters results in the UnicodeEncodeError being thrown while trying to convert the URL argument to ASCII.
A better regular expression to avoid matching too much text would be
r=r'src="([^"]+?\.jpg)"'
Debugging
In the spirit of teaching someone to fish rather than simply giving them a fish for one day, I’d recommend that you use print statements to debug problems such as this. I was able to diagnose this particular issue by replacing the urllib.request.urlretrieve(iu, '%s.jpg' %y) line with print(iu).

Python 3.1 server-side can't output Unicode string to client

I'm using a free web host but choosing not to work with any Python framework, and am stuck trying to print Chinese characters saved in the source file (using emacs to save file encoded in utf-8) to the resulting HTML page. I thought Unicode "just works" in Python 3.1 so I am baffled. I found three solutions that aren't working. I might just be missing a detail or two.
The host is Alwaysdata, and it has been straightforward to use, so I have little clue about details of how they put together the parts. All I do is upload or edit (with ssh) Python files to a www folder, change permissions, point a browser to the right URL, and it works.
My first attempt, which works on local IDLE (and also the server's Python command line interactive shell, which makes me even more confused why it won't work when it's passed to a browser)
#!/usr/bin/python3.1
mystr = "世界好"
print("Content-Type: text/html\n\n")
print("""<!DOCTYPE html>
<html><head><meta charset="utf-8"></head>
<body>""")
print(mystr)
The error is:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3:
ordinal not in range(128)
Then I tried
print(mystr.encode("utf-8"))
resulting in no error, but the following undesired output to the browser:
b'\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c'
Third, the following lines were added but got an error:
import sys
sys.setdefaultencoding("utf-8")
AttributeError: 'module' object has no attribute 'setdefaultencoding'
Finally, replacing print with f.write:
import codecs
f = codecs.open(sys.stdout, "w", "utf-8")
mystr = "你好世界"
...
f.write(mystr)
error:
TypeError: invalid file: <_io.TextIOWrapper name='<stdout>'
encoding='ANSI_X3.4-1968'>
How do I get the output to work? Do I need to use a framework for a quick fix?
It sounds like you are using CGI, which is a stupid API as it's using stdout, made for output to humans, to output to your browser. This is the basic source of your problems.
You need to encode it in UTF-8, and then write to sys.stdout.buffer instead of sys.stdout.
And after that, get yourself a webframework. Really, you'll be a lot happier.

Resources