Python 3 character encoding issue - python-3.x

i am selecting values from a MySQL // Maria DB that contains latin1 charset with latin1_swedish_ci collation. There are possible characters from different European language as Spanish ñ, German ä or Norwegian ø.
I get the data with
#!/usr/bin/env python3
# coding: utf-8
...
sql.execute("SELECT name FROM myTab")
for row in sql
print(row[0])
There is an error message:
UnicodeEncodeError: 'ascii' codec can't encode character '\xf1'
Okay I have changed my print to
print(str(row[0].encode('utf8')))
and the result looks like this:
b'\xc3\xb1'
i looked at this Working with utf-8 encoding in Python source but i have declard the header. Also decode('utf8').encode('cp1250') does not help

okay the encoding issue has been solved finaly. Coldspeed gave a important hind with loacle. therefore all kudos for him! Unfortunately it was not that easy.
I found a workaround that fix the problem.
import sys
sys.stdout = open(sys.stdout.fileno(), mode='w', encoding='utf8', buffering=1)
The solution is from Jack O'Connor. posted in this answer:

Python3 tries to automatically decode this string based on your locale settings. If your locale doesn't match up with the encoding on the string, you get garbled text, or it doesn't work at all. You can forcibly try encoding it with your locale and then decoding to cp1252 (it seems this is the encoding on the string).
print(row[0].encode('latin-1').decode('cp1252'))

Related

Python: How to translate UTF8 String containing unicode decoded characters ("Ok\u00c9" to "Oké")

I'm trying to fix the string I'm getting from my python script.
I'm doing a call to an API, but it is returning me utf8 String that is still containing unicode encoded characters.
stuff like "Ok\u00c9" should be "Oké".
I tried converting it, but all efforts to fix it seem to result in errors or in the same result. is there someone who could fix this for me in Python 3?
print('\u00c9'.encode().decode('unicode-escape'))
>> é
print('Ok\u00c9'.encode().decode('unicode-escape'))
>> should print 'Oké'
>> but gives an error
hope you guys know the solution. thanks in advance!
Ive found the problem. The encoding decoding was wrong. The text came in as Windows-1252 encoding.
I've use
import chardet
chardet.detect(var3.encode())
to detect the proper encoding, and the did a
var3 = 'OK\u00c9'.encode('utf8').decode('Windows-1252').encode('utf8').decode('utf8')
conversion to eventually get it in the right format!

Google Translate API returns non UTF8 characters

Resolve
See in the end of this post for the solution
Good evening.
Im trying to play with the google translate v3 api.
And I arrive on a mystical encoding issue.
I do this :
def translate_text_langueTarget(texteToTranslate, langueTarget):
parent = client.location_path(project_id, location)
langueOrigin = detect_language(texteToTranslate)
if (langueOrigin == "en" and langueTarget == "en"):
return(texteToTranslate)
try:
response = client.translate_text(
parent=parent,
contents=[texteToTranslate],
mime_type='text/plain',
source_language_code=langueOrigin,
target_language_code=langueTarget)
translatedTexte = str(response.translations)[19:-3]
except:
translatedTexte = "Sorry my friend, the translation is lost on the internet"
print(response)
print(type(response))
print(response.translations)
print(type(response.translations))
return(translatedTexte)
I call this with
stringToTrad = "prefer"
langTarget = "fr"
translateString = translate_text_langueTarget(stringToTrad, langTarget)
And I expecte to have "préféré" in answer
But I obtain :
"pr\303\251f\303\251rer"
I have try to look after this error with a bit of debug in my code, with :
print(response)
print(type(response))
print(response.translations)
print(type(response.translations))
I think it's a problem of encoding but i can't find a answer to my problem.
I work in python and my scrip is tag :
#! /usr/bin/env python3
# coding: utf-8
in the header
Do you have an idea ?
Resolve.
I use :
translatedTexte = codecs.escape_decode(translatedTexte)[0]
translatedTexte = translatedTexte.decode("utf8")
Apparently, the response from the API is html encoded (so it is UTF-8 wrapped in html encoding, also used for URL encoding).
The solution is simple.
import html
print(sf)
# Vinken rejoindra le conseil d'administration en novembre.
print(html.unescape(sf))
# Vinken rejoindra le conseil d'administration en novembre.
+Info https://stackoverflow.com/a/48805931/4752223
API of Google Translate gives you UTF-8 text.
You got c3 a9 (303 251 as octal numbers) which it is really é, as expected.
So your code take the correct UTF-8 file and it writes it as maybe wrong encoding.
This line is just a myth, not useful:
# coding: utf-8
If you want that your code interpret input and output as UTF-8, you should explicitly say so. With your code, I assume that (one problem) is that you use print (better to write into a file). On Windows, by default, terminals are not UTF-8, but old "Windows ANSI like and extended also know as Windows 1252" encoding.
So write into a file (with explicit UTF-8 encoding), or just change terminal settings, to have UTF-8 terminal. In addition, you may have escape sequences, on results. To me, it smell much, to have results written in octal way. Not a think of standard Python (and it will complain, about wrong encoding). You may need to parse the response, to translate escape sequences.

UnicodeDecodeError/ invalid continuation byte

Can someone say, why is the below item failing? Simple program, but I couldn't find answer anywhere
Python Code
from flask import Flask
app = Flask(__name__)
#app.route('/')
def hello_world():
return 'Hello, World!'
Results
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc8 in position 0: invalid continuation byte
refer to this topic (Automatic Conversion) from Flask docs
Flask expect you are using UTF-8 for encoding files (.py and .html at least)
depending on the code editor you are using you can enforce UTF-8 character encoding
for e.g have look at this thread on How do I convert an ANSI encoded file to UTF-8 with Notepad++?
Update
as per this thread the problem could be OS-related, change your server hostname to a string that only contains ASCII characters.

Encoding issues related to Python and foreign languages

Here's a problem I am facing with encoding and decoding texts.
I am trying to write a code that finds a 'string' or a 'byte' in a file, and return the path of the file.
Currently, since the files I am opening have encoding of 'windows-1252' or 'cp-1252', so I have been trying to:
1. encode my string into a byte corresponding to the encoding of the file
2. match the file and get the path of that file
I have a file, say 'f', that has the encoding of 'windows-1252' or 'cp-1252'. It includes a text that is in Chinese: '[跑Online農場]'
with open(os.path.join(root, filename), mode='rb') as f:
text = f.read()
print(encoding(text)) # encoding() is a separate function that I wrote that returns the encoding of the file
print(text)
Windows-1252
b'\x00StaticText\x00\x00\x12\x00[\xb6]Online\xb9A\xb3\xf5]\x00\x01\x00\x ...
As you may see, the 'binary' texts for [跑Online農場] is [\xb6]Online\xb9A\xb3\xf5]
However, the funny thing is that if I literally convert the string into bytes, I get:
enter_text = '[跑Online農場]'
print(bytes(enter_text, 'cp1252'))
UnicodeEncodeError: 'charmap' codec can't encode character '\u8dd1' in position 1: character maps to <undefined>
On the other hand, opening the file using
with open(os.path.join(root, filename), mode='r', encoding='cp-1252') as f ...
I get:
StaticText [¶]Online¹A³õ] €?‹ Œ î...
which I am not sure how I would 'translate' '[跑Online農場]' into '[¶]Online¹A³õ]'. Answer to this may also solve the problem
What should I do to correctly 'encode' the Chinese/Foreign characters so that it matches the 'rb' bytes that the Python returns?
Thank you!
Your encoding function is wrong: the codec of the file is probably CP950, but certainly not CP1252.
Note: guessing the encoding of a given byte string is always approximate.
There's no safe way of determining the encoding for sure.
If you have a byte string like
b'[\xb6]Online\xb9A\xb3\xf5]'
and you know it must translate (be decoded) into
'[跑Online農場]'
then what you can is trial and error with a few codecs.
I did this with the list of codecs supported by Python, searching for codecs for Chinese.
When using CP-1252 (the Windows version of Latin-1), as you did, you get mojibake:
>>> b'[\xb6]Online\xb9A\xb3\xf5]'.decode('cp1252')
'[¶]Online¹A³õ]'
When using CP-950 (the Windows codepage for Traditional Chinese), you get the expected output:
>>> b'[\xb6]Online\xb9A\xb3\xf5]'.decode('cp950')
'[跑Online農場]'
So: use CP-950 for reading the file.

cgi python3 problems with encoding

I created a cgi script (running at localhost with apache) which will load text from textarea and then I will work with it. I have problems with characters like š,ť,é,.. that they are not displayed correctly. I tried it in many ways. Here is one version of my shortcode in which I am just searching for the right way to deal with it.
#!C:/Python33/python
# -*- coding: UTF-8 -*-
import cgi
import cgitb
cgitb.enable()
form = cgi.FieldStorage()
if form.getvalue('textcontent'):
text_content = form.getvalue('textcontent')
else:
text_content = ""
print ("Content-type:text/html")
print ()
print("<!DOCTYPE html>")
print ("<html>")
print ("<head>")
print("<meta charset='UTF-8'></meta>")
print ("</head>")
print ("<body>")
print ("<form>")
print ("text_area:<br />")
print ("<textarea name='textcontent' rows='5' cols='20'></textarea>")
print ("<br />")
print ("<input type='submit' value='submit form' />")
print ("</form>")
print("<p>")
print(text_content)
print("</p>")
print ("</body>")
print ("</html>")
This way is using UTF-8, when I try to write something, it looks like this (write to textarea and submit):
čítam -> ��tam
When I use latin-1 as python encoding and utf-8 as charset in html part it works like this:
časa -> časa (correctly)
but with characters with an accent mark (for example áno) it returns error:
UnicodeEncodeError: 'charmap' codec can't encode character '\\ufffd' in position 0: character maps to <undefined>\r
With sys.stdout.encoding it writes cp1250 encoding (work under windows) and with sys.getdefaultencoding() it returns utf-8
I tried also text_content = (form.getvalue('textcontent')).encode('utf-8') for example word číslo and result is b'\xef\xbf\xbd\xef\xbf\xbdslo'
I don't know how to handle this problem.
I need číslo -> číslo fo example.
UPDATE: Now I have UTF-8 for python as html encoding. It looks like work with text (comparing words with the dictionary,..) is going well, so the only one problem now is that output looks like ��tam, so I need to modify it to look like čítam instead of ��tam.
UPDATE 2: When encoding is UTF-8, and in browser UTF-8 too, it displays �s, when I change browser encoding to cp1250, it displays correctly, but when I refresh the site or click on Submit button it writes error UnicodeEncodeError: 'charmap' codec can't encode character '\\ufffd'
UPDATE 3: Tried it on linux and after a few problems I found out that apache server is using wrong encoding(ascii), but I can't accomplish this problem yet. Modified /etc/apache2/envvars to PATH LANG="sk_SK.UTF-8" but got some warning in the terminal by gedit that editing was not good. So encoding is still ascii.
write your form in this way:
<form accept-charset="utf-8">
put accept-charset = "utf-8" in your forms, it can solve this problems

Resources