ERROR: invalid byte sequence for encoding "UTF8" - linux

I looked at similar questions, but still have not found a suitable solution.
On my Ubuntu OS I created some database by:
createdb PADB -W
And created a table.
create table teacher(
id_teacher integer PRIMARY KEY,
name varchar(120),
experience integer
);
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "teacher_pkey" for table "teacher"
I want to add some data contains Cyrillic, but I got this error:
PADB=# insert into teacher (name, experience) values ("Пупкин Василий Иванович", 15);
ERROR: invalid byte sequence for encoding "UTF8": 0xd0d0
Here is my lc settings:
PADB=# select name, setting from pg_settings where name like 'lc_%';
name | setting
-------------+-------------
lc_collate | ru_RU.UTF-8
lc_ctype | ru_RU.UTF-8
lc_messages | ru_RU.UTF-8
lc_monetary | ru_RU.UTF-8
lc_numeric | ru_RU.UTF-8
lc_time | ru_RU.UTF-8
(6 rows)
What is wrong?
Postgresql 9.1.11

I suspect your client application is actually sending data in koi8-r or iso-8859-5 encoding, not utf-8, but your client_encoding is telling PostgreSQL to expect UTF-8.
Either convert the input data to utf-8, or change your client_encoding to match the input data.
Decoding your data with different encodings produces:
>>> print "\xd0\xd0".decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 0: invalid continuation byte
>>> print "\xd0\xd0".decode("koi8-r")
пп
>>> print "\xd0\xd0".decode("iso-8859-5")
аа
However, rather strangely, your input doesn't appear to contain any of these. I'm a bit puzzled as to what encoding would turn Пупкин Василий Иванович into the byte sequence\xd0\xd0. So this isn't fully explained yet. In fact, I can't find any encoding of Пупкин Василий Иванович that produces that byte sequence, so I'm wondering if there's some double-encoding or similar mangling going on. I'd need to know more about your environment to say more; see comments on the original question.

I solved the problem, but I really don't know which of my actions were the most useful:
1) I rebuild and reinstalled postgreSQL with readline and zlib libraries (previously I run configure with keys --without-zlib and --without-readline).
2) I started to use single quotes instead of double.
Thank you all anyway.

Workaround: Place your data in a UTF-8 encoded csv file then import (/copy).
You could use Notepad++: Encoding > Convert to UTF-8 to create the file.

Related

Encoding issues related to Python and foreign languages

Here's a problem I am facing with encoding and decoding texts.
I am trying to write a code that finds a 'string' or a 'byte' in a file, and return the path of the file.
Currently, since the files I am opening have encoding of 'windows-1252' or 'cp-1252', so I have been trying to:
1. encode my string into a byte corresponding to the encoding of the file
2. match the file and get the path of that file
I have a file, say 'f', that has the encoding of 'windows-1252' or 'cp-1252'. It includes a text that is in Chinese: '[跑Online農場]'
with open(os.path.join(root, filename), mode='rb') as f:
text = f.read()
print(encoding(text)) # encoding() is a separate function that I wrote that returns the encoding of the file
print(text)
Windows-1252
b'\x00StaticText\x00\x00\x12\x00[\xb6]Online\xb9A\xb3\xf5]\x00\x01\x00\x ...
As you may see, the 'binary' texts for [跑Online農場] is [\xb6]Online\xb9A\xb3\xf5]
However, the funny thing is that if I literally convert the string into bytes, I get:
enter_text = '[跑Online農場]'
print(bytes(enter_text, 'cp1252'))
UnicodeEncodeError: 'charmap' codec can't encode character '\u8dd1' in position 1: character maps to <undefined>
On the other hand, opening the file using
with open(os.path.join(root, filename), mode='r', encoding='cp-1252') as f ...
I get:
StaticText [¶]Online¹A³õ] €?‹ Œ î...
which I am not sure how I would 'translate' '[跑Online農場]' into '[¶]Online¹A³õ]'. Answer to this may also solve the problem
What should I do to correctly 'encode' the Chinese/Foreign characters so that it matches the 'rb' bytes that the Python returns?
Thank you!
Your encoding function is wrong: the codec of the file is probably CP950, but certainly not CP1252.
Note: guessing the encoding of a given byte string is always approximate.
There's no safe way of determining the encoding for sure.
If you have a byte string like
b'[\xb6]Online\xb9A\xb3\xf5]'
and you know it must translate (be decoded) into
'[跑Online農場]'
then what you can is trial and error with a few codecs.
I did this with the list of codecs supported by Python, searching for codecs for Chinese.
When using CP-1252 (the Windows version of Latin-1), as you did, you get mojibake:
>>> b'[\xb6]Online\xb9A\xb3\xf5]'.decode('cp1252')
'[¶]Online¹A³õ]'
When using CP-950 (the Windows codepage for Traditional Chinese), you get the expected output:
>>> b'[\xb6]Online\xb9A\xb3\xf5]'.decode('cp950')
'[跑Online農場]'
So: use CP-950 for reading the file.

Python 3 character encoding issue

i am selecting values from a MySQL // Maria DB that contains latin1 charset with latin1_swedish_ci collation. There are possible characters from different European language as Spanish ñ, German ä or Norwegian ø.
I get the data with
#!/usr/bin/env python3
# coding: utf-8
...
sql.execute("SELECT name FROM myTab")
for row in sql
print(row[0])
There is an error message:
UnicodeEncodeError: 'ascii' codec can't encode character '\xf1'
Okay I have changed my print to
print(str(row[0].encode('utf8')))
and the result looks like this:
b'\xc3\xb1'
i looked at this Working with utf-8 encoding in Python source but i have declard the header. Also decode('utf8').encode('cp1250') does not help
okay the encoding issue has been solved finaly. Coldspeed gave a important hind with loacle. therefore all kudos for him! Unfortunately it was not that easy.
I found a workaround that fix the problem.
import sys
sys.stdout = open(sys.stdout.fileno(), mode='w', encoding='utf8', buffering=1)
The solution is from Jack O'Connor. posted in this answer:
Python3 tries to automatically decode this string based on your locale settings. If your locale doesn't match up with the encoding on the string, you get garbled text, or it doesn't work at all. You can forcibly try encoding it with your locale and then decoding to cp1252 (it seems this is the encoding on the string).
print(row[0].encode('latin-1').decode('cp1252'))

Decoding/Encoding using sklearn load_files

I'm following the tutorial here
https://github.com/amueller/introduction_to_ml_with_python/blob/master/07-working-with-text-data.ipynb
to learn about machine learning and text.
In my case, I'm using tweets I downloaded, with positive and negative tweets in the exact same directory structure they are using (trying to learn sentiment analysis).
Here in the iPython Notebook I load my data just like they do:
tweets_train =load_files('Path to my training Tweets')
And then I try to fit them with CountVectorizer
vect = CountVectorizer().fit(text_train)
I get
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position
561: invalid continuation byte
Is this because my Tweets have all sorts of non standard text in them? I didn't do any cleanup of my Tweets (I assume there are libraries that help with that in order to make a bag of words work?)
EDIT:
Code I use using Twython to download tweets:
def get_tweets(user):
twitter = Twython(CONSUMER_KEY,CONSUMER_SECRET,ACCESS_KEY,ACCESS_SECRET)
user_timeline = twitter.get_user_timeline(screen_name=user,count=1)
lis = user_timeline[0]['id']
lis = [lis]
for i in range(0, 16): ## iterate through all tweets
## tweet extract method with the last list item as the max_id
user_timeline = twitter.get_user_timeline(screen_name=user,
count=200, include_retweets=False, max_id=lis[-1])
for tweet in user_timeline:
lis.append(tweet['id']) ## append tweet id's
text = str(tweet['text']).replace("'", "")
text_file = open(user, "a")
text_file.write(text)
text_file.close()
You get a UnicodeDecodeError because your files are being decoded with the wrong text encoding.
If this means nothing to you, make sure you understand the basics of Unicode and text encoding, eg. with the official Python Unicode HOWTO.
First, you need to find out what encoding was used to store the tweets on disk.
When you saved them to text files, you used the built-in open function without specifying an encoding. This means that the system's default encoding was used. Check this, for example, in an interactive session:
>>> f = open('/tmp/foo', 'a')
>>> f
<_io.TextIOWrapper name='/tmp/foo' mode='a' encoding='UTF-8'>
Here you can see that in my local environment the default encoding is set to UTF-8. You can also directly inspect the default encoding with
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
There are other ways to find out what encoding was used for the files.
For example, the Unix tool file is pretty good at guessing the encoding of existing files, if you happen to be working on a Unix platform.
Once you think you know what encoding was used for writing the files, you can specify this in the load_files() function:
tweets_train = load_files('path to tweets', encoding='latin-1')
... in case you find out Latin-1 is the encoding that was used for the tweets; otherwise adjust accordingly.

namelist() from ZipFile returns strings with an invalid encoding

The problem is that for some archives or files up-loaded to the python application, ZipFile's namelist() returns badly decoded strings.
from zip import ZipFile
for name in ZipFile('zipfile.zip').namelist():
print('Listing zip files: %s' % name)
How to fix that code so i always decode file names in unicode (so Chineeze, Russian and other languages supported)?
I've seen some samples for Python 2, but since string's nature is changed in python3, i have no clue how to re-encode it, or apply chardet on it.
How to fix that code so i always decode file names in unicode (so Chineeze, Russian and other languages supported)?
Automatically? You can't. Filenames in a basic ZIP file are strings of bytes with no attached encoding information, so unless you know what the encoding was on the machine that created the ZIP you can't reliably get a human-readable filename back out.
There is an extension to the flags on modern ZIP files to tell you that the filename is UTF-8. Unfortunately files you receive from Windows users typically don't have it, so you'll left guessing with inherently unreliable methods like chardet.
I've seen some samples for Python 2, but since string's nature is changed in python3, i have no clue how to re-encode it, or apply chardet on it.
Python 2 would just give you raw bytes back. In Python 3 the new behaviour is:
if the UTF-8 flag is set, it decodes the filenames using UTF-8 and you get the correct string value back
otherwise, it decodes the filenames using DOS code page 437, which is pretty unlikely to be what was intended. However you can re-encode the string back to the original bytes, and then try to decode again using the code page you actually want, eg name.encode('cp437').decode('cp1252').
Unfortunately (again, because the unfortunatelies never end where ZIP is concerned), ZipFile does this decoding silently without telling you what it did. So if you want to switch and only do the transcode step when the filename is suspect, you have to duplicate the logic for sniffing whether the UTF-8 flag was set:
ZIP_FILENAME_UTF8_FLAG = 0x800
for info in ZipFile('zipfile.zip').filelist():
filename = info.filename
if info.flag_bits & ZIP_FILENAME_UTF8_FLAG == 0:
filename_bytes = filename.encode('437')
guessed_encoding = chardet.detect(filename_bytes)['encoding'] or 'cp1252'
filename = filename_bytes.decode(guessed_encoding, 'replace')
...
Here's the code that decodes filenames in zipfile.py according to the zip spec that supports only cp437 and utf-8 character encodings:
if flags & 0x800:
# UTF-8 file names extension
filename = filename.decode('utf-8')
else:
# Historical ZIP filename encoding
filename = filename.decode('cp437')
As you can see, if 0x800 flag is not set i.e., if utf-8 is not used in your input zipfile.zip then cp437 is used and therefore the result for "Chineeze, Russian and other languages" is likely to be incorrect.
In practice, ANSI or OEM Windows codepages may be used instead of cp437.
If you know the actual character encoding e.g., cp866 (OEM (console) codepage) may be used on Russian Windows then you could reencode filenames to get the original filenames:
filename = corrupted_filename.encode('cp437').decode('cp866')
The best option is to create the zip archive using utf-8 so that you can support multiple languages in the same archive:
c:\> 7z.exe a -tzip -mcu archive.zip <files>..
or
$ python -mzipfile -c archive.zip <files>..`
Got the same problem, but with defined language (Russian).
Most simple solution is just to convert it with this utility: https://github.com/vlm/zip-fix-filename-encoding
For me it works on 98% of archives (failed to run on 317 files from corpus of 11388)
More complex solution: use python module chardet with zipfile. But it depends on python version (2 or 3) you use - it has some differences on zipfile. For python 3 I wrote a code:
import chardet
original_name = name
try:
name = name.encode('cp437')
except UnicodeEncodeError:
name = name.encode('utf8')
encoding = chardet.detect(name)['encoding']
name = name.decode(encoding)
This code try to work with old style zips (having encoding CP437 and just has it broken), and if fails, it seems that zip archive is new style (UTF-8). After determining proper encoding, you can extract files by code like:
from shutil import copyfileobj
fp = archive.open(original_name)
fp_out = open(name, 'wb')
copyfileobj(fp, fp_out)
In my case, this resolved last 2% of failed files.

Python's handling of shell strings

I still do not understand completely how python's unicode and str types work. Note: I am working in Python 2, as far as I know Python 3 has a completely different approach to the same issue.
What I know:
str is an older beast that saves strings encoded by one of the way too many encodings that history has forced us to work with.
unicode is an more standardised way of representing strings using a huge table of all possible characters, emojis, little pictures of dog poop and so on.
The decode function transforms strings to unicode, encode does the other way around.
If I, in python's shell, simply say:
>>> my_string = "some string"
then my_string is a str variable encoded in ascii (and, because ascii is a subset of utf-8, it is also encoded in utf-8).
Therefore, for example, I can convert this into a unicode variable by saying one of the lines:
>>> my_string.decode('ascii')
u'some string'
>>> my_string.decode('utf-8')
u'some string'
What I don't know:
How does Python handle non-ascii strings that are passed in the shell, and, knowing this, what is the correct way of saving the word "kožušček"?
For example, I can say
>>> s1 = 'kožušček'
In which case s1 becomes a str instance that I am unable to convert into unicode:
>>> s1='kožušček'
>>> s1
'ko\x9eu\x9a\xe8ek'
>>> print s1
kožušček
>>> s1.decode('ascii')
Traceback (most recent call last):
File "<pyshell#23>", line 1, in <module>
s1.decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9e in position 2: ordinal not in range(128)
Now, naturally I can't decode the string with ascii, but what encoding should I then use? After all, my sys.getdefaultencoding() returns ascii! Which encoding did Python use to encode s1 when fed the line s1=kožušček?
Another thought I had was to say
>>> s2 = u'kožušček'
But then, when I printed s2, I got
>>> print s2
kouèek
which means that Python lost a whole letter. Can someone explain this to me?
str objects contain bytes. What those bytes represent Python doesn't dictate. If you produced ASCII-compatible bytes, you can decode them as ASCII. If they contain bytes representing UTF-8 data they can be decoded as such. If they contain bytes representing an image, then you can decode that information and display an image somewhere. When you use repr() on a str object Python will leave any bytes that are ASCII printable as such, the rest are converted to escape sequences; this keeps debugging such information practical even in ASCII-only environments.
Your terminal or console in which you are running the interactive interpreter writes bytes to the stdin stream that Python reads from when you type. Those bytes are encoded according to the configuration of that terminal or console.
In your case, your console encoded the input you typed to a Windows codepage, most likely. You'll need to figure out the exact codepage and use that codec to decode the bytes. Codepage 1252 seems to fit:
>>> print 'ko\x9eu\x9a\xe8ek'.decode('cp1252')
kožušèek
When you print those same bytes, your console is reading those bytes and interpreting them in the same codec it is already configured with.
Python can tell you what codec it thinks your console is set to; it tries to detect this information for Unicode literals, where the input has to be decoded for you. It uses the locale.getpreferredencoding() function to determine this, and the sys.stdin and sys.stdout objects have an encoding attribute; mine is set to UTF-8:
>>> import sys
>>> sys.stdin.encoding
'UTF-8'
>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'
>>> 'kožušèek'
'ko\xc5\xbeu\xc5\xa1\xc3\xa8ek'
>>> u'kožušèek'
u'ko\u017eu\u0161\xe8ek'
>>> print u'kožušèek'
kožušèek
Because my terminal has been configured for UTF-8 and Python has detected this, using a Unicode literal u'...' works. The data is automatically decoded by Python.
Why exactly your console lost a whole letter I don't know; I'd have to have access to your console and do some more experiments, see the output of print repr(s2), and test all bytes between 0x00 and 0xFF to see if this is on the input or output side of the console.
I recommend you read up on Python and Unicode:
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Your system does not necessarily use the sys.getdefaultencoding() encoding; it is merely the default used when you convert without telling it the encoding, as in:
>>> sys.getdefaultencoding()
'ascii'
>>> unicode(s1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 2: ordinal not in range(128)
Python's idea of your system locale is in the locale module:
>>> import locale
>>> locale.getdefaultlocale()
('en_US', 'UTF-8')
>>> locale.getpreferredencoding()
'UTF-8'
And using this we can decode the string:
>>> u1=s1.decode(locale.getdefaultlocale()[1])
>>> u1
u'ko\u017eu\u0161\u010dek'
>>> print u1
kožušček
There's a chance the locale has not been set up, as is the case for the 'C' locale. That may cause the reported encoding to be None even though the default is 'ascii'. Normally figuring this out is the job of setlocale, which getpreferredencoding will automatically call. I would suggest calling it once in your program startup and saving the value returned for all further use. The encoding used for filenames may also be yet another case, reported in sys.getfilesystemencoding().
The Python-internal default encoding is set up by the site module, which contains:
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "ascii" # Default value set by _PyUnicode_Init()
if 0:
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale()
if loc[1]:
encoding = loc[1]
if 0:
# Enable to switch off string to Unicode coercion and implicit
# Unicode to string conversion.
encoding = "undefined"
if encoding != "ascii":
# On Non-Unicode builds this will raise an AttributeError...
sys.setdefaultencoding(encoding) # Needs Python Unicode build !
So if you want it set by default in every run of Python, you can change that first if 0 to if 1.

Resources