why does file.tell() affect encoding? - python-3.x

Calling tell() while reading a GBK-encoded file of mine causes the next call to readline() to raise a UnicodeDecodeError. However, if I don't call tell(), it doesn't raise this error.
C:\tmp>hexdump badtell.txt
000000: 61 20 6B 0D 0A D2 BB B0-E3 a k......
C:\tmp>type test.py
with open(r'c:\tmp\badtell.txt', "r", encoding='gbk') as f:
while True:
pos = f.tell()
line = f.readline();
if not line: break
print(line)
C:\tmp>python test.py
a k
Traceback (most recent call last):
File "test.py", line 4, in <module>
line = f.readline();
UnicodeDecodeError: 'gbk' codec can't decode byte 0xd2 in position 0: incomplete multibyte sequence
When I remove the f.tell() statement, it decoded successfully. Why?
I tried Python3.4/3.5 x64 on Win7/Win10, it is all the same.
Any one, any idea? Should I report a bug?
I have a big text file, and I really want to get file position ranges of this big text, is there a workaround?

OK, there is a workaround, It works so far:
with open(r'c:\tmp\badtell.txt', "rb") as f:
while True:
pos = f.tell()
line = f.readline();
if not line: break
line = line.decode("gbk").strip('\n')
print(line)
I submitted an issue yesterday here: http://bugs.python.org/issue26990
still no response yet

I just replicated this on Python 3.4 x64 on Linux. Looking at the docs for TextIOBase, I don't see anything that says tell() causes problems with reading a file, so maybe it is indeed a bug.
b'\xd2'.decode('gbk')
gives an error like the one that you saw, but in your file that byte is followed by the byte BB, and
b'\xd2\xbb'.decode('gbk')
gives a value equal to '\u4e00', not an error.
I found a workaround that works for the data in your original question, but not for other data, as you've since found. Wish I knew why! I called seek() after every tell(), with the value that tell() returned:
pos = f.tell()
f.seek(pos)
line = f.readline()
An alternative to f.seek(f.tell()) is to use the SEEK_CUR mode of seek() to give the position. With an offset of 0, this does the same as the above code: moves to the current position and gets that position.
pos = f.seek(0, io.SEEK_CUR)
line = f.readline()

Related

Error message: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte

I am a newbie in programming and have a question:
I try to edit some .vtt files, where I want to remove certain substrings from the text. The file should keep its structure. For this, I copied the .vtt files in the folder and changed it to a .txt ending. Now I run this simple code:
import os
file_index = 0
all_text = []
path = "/Users/username/Documents/programming/IMS/Translate/files/"
new_path = "/Users/username/Documents/programming/IMS/Translate/new_files/"
for filename in os.listdir(path):
if os.path.isfile(filename): #check if there is a file in the directory
with open(os.path.join(path, filename), 'r') as file: # open in read-only mode
for line in file.read().split("\n"): #read lines and split
line = " ".join(line.split())
start_index = line.find("[") #find the first character of string to remove, this returns the index number
last_index = start_index + 11 #define the last index to be removed
if start_index != -1:
line = line[:start_index] + line[last_index:] #The new line to slice the first charaters until the one to be removed, and add the others that need to stay
all_text.append(line)
else:
line = line[:]
all_text.append(line)'''
I get this error message:
> File "srt-files-strip.py", line 11, in <module>
> for line in file.read().split("\n"): #read lines and split File "/usr/local/Cellar/python#3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/codecs.py", line 322, in decode
> (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position
> 3131: invalid start byte
I have search through different forums, changed to encoding="utf16", but to no avail. Strange thing is that it did work earlier on. Then I wrote a program to rename my files automatically, after that, it threw this error. I have cleared all files in the folder, copied the original ones in again ... can't get it to work. Would really appreciate your help, as I have really no idea where to look. Thx

why does it give UnicodeDecodeError?

I'm doing a course of python on py4e, almost done, but chapter 11 seems like impossible because it gives me error every time.
Error:
line 4, in <module>
lines = ffail.read()
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 8: invalid continuation byte
Code:
import re
ffail = open('regex_sum_340933.txt')
lines = ffail.read()
count = 0
match = re.findall('[0-9]+', lines)
for II in match:
number = int(II)
count = count + number
print(count)
Try this:
import re
lines = open('regex_sum_340933.txt', encoding='utf-8', errors='ignore').read()
count = sum(map(int, re.findall('[0-9]+', lines)))
You are not doing it right. first of all you need to close the file I would suggest
to just use with so you won't need to worry about closing the file.
replace how you read the file with this
ffail = ""
with open("regex_sum_340933.txt", mode = "r" ,encoding='UTF-8', errors='ignore', buffering=-1) as some_file:
ffail = some_file.read()
make sure that regex_sum_340933.txt is in the same directory as the file of the code.
if you are still having difficulties you could visit this question
thanks for helping, the code wasn't wrong, just my Mac didn't want to make it work. Tried with windows and the answer came immediately.

Fixing AttributeError: 'file' object has no attribute 'buffer' (Python3)

Python 2.7 on Ubuntu. I tried run small python script (file converter) for Python3, got error:
$ python uboot_mdb_to_image.py < input.txt > output.bin
Traceback (most recent call last):
File "uboot_mdb_to_image.py", line 29, in <module>
ascii_stdin = io.TextIOWrapper(sys.stdin.buffer, encoding='ascii', errors='strict')
AttributeError: 'file' object has no attribute 'buffer'
I suspect it's caused by syntax differences between python 3 and python 2, here is script itself:
#!/usr/bin/env python3
import sys
import io
BYTES_IN_LINE = 0x10 # Number of bytes to expect in each line
c_addr = None
hex_to_ch = {}
ascii_stdin = io.TextIOWrapper(sys.stdin.buffer, encoding='ascii', errors='strict')
for line in ascii_stdin:
line = line[:-1] # Strip the linefeed (we can't strip all white
# space here, think of a line of 0x20s)
data, ascii_data = line.split(" ", maxsplit = 1)
straddr, strdata = data.split(maxsplit = 1)
addr = int.from_bytes(bytes.fromhex(straddr[:-1]), byteorder = 'big')
if c_addr != addr - BYTES_IN_LINE:
if c_addr:
sys.exit("Unexpected c_addr in line: '%s'" % line)
c_addr = addr
data = bytes.fromhex(strdata)
if len(data) != BYTES_IN_LINE:
sys.exit("Unexpected number of bytes in line: '%s'" % line)
# Verify that the mapping from hex data to ASCII is consistent (sanity check for transmission errors)
for b, c in zip(data, ascii_data):
try:
if hex_to_ch[b] != c:
sys.exit("Inconsistency between hex data and ASCII data in line (or the lines before): '%s'" % line)
except KeyError:
hex_to_ch[b] = c
sys.stdout.buffer.write(data)
Can anyone advice how to fix this please?
It's an old question, but since I've run into a similar issue and it came up first when googling the error...
Yes, it's caused by a difference between Python 3 and 2. In Python 3, sys.stdin is wrapped in io.TextIOWrapper. In Python 2 it's a file object, which doesn't have a buffer attribute. The same goes for stderr and stdout.
In this case, the same functionality in Python 2 can be achieved using codecs standard library:
ascii_stdin = codecs.getreader("ascii")(sys.stdin, errors="strict")
However, this snippet provides an instance of codecs.StreamReader, not io.TextIOWrapper, so may be not suitable in other cases. And, unfortunately, wrapping Python 2 stdin in io.TextIOWrapper isn't trivial - see Wrap an open stream with io.TextIOWrapper for more discussion on that.
The script in question has more Python 2 incompabilities. Related to the issue in question, sys.stdout doesn't have a buffer attribute, so the last line should be
sys.stdout.write(data)
Other things I can spot:
str.split doesn't have maxsplit argument. Use line.split(" ")[:2] instead.
int doesn't have a from_bytes attribute. But int(straddr[:-1].encode('hex'), 16) seems to be equivalent.
bytes type is Python 3 only. In Python 2, it's an alias for str.

python3 UnicodeEncodeError: 'charmap' codec can't encode characters in position 95-98: character maps to <undefined>

A month ago I encountered this Github: https://github.com/taraslayshchuk/es2csv
I installed this package via pip3 in Linux ubuntu. When I wanted to use this package, I encountered the problem that this package is meant for python2. I dived into the code and soon I found the problem.
for line in open(self.tmp_file, 'r'):
timer += 1
bar.update(timer)
line_as_dict = json.loads(line)
line_dict_utf8 = {k: v.encode('utf8') if isinstance(v, unicode) else v for k, v in line_as_dict.items()}
csv_writer.writerow(line_dict_utf8)
output_file.close()
bar.finish()
else:
print('There is no docs with selected field(s): %s.' % ','.join(self.opts.fields))
The code did a check for unicode, this is not necessary within python3 Therefore, I changed the code to the code below. As result, The package worked properly under Ubuntu 16.
for line in open(self.tmp_file, 'r'):
timer += 1
bar.update(timer)
line_as_dict = json.loads(line)
# line_dict_utf8 = {k: v.encode('utf8') if isinstance(v, unicode) else v for k, v in line_as_dict.items()}
csv_writer.writerow(line_as_dict)
output_file.close()
bar.finish()
else:
print('There is no docs with selected field(s): %s.' % ','.join(self.opts.fields))
But a month later, it was necessary to get the es2csv package working on a Windows 10 operating system. After doing the exact same adjustments with es2csv under Windows 10, I received the following error message after I tried to run es2csv:
PS C:\> es2csv -u 192.168.230.151:9200 -i scrapy -o database.csv -q '*'
Found 218 results
Run query [#######################################################################################################################] [218/218] [100%] [0:00:00] [Time: 0:00:00] [ 2.3 Kidocs/s]
Write to csv [# ] [2/218] [ 0%] [0:00:00] [ETA: 0:00:00] [ 3.9 Kilines/s]T
raceback (most recent call last):
File "C:\Users\admin\AppData\Local\Programs\Python\Python36\Scripts\es2csv-script.py", line 11, in <module>
load_entry_point('es2csv==5.2.1', 'console_scripts', 'es2csv')()
File "c:\users\admin\appdata\local\programs\python\python36\lib\site-packages\es2csv.py", line 284, in main
es.write_to_csv()
File "c:\users\admin\appdata\local\programs\python\python36\lib\site-packages\es2csv.py", line 238, in write_to_csv
csv_writer.writerow(line_as_dict)
File "c:\users\admin\appdata\local\programs\python\python36\lib\csv.py", line 155, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
File "c:\users\admin\appdata\local\programs\python\python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 95-98: character maps to <undefined>
Does anyone has an idea how to fix this error message?
It's due to the default behaviour of open in Python 3. By default, Python 3 will open files in Text mode, which means that it also has to apply a text decoding, such as utf-8 or ASCII, for every character it reads.
Python will use your locale to determine the most suitable encoding. On OS X and Linux, this is usually UTF-8. On Windows, it'll use an 8-bit character set, such windows-1252, to match the behaviour of Notepad.
As an 8-bit character set only has a limited number of characters, it's very easy to end up trying to write a character not supported by the character set. For example, if you tried to write a Hebrew character with Windows-1252, the Western European character set.
To resolve your problem, you simply need to override the automatic encoding selection in open and hardcode it to use UTF-8:
for line in open(self.tmp_file, 'r', encoding='utf-8'):

how to skip enumerate encoding exception in python3?

I crafted script and preprocessed large csv for importing to database:
with open(sys.argv[1], encoding='utf-16') as _f:
for i, line in enumerate(_f):
try:
.... some stuff with line ...
except Exception as e:
...
But at some point it gives me exception on enumerate :
...
File "/Users/elajah/PycharmProjects/untitled1/importer.py", line 94, in main
for i, line in enumerate(_f):
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/encodings/utf_16.py", line 69, in _buffer_decode
return self.decoder(input, self.errors, final)
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x00 in position 0: truncated data
...
How to skip broken lines in file not interrupting the script flow ?
You can pass the parameter errors="ignore" to open, to tell Python that you don't care about encoding errors when reading from the file.
with open(sys.argv[1], errors="ignore") as _f:
This may behave oddly however, since it will just skip the invalid bytes, not the whole line the invalid bytes showed up on.
If the behavior you need is to ignore the whole line if anything goes wrong with the decoding, you might be better off reading the file in binary mode and trying the decoding yourself inside your try/except block, inside the loop:
with open(sys.argv[1], 'b') as _f:
for i, line_bytes in enumerate(_f):
try:
line = line_bytes.decode('utf-16')
# do some stuff with line ...
except UnicodeDecodeError:
pass
A final idea is to fix whatever is wrong with your file's data so you don't get decoding errors when reading it. But who knows how easy that is. If you're getting the file from somewhere else, out of your control, there may not be any practical way to fix it ahead of time.
You ignore an exception by catching it and the doing nothing
try:
.... some stuff with line ...
except UnicodeDecodeError as e:
pass
But it will depend on the situation if that is really what you want.
You can find the name of the exception in the last line of the stack trace
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x00 in position 0: truncated data

Resources