why does it give UnicodeDecodeError? - python-3.x

I'm doing a course of python on py4e, almost done, but chapter 11 seems like impossible because it gives me error every time.
Error:
line 4, in <module>
lines = ffail.read()
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 8: invalid continuation byte
Code:
import re
ffail = open('regex_sum_340933.txt')
lines = ffail.read()
count = 0
match = re.findall('[0-9]+', lines)
for II in match:
number = int(II)
count = count + number
print(count)

Try this:
import re
lines = open('regex_sum_340933.txt', encoding='utf-8', errors='ignore').read()
count = sum(map(int, re.findall('[0-9]+', lines)))

You are not doing it right. first of all you need to close the file I would suggest
to just use with so you won't need to worry about closing the file.
replace how you read the file with this
ffail = ""
with open("regex_sum_340933.txt", mode = "r" ,encoding='UTF-8', errors='ignore', buffering=-1) as some_file:
ffail = some_file.read()
make sure that regex_sum_340933.txt is in the same directory as the file of the code.
if you are still having difficulties you could visit this question

thanks for helping, the code wasn't wrong, just my Mac didn't want to make it work. Tried with windows and the answer came immediately.

Related

How to get python to tolerate UTF-8 encoding errors

I have a set of UTF-8 texts I have scraped from web pages. I am trying to extract keywords from these files like so:
import os
import json
from rake_nltk import Rake
rake_nltk_var = Rake()
directory = 'files'
results = {}
for filename in os.scandir(directory):
if filename.is_file():
with open("files/" + filename.name, encoding="utf-8", mode = 'r') as infile:
text = infile.read()
rake_nltk_var.extract_keywords_from_text(text)
keyword_extracted = rake_nltk_var.get_ranked_phrases()
results[filename.name] = keyword_extracted
with open("extracted-keywords.json", "w") as outfile:
json.dump(results, outfile)
One of the files I've managed to process so far is throwing the following error on read:
Traceback (most recent call last):
File "extract-keywords.py", line 11, in <module>
text = infile.read()
File "c:\python36\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 66: invalid start byte
0x92 is a right single quotation mark, but the 66th char of the file is a "u" so IDK where this error is coming from. Regardless, is there some way to make the codec tolerate such encoding errors? For example, Perl simply substitutes a question mark for any character it can't decode. Is there some way to get Python to do the same? I have a lot of files and can't afford to stop and debug every encoding error they might contain.
I have a set of UTF-8 texts I have scraped from web pages
If they can't be read with the script you've shown, then these are not actually UTF-8 encoded files.
We have to know about the code which wrote the files in the first place to tell the correct way to decode. However, the ’ character is 0x92 byte in code page 1252, so try using that encoding instead, i.e.:
with open("files/" + filename.name, encoding="cp1252") as infile:
text = infile.read()
Ignoring decoding errors corrupts the data, so it's best to use the correct decoder when possible, so try and do that first! However, about this part of the question:
Regardless, is there some way to make the codec tolerate such encoding errors? For example, Perl simply substitutes a question mark for any character it can't decode. Is there some way to get Python to do the same?
Yes, you can specify errors="replace"
>>> with open("/tmp/f.txt", "w", encoding="cp1252") as f:
... f.write('this is a right quote: \N{RIGHT SINGLE QUOTATION MARK}')
...
>>> with open("/tmp/f.txt", encoding="cp1252") as f:
... print(f.read()) # using correct encoding
...
this is a right quote: ’
>>> with open("/tmp/f.txt", encoding="utf-8", errors="replace") as f:
... print(f.read()) # using incorrect encoding and replacing errors
this is a right quote: �

Error message: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte

I am a newbie in programming and have a question:
I try to edit some .vtt files, where I want to remove certain substrings from the text. The file should keep its structure. For this, I copied the .vtt files in the folder and changed it to a .txt ending. Now I run this simple code:
import os
file_index = 0
all_text = []
path = "/Users/username/Documents/programming/IMS/Translate/files/"
new_path = "/Users/username/Documents/programming/IMS/Translate/new_files/"
for filename in os.listdir(path):
if os.path.isfile(filename): #check if there is a file in the directory
with open(os.path.join(path, filename), 'r') as file: # open in read-only mode
for line in file.read().split("\n"): #read lines and split
line = " ".join(line.split())
start_index = line.find("[") #find the first character of string to remove, this returns the index number
last_index = start_index + 11 #define the last index to be removed
if start_index != -1:
line = line[:start_index] + line[last_index:] #The new line to slice the first charaters until the one to be removed, and add the others that need to stay
all_text.append(line)
else:
line = line[:]
all_text.append(line)'''
I get this error message:
> File "srt-files-strip.py", line 11, in <module>
> for line in file.read().split("\n"): #read lines and split File "/usr/local/Cellar/python#3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/codecs.py", line 322, in decode
> (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position
> 3131: invalid start byte
I have search through different forums, changed to encoding="utf16", but to no avail. Strange thing is that it did work earlier on. Then I wrote a program to rename my files automatically, after that, it threw this error. I have cleared all files in the folder, copied the original ones in again ... can't get it to work. Would really appreciate your help, as I have really no idea where to look. Thx

How can I copy all PDF pages in a TXT file in python?

I have written the following script, in order to extract the text of a PDF file into plain text and save it into a TXT file:
import PyPDF2
def pdfToTxt(pdfFile):
pdfFileObject = open(pdfFile, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
numberOfPages = pdfReader.numPages
tempFile = open(r"temp.txt","a")
for p in range(numberOfPages):
pagesObject = pdfReader.getPage(p)
text = pagesObject.extractText()
tempFile.writelines(text)
tempFile.close()
pdfToTxt("PdfFile.pdf")
The code works fine for the first 15 pages, which are successfully written in temp.txt file, but after the 15th page I get the following error:
Traceback (most recent call last):
File "PdfToTextExtractor.py", line 35, in <module>
pdfToTxt("PdfFile.pdf")
File "PdfToTextExtractor.py", line 30, in pdfToTxt
tempFile.writelines(text)
File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 0: characte
r maps to <undefined>
It seems that the character '\ufb01' is the problem.
In case you have any idea how to overcome this issue, please let me know.
In order to overcome this issue, you have to replace the character with another one (let's say a white space), before you write it into the file.
In that case you have to add the following line in the for loop:
text = text.replace('\ufb01', " ")
the method should look like this:
def pdfToTxt(pdfFile):
pdfFileObject = open(pdfFile, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
numberOfPages = pdfReader.numPages
tempFile = open(r"temp.txt","a")
for p in range(numberOfPages):
pagesObject = pdfReader.getPage(p)
text = pagesObject.extractText()
text = text.replace('\ufb01', " ")
tempFile.writelines(text)
tempFile.close()
When opening your tempFile, set the encoding like so:
tempFile = open(r"temp.txt","a", encoding='utf-8')
The issue is in the way you open file, so replace
tempFile = open(r"temp.txt","a")
With the same open + extra param:
tempFile = open(r"temp.txt","a", encoding="utf-8")
Additionally, I suggest you to use context manager in case of any file operations, which ensures that file will be closed correctly in case of unexpected exception:
with open(r"temp.txt","a") as tempFile:
...
Also, if you do so, you can remove file closing after for loop.

Web scraping Python program returns "'charmap' codec can't encode character"

import bs4 as bs
import urllib.request
import re
import os
from colorama import Fore, Back, Style, init
init()
def highlight(word):
if word in keywords:
return Fore.RED + str(word) + Fore.RESET
else:
return str(word)
for newurl in newurls:
url = urllib.request.urlopen(newurl)
soup1 = bs.BeautifulSoup(url, 'lxml')
paragraphs =soup1.findAll('p')
print (Fore.GREEN + soup1.h2.text + Fore.RESET)
print('')
for paragraph in paragraphs:
if paragraph != None:
textpara = paragraph.text.strip().split(' ')
colored_words = list(map(highlight, textpara))
print(" ".join(colored_words).encode("utf-8")) #encode("utf-8")
else:
pass
I will have list of key words and urls to go through.
After running few keywords in a url, I get output like this
b'\x1b[31mthe desired \x1b[31mmystery corners \x1b[31mthe differential .
\x1b[31mthe back \x1b[31mpretends to be \x1b[31mthe'
I removed encode("utf-8") and I get encoding error
Traceback (most recent call last):
File "C:\Users\resea\Desktop\Python Projects\Try 3.py", line 52, in
<module>
print(" ".join(colored_words)) #encode("utf-8")
File "C:\Python34\lib\site-packages\colorama\ansitowin32.py", line 41, in
write
self.__convertor.write(text)
File "C:\Python34\lib\site-packages\colorama\ansitowin32.py", line 162,
in write
self.write_and_convert(text)
File "C:\Python34\lib\site-packages\colorama\ansitowin32.py", line 190,
in write_and_convert
self.write_plain_text(text, cursor, len(text))
File "C:\Python34\lib\site-packages\colorama\ansitowin32.py", line 195, in
write_plain_text
self.wrapped.write(text[start:end])
File "C:\Python34\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2019' in
position 23: character maps to <undefined>
Where am I going wrong?
I know what I'm going to suggest is more of a workaround than a "solution" but I've been frustrated, again and again, by all sorts of strange characters that had to be dealt with "encode this" or "encode that", sometimes successfully and many times not.
Depending on the type of text used in your newurl, the universe of problematic characters is probably limited. So I deal with them on a case-by-case basis: Every time i get one of these errors, I do this:
import unicodedata
unicodedata.name('\u2019')
In your case, you'll get this:
'RIGHT SINGLE QUOTATION MARK'
The old, pesky, right single quotation mark... So next, as suggested here, I just replace that pesky character with another that looks like it, but does not raise the error; in your case
colored_words = list(map(highlight, textpara)).replace(u"\u2019", "'") # or some other replacement character
should work. And you rinse and repeat every time this error pops up. Admittedly, not the most elegant solution, but after a while, all possible strange characters in your newurl are captured and the errors stop.

why does file.tell() affect encoding?

Calling tell() while reading a GBK-encoded file of mine causes the next call to readline() to raise a UnicodeDecodeError. However, if I don't call tell(), it doesn't raise this error.
C:\tmp>hexdump badtell.txt
000000: 61 20 6B 0D 0A D2 BB B0-E3 a k......
C:\tmp>type test.py
with open(r'c:\tmp\badtell.txt', "r", encoding='gbk') as f:
while True:
pos = f.tell()
line = f.readline();
if not line: break
print(line)
C:\tmp>python test.py
a k
Traceback (most recent call last):
File "test.py", line 4, in <module>
line = f.readline();
UnicodeDecodeError: 'gbk' codec can't decode byte 0xd2 in position 0: incomplete multibyte sequence
When I remove the f.tell() statement, it decoded successfully. Why?
I tried Python3.4/3.5 x64 on Win7/Win10, it is all the same.
Any one, any idea? Should I report a bug?
I have a big text file, and I really want to get file position ranges of this big text, is there a workaround?
OK, there is a workaround, It works so far:
with open(r'c:\tmp\badtell.txt', "rb") as f:
while True:
pos = f.tell()
line = f.readline();
if not line: break
line = line.decode("gbk").strip('\n')
print(line)
I submitted an issue yesterday here: http://bugs.python.org/issue26990
still no response yet
I just replicated this on Python 3.4 x64 on Linux. Looking at the docs for TextIOBase, I don't see anything that says tell() causes problems with reading a file, so maybe it is indeed a bug.
b'\xd2'.decode('gbk')
gives an error like the one that you saw, but in your file that byte is followed by the byte BB, and
b'\xd2\xbb'.decode('gbk')
gives a value equal to '\u4e00', not an error.
I found a workaround that works for the data in your original question, but not for other data, as you've since found. Wish I knew why! I called seek() after every tell(), with the value that tell() returned:
pos = f.tell()
f.seek(pos)
line = f.readline()
An alternative to f.seek(f.tell()) is to use the SEEK_CUR mode of seek() to give the position. With an offset of 0, this does the same as the above code: moves to the current position and gets that position.
pos = f.seek(0, io.SEEK_CUR)
line = f.readline()

Resources