Some 'utf-8' codec can't decode byte - python-3.x

I get some error when i download a website using wget
code:
import threading
import urllib.request
import os
import re
import time
import json
def wget(url):
#self.url = url
data = os.popen('wget -qO- %s'% url).read()
return data
print (wget("http://jamesholm.se/dj.php"))
Error:
Traceback (most recent call last):
File "stand-alone-check-url.py", line 13, in <module>
print (wget("http://jamesholm.se/dj.php"))
File "stand-alone-check-url.py", line 10, in wget
data = os.popen('wget -qO- %s'% url).read()
File "/usr/local/lib/python3.4/codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9a in position 13133: invalid start byte
How to overcome this error?

You can't decode arbitrary byte sequences as utf-8 encoded text:
>>> b'\xa9'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 0: invalid start byte
The page indicates that it uses utf-8 but the actual data that the server sends is not utf-8. It happens.
There is bs4.UnicodeDammit that allows you to handle data with inconsistent encodings:
import bs4 # $ pip install beautifulsoup4
print(bs4.UnicodeDammit.detwingle(b'S\x9aben - Ostwind Rec').decode('utf-8'))
# -> Sšben - Ostwind Rec

Instead of wget, use requests python module.
>>> import requests
>>> data = requests.get("http://jamesholm.se/dj.php").text
>>> print(data)

Related

Error while reading a csv file by using csv module in python3

When I am trying to read a csv file I am getting this type of error:
Traceback (most recent call last):
File "/root/Downloads/csvafa.py", line 4, in <module>
for i in a:
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
The code that i used:
import csv
with open('Book1.csv') as f:
a=csv.reader(f)
for i in a:
print(i)
i even tried to change the encoding to latin1:
import csv
with open('Book1.csv',encoding='latin1') as f:
a=csv.reader(f)
for i in a:
print(i)
After that i am getting this type of error message:
Traceback (most recent call last):
File "/root/Downloads/csvafa.py", line 4, in <module>
for i in a:
_csv.Error: line contains NUL
I am a beginner to python
This error is raised when we try to encode an invalid string. When Unicode string can’t be represented in this encoding (UTF-8), python raises a UnicodeEncodeError. You can try encoding: 'latin-1' or 'iso-8859-1'.
import pandas as pd
dataset = pd.read_csv('Book1.csv', encoding='ISO-8859–1')
It can also be that the data is compressed. Have a look at this answer.
I would try reading the file in utf-8 enconding
another solution might be this answer
It's still most likely gzipped data. gzip's magic number is 0x1f 0x8b, which is consistent with the UnicodeDecodeError you get.
You could try decompressing the data on the fly:
with open('destinations.csv', 'rb') as fd:
gzip_fd = gzip.GzipFile(fileobj=fd)
destinations = pd.read_csv(gzip_fd)

UnicodeDecodeError: charmap' codec can't decode byte 0x8f in position 756

I'm unable to retrieve the data from a Microsoft Excel document. I've tried using encoding 'Latin-1' or 'UTF-8' but when it gives me hundreds of \x00's in the terminal. Is there any way I can retrieve the data and output it to a text file?
This is what I'm running on the terminal and the error I get:
PS C:\Users\Andy-\Desktop> python.exe SRT411-Lab2.py Lab2Data.xlsx
Traceback (most recent call last):
File "SRT411-Lab2.py", line 9, in
lines = file.readlines()
File "C:\ProgramFiles\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1776.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 756: character maps to <\undefined>
Any help is greatly appreciated!
#!/usr/bin/python3
import sys
filename = sys.argv[1]
print(filename)
file = open(filename, 'r')
lines = file.readlines()
file.close()
print(lines)
I'd probably convert the excel file to csv file and use pandas to parse it

What is this error when i try to parse a simple pcap file?

import dpkt
f = open('gtp.pcap')
pcap = dpkt.pcap.Reader(f)
for ts, buf in pcap:
eth = dpkt.ethernet.Ethernet(buf)
print(eth)
Traceback (most recent call last):
File "new.py", line 4, in <module>
pcap = dpkt.pcap.Reader(f)
File "/home/user/gtp_gaurang/venv/lib/python3.5/site-packages/dpkt/pcap.py", line 244, in __init__
buf = self.__f.read(FileHdr.__hdr_len__)
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 16: invalid start byte
(venv) user#user-OptiPlex-7010:~/gtp_gaurang$ python3 new.py
Traceback (most recent call last):
File "new.py", line 4, in <module>
pcap = dpkt.pcap.Reader(f)
File "/home/user/gtp_gaurang/venv/lib/python3.5/site-packages/dpkt/pcap.py", line 244, in __init__
buf = self.__f.read(FileHdr.__hdr_len__)
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 16: invalid start byte
What is this error when i try to parse a simple pcap file?
I am running this simple pcap parser code. But it is showing the above
error. Can anyone please help.
Can you please check this link.
Related Answer
according to the answer suggestion, UTF-8 encounters an invalid byte which it cannot decode. So if you just read your file in binary format this error will not come as the decoding will not happen and the file contents will remain a bytes.
Open the file in binary mode
f = open('gtp.pcap', 'rb')
pcap = dpkt.pcap.Reader(f)
...

How to convert large binary file into pickle dictionary in python?

I am trying to convert large binary file contains Arabic words with 300 dimension vectors into pickle dictionary
What I am write so far is:
import pickle
ArabicDict = {}
with open('cc.ar.300.bin', encoding='utf-8') as lex:
for token in lex:
for line in lex.readlines():
data = line.split()
ArabicDict[data[0]] = float(data[1])
pickle.dump(ArabicDict,open("ArabicDictionary.p","wb"))
The error which I am getting is:
Traceback (most recent call last):
File "E:\Dataset", line 4, in <module>
for token in lex:
File "E:\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte

Cant' get correct representation for some Italian words

I need to normalize text from Italian wiki using python3 and nltk and I've got one problem. Most of the words are OK, but some words are mapped incorrect, to be more exact - some symbols.
For example:
'fruibilit\xe3', 'n\xe2\xba', 'citt\xe3'
I'm sure that the problem is in symbols like à, è.
Code:
# coding: utf8
import os
from nltk import corpus, word_tokenize, ConditionalFreqDist
it_sw_plus = corpus.stopwords.words('italian') + ['doc', 'https']
#it_folder_names = ['AA', 'AB', 'AC', 'AD', 'AE', 'AF']
it_path = os.listdir('C:\\Users\\1\\projects\\i')
it_corpora = []
def normalize(raw_text):
tokens = word_tokenize(raw_text)
norm_tokens = []
for token in tokens:
if token not in it_sw_plus and token.isalpha():
token = token.lower().encode('utf8')
norm_tokens.append(token)
return norm_tokens
for folder_name in it_path:
path_to_files = 'C:\\Users\\1\\projects\\i\\%s' % (folder_name)
files_list = os.listdir(path_to_files)
for file_name in files_list:
file_path = path_to_files + '\\' + file_name
text_file = open(file_path)
raw_text = text_file.read().decode('utf8')
norm_tokens = normalize(raw_text)
it_corpora.append(norm_tokens)
print(it_corpora)
How can I resolve this problem?
I'm running on Win7(rus).
When I try this code:
import io
with open('C:\\Users\\1\\projects\\i\\AA\\wiki_00', 'r', encoding='utf8') as fin:
for line in fin:
print (line)
In PowerShell:
<doc id="2" url="https://it.wikipedia.org/wiki?curid=2" title="Armonium">
Armonium
Traceback (most recent call last):
File "i.py", line 5, in <module>
print (line)
File "C:\Python35-32\lib\encodings\cp866.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 3: character maps to <undefined>
In Python command line:
<doc id="2" url="https://it.wikipedia.org/wiki?curid=2" title="Armonium">
Armonium
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\1\projects\i.py", line 5, in <module>
print (line)
File "C:\Python35-32\lib\encodings\cp866.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position
3: character maps to <undefined>
When I try the request:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python35-32\lib\encodings\cp866.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position
90: character maps to <undefined>
Try specifying the encoding when reading the file if you know the encoding, in python2
import io
with io.open(filename, 'r', encoding='latin-1') as fin:
for line in fin:
print line # line should be encoded as latin-1
But in your case, the file you've posted isn't a latin1 file but a utf8 file, in python3:
>>> import urllib.request
>>> url = 'https://raw.githubusercontent.com/GiteItAwayNow/TrueTry/master/it'
>>> response = urllib.request.urlopen(url)
>>> data = response.read()
>>> text = data.decode('utf8')
>>> print (text) # this prints the file perfectly.
To read a 'utf8' file in python2:
import io
with io.open(filename, 'r', encoding='utf8') as fin:
for line in fin:
print (line) # line should be encoded as utf8
To read a 'utf8' file, in python3:
with open(filename, 'r', encoding='utf8') as fin:
for line in fin:
print (line) # line should be encoded as utf8
As a good practice, when dealing with text data, try to use unicode and python3 whenever possible. Do take a look at
https://docs.python.org/3.5/howto/unicode.html#the-string-type
What's the deal with Python 3.4, Unicode, different languages and Windows?
UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function
Additionally, if you haven't install this module for printing utf8 on windows console, you should try it:
pip install win-unicode-console
Or download this: https://pypi.python.org/packages/source/w/win_unicode_console/win_unicode_console-0.4.zip and then python setup.py install

Resources