Decoding/Encoding using sklearn load_files - python-3.x

I'm following the tutorial here
https://github.com/amueller/introduction_to_ml_with_python/blob/master/07-working-with-text-data.ipynb
to learn about machine learning and text.
In my case, I'm using tweets I downloaded, with positive and negative tweets in the exact same directory structure they are using (trying to learn sentiment analysis).
Here in the iPython Notebook I load my data just like they do:
tweets_train =load_files('Path to my training Tweets')
And then I try to fit them with CountVectorizer
vect = CountVectorizer().fit(text_train)
I get
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position
561: invalid continuation byte
Is this because my Tweets have all sorts of non standard text in them? I didn't do any cleanup of my Tweets (I assume there are libraries that help with that in order to make a bag of words work?)
EDIT:
Code I use using Twython to download tweets:
def get_tweets(user):
twitter = Twython(CONSUMER_KEY,CONSUMER_SECRET,ACCESS_KEY,ACCESS_SECRET)
user_timeline = twitter.get_user_timeline(screen_name=user,count=1)
lis = user_timeline[0]['id']
lis = [lis]
for i in range(0, 16): ## iterate through all tweets
## tweet extract method with the last list item as the max_id
user_timeline = twitter.get_user_timeline(screen_name=user,
count=200, include_retweets=False, max_id=lis[-1])
for tweet in user_timeline:
lis.append(tweet['id']) ## append tweet id's
text = str(tweet['text']).replace("'", "")
text_file = open(user, "a")
text_file.write(text)
text_file.close()

You get a UnicodeDecodeError because your files are being decoded with the wrong text encoding.
If this means nothing to you, make sure you understand the basics of Unicode and text encoding, eg. with the official Python Unicode HOWTO.
First, you need to find out what encoding was used to store the tweets on disk.
When you saved them to text files, you used the built-in open function without specifying an encoding. This means that the system's default encoding was used. Check this, for example, in an interactive session:
>>> f = open('/tmp/foo', 'a')
>>> f
<_io.TextIOWrapper name='/tmp/foo' mode='a' encoding='UTF-8'>
Here you can see that in my local environment the default encoding is set to UTF-8. You can also directly inspect the default encoding with
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
There are other ways to find out what encoding was used for the files.
For example, the Unix tool file is pretty good at guessing the encoding of existing files, if you happen to be working on a Unix platform.
Once you think you know what encoding was used for writing the files, you can specify this in the load_files() function:
tweets_train = load_files('path to tweets', encoding='latin-1')
... in case you find out Latin-1 is the encoding that was used for the tweets; otherwise adjust accordingly.

Related

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 10: invalid start byte

I am trying to read MIDI music files and processing them a bit using the music21 library. I am using the self defined read_midi function, and getting this error "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 10: invalid start byte"
import os
#Array Processing
import numpy as np
#specify the path
path='audio/'
#read all the filenames
files=[i for i in os.listdir(path) if i.endswith(".mid")]
#reading each midi file
notes_array = np.array([read_midi(path+i) for i in files])
here is the read_midi function:
def read_midi(file):
print("Loading Music File:",file)
notes=[]
notes_to_parse = None
#parsing a midi file
midi = converter.parse(file)
#grouping based on different instruments
s2 = instrument.partitionByInstrument(midi)
#Looping over all the instruments
for part in s2.parts:
#select elements of only piano
if 'Piano' in str(part):
notes_to_parse = part.recurse()
#finding whether a particular element is note or a chord
for element in notes_to_parse:
#note
if isinstance(element, note.Note):
notes.append(str(element.pitch))
#chord
elif isinstance(element, chord.Chord):
notes.append('.'.join(str(n) for n in element.normalOrder))
return np.array(notes)
kindly suggest how can I get rid of this error.
An answer I got from the music21 Google Groups and fixed my problem :
HI, and thanks for the report. This is a regression caused by a new feature in 6.1.0 that creates Instrument objects from the text of MIDI track names. It's fixed in the next unreleased version (likely to be 6.2.0), which is available now on GitHub. If that's too cumbersome to install, you can also just edit your own copy of music21 to apply the fix found here: https://github.com/cuthbertLab/music21/pull/607/files
For the curious, the original feature wrongly assumed all MIDI track names would be encoded using utf-8. The files we found to fail each had a copyright symbol in the track name, and they were each created by "www.piano-midi.de". Would you mind sharing what MIDI writer created your file?
Also, I would very much appreciate you sharing this answer on Stack Overflow, since I'm not active there.
Cheers, and happy music21-ing,

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte while accessing csv file

I am trying to access csv file from aws s3 bucket and getting error 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte code is below I am using python 3.7 version
from io import BytesIO
import boto3
import pandas as pd
import gzip
s3 = boto3.client('s3', aws_access_key_id='######',
aws_secret_access_key='#######')
response = s3.get_object(Bucket='#####', Key='raw.csv')
# print(response)
s3_data = StringIO(response.get('Body').read().decode('utf-8')
data = pd.read_csv(s3_data)
print(data.head())
kindly help me out here how i can resolve this issue
using gzip worked for me
client = boto3.client('s3', aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key)
csv_obj = client.get_object(Bucket=####, Key=###)
body = csv_obj['Body']
with gzip.open(body, 'rt') as gf:
csv_file = pd.read_csv(gf)
The error you're getting means the CSV file you're getting from this S3 bucket is not encoded using UTF-8.
Unfortunately the CSV file format is quite under-specified and doesn't really carry information about the character encoding used inside the file... So either you need to know the encoding, or you can guess it, or you can try to detect it.
If you'd like to guess, popular encodings are ISO-8859-1 (also known as Latin-1) and Windows-1252 (which is roughly a superset of Latin-1). ISO-8859-1 doesn't have a character defined for 0x8b (so that's not the right encoding), but Windows-1252 uses that code to represent a left single angle quote (‹).
So maybe try .decode('windows-1252')?
If you'd like to detect it, look into the chardet Python module which, given a file or BytesIO or similar, will try to detect the encoding of the file, giving you what it thinks the correct encoding is and the degree of confidence it has in its detection of the encoding.
Finally, I suggest that, instead of using an explicit decode() and using a StringIO object for the contents of the file, store the raw bytes in an io.BytesIO and have pd.read_csv() decode the CSV by passing it an encoding argument.
import io
s3_data = io.BytesIO(response.get('Body').read())
data = pd.read_csv(s3_data, encoding='windows-1252')
As a general practice, you want to delay decoding as much as you can. In this particular case, having access to the raw bytes can be quite useful, since you can use that to write a copy of them to a local file (that you can then inspect with a text editor, or on Excel.)
Also, if you want to do detection of the encoding (using chardet, for example), you need to do so before you decode it, so again in that case you need the raw bytes, so that's yet another advantage to using the BytesIO here.

Python 3 pickle load from Python 2

I have a pickle file that was created (I don't know how exactly) in python 2. It is intended to be loaded by the following python 2 lines, which when used in python 3 (unsurprisingly) do not work:
with open('filename','r') as f:
foo, bar = pickle.load(f)
Result:
'ascii' codec can't decode byte 0xc2 in position 1219: ordinal not in range(128)
Manual inspection of the file indicates it is utf-8 encoded, therefore:
with open('filename','r', encoding='utf-8') as f:
foo, bar = pickle.load(f)
Result:
TypeError: a bytes-like object is required, not 'str'
With binary encoding:
with open('filename','rb', encoding='utf-8') as f:
foo, bar = pickle.load(f)
Result:
ValueError: binary mode doesn't take an encoding argument
Without binary encoding:
with open('filename','rb') as f:
foo, bar = pickle.load(f)
Result:
UnpicklingError: invalid load key, '
'.
Is this pickle file just broken? If not, how can I pry this thing open in python 3? (I have browsed the extensive collection of related questions and not found anything that works yet.)
Finally, note that the original
import cPickle as pickle
has been replaced with
import _pickle as pickle
The loading of python2 pickles in python3 (version 3.7.2 in this example) can be helped using the fix_imports parameter in the pickle.load function, but in my case it also worked without setting that parameter to True.
I was attempting to load a scipy.sparse.csr.csr_matrix contained in pickle generated using Python2.
When inspecting the file format using the UNIX command file it says:
>file -bi python2_generated.pckl
application/octet-stream; charset=binary
I could load the pickle in Python3 using the following code:
with open("python2_generated.pckl", "rb") as fd:
bh01 = pickle.load(fd, fix_imports=True, encoding="latin1")
Note that the loading was successful with and without setting fix_imports to True
As for the "latin1" encoding, the Python3 documentation (version 3.7.2) for the pickle.load function says:
Using encoding='latin1' is required for unpickling NumPy arrays and instances of datetime, date and time pickled by Python 2
Although this is specifically for scipy matrixes (or Numpy arrays), and since Novak is not clarifing what his pickle file contained,
I hope this could of help to other users :)
Two errors were conflating each other.
First: By the time the .p file reached me, it had almost certainly been corrupted in transit, likely by FTP-ing (or similar) in ASCII rather than binary mode. I was able to get my hands on a properly transmitted copy, which allowed me to discover...
Second: Whatever the file might have implied on the inside, the proper encoding was 'latin1' not 'utf-8'.
So in a sense, yes, the file was broken, and even after that I was doing it wrong. I leave this here as a reminder to whoever eventually has the next bizarre pickle/python2/python3 issue that there can be multiple things gone wrong, and they have to be solved in the correct orderr.

Arabic text replaced with escape sequences when creating CSV files using python

I am trying to create a CSV file that contains Arabic tweets collected using tweepy for a project I am doing. All is fine gathering the data, however, when i am writing to the CSV file all Arabic results are escaped with \xXXXX sequences
as follows:
b'#\xd8\xa7\xd9\x84\xd9\x8a\xd9\x88\xd9\x85_\xd8\xa7\xd9\x84\xd8\xb9\xd8\xa7\xd9\x84\xd9\x85\xd9\x8a_\xd9\x84\xd9\x84\xd8\xa7\xd8\xb9\xd8\xa7\xd9\x82\xd9\x87_2017 \xd8\xa7\xd9\x84\xd8\xa5\xd8\xb9\xd8\xa7\xd9\x82\xd8\xa9 \xd8\xa7\xd9\x84\xd8\xad\xd9\x82\xd9\x8a\xd9\x82\xd9\x8a\xd8\xa9 \xd8\xa7\xd8\xb9\xd8\xa7\xd9\x82\xd8\xa9 \xd8\xa7\xd9\x84\xd9\x81\xd9\x83\xd8\xb1 \xd9\x88\xd9\x84\xd9\x8a\xd8\xb3\xd8\xaa \xd8\xa7\xd8\xb9\xd8\xa7\xd9\x82\xd8\xa9
I looked at many previously asked questions and all I could find was suggestions for python 2 or answers similar to the one I am writing. When I was creating JSON files instead I was using ensure_ascii=False but I couldn't find anything similar for CSV. Below is my code:
with codecs.open('tweets.csv', 'a', encoding='utf-8') as file:
fieldnames = ['tweet', 'country']
writer = csv.DictWriter(file, fieldnames=fieldnames)
data = {'tweet': status.text, 'country': status.place.full_name}
writer.writerow(data)
I tried adding .encoding='utf-8' to status.text and status.place as well but that also didn't work. Any suggestions?
You have to make sure the Arabic string you have is decoded into UTF-8 before you write it. Assuming status.text is of type bytes you should type text=status.text.decode('utf-8'). (Maybe you have to do this for status.place.full_name too.) But if it's of type str then it won't have an decode() method. To avoid escape sequences in your file, a str object should be written anyway.
If you try to specify the encoding of a bytes object (like the one you presumably have) as 'utf-8' that won't work because the text is already in UTF-8 bytes. So in order to get UTF-8 characters you must call decode() on the bytes object. That way it writes the UTF-8 characters and not the UTF-8 bytes.

namelist() from ZipFile returns strings with an invalid encoding

The problem is that for some archives or files up-loaded to the python application, ZipFile's namelist() returns badly decoded strings.
from zip import ZipFile
for name in ZipFile('zipfile.zip').namelist():
print('Listing zip files: %s' % name)
How to fix that code so i always decode file names in unicode (so Chineeze, Russian and other languages supported)?
I've seen some samples for Python 2, but since string's nature is changed in python3, i have no clue how to re-encode it, or apply chardet on it.
How to fix that code so i always decode file names in unicode (so Chineeze, Russian and other languages supported)?
Automatically? You can't. Filenames in a basic ZIP file are strings of bytes with no attached encoding information, so unless you know what the encoding was on the machine that created the ZIP you can't reliably get a human-readable filename back out.
There is an extension to the flags on modern ZIP files to tell you that the filename is UTF-8. Unfortunately files you receive from Windows users typically don't have it, so you'll left guessing with inherently unreliable methods like chardet.
I've seen some samples for Python 2, but since string's nature is changed in python3, i have no clue how to re-encode it, or apply chardet on it.
Python 2 would just give you raw bytes back. In Python 3 the new behaviour is:
if the UTF-8 flag is set, it decodes the filenames using UTF-8 and you get the correct string value back
otherwise, it decodes the filenames using DOS code page 437, which is pretty unlikely to be what was intended. However you can re-encode the string back to the original bytes, and then try to decode again using the code page you actually want, eg name.encode('cp437').decode('cp1252').
Unfortunately (again, because the unfortunatelies never end where ZIP is concerned), ZipFile does this decoding silently without telling you what it did. So if you want to switch and only do the transcode step when the filename is suspect, you have to duplicate the logic for sniffing whether the UTF-8 flag was set:
ZIP_FILENAME_UTF8_FLAG = 0x800
for info in ZipFile('zipfile.zip').filelist():
filename = info.filename
if info.flag_bits & ZIP_FILENAME_UTF8_FLAG == 0:
filename_bytes = filename.encode('437')
guessed_encoding = chardet.detect(filename_bytes)['encoding'] or 'cp1252'
filename = filename_bytes.decode(guessed_encoding, 'replace')
...
Here's the code that decodes filenames in zipfile.py according to the zip spec that supports only cp437 and utf-8 character encodings:
if flags & 0x800:
# UTF-8 file names extension
filename = filename.decode('utf-8')
else:
# Historical ZIP filename encoding
filename = filename.decode('cp437')
As you can see, if 0x800 flag is not set i.e., if utf-8 is not used in your input zipfile.zip then cp437 is used and therefore the result for "Chineeze, Russian and other languages" is likely to be incorrect.
In practice, ANSI or OEM Windows codepages may be used instead of cp437.
If you know the actual character encoding e.g., cp866 (OEM (console) codepage) may be used on Russian Windows then you could reencode filenames to get the original filenames:
filename = corrupted_filename.encode('cp437').decode('cp866')
The best option is to create the zip archive using utf-8 so that you can support multiple languages in the same archive:
c:\> 7z.exe a -tzip -mcu archive.zip <files>..
or
$ python -mzipfile -c archive.zip <files>..`
Got the same problem, but with defined language (Russian).
Most simple solution is just to convert it with this utility: https://github.com/vlm/zip-fix-filename-encoding
For me it works on 98% of archives (failed to run on 317 files from corpus of 11388)
More complex solution: use python module chardet with zipfile. But it depends on python version (2 or 3) you use - it has some differences on zipfile. For python 3 I wrote a code:
import chardet
original_name = name
try:
name = name.encode('cp437')
except UnicodeEncodeError:
name = name.encode('utf8')
encoding = chardet.detect(name)['encoding']
name = name.decode(encoding)
This code try to work with old style zips (having encoding CP437 and just has it broken), and if fails, it seems that zip archive is new style (UTF-8). After determining proper encoding, you can extract files by code like:
from shutil import copyfileobj
fp = archive.open(original_name)
fp_out = open(name, 'wb')
copyfileobj(fp, fp_out)
In my case, this resolved last 2% of failed files.

Resources