Loading pre-trained word embeddings - python-3.x

I am trying to load the pre-trained word2Vec model using the command below but get an Unicode error. Need some help getting to the bottom of it. I googled around but could not find a working solution to this.
python -m spacy init-model en /tmp/google_news_vectors --vectors-loc ~/Downloads/GoogleNews-vectors-negative300.bin.gz
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 7: invalid start byte

Spacy expects the vectors to be in the text format rather than the binary format:
https://spacy.io/api/cli#init-model
For how to convert the binary model, see: https://stackoverflow.com/a/33183634/461847

Related

how to import with python3 pkl file saved in python2

I am running some demo code from vehicle-prediction. When comes to the models in models directory, I can load this model model_2000_car_100_iter_v.pkl in Python2, but My integration environment is Python3. So when I run the codes to load the model in Python3 using joblib.load(), errors are raised:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 1024: ordinal not in range(128)
and
raise new_exc
ValueError: You may be trying to read with python 3 a joblib pickle generated with python 2. This is not feature supported by joblib.
I try to figure this out referring to pickle, because joblib is refered to pickle. so the pickle doc says encoding = 'latin1' can avoid this issue, but without success. And I also try 'iso-8859-1' encoding, which is also failed.
import pickle
picklefile=open('model_2000_car_100_iter_v.pkl','rb')
data=pickle.load(picklefile,encoding='iso-8859-1') # or 'latin1'
I can see that pickle allows user to save model in Python3, and load it in Python2, using the protocol parameter, but how can I do this in reverse?
Is there a way to load the model in different python version using joblib?

Why does using Gensim with Glove continue to give a 'utf-8' UnicodeDecodeError?

I am trying to use Gensim with Glove instead of word2vec. To make the shape of Glove compatible with Gensim and use it, I am using the following lines of code:
import gensim
from gensim.scripts.glove2word2vec import glove2word2vec
glove_in = 'glove.840B.300d.txt'
word2vec_format_out = 'glove.840B.300d.txt.word2vec'
glove2word2vec(glove_in, word2vec_format_out)
model =
gensim.models.KeyedVectors.load_word2vec_format(word2vec_format_out,
encoding='utf-8', binary=True)
However, this last line of code gives the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbd in position 0:
invalid start byte
I have tried to open Glove first and then writing as a csv file, then re-open specifying encoding='utf-8'. I also tried several other things mentioned here, but the error keeps coming back. Does anyone know a solution for this?

Decoding/Encoding using sklearn load_files

I'm following the tutorial here
https://github.com/amueller/introduction_to_ml_with_python/blob/master/07-working-with-text-data.ipynb
to learn about machine learning and text.
In my case, I'm using tweets I downloaded, with positive and negative tweets in the exact same directory structure they are using (trying to learn sentiment analysis).
Here in the iPython Notebook I load my data just like they do:
tweets_train =load_files('Path to my training Tweets')
And then I try to fit them with CountVectorizer
vect = CountVectorizer().fit(text_train)
I get
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position
561: invalid continuation byte
Is this because my Tweets have all sorts of non standard text in them? I didn't do any cleanup of my Tweets (I assume there are libraries that help with that in order to make a bag of words work?)
EDIT:
Code I use using Twython to download tweets:
def get_tweets(user):
twitter = Twython(CONSUMER_KEY,CONSUMER_SECRET,ACCESS_KEY,ACCESS_SECRET)
user_timeline = twitter.get_user_timeline(screen_name=user,count=1)
lis = user_timeline[0]['id']
lis = [lis]
for i in range(0, 16): ## iterate through all tweets
## tweet extract method with the last list item as the max_id
user_timeline = twitter.get_user_timeline(screen_name=user,
count=200, include_retweets=False, max_id=lis[-1])
for tweet in user_timeline:
lis.append(tweet['id']) ## append tweet id's
text = str(tweet['text']).replace("'", "")
text_file = open(user, "a")
text_file.write(text)
text_file.close()
You get a UnicodeDecodeError because your files are being decoded with the wrong text encoding.
If this means nothing to you, make sure you understand the basics of Unicode and text encoding, eg. with the official Python Unicode HOWTO.
First, you need to find out what encoding was used to store the tweets on disk.
When you saved them to text files, you used the built-in open function without specifying an encoding. This means that the system's default encoding was used. Check this, for example, in an interactive session:
>>> f = open('/tmp/foo', 'a')
>>> f
<_io.TextIOWrapper name='/tmp/foo' mode='a' encoding='UTF-8'>
Here you can see that in my local environment the default encoding is set to UTF-8. You can also directly inspect the default encoding with
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
There are other ways to find out what encoding was used for the files.
For example, the Unix tool file is pretty good at guessing the encoding of existing files, if you happen to be working on a Unix platform.
Once you think you know what encoding was used for writing the files, you can specify this in the load_files() function:
tweets_train = load_files('path to tweets', encoding='latin-1')
... in case you find out Latin-1 is the encoding that was used for the tweets; otherwise adjust accordingly.

Gensim - Trying to load a text file in gensim

Trying to load up a file in gensim with this line of code :
model = gensim.models.KeyedVectors.load_word2vec_format(r"C:/Users/dan/txt_sentoken/pos/cv000_29590.tx", binary=False)
However, I am getting this error:
ValueError: invalid literal for int() with base 10:'films'
Help how do I solve this error ?
Each corpus need to start with a line containing the vocab size and the vector size in that order.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 5

I am trying to read a file using the following code.
precomputed = pickle.load(open('test/vgg16_features.p', 'rb'))
features = precomputed['features']
But getting this error.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 5: ordinal not in range(128)
The file I am trying to read contains image features which are extracted using deep neural networks. The file content looks like below.
(dp0
S'imageIds'
p1
(lp2
I262145
aI131074
aI131075
aI393221
aI393223
aI393224
aI524297
aI393227
aI393228
aI262146
aI393230
aI262159
aI524291
aI322975
aI131093
aI524311
....
....
....
Please note that, this is big file, of size 2.8GBs.
I know this is a duplicate question but I followed the suggested solutions in other stackoverflow posts but couldn't solve it. Any help would be appreciated!
Finally I found the solution. The problem was actually about unpickling a python 2 object with python 3 which I couldn't understand first because the pickle file I got was written through a python 2 program.
Thanks to this answer which solved the problem. So, all I need to do is set the encoding parameter of pickle.load() function to latin1 because latin1 works for any input as it maps the byte values 0-255 to the first 256 Unicode codepoints directly.
So, the following worked for me!
precomputed = pickle.load(open('test/vgg16_features.p', 'rb'), encoding='latin1')

Resources