Gensim - Trying to load a text file in gensim - python-3.x

Trying to load up a file in gensim with this line of code :
model = gensim.models.KeyedVectors.load_word2vec_format(r"C:/Users/dan/txt_sentoken/pos/cv000_29590.tx", binary=False)
However, I am getting this error:
ValueError: invalid literal for int() with base 10:'films'
Help how do I solve this error ?

Each corpus need to start with a line containing the vocab size and the vector size in that order.

Related

Operator translate error occurs when I try to convert onnx file to caffe2

I train a boject detection model on pytorch, and I have exported to onnx file.
And I want to convert it to caffe2 model :
import onnx
import caffe2.python.onnx.backend as onnx_caffe2_backend
# Load the ONNX ModelProto object. model is a standard Python protobuf object
model = onnx.load("CPU4export.onnx")
# prepare the caffe2 backend for executing the model this converts the ONNX model into a
# Caffe2 NetDef that can execute it. Other ONNX backends, like one for CNTK will be
# availiable soon.
prepared_backend = onnx_caffe2_backend.prepare(model)
# run the model in Caffe2
# Construct a map from input names to Tensor data.
# The graph of the model itself contains inputs for all weight parameters, after the input image.
# Since the weights are already embedded, we just need to pass the input image.
# Set the first input.
W = {model.graph.input[0].name: x.data.numpy()}
# Run the Caffe2 net:
c2_out = prepared_backend.run(W)[0]
# Verify the numerical correctness upto 3 decimal places
np.testing.assert_almost_equal(torch_out.data.cpu().numpy(), c2_out, decimal=3)
print("Exported model has been executed on Caffe2 backend, and the result looks good!")
I always got this error :
RuntimeError: ONNX conversion failed, encountered 1 errors:
Error while processing node: input: "90"
input: "91"
output: "92"
op_type: "Resize"
attribute {
name: "mode"
s: "nearest"
type: STRING
}
. Exception: Don't know how to translate op Resize
How can I solve it ?
The problem is that the Caffe2 ONNX backend does not yet support the export of the Resize operator.
Please raise an issue on the Caffe2 / PyTorch github -- there's an active community of developers who should be able to address this use case.

Loading pre-trained word embeddings

I am trying to load the pre-trained word2Vec model using the command below but get an Unicode error. Need some help getting to the bottom of it. I googled around but could not find a working solution to this.
python -m spacy init-model en /tmp/google_news_vectors --vectors-loc ~/Downloads/GoogleNews-vectors-negative300.bin.gz
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 7: invalid start byte
Spacy expects the vectors to be in the text format rather than the binary format:
https://spacy.io/api/cli#init-model
For how to convert the binary model, see: https://stackoverflow.com/a/33183634/461847

Why does using Gensim with Glove continue to give a 'utf-8' UnicodeDecodeError?

I am trying to use Gensim with Glove instead of word2vec. To make the shape of Glove compatible with Gensim and use it, I am using the following lines of code:
import gensim
from gensim.scripts.glove2word2vec import glove2word2vec
glove_in = 'glove.840B.300d.txt'
word2vec_format_out = 'glove.840B.300d.txt.word2vec'
glove2word2vec(glove_in, word2vec_format_out)
model =
gensim.models.KeyedVectors.load_word2vec_format(word2vec_format_out,
encoding='utf-8', binary=True)
However, this last line of code gives the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbd in position 0:
invalid start byte
I have tried to open Glove first and then writing as a csv file, then re-open specifying encoding='utf-8'. I also tried several other things mentioned here, but the error keeps coming back. Does anyone know a solution for this?

DataLossError when running pretrained model inception_v4

I am trying to run vectorize_pretrained.py using InceptionV4 model type from Oliver Edholm model, see https://github.com/OliverEdholm/Image-Retrieval
I got the following error:
DataLossError (see above for traceback): Unable to open table file ./embedding/extraction/inception_v4.py: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
     [[Node: save/RestoreV2_475 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_475/tensor_names, save/RestoreV2_475/shape_and_slices)]]
When I run the same file using VGG16 model type I got the same error.
Does anyone have a suggestion?

Word2vec saved model is not UTF-8 encoded but the sentence input to the Word2vec model is UTF-8 encoded

I trained a word2vec model using gensim package and saved it with the following name.
model_name = "300features_1minwords_10context"
model.save(model_name)
I got these log message info. while the model was getting trained and saved.
INFO : not storing attribute syn0norm
INFO : not storing attribute cum_table
Then, I tried to load the model using this,
from gensim.models import Word2Vec
model = Word2Vec.load("300features_1minwords_10context")
I got the following error.
2017-06-22 21:27:14,975 : INFO : loading Word2Vec object from 300features_1minwords_10context
2017-06-22 21:27:15,496 : INFO : loading wv recursively from 300features_1minwords_10context.wv.* with mmap=None
2017-06-22 21:27:15,497 : INFO : setting ignored attribute syn0norm to None
2017-06-22 21:27:15,498 : INFO : setting ignored attribute cum_table to None
2017-06-22 21:27:15,499 : INFO : loaded 300features_1minwords_10context
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-25-9d90db0f07c0> in <module>()
1 from gensim.models import Word2Vec
2 model = Word2Vec.load("300features_1minwords_10context")
----> 3 model.syn0.shape
AttributeError: 'Word2Vec' object has no attribute 'syn0'
Also, in the file "300features_1minwords_10context", it shows that
"300features_1minwords_10context" is not UTF-8 encoded
Saving disabled.
Open console for more details
To fix the above attribute error, I have also tried the following from the google forum,
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format("300features_1minwords_10context")
model.syn0.shape
It resulted in another error which is
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
The model is trained with UTF-8 encoded sentences. I am not sure why is it throwing this error ?
More info :
df = pd.read_csv('UNSPSCdataset.csv',encoding='mac_roman',low_memory=False)
features = ['MaterialDescription']
temp_features = df[features]
temp_features.to_csv('materialDescription', encoding='UTF-8')
X = pd.read_csv('materialDescription',encoding='UTF-8')
Here, I had to use 'mac_roman' encoding in order to access it using pandas dataframe. Since the text in the dataframe has to be in UTF-8 while training the model, I have saved that particular feature in a separate csv file by encoding it with UTF-8 and later, I have the accessed that particular column.
Any help is appreciable
Are you using the latest gensim? If not, be sure to try it – there have sometimes been save()/load() bugs in older versions.
The INFO "not storing" log lines are normal – they're not indicative of any problem (and thus could be deleted from your question.)
Are you getting the "has no attribute" error directly upon the load()? (A full error stack here would be useful, and clarify this.)
UPDATE: From the now-shown error-stack, the error is not occurring in the load() line, but on the following line, when you attempt to access model.syn0.shape. Recent versions of gensim no longer have a syn0 as a property of Word2Vec class objects – that info is moved to a constituent KeyedVectors object, in the wv property. So model.wv.syn0.shape is likely to access what you're seeking, without an error.
When your model is largish, save() can generate multiple files on the side, with extra extensions, for the model's large array properties (like syn0). These files must be kept alongside the main filename for the model to be re-load()ed. Is it possible you've moved the 300features_1minwords_10context file, but not any such accompanying files, to a new location where the load() is then incomplete?
You can't load_word2vec_format() a file that was native-gensim save()d – their different formats entirely, so the encoding error is just an artifact of trying to read a binary Python pickle file (from save()) as another format entirely.

Resources