Gensim: KeyedVectors.train()

Gensim: KeyedVectors.train() - python-3.x

I downloaded Wikipedia word vectors from here. I loaded the vectors with:
model_160 = KeyedVectors.load_word2vec_format(wiki_160_path, binary=False)
and then want to train them with:
model_160.train()
I get the error back:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-11-22a9f6312119> in <module>()
----> 1 model.train()
AttributeError: 'KeyedVectors' object has no attribute 'train'
My question is now:
It seems like KeyedVectors has no train function, but I want to continue training the vectors on my personal sentences, instead of just using the Wikipedia vectors. How is this possible?
Thanks in advance, Jan

You can't use KeyedVectors for that.
From the documentation:
Word vector storage and similarity look-ups.
The word vectors are considered read-only in this class.
And also:
The word vectors can also be instantiated from an existing file on disk in the word2vec C format as a KeyedVectors instance.
[...]
NOTE: It is impossible to continue training the vectors loaded from the C format
because hidden weights, vocabulary frequency and the binary tree is
missing.

Related

Fitting a Gensim Fasttext pretrained model to my text

I have a pretrained fast text model, I have loaded it into my notebook and want to fit it to my free form text to train a ML classifier.
import pandas as pd
from sklearn.model_selection import train_test_split
from gensim.models import FastText
import pickle
import numpy as np
from numpy.linalg import norm
from gensim.utils import tokenize
model_2 = FastText.load(model_path + 'itsm_fasttext_embeddings_100_dim.model')
tokens = list()
def get_column_vector(model, list_corpus):
for i in list_corpus:
svec = np.zeros(100)
tok_sent = list(tokenize(i))
count = 0
for word in tok_sent:
vec = model.wv[word]
norm_vec = norm(vec)
if (norm_vec > 0):
vec = np.multiply(vec, (1/norm_vec))
svec = np.add(svec, vec)
count += 1
if (count > 0):
averaged_vec = np.multiply(svec, (1/count))
tokens.append(averaged_vec)
return tokens
list_corpus = df["freeformtext_col"].tolist()
# lst = array of vectors for each row of free form text
lst = get_column_vector(model, list_corpus)
x_text_train, x_text_test, y_train, y_test = train_test_split(lst, y, test_size=0.2, random_state=42)
model_2.fit(x_text_train, y_train, validation_split=0.1, shuffle=True)
I get the error of
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [59], in <cell line: 1>()
----> 1 model_2.fit(x_text_train, y_train, validation_split=0.1,
shuffle=True)
AttributeError: 'FastText' object has no attribute 'fit'
Other documentation showing the initial training of fasttext have the fit function.
I am having trouble finding documentation of others who have taken a pre-tained fasttext gensim model and fit it to their text data to ultimately use a classifier

The Gensim FastText implementation offers no .fit() method. (I also don't see any such method in Facebook's Python wrapper of its original C++ FastText implementation. Even in its supervised-classification mode, it has its own train_supervised() method rather than a scikit-learn-style fit() method.)
If you saw some online example using such a method, it must have been using a different FastText implementation - so you should consult the full details of that other example to see which library they were using.
I don't know of any good online examples showing how to 'fine-tune' a pretrained FastText model to a smaller set of new texts, much less any demonstrating benefits, gotchas, & rules-of-thumb for performing such an operation.
If you did see an online example suggesting such an approach, & demonstrating some benefits over other less-complicated approaches, then that source-of-inspiration would also be the model to follow - or to mention/link when trying to debug their approach. Without someone's full working examples as a guide/template, you're in improvised-innovation mode.
Note you don't have to start with someone else's pre-trained model. You can train your own FastText models with your own training texts – and for many domains, & tasks, this could work better than a generic model trained from public sources like Wikipedia texts or large web crawls.
And when you do, you have the option of simply using FastText in its base unsupervised mode – as a way to featurize text – then pass those FastText-modeled features to some other explicit classifier option (such as the many calssifiers in scikit-learn with .fit() methods).
FastText's own -supervised mode builds a different kind of model that combines the word-training with the classification-training. A general FastText language model you find online is unlikely to be a specific -supervised mode model, unless it is explicitly declared to be one. If it's a standard unsupervised model, there's no straightforward way to adapt it into a -supervised model. And if it is already a -supervised model, it will have already been trained for someone else's fixed set of known-labels.

Using FastText word embedding with sklearn SVM

I'm trying to use fasttext word embeddings as input for a SVM for a text classification task. I averaged the word vectors over each sentence, and for each sentence I want to predict a certain class. But, when I simply try to use the vectors as input for the SVM, I get the following error:
TypeError: only size-1 arrays can be converted to Python scalars
The above exception was the direct cause of the following exception:
*some traceback stuff*
VallueError: setting an array element with a sequence.
I suspect I have to convert the word embedding vectors into some other kind of format, but I'm not really sure what that would have to be. I find the documentation on sklearn confusing.
Does anyone know how to use the fasttext embedding vectors as input for a SVM?
Thanks in advance. If there is anything you need to know, let me know.
CptBaas

They have to be a list:
In python should be:
list(your_data)

Gensim's Word2Vec not training provided documents

I'm facing a Gensim training problem using Word2Vec.
model.wv.vocab is not getting any further word from the trained corpus
the only words in are from the ones from initialization instruction !
In fact, after many times trying on my own code, even the official site's example didn't work !
I tried saving model at many spots in my code
I even tried saving and reloading the corpus alongside train instruction
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
path = get_tmpfile("word2vec.model")
model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")
print(len(model.wv.vocab))
model.train([["hello", "world"]], total_examples=1, epochs=1)
model.save("word2vec.model")
print(len(model.wv.vocab))
first print statement gives 12 which is right
second 12 when it's supposed to give 14 (len(vocab + 'hello' + 'world'))

Additional calls to train() don't expand the known vocabulary. So, there is no way that the value of len(model.wv.vocab) will change after another call to train(). (Either 'hello' and 'world' are already known to the model, in which case they were in the original count of 12, or they weren't known, in which case they were ignored.)
The vocabulary is only established during a specific build_vocab() phase, which happens automatically if, as your code shows, you supplied a training corpus (common_texts) in model instantiation.
You can use a call to build_vocab() with the optional added parameter update=True to incrementally update a model's vocabulary, but this is best considered an advanced/experimental technique that introduces added complexities. (Whether such vocab-expansion, and then followup incremental training, actually helps or hurts will depend on getting a lot of other murky choices about alpha, epochs, corpus-sizing, training modes, and corpus-contents correct.)

Machine Learning liner Regression - Sklearn

I'm new to the Machine learning domain and in Learn Regression i have some doubt
1:While practicing the sklearn learn regression model prediction method getting the below error.
Code:
sklearn.linear_model.LinearRegression.predict(25)
Error:
"ValueError: Expected 2D array, got scalar array instead: array=25. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
Do i need to pass a 2-D array? Checked on sklearn documentation page any haven't found any thing for version update.
**Running my code on Kaggle
https://www.kaggle.com/aman9d/bikesharingdemand-upx/
2: Is index of dataset going to effect model's score (weights)?

First of all you should put your code as you use:
# import, instantiate, fit
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X, y)
# use the predict method
linreg.predict(25)
Because what you post in the question is not properly executable, predict method is not static for the class LinearRegression.
When you fit a model, the first step is recognize which kind of data will be the input, in your case will be similar to X, that means that if you pass something with different shape of X to the model it will raise an error.
In your example X seems to be a pd.DataFrame() instance with only 1 column, this should be replaceable with an array of 2 dimension representing the number of examples by the number of features, so if you try:
linreg.predict([[25]])
should work.
For example if you were trying a regression with more than 1 feature aka column, let's say temp and humidity, your input would look like this:
linreg.predict([[25, 56]])
I hope this will help you and always keep in mind which is the shape of your data.
Documentation: LinearRegression fit
X : array-like or sparse matrix, shape (n_samples, n_features)

Out of Memory Error in Scikit-learn MultinomialNB

In order to run NB classifier in about 400 MB of text data i need to use vectorizer.
vectorizer = TfidfVectorizer(min_df=2)
X_train = vectorizer.fit_transform(X_data)
But it is giving out of memory error. I am using Linux64 an python 64 bit version. How does people work through Vectorization process in Scikit for large data set (text)
Traceback (most recent call last):
File "ParseData.py", line 234, in <module>
main()
File "ParseData.py", line 211, in main
classifier = MultinomialNB().fit(X_train, y_train)
File "/home/pratibha/anaconda/lib/python2.7/site-packages/sklearn/naive_bayes.py", line 313, in fit
Y = labelbin.fit_transform(y)
File "/home/pratibha/anaconda/lib/python2.7/site-packages/sklearn/base.py", line 408, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/home/pratibha/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 272, in transform
neg_label=self.neg_label)
File "/home/pratibha/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 394, in label_binarize
Y = np.zeros((len(y), len(classes)), dtype=np.int)
Edited (ogrisel): I changed the title from "Out of Memory Error in Scikit Vectorizer" to "Out of Memory Error in Scikit-learn MultinomialNB" to make it more descriptive of the actual problem.

Let me summarize the outcome of the discussion in the comments:
The label preprocessing machinery used internally in many scikit-learn classifiers does not scale well memory wise w.r.t. the number of classes. This is a known issue and there is ongoing work to tackle it.
The MultinomialNB class it-self will probably not be suitable to classify in a label space with cardinality 43K even if the label preprocessing limitation is fixed.
To address the large cardinality classification problem you could try:
fit binary SGDClassifier(loss='log', penalty='elasticnet') instances on columns of y_train converted as numpy arrays independently, then call clf.sparsify() and finally wrap those sparse models as a final one-vs-rest classifier (or rank predictions of the binary classifier by proba). Dependending on the value of the regularizer parameter alpha you might get sparse models that are small enough to fit in memory. You can also try to do the same with LogisticRegression, that is something like:
clf_label_i = LogisticRegression(penalty='l1').fit(X_train, y_train[:, label_i].toarray()).sparsify()
alternatively try to do a PCA of the target labels y_train, then cast your classification problem as a multi-output regression problem in the reduced label PCA space, and then decode the regressor's output by looking for the nearest class encoding in the label PCA space.
You can also have a look at
Block Coordinate Descent Algorithms for Large-scale Sparse Multiclass Classification implemented in lightning but I am not sure it suitable for label cardinality 43K either.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string