Fitting a Gensim Fasttext pretrained model to my text - nlp

I have a pretrained fast text model, I have loaded it into my notebook and want to fit it to my free form text to train a ML classifier.
import pandas as pd
from sklearn.model_selection import train_test_split
from gensim.models import FastText
import pickle
import numpy as np
from numpy.linalg import norm
from gensim.utils import tokenize
model_2 = FastText.load(model_path + 'itsm_fasttext_embeddings_100_dim.model')
tokens = list()
def get_column_vector(model, list_corpus):
for i in list_corpus:
svec = np.zeros(100)
tok_sent = list(tokenize(i))
count = 0
for word in tok_sent:
vec = model.wv[word]
norm_vec = norm(vec)
if (norm_vec > 0):
vec = np.multiply(vec, (1/norm_vec))
svec = np.add(svec, vec)
count += 1
if (count > 0):
averaged_vec = np.multiply(svec, (1/count))
tokens.append(averaged_vec)
return tokens
list_corpus = df["freeformtext_col"].tolist()
# lst = array of vectors for each row of free form text
lst = get_column_vector(model, list_corpus)
x_text_train, x_text_test, y_train, y_test = train_test_split(lst, y, test_size=0.2, random_state=42)
model_2.fit(x_text_train, y_train, validation_split=0.1, shuffle=True)
I get the error of
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [59], in <cell line: 1>()
----> 1 model_2.fit(x_text_train, y_train, validation_split=0.1,
shuffle=True)
AttributeError: 'FastText' object has no attribute 'fit'
Other documentation showing the initial training of fasttext have the fit function.
I am having trouble finding documentation of others who have taken a pre-tained fasttext gensim model and fit it to their text data to ultimately use a classifier

The Gensim FastText implementation offers no .fit() method. (I also don't see any such method in Facebook's Python wrapper of its original C++ FastText implementation. Even in its supervised-classification mode, it has its own train_supervised() method rather than a scikit-learn-style fit() method.)
If you saw some online example using such a method, it must have been using a different FastText implementation - so you should consult the full details of that other example to see which library they were using.
I don't know of any good online examples showing how to 'fine-tune' a pretrained FastText model to a smaller set of new texts, much less any demonstrating benefits, gotchas, & rules-of-thumb for performing such an operation.
If you did see an online example suggesting such an approach, & demonstrating some benefits over other less-complicated approaches, then that source-of-inspiration would also be the model to follow - or to mention/link when trying to debug their approach. Without someone's full working examples as a guide/template, you're in improvised-innovation mode.
Note you don't have to start with someone else's pre-trained model. You can train your own FastText models with your own training texts – and for many domains, & tasks, this could work better than a generic model trained from public sources like Wikipedia texts or large web crawls.
And when you do, you have the option of simply using FastText in its base unsupervised mode – as a way to featurize text – then pass those FastText-modeled features to some other explicit classifier option (such as the many calssifiers in scikit-learn with .fit() methods).
FastText's own -supervised mode builds a different kind of model that combines the word-training with the classification-training. A general FastText language model you find online is unlikely to be a specific -supervised mode model, unless it is explicitly declared to be one. If it's a standard unsupervised model, there's no straightforward way to adapt it into a -supervised model. And if it is already a -supervised model, it will have already been trained for someone else's fixed set of known-labels.

Related

is it possible to set the splitting strategy for GridSearchCv?

I'm optimizing model's hyperparameters by GridSearchCv. And because the data I'm working with is very imbalanced, I need to "choose" the manner that the algortihm splits the train/test sets in order to ensure that the underrepresented points are in both sets.
By reading scikit-learn's documentation, I have the idea that it's possible to set the splitting strategy for GridSearch but I'm not sure how or if this is the case.
I would be very grateful if someone could help me with this.
Yes, pass in the GridSearchCV as cv a StratifiedKFold object.
from sklearn.model_selection import StratifiedKFold
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
skf = StratifiedKFold(n_splits=5)
clf = GridSearchCV(svc, parameters, cv = skf)
clf.fit(iris.data, iris.target)
By default, if you are training a classification model with GridSearchCV, the default method for splitting the dataset is StratifiedKFold, that takes care of balancing the dataset according to the target variable.
If your dataset is imbalanced for some other reason (not the target variable), you can choose another criteria to perform the split. Carefully read the documentation of GridSearchCV, and select an appropriate CV splitter.
In the scikit-learn documentation of model selection, there are many Splitter Classes that you could use. Or you can define your own splitter class according to your criteria, but it would be more difficult.

Streaming corpus to a vectorizer in a pipeline

I have a large language corpus and I use sklearn tfidf vectorizer and gensim Doc2Vec to compute language models. My total corpus has about 100,000 documents and I realized that my Jupyter notebook stops computing once I cross a certain threshold. I guess that the memory is full after applying the grid-search and cross-validation steps.
Even following example script already stops for Doc2Vec at some point:
%%time
import pandas as pd
import numpy as np
from tqdm import tqdm
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.sklearn_api import D2VTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess
np.random.seed(1)
data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
X_train, X_test, y_train, y_test = train_test_split([simple_preprocess(doc) for doc in data.text],
data.label, random_state=1)
model_names = [
'TfidfVectorizer',
'Doc2Vec_PVDM',
]
models = [
TfidfVectorizer(preprocessor=' '.join, tokenizer=None, min_df = 5),
D2VTransformer(dm=0, hs=0, min_count=5, iter=5, seed=1, workers=1),
]
parameters = [
{
'model__smooth_idf': (True, False),
'model__norm': ('l1', 'l2', None)
},
{
'model__size': [200],
'model__window': [4]
}
]
for params, model, name in zip(parameters, models, model_names):
pipeline = Pipeline([
('model', model),
('clf', LogisticRegression())
])
grid = GridSearchCV(pipeline, params, verbose=1, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)
print(grid.best_params_)
cval = cross_val_score(grid.best_estimator_, X_train, y_train, scoring='accuracy', cv=5, n_jobs=-1)
print("Cross-Validation (Train):", np.mean(cval))
print("Finished.")
Is there a way to "stream" each line in a document, instead of loading the full data into memory? Or another way to make it more memory efficient? I read a few articles on the topic but could not discover any that included a pipeline example.
With just 100,000 documents, unless they're gigantic, it's not necessarily the loading-of-data into memory that's causing you problems. Note especially:
loading & tokenizing the docs has already succeeded before you even begin the scikit-learn pipelines/grid-search, and the further multiplication of memory usage is in the necessarily-repeated alternate models, not the original docs
scikit-learn APIs tend to assume the training data is fully in memory – so even though the innermost gensim classes (Doc2Vec) are happy with streamed data of arbitrary size, it's harder to adapt that into scikit-learn
So you should look elsewhere, and there are other issues with your shown code.
I've often had memory or lockup issues with scikit-learn's attempts at parallelism (as enabled through n_jobs-like parameters), especially inside Jupyter notebooks. It forks full OS processes, which tend to blow up memory usage. (Each sub-process gets a full copy of the parent process's memory, which might be efficiently shared – until the subprocess starts moving/changing things.) Sometimes one process, or inter-process communication, fails and the main process is just left waiting for a response – which seems to especially confuse Jupyter notebooks.
So, unless you have tons of memory and absolutely need scikit-learn parallelism, I'd recommend trying to get things working with n_jobs=1 first – and only later experimenting with more jobs.
In contrast, the workers of the Doc2Vec class (and D2VTransformer) uses lighter-weight threads, and you should use at least workers=3, and perhaps 8 (if you have at least that many cores, rather than the workers=1 you're using now.
But also: you're doing a bunch of redundant actions of unclear value in your code. The test set from initial train-test split isn't ever used. (Perhaps you were thinking of keeping it aside as a final validation set? That's the most rigorous way to get a good estimate of your final result's performance on future unseen data, but in many contexts data is limited, and that estimate isn't as important as just doing the best possible with limited data.)
The GridSearchCV itself does a 5-way train/test split as part of its work, and its best results are remembered in its properties when it's done.
So you don't need to do the cross_val_score() again - you can read the results from GridSearchCV.

Load and use saved Keras model.h5

I try to a KerasClassifier (wrapper) into final_model.h5
validator = GridSearchCV(estimator=clf, param_grid=param_grid)
grid_result = validator.fit(train_images, train_labels)
best_estimator = grid_result.best_estimator_
best_estimator.model.save("final_model.h5")
And then I want to reuse the model
from keras.models import load_model
loaded_model = load_model("final_model.h5")
But it seems like loaded_model is now a Sequential object instead. In other words it is different from KerasClassifier object like best_estimator
I want to reuse some method like score which is available in KerasClassifier, which is not available in Sequential model. What should I do?
Also, I would like to know more about how to continue the training process left off on final_model.h5. What can I do next?
Yes, in the end you saved the Keras model as HDF5, not the KerasClassifier that is just an adapter to use with scikit-learn.
But you don't really need the KerasClassifier instance, you want the score function and this in keras is called evaluate, so just call model.evaluate(X, Y) and this will return a list containing first the loss and then any metrics that your model used (most likely accuracy).
To continue training the model, just load it and call model.fit with the new training set and that's it.

Show model layout / design (with all connections) in Keras

I have major differences when testing a Keras LSTM model after I've trained it compared to when I load that trained model from a .h5 file (Accuracy of the first is always > 0.85 but of the later is always below < 0.2 i.e. a random guess).
However I checked the weights, they are identical and also the sparse layout Keras give me via plot_model is the same, but since this only retrieves a rough overview:
Is there away to show the full layout of a Keras model (especially node connections)?
If you're using tensorflow backend, apart from plot_model, you can also use keras.callbacks.TensorBoard callback to visualize the whole graph in tensorboard. Example:
callback = keras.callbacks.TensorBoard(log_dir='./graph',
histogram_freq=0,
write_graph=True,
write_images=True)
model.fit(..., callbacks=[callback])
Then run tensorboard --logdir ./graph from the same directory.
This is a quick shortcut, but you can go even further with that.
For example, add tensorflow code to define (load) the model within custom tf.Graph instance, like this:
from keras.layers import LSTM
import tensorflow as tf
my_graph = tf.Graph()
with my_graph.as_default():
# All ops / variables in the LSTM layer are created as part of our graph
x = tf.placeholder(tf.float32, shape=(None, 20, 64))
y = LSTM(32)(x)
.. after which you can list all graph nodes with dependencies, evaluate any variable, display the graph topology and so on, to compare the models.
I personally think, the simplest way is to setup your own session. It works in all cases with minimal patching:
import tensorflow as tf
from keras import backend as K
sess = tf.Session()
K.set_session(sess)
...
# Now can evaluate / access any node in this session, e.g. `sess.graph`

Use gensim Random Projection in sklearn SVM

Is it possible to use a gensim Random Projection to train a SVM in sklearn?
I need to use gensim's tfidf implementation because it's better at dealing with large inputs and then want to put that into a random projection on which I will train my SVM. I'd also be happy to just pass the tfidf model generated by gensim to sklearn and use their random projection, if that makes things easier.
But so far I haven't found a way to get either model out of gensim into sklearn.
I have tried using gensim.matutils.corpus2cscbut of course that doesn't work: neither TfidfModel nor RpModel are corpi, so now I'm clueless at what to try next.
This is now very easy thanks to an awesome gensim contribution from Chinmaya Pancholi (see post here).
Simply import the sklearn wrapper from `gensim:
from gensim.sklearn_api import RpTransformer
Then, you can use the model to do analysis as you would any other sklearn classifier:
model = RpTransformer(num_topics=2)
clf = svm.SVC()
pipe = Pipeline([('features', model,), ('classifier', clf)])
pipe.fit(X_train, y_train)
One thing to be aware of, when using the gensim models, is that you still need to perform the dictionary and corpus steps. So instead of fitting your model on X_train, you'll have to do something along the following lines:
dictionary = Dictionary(X_train)
corpus_train = [dictionary.doc2bow(text) for text in X_train]
corpus_test = [dictionary.doc2bow(text) for text in X_test]
Then fit/predict your model on corpus_train or corpus_test.

Resources