Streaming corpus to a vectorizer in a pipeline - scikit-learn

I have a large language corpus and I use sklearn tfidf vectorizer and gensim Doc2Vec to compute language models. My total corpus has about 100,000 documents and I realized that my Jupyter notebook stops computing once I cross a certain threshold. I guess that the memory is full after applying the grid-search and cross-validation steps.
Even following example script already stops for Doc2Vec at some point:
%%time
import pandas as pd
import numpy as np
from tqdm import tqdm
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.sklearn_api import D2VTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess
np.random.seed(1)
data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
X_train, X_test, y_train, y_test = train_test_split([simple_preprocess(doc) for doc in data.text],
data.label, random_state=1)
model_names = [
'TfidfVectorizer',
'Doc2Vec_PVDM',
]
models = [
TfidfVectorizer(preprocessor=' '.join, tokenizer=None, min_df = 5),
D2VTransformer(dm=0, hs=0, min_count=5, iter=5, seed=1, workers=1),
]
parameters = [
{
'model__smooth_idf': (True, False),
'model__norm': ('l1', 'l2', None)
},
{
'model__size': [200],
'model__window': [4]
}
]
for params, model, name in zip(parameters, models, model_names):
pipeline = Pipeline([
('model', model),
('clf', LogisticRegression())
])
grid = GridSearchCV(pipeline, params, verbose=1, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)
print(grid.best_params_)
cval = cross_val_score(grid.best_estimator_, X_train, y_train, scoring='accuracy', cv=5, n_jobs=-1)
print("Cross-Validation (Train):", np.mean(cval))
print("Finished.")
Is there a way to "stream" each line in a document, instead of loading the full data into memory? Or another way to make it more memory efficient? I read a few articles on the topic but could not discover any that included a pipeline example.

With just 100,000 documents, unless they're gigantic, it's not necessarily the loading-of-data into memory that's causing you problems. Note especially:
loading & tokenizing the docs has already succeeded before you even begin the scikit-learn pipelines/grid-search, and the further multiplication of memory usage is in the necessarily-repeated alternate models, not the original docs
scikit-learn APIs tend to assume the training data is fully in memory – so even though the innermost gensim classes (Doc2Vec) are happy with streamed data of arbitrary size, it's harder to adapt that into scikit-learn
So you should look elsewhere, and there are other issues with your shown code.
I've often had memory or lockup issues with scikit-learn's attempts at parallelism (as enabled through n_jobs-like parameters), especially inside Jupyter notebooks. It forks full OS processes, which tend to blow up memory usage. (Each sub-process gets a full copy of the parent process's memory, which might be efficiently shared – until the subprocess starts moving/changing things.) Sometimes one process, or inter-process communication, fails and the main process is just left waiting for a response – which seems to especially confuse Jupyter notebooks.
So, unless you have tons of memory and absolutely need scikit-learn parallelism, I'd recommend trying to get things working with n_jobs=1 first – and only later experimenting with more jobs.
In contrast, the workers of the Doc2Vec class (and D2VTransformer) uses lighter-weight threads, and you should use at least workers=3, and perhaps 8 (if you have at least that many cores, rather than the workers=1 you're using now.
But also: you're doing a bunch of redundant actions of unclear value in your code. The test set from initial train-test split isn't ever used. (Perhaps you were thinking of keeping it aside as a final validation set? That's the most rigorous way to get a good estimate of your final result's performance on future unseen data, but in many contexts data is limited, and that estimate isn't as important as just doing the best possible with limited data.)
The GridSearchCV itself does a 5-way train/test split as part of its work, and its best results are remembered in its properties when it's done.
So you don't need to do the cross_val_score() again - you can read the results from GridSearchCV.

Related

Tree based algorithm different behavior with duplicated features

I don't understand why I have three different behaviors depending on the classifier I use, even though they should go hand in hand.
This is the code in order to go deeply in the question:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_validate
import matplotlib.pyplot as plt
import numpy as np
#load data
wine = datasets.load_wine()
X = wine.data
y = wine.target
# some helper functions
def repeat_feature(X,which=1,times=1):
return np.hstack([X,np.hstack([X[:, :which]]*times)])
def do_the_job(X,y,clf):
return np.mean(cross_validate(clf, X, y,cv=5)['test_score'])
# define the classifiers
clf1=DecisionTreeClassifier(max_depth=25,random_state=42)
clf2=RandomForestClassifier(n_estimators=5,random_state=42)
clf3=LGBMClassifier(n_estimators=5,random_state=42)
# repeat up to 50 times the same feature and test the classifiers
clf1_result=[]
clf2_result=[]
clf3_result=[]
for i in range(1,50):
my_x=repeat_feature(X,times=i)
clf1_result.append(do_the_job(my_x,y,clf1))
clf2_result.append(do_the_job(my_x,y,clf2))
clf3_result.append(do_the_job(my_x,y,clf3))
# plot the mean of the cv-scores for each classifier
plt.figure(figsize=(12,7))
plt.plot(clf1_result,label='tree')
plt.plot(clf2_result,label='forest')
plt.plot(clf3_result,label='boost')
plt.legend()
The result of the previous script is the following graph:
What I want to verify is that by adding the same information (like a repeated feature) I would get a decrease in the score (which happens as expected for random forest).
The question is why does this not happen with the other two classifiers instead?
Why do their scores remain stable?
Am I missing something from the theoretical point of view?
Ty all
When fitting a single decision tree (sklearn.tree.DecisionTreeClassifier)
or a LightGBM model using its default behavior (lightgbm.LGBMClassifier), the training algorithm considers all features as candidates for every split, and always chooses the split with the best "gain" (reduction in the training loss).
Because of this, adding multiple identical copies of the same feature will not change the fit to the training data.
For random forest, on the other hand, the training algorithm randomly selects a subset of features to consider at each split. The random forest learns how to explain the training data by ensembling together multiple slightly-different models, and this can be effective because the different models explain different characteristics of the target. If you hold the number of trees + the number of leaves per tree constant, then adding copies of a feature reduces the diversity of the trees in the forest, which reduces the forest's fit to the training data.

Fitting a Gensim Fasttext pretrained model to my text

I have a pretrained fast text model, I have loaded it into my notebook and want to fit it to my free form text to train a ML classifier.
import pandas as pd
from sklearn.model_selection import train_test_split
from gensim.models import FastText
import pickle
import numpy as np
from numpy.linalg import norm
from gensim.utils import tokenize
model_2 = FastText.load(model_path + 'itsm_fasttext_embeddings_100_dim.model')
tokens = list()
def get_column_vector(model, list_corpus):
for i in list_corpus:
svec = np.zeros(100)
tok_sent = list(tokenize(i))
count = 0
for word in tok_sent:
vec = model.wv[word]
norm_vec = norm(vec)
if (norm_vec > 0):
vec = np.multiply(vec, (1/norm_vec))
svec = np.add(svec, vec)
count += 1
if (count > 0):
averaged_vec = np.multiply(svec, (1/count))
tokens.append(averaged_vec)
return tokens
list_corpus = df["freeformtext_col"].tolist()
# lst = array of vectors for each row of free form text
lst = get_column_vector(model, list_corpus)
x_text_train, x_text_test, y_train, y_test = train_test_split(lst, y, test_size=0.2, random_state=42)
model_2.fit(x_text_train, y_train, validation_split=0.1, shuffle=True)
I get the error of
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [59], in <cell line: 1>()
----> 1 model_2.fit(x_text_train, y_train, validation_split=0.1,
shuffle=True)
AttributeError: 'FastText' object has no attribute 'fit'
Other documentation showing the initial training of fasttext have the fit function.
I am having trouble finding documentation of others who have taken a pre-tained fasttext gensim model and fit it to their text data to ultimately use a classifier
The Gensim FastText implementation offers no .fit() method. (I also don't see any such method in Facebook's Python wrapper of its original C++ FastText implementation. Even in its supervised-classification mode, it has its own train_supervised() method rather than a scikit-learn-style fit() method.)
If you saw some online example using such a method, it must have been using a different FastText implementation - so you should consult the full details of that other example to see which library they were using.
I don't know of any good online examples showing how to 'fine-tune' a pretrained FastText model to a smaller set of new texts, much less any demonstrating benefits, gotchas, & rules-of-thumb for performing such an operation.
If you did see an online example suggesting such an approach, & demonstrating some benefits over other less-complicated approaches, then that source-of-inspiration would also be the model to follow - or to mention/link when trying to debug their approach. Without someone's full working examples as a guide/template, you're in improvised-innovation mode.
Note you don't have to start with someone else's pre-trained model. You can train your own FastText models with your own training texts – and for many domains, & tasks, this could work better than a generic model trained from public sources like Wikipedia texts or large web crawls.
And when you do, you have the option of simply using FastText in its base unsupervised mode – as a way to featurize text – then pass those FastText-modeled features to some other explicit classifier option (such as the many calssifiers in scikit-learn with .fit() methods).
FastText's own -supervised mode builds a different kind of model that combines the word-training with the classification-training. A general FastText language model you find online is unlikely to be a specific -supervised mode model, unless it is explicitly declared to be one. If it's a standard unsupervised model, there's no straightforward way to adapt it into a -supervised model. And if it is already a -supervised model, it will have already been trained for someone else's fixed set of known-labels.

is it possible to set the splitting strategy for GridSearchCv?

I'm optimizing model's hyperparameters by GridSearchCv. And because the data I'm working with is very imbalanced, I need to "choose" the manner that the algortihm splits the train/test sets in order to ensure that the underrepresented points are in both sets.
By reading scikit-learn's documentation, I have the idea that it's possible to set the splitting strategy for GridSearch but I'm not sure how or if this is the case.
I would be very grateful if someone could help me with this.
Yes, pass in the GridSearchCV as cv a StratifiedKFold object.
from sklearn.model_selection import StratifiedKFold
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
skf = StratifiedKFold(n_splits=5)
clf = GridSearchCV(svc, parameters, cv = skf)
clf.fit(iris.data, iris.target)
By default, if you are training a classification model with GridSearchCV, the default method for splitting the dataset is StratifiedKFold, that takes care of balancing the dataset according to the target variable.
If your dataset is imbalanced for some other reason (not the target variable), you can choose another criteria to perform the split. Carefully read the documentation of GridSearchCV, and select an appropriate CV splitter.
In the scikit-learn documentation of model selection, there are many Splitter Classes that you could use. Or you can define your own splitter class according to your criteria, but it would be more difficult.

How do I restrict the number of processors used by the ridge regression model in sklearn?

I want to make a fair comparison between different machine learning models. However, I find that the ridge regression model will automatically use multiple processors and there is no parameter that I can restrict the number of used processors (such as n_jobs). Is there any possible way to solve this problem?
A minimal example:
from sklearn.datasets import make_regression
from sklearn.linear_model import RidgeCV
features, target = make_regression(n_samples=10000, n_features=1000)
r = RidgeCV()
r.fit(features, target)
print(r.score(features, target))
If you set the environmental variable OMP_NUM_THREADS to n, you will get the expected behaviour. E.g. on linux, do export OMP_NUM_THREADS=1 in the terminal to restrict the use to 1 cpu.
Depending on your system, you can also set it directly in python. See e.g. How to set environment variables in Python?
Trying to expand further on #PV8 answer, what happens whenever you instantiate an instance of RidgeCV() without explicitly setting cv parameter (as in your case) is that an Efficient Leave One Out cross-validation is run (according to the algorithms referenced here, implementation here).
On the other side, when explicitly passing cv parameter to RidgeCV() this happens:
model = Ridge()
parameters = {'alpha': [0.1, 1.0, 10.0]}
gs = GridSearchCV(model, param_grid=parameters)
gs.fit(features, target)
print(gs.best_score_)
(as you can see here), namely that you'll use GridSearchCV with default n_jobs=None.
Most importantly, as pointed out by one of sklearn core-dev here, the issue you are experimenting might be not dependent on sklearn, but rather on
[...] your numpy setup performing vectorized operations with parallelism.
(where vectorized operations are performed within the computationally efficient LOO cross-validation procedure that you are implicitly calling by not passing cv to RidgeCV()).
Based on the docs for RidgeCV:
Ridge regression with built-in cross-validation.
By default, it performs Leave-One-Out Cross-Validation, which is a form of efficient Leave-One-Out cross-validation.
And by default you use None - to use the efficient Leave-One-Out cross-validation.
An alternate approach with ridge regression and cross validation:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
clf = Ridge(a)
scores = cross_val_score(clf, features, target, cv=1, n_jobs=1)
print(scores)
See also the docs of Ridge and cross_val_score.
Here it is try to take a look here sklearn.utils.parallel_backend i think you can set up the number of cores for calculation using the njobs parameter.

Scikit-Learn: Avoiding Data Leakage During Cross-Validation

I've just been reading up on k-fold cross-validation and have realized that I'm inadvertently leaking data with my current preprocessing setup.
Usually, I have a train and test dataset. I do a bunch of data imputation and one-hot encoding on my entire train dataset and then run k-fold cross-validation.
The leakage comes in because, if I'm doing 5-fold cross-validation, I'm training on 80% of my train data and testing it on the remaining 20% of the train data.
I really should just be imputing the 20% based on the 80% of train (whereas I was using 100% of the data before).
1) Is this the right way to think about cross-validation?
2) I've been looking at the Pipeline class in sklearn.pipeline and it seems useful for doing a bunch of transformations and then finally fitting a model to the resulting data. However, I'm doing a bunch of stuff like "impute missing data in float64 columns with the mean", "impute all other data with the mode", etc.
There isn't an obvious transformer for this kind of imputation. How would I go about adding this step to a Pipeline? Would I just make my own subclass of BaseEstimator?
Any guidance here would be great!
1) Yes, you should impute the 20% test data using the 80% training data.
2) I wrote a blog post that answers your second question, but I'll include the core parts here.
With sklearn.pipeline, you can apply separate preprocessing rules to different feature types (e.g., numeric, categorical). In the example code below, I impute the median of numeric features before scaling them. The categorical and boolean features are imputed with the mode -- the categorical features are one-hot encoded.
You can include an estimator at the end of the pipeline for regression, classification, etc.
import numpy as np
from sklearn.pipeline import make_pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder, Imputer, StandardScaler
preprocess_pipeline = make_pipeline(
FeatureUnion(transformer_list=[
("numeric_features", make_pipeline(
TypeSelector(np.number),
Imputer(strategy="median"),
StandardScaler()
)),
("categorical_features", make_pipeline(
TypeSelector("category"),
Imputer(strategy="most_frequent"),
OneHotEncoder()
)),
("boolean_features", make_pipeline(
TypeSelector("bool"),
Imputer(strategy="most_frequent")
))
])
)
The TypeSelector portion of the pipeline assumes the object X is a pandas DataFrame. The subset of columns with the given data type are selected with TypeSelector.transform.
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
class TypeSelector(BaseEstimator, TransformerMixin):
def __init__(self, dtype):
self.dtype = dtype
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
return X.select_dtypes(include=[self.dtype])
I recommend thinking of 5-fold cross validation as simply splitting up the data into 5 parts (or folds). You hold out one fold for testing and using the other 4 together for your training set. We repeat this process another 4 times until each fold has had the chance to be tested.
For your imputation to work correctly and not be subject to contamination, you would need to determine the mean from the 4 folds used for testing, and use it to impute that value in both the training set and test set.
I like to implement the CV split with StratifiedKFold. This will ensure you have the same number of samples for each class in the folds.
To answer your question about using Pipelines, I would say you should probably subclass the BaseEstimator with your custom Imputation transformer. Inside of your loop for the CV-split, you should compute the mean from your training set then set this mean as a parameter in your transformer. Then you can call fit or transform.

Resources