I am using GridSearchCV like this:
corpus = load_files('corpus')
with open('stopwords.txt', 'r') as f:
stop_words = [y for x in f.read().split('\n') for y in (x, x.title())]
x = corpus.data
y = corpus.target
pipeline = Pipeline([
('vec', CountVectorizer(stop_words=stop_words)),
('classifier', MultinomialNB())])
parameters = {'vec__ngram_range': [(1, 1), (1, 2)],
'classifier__alpha': [1e-2, 1e-3],
'classifier__fit_prior': [True, False]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=-1, cv=5, scoring="f1", verbose=10)
gs_clf = gs_clf.fit(x, y)
joblib.dump(gs_clf.best_estimator_, 'MultinomialNB.pkl', compress=1)
Then, in another file, to classify new documents (not from the corpus), I do this:
classifier = joblib.load(filepath) # path to .pkl file
result = classifier.predict(tokenlist)
My question is: Where do I get the values needed for the classification_report?
In many other examples, I see people split the corpus into traing set and test set.
However, since I am using GridSearchCV with kfold-cross-validation, I don't need to do that.
So how can I get those values from GridSearchCV?
If you have GridSearchCV object:
from sklearn.metrics import classification_report
clf = GridSearchCV(....)
clf.fit(x_train, y_train)
classification_report(y_test,clf.best_estimator_.predict(x_test))
If you have saved the best estimator and loaded it then:
classifier = joblib.load(filepath)
classification_report(y_test,classifier.predict(x_test))
The best model is in clf.best_estimator_. You need to fit the training data to this; then predict your test data and use ytest and ypreds for the classification report.
Related
I'm practicing on the kaggle news headline dataset on the DJIA prices as exported from Yahoo Finance: https://www.kaggle.com/aaron7sun/stocknews#Combined_News_DJIA.csv
There are not many discussions on NLP with TimeSeries, I attempted using this article's code using CountVectorizer() however unsuccessful. I was wondering if anyone has any resources or suggestions?
My code below based on headline in dataset above:
def modeller(vect, X_tr, y_tr, X_te):
X_train_dtm = vect.fit_transform(X_tr.unstack())
X_test_dtm = vect.fit_transform(X_te.unstack())
X_tr_arima = [x for x in X_train_dtm]
print('done with count vectorizer. now modelling.')
model = ARIMA(X_tr_arima, order=(1,1,1))
print('done modelling. now fitting')
model_fit = model.fit(X_tr_arima, y_tr)
y_hat = model.predict(x_te_arima)
return y_hat
vect = CountVectorizer(stop_words='english')
X_train, X_test, y_train, y_test = X.iloc[0:100], X.iloc[100:X.shape[0]], y[0:100], y[100:len(y)]
modeller(vect, X_train, y_train, X_test)
Output (error from ARIMA line):
ValueError: setting an array element with a sequence.
I had the same problem and I could fix it using this approach.
Try to change
from pmdarima.pipeline import Pipeline
to
from pmdarima.pipeline import Pipeline as arimaPip
What im trying to do;
Get the K-fold cross validated scores of an SVM. The data has all numerical independent variables, and a categorical dependent variable. Im using python3, sklearn and feature engine.
My understanding on the matter;
The independent variable has NA values, all of them are below 5% of the total data points, so i imputed them using the median values from the train set, as the variables are not normally distributed. I also scaled the values of the train and test set using the values from the test set. My train-test split is 80-20.
I understand that it is a good practice to scaled and impute data using only the train set. As this helps avoid over-fit and data leak.
When it comes to Kfold cross validation, the train and test set change.
Question;
Is there a way to ensure that i can re-impute and re-scale the train and test set based on the train set of each fold ?
Any help is appreciated, thank you !
Train-test split using a random seed. Same random seed is used in the K-Fold cross validation.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 3)
NA value imputation;
from feature_engine import missing_data_imputers as mdi
imputer = mdi.MeanMedianImputer(imputation_method = 'median')
imputer.fit(X_train)
X_train = imputer.transform(X_train)
Variable transformation;
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_trans = scaler.transform(X_train)
X_test_trans = scaler.transform(X_test)
Below is the SVM;
def svm1(gam, C):
clf1 = svm.SVC(gamma=gam, C=C)
clf1.fit(X_train_trans, y_train)
print('The Trainset Score is {}.'.format(clf1.score(X_train_trans , y_train)))
print('The Testset Score is {}.'.format(clf1.score(X_test_trans , y_test)))
print('')
y_pred1 = clf1.predict(X_test_trans)
print('The confusin matrix is; \n{}'.format(metrics.confusion_matrix(y_test , y_pred1)))
interactive(svm1, gam = G1, C = cc1)
I then merge the train and test set, to get back a transformed dataset;
frames3 = [X_test_trans, X_train_trans ]
X_Final = pd.concat(frames3)
Now i fit the X_Final, which is concated train and test set, to get K-fold cross validated score.
kfold = KFold(n_splits = 10, random_state = 3)
model = svm.SVC(gamma=0.23, C=3.20)
results = cross_val_score(model, PCA_X_Final,y_Final, cv = kfold)
print(results)
print('Accuracy = {}%, Standard Deviation = {}%'.format(round(results.mean(), 4), round(results.std(), 2)))
I would like to know how i can re-scale and re-impute each fold, so that the variables are re-scaled, and NA values re-imputed in each fold using the train set to avoid overfit / dataleak
To impute and scale the data with the parameters derived from each fold in the CV, you first need to establish the engineering steps in a pipeline, and then do CV over the entire pipeline. For example something like this:
set up engineering pipeline:
my_pipe = Pipeline([
# missing data imputation
('imputer_num',
mdi.MeanMedianImputer(imputation_method='mean', variables=['varA', 'varB'])),
# scaler
('scaler', StandardScaler()),
# Gradient Boosted machine (or your SVM instead)
('gbm', GradientBoostingClassifier(random_state=0))
])
then the CV:
param_grid = {
# try different gradient boosted tree model parameters
'gbm__max_depth': [None, 1, 3],
}
# now we set up the grid search with cross-validation
grid_search = GridSearchCV(my_pipe, param_grid,
cv=5, n_jobs=-1, scoring='roc_auc')
More details in this notebook.
used scikit's DictVectorizer to make a feature vector
X = dataset.drop('Tag', axis=1)
v = DictVectorizer(sparse=False)
X = v.fit_transform(X.to_dict('records'))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=0)
classes = np.unique(y)
classes = classes.tolist()
per = Perceptron(verbose=10, n_jobs=-1, max_iter=5)
per.partial_fit(X_train, y_train, classes)
joblib.dump(per, 'saved_model.pkl')
and save trined model to file.
load model in another file for new date
new_X=df
v = DictVectorizer(sparse=False)
new_X = v.fit_transform(new_X.to_dict('records'))
#Load model
per_load = joblib.load('saved_model2.pkl')
per_load.predict(new_X)
i try to predict new data When I execute this code, the output is
Value error
ValueError: X has 43 features per sample; expecting 983
How do I save the model ?
you need to save pickle object for vectorizer , as well and apply transform rather fit_transform because your vectorizer has already learned the vocabulary and that need to used for predicting unseen data
#use
import joblib
joblib.dump(v, 'vectorizer.pkl')
#loading pickle
v = joblib.load('vectorizer.pkl')
per_load.predict(v.transform(["new comment"]) #don't use fit_transform , use transfom only
I want to implement GridSearchCV for XGboost model in pipeline. I have preprocessor for data, defined above the code, some grid params
XGBmodel = XGBRegressor(random_state=0)
pipe = Pipeline(steps=[
('preprocess', preprocessor),
('XGBmodel', XGBmodel)
])
And I want to pass these fit params
fit_params = {"XGBmodel__eval_set": [(X_valid, y_valid)],
"XGBmodel__early_stopping_rounds": 10,
"XGBmodel__verbose": False}
I am trying to fit model
searchCV = GridSearchCV(pipe, cv=5, param_grid=param_grid, fit_params=fit_params)
searchCV.fit(X_train, y_train)
but I get error on the line with eval_set: DataFrame.dtypes for data must be int, float or bool
I guess it is because validation data aren't going through the preprocessing, but when I google I find that everywhere it is done by this way and seems it should work.
Also I tried to find a way to apply preprocessor for validation data separately, but it is not possible to transform validation data without fitting train data before it.
Full code
columns = num_cols + cat_cols
X_train = X_full_train[columns].copy()
X_valid = X_full_valid[columns].copy()
num_preprocessor = SimpleImputer(strategy = 'mean')
cat_preprocessor = Pipeline(steps=[
('imputer', SimpleImputer(strategy = 'most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(transformers=[
('num', num_preprocessor, num_cols),
('cat', cat_preprocessor, cat_cols)
])
XGBmodel = XGBRegressor(random_state=0)
pipe = Pipeline(steps=[
('preprocess', preprocessor),
('XGBmodel', XGBmodel)
])
param_grid = {
"XGBmodel__n_estimators": [10, 50, 100, 500],
"XGBmodel__learning_rate": [0.1, 0.5, 1],
}
fit_params = {"XGBmodel__eval_set": [(X_valid, y_valid)],
"XGBmodel__early_stopping_rounds": 10,
"XGBmodel__verbose": False}
searchCV = GridSearchCV(pipe, cv=5, param_grid=param_grid, fit_params=fit_params)
searchCV.fit(X_train, y_train)
Is there any way to preprocess validation data in pipeline? Or maybe completely different way to implement this thing?
There is no good way. If you have a long pipeline of transformers before fitting a model, then you can consider to fit those in the pipeline and then apply the model separately.
The underlying issue is that a pipeline has no notion of a validation set used in the model fitting. You can see a discussion on LightGBM github here. Their proposal is to pre-train transformers and apply those to the validation data before you fit the full pipeline. This can be fine, if you use fast transformers, but can double CPU time in an extreme scenario.
One way to train a pipeline that is using EarlyStopping is to train the preprocessing and the regressor separately.
The steps are the following:
fit_transform() the transformers
transform() the validation data.
fit() the model with Xgboost parameters
dump the fitted pipeline
as follows:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from xgboost import XGBRegressor
import pickle
import numpy as np
import joblib
rng = np.random.RandomState(0)
X_train, X_val = rng.randn(50, 3), rng.randn(20, 3)
y_train, y_val = rng.randn(50, 1), rng.randn(20, 1)
pipeline = Pipeline([
('scaler', StandardScaler()),
('regressor', XGBRegressor(random_state=0)),
])
X_train_transformed = pipeline[:-1].fit_transform(X_train)
x_val_transformed = pipeline[:-1].transform(X_val)
pipeline[-1].fit(
X=X_train_transformed,
y=y_train,
eval_set=[(x_val_transformed, y_val)],
early_stopping_rounds=10,
)
joblib.dump(pipeline, 'pipeline.pkl')
pipe = joblib.load('pipeline.pkl')
pipe.score(X_val, y_val)
Notes: This will work if you you want to fit the pipeline. However, if you want to perform a GridSearch using earlyStropping, you will have to write your own gridsearch like in this article.
I found and successfully tested following script that applies Pipeline and GridSearchCV to classifier selection. The script outputs the best classifier and its accuracy.
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn import datasets
iris = datasets.load_iris()
X_train = iris.data
y_train = iris.target
X_test = iris.data[:10] # Augmenting test data
y_test = iris.target[:10] # Augmenting test data
#Create a pipeline
pipe = Pipeline([('classifier', LogisticRegression())])
# Create space of candidate learning algorithms and their hyperparameters
search_space = [{'classifier': [LogisticRegression()],
'classifier__penalty': ['l1', 'l2'],
'classifier__C': np.logspace(0, 4, 10)},
{'classifier': [RandomForestClassifier()],
'classifier__n_estimators': [10, 100, 1000],
'classifier__max_features': [1, 2, 3]}]
# Create grid search
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0)
# Fit grid search
best_model = clf.fit(X_train, y_train)
print('Best training accuracy: %.3f' % best_model.best_score_)
print('Best estimator:', best_model.best_estimator_.get_params()['classifier'])
# Predict on test data with best params
y_pred = best_model.predict(X_test)
# Test data accuracy of model with best params
print(classification_report(y_test, y_pred, digits=4))
print('Test set accuracy score for best params: %.3f' % accuracy_score(y_test, y_pred))
from sklearn.metrics import precision_recall_fscore_support
print(precision_recall_fscore_support(y_test, y_pred,
average='weighted'))
How can I adjust the script so that it not only outputs the best classifier, which is LogReg in our example, but also the best selected among the other classifiers? Above, I like to see the output from RandomForestClassifier(), too.
Ideal is a solution where the best classifier for each algorithm (LogReg, RandomForest,..) is shown and where each of those best classifiers is sorted into a table. The first column or index should be the model and precision_recall_fscore_support values are in rows on the right. The table should then be sorted by F-score.
PS: Though the script works, I'm yet unsure what the function of LogisticRegression() in the Pipeline is, as it's defined in the search space later.
Solution (simplified):
from sklearn import datasets
iris = datasets.load_iris()
X_train = iris.data
y_train = iris.target
X_test = iris.data[:10]
y_test = iris.target[:10]
seed=1
models = [
'RFC',
'logisticRegression'
]
clfs = [
RandomForestClassifier(random_state=seed,n_jobs=-1),
LogisticRegression()
]
params = {
models[0]:{'n_estimators':[100]},
models[1]: {'C':[1000]}
}
for name, estimator in zip(models,clfs):
print(name)
clf = GridSearchCV(estimator, params[name], scoring='accuracy', refit='True', n_jobs=-1, cv=5)
clf.fit(X_train, y_train)
print("best params: " + str(clf.best_params_))
print("best scores: " + str(clf.best_score_))
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("Accuracy: {:.4%}".format(acc))
print(classification_report(y_test, y_pred, digits=4))
If I understood correctly, this should work fine.
import pandas as pd
import numpy as np
df = pd.DataFrame(list(best_model.cv_results_['params']))
ranking = best_model.cv_results_['rank_test_score']
# The sorting is done based on the test_score of the models.
sorting = np.argsort(best_model.cv_results_['rank_test_score'])
# Sort the lines based on the ranking of the models
df_final = df.iloc[sorting]
# The first line contains the best model and its parameters
df_final.to_csv('sorted_table.csv')
# OR to avoid the index in the writting
df_final.to_csv('sorted_table2.csv',index=False)
Results:
However, in this case, the ordering is not done based on the F values. To do so use this. Define in the GridSearch the scoring attribute to f1_weighted and repeat my code.
Example:
...
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0,scoring='f1_weighted')
best_model = clf.fit(X_train, y_train)
df = pd.DataFrame(list(best_model.cv_results_['params']))
ranking = best_model.cv_results_['rank_test_score']
# The sorting is done based on the F values of the models.
sorting = np.argsort(best_model.cv_results_['rank_test_score'])
# Sort the lines based on the ranking of the models
df_final = df.iloc[sorting]
df_final.to_csv('F_sorted_table.csv')
Results: