Pipe-lining Standardscaler, Recursive feature selection, and Classifier - scikit-learn

I have a given dataset, X and Y.
I want to implement the following steps using pipeline:
- Standardscaler
- Recursive feature selection
- RandomForestClassifier
- cross-validation predict
I implemented as follows:
import numpy as np
from sklearn.feature_selection import RFE, RFECV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
Y = data.target
print X.shape
print Y.shape
clf = RandomForestClassifier(n_estimators=50,max_features=None,n_jobs=-1,random_state=0)
kf = KFold(n_splits=2, shuffle=True, random_state=0)
pipeline = Pipeline([('standardscaler', StandardScaler()),
('rfecv', RFECV(estimator=clf, step=1, cv=kf, scoring='accuracy', n_jobs=7)),
('clf', clf)])
pipeline.fit(X,Y)
ypredict = cross_val_predict(pipeline, X, Y, cv=kf)
accuracy = accuracy_score(Y, ypredict)
print (accuracy)
Please look into my implementation deeply, and let me know where is wrong with my code. Thank you.

This works. The final estimator in the pipeline only needs to implement fit which REFCV does. Here's the code:
from sklearn.feature_selection import RFECV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
Y = data.target
clf = RandomForestClassifier()
# create pipeline
estimators = [('standardize' , StandardScaler()),
('rfecv', RFECV(estimator=clf, scoring='accuracy'))]
# build the pipeline
pipeline = Pipeline(estimators)
# run the pipeline
kf = KFold(n_splits=2, shuffle=True, random_state=0)
ypredict = cross_val_predict(pipeline, X, Y, cv=kf)
accuracy = accuracy_score(Y, ypredict)
print (accuracy)
'Output':
0.96

Related

xgboost feature importance high but doesn't produce a better model

I am using XGboost for a binary prediction problem. I tested my model with several features and had some good results.
After adding one feature to the model and calculating the feature importance. The importance of this feature showed to be very high and far superior to other features.
However, when testing the model the test score drops considerably.
Is there an explanation for this kind of behaviour ?
There are at least a few ways to run feature importance experiments.
# Let's load the packages
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import permutation_importance
import shap
from matplotlib import pyplot as plt
plt.rcParams.update({'figure.figsize': (12.0, 8.0)})
plt.rcParams.update({'font.size': 14})
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)
# 1
rf.feature_importances_
plt.barh(boston.feature_names, rf.feature_importances_)
sorted_idx = rf.feature_importances_.argsort()
plt.barh(boston.feature_names[sorted_idx], rf.feature_importances_[sorted_idx])
plt.xlabel("Random Forest Feature Importance")
# 2
perm_importance = permutation_importance(rf, X_test, y_test)
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(boston.feature_names[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance")
# 3
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")
Also, you can certainly add more data into your model. Models, almost without exception, produce more accurate results when they 'see' more data. Finally, you can always test other models on your dataset and see how they perform. Today at work I tested an XGboost model and a RandomForestRegressor model. I expected the former to perform better, but the latter actually performed much better. It's almost impossible to guess which model will perform better over any given dataset, you have to try multiple models, check the predictive capabilities of each, and pick the one (or maybe two) that performs the best. Having said that, you can try something like this.
import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn import cluster, datasets
from sklearn.neighbors import kneighbors_graph
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
np.random.seed(0)
pd.set_option('display.max_columns', 500)
#df = pd.read_csv('C:\\your_path_here\\test.csv')
#print('done!')
#df = df[:10000]
#df = df.fillna(0)
#df = df.dropna()
X = df[['RatingScore',
'Par',
'Term',
'TimeToMaturity',
'LRMScore',
'Coupon',
'Price']]
#select your target variable
y = df[['Spread']]
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
colors = np.hstack([colors] * 20)
clustering_names = [
'MiniBatchKMeans', 'AffinityPropagation', 'MeanShift',
'SpectralClustering', 'Ward', 'AgglomerativeClustering',
'DBSCAN', 'Birch']
plt.figure(figsize=(len(clustering_names) * 2 + 3, 9.5))
plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05,
hspace=.01)
plot_num = 1
blobs = datasets.make_blobs(n_samples=n_samples, random_state=8)
# normalize dataset for easier parameter selection
X = StandardScaler().fit_transform(X)
# estimate bandwidth for mean shift
bandwidth = cluster.estimate_bandwidth(X, quantile=0.3)
# connectivity matrix for structured Ward
connectivity = kneighbors_graph(X, n_neighbors=10, include_self=False)
# make connectivity symmetric
connectivity = 0.5 * (connectivity + connectivity.T)
# create clustering estimators
ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True)
two_means = cluster.MiniBatchKMeans(n_clusters=2)
ward = cluster.AgglomerativeClustering(n_clusters=2, linkage='ward',
connectivity=connectivity)
spectral = cluster.SpectralClustering(n_clusters=2,
eigen_solver='arpack',
affinity="nearest_neighbors")
dbscan = cluster.DBSCAN(eps=.2)
affinity_propagation = cluster.AffinityPropagation(damping=.9,
preference=-200)
average_linkage = cluster.AgglomerativeClustering(
linkage="average", affinity="cityblock", n_clusters=2,
connectivity=connectivity)
birch = cluster.Birch(n_clusters=2)
clustering_algorithms = [
two_means, affinity_propagation, ms, spectral, ward, average_linkage,
dbscan, birch]
for name, algorithm in zip(clustering_names, clustering_algorithms):
# predict cluster memberships
t0 = time.time()
algorithm.fit(X)
t1 = time.time()
if hasattr(algorithm, 'labels_'):
y_pred = algorithm.labels_.astype(np.int)
else:
y_pred = algorithm.predict(X)
# plot
plt.subplot(4, len(clustering_algorithms), plot_num)
if i_dataset == 0:
plt.title(name, size=18)
plt.scatter(X[:, 0], X[:, 1], color=colors[y_pred].tolist(), s=10)
if hasattr(algorithm, 'cluster_centers_'):
centers = algorithm.cluster_centers_
center_colors = colors[:len(centers)]
plt.scatter(centers[:, 0], centers[:, 1], s=100, c=center_colors)
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.xticks(())
plt.yticks(())
plt.text(.99, .01, ('%.2fs' % (t1 - t0)).lstrip('0'),
transform=plt.gca().transAxes, size=15,
horizontalalignment='right')
plot_num += 1
plt.show()
Finally, consider looping through several regression, or classification, models in one go, and getting the results for each.
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
from sklearn import linear_model
import statsmodels.api as sm
X = X
y = y
# Note the difference in argument order
model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make the predictions by the model
# Print out the statistics
model.summary()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import TweedieRegressor
from sklearn.linear_model import PoissonRegressor
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.svm import LinearSVR
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
regressors = [
LinearRegression(),
SGDRegressor(),
KNeighborsRegressor(),
DecisionTreeRegressor(),
RandomForestRegressor(),
GradientBoostingRegressor(),
TweedieRegressor(),
PoissonRegressor(),
Ridge(),
Lasso()
]
import pandas as pd
# Logging for Visual Comparison
log_cols=["Regressor", "RMSE", "MAE"]
log = pd.DataFrame(columns=log_cols)
for reg in regressors:
reg.fit(X_train, y_train)
name = reg.__class__.__name__
print(reg.score(X_test, y_test))
y_pred = reg.predict(X_test)
lr_mse = mean_squared_error(y_pred, y_test)
lr_rmse = np.sqrt(lr_mse)
print(name + ' RMSE: %.4f' % lr_rmse)
lin_mae = mean_absolute_error(y_pred, y_test)
print(name + ' MAE: %.4f' % lin_mae)
log_entry = pd.DataFrame([[name, lr_rmse, lin_mae]], columns=log_cols)
log = log.append(log_entry)
print("="*30)
import seaborn as sns
import matplotlib as plt
sns.set_color_codes("muted")
sns.barplot(x='RMSE', y='Regressor', data=log, color="b")
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.datasets import load_iris
iris = load_iris()
iris
# Step 2: Separating the data into dependent and independent variables
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
# Step 3: Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
classifiers = [
GaussianNB(),
MLPClassifier(),
KNeighborsClassifier(),
GaussianProcessClassifier(),
DecisionTreeClassifier(),
RandomForestClassifier(),
AdaBoostClassifier(),
GradientBoostingClassifier(),
QuadraticDiscriminantAnalysis()]
import pandas as pd
# Logging for Visual Comparison
log_cols=["Classifier", "Accuracy"]
log = pd.DataFrame(columns=log_cols)
for clf in classifiers:
clf.fit(X_train, y_train)
name = clf.__class__.__name__
print("="*30)
print(name)
print('****Results****')
train_predictions = clf.predict(X_test)
acc = accuracy_score(y_test, train_predictions)
print("Accuracy: {:.4%}".format(acc))
log_entry = pd.DataFrame([[name, acc*100]], columns=log_cols)
log = log.append(log_entry)
print("="*30)
import seaborn as sns
import matplotlib as plt
sns.set_color_codes("muted")
sns.barplot(x='Accuracy', y='Classifier', data=log, color="b")

AttributeError: 'numpy.ndarray' object has no attribute 'lower' - how to fix it?

The full error is this. I am not sure how to fix it. I'm trying to predict the link between gender and aggresiveness in tweets.
(https://i.stack.imgur.com/T4Ual.png)
This is the whole script
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#De specifikke, vi ved vi kommer til at bruge
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB #Gør at man kan have mere end 2 classes
data = pd.read_csv('/work/90301/Individual project/TheClimateChangeTwitterDataset.csv')
#corpus=data['text']
#corpus=text.loc[:,['aggressiveness', 'gender']]
cv=CountVectorizer() #Take some text and turn it into a matrix
X = cv.fit_transform(data.values).toarray()
#x = X['aggressiveness'].values
#y = X['gender'].values
y=data['gender'].values
print(X.shape)
print(y.shape)
X_train,X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
#Instantiate and train Naive Bayes
classifier = MultinomialNB(fit_prior=True)
classifier.fit(X_train, y_train)
#test model
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(f'Relative accuracy: {accuracy_score(y_test, y_pred)}')
print(f'Accuracy in instances: {accuracy_score(y_test, y_pred, normalize=False)}')
#Infer the label (spam/ham) of a message
aggressiveness=[corpus]
#print(email)
aggressiveness_array = cv.transform(aggressiveness).toarray()
print(classifier.predict(aggressiveness_array))

AttributeError: 'RFECV' object has no attribute 'ranking_'

I tried to get features ranking, by using followings:
1. Standardscaler
2. RandomForestClassifier
3. Recursive feature selection
from sklearn.feature_selection import RFECV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
Y = data.target
clf = RandomForestClassifier()
estimators = [('standardize' , StandardScaler()),
('rfecv', RFECV(estimator=clf, scoring='accuracy'))]
pipeline = Pipeline(estimators)
ranking_features = pipeline.named_steps['rfecv'].ranking_
print (ranking_features)
AttributeError: 'RFECV' object has no attribute 'ranking_'
Any best practice to do this is welcomed.
We first use rfecev to fit the data before calling the ranking_ attribute. Try running this code:
from sklearn.feature_selection import RFECV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
Y = data.target
clf = RandomForestClassifier()
estimators = [('standardize' , StandardScaler()),
('rfecv', RFECV(estimator=clf, scoring='accuracy'))]
# create pipeline
pipeline = Pipeline(estimators)
# fit rfecv to data
rfecv_data = pipeline.named_steps['rfecv'].fit(X, Y)
# get the feature ranking
ranking_features = rfecv_data.ranking_
print (ranking_features)
'Output':
[2 3 1 1]

Feature-selection and prediction

from sklearn.feature_selection import RFECV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
I have X and Y data.
data = load_iris()
X = data.data
Y = data.target
I would like to implement RFECV feature selection and prediction with k-fold validation approach.
code corrected from the answer # https://stackoverflow.com/users/3374996/vivek-kumar
clf = RandomForestClassifier()
kf = KFold(n_splits=2, shuffle=True, random_state=0)
estimators = [('standardize' , StandardScaler()),
('clf', clf)]
class Mypipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
#property
def feature_importances_(self):
return self._final_estimator.feature_importances_
pipeline = Mypipeline(estimators)
rfecv = RFECV(estimator=pipeline, cv=kf, scoring='accuracy', verbose=10)
rfecv_data = rfecv.fit(X, Y)
print ('no. of selected features =', rfecv_data.n_features_)
EDIT (for small remaining):
X_new = rfecv.transform(X)
print X_new.shape
y_predicts = cross_val_predict(clf, X_new, Y, cv=kf)
accuracy = accuracy_score(Y, y_predicts)
print ('accuracy =', accuracy)
Instead of wrapping StandardScaler and RFECV in a same pipeline, do that for StandardScaler and RandomForestClassifier and pass that pipeline to the RFECV as an estimator. In this no traininf info will be leaked.
estimators = [('standardize' , StandardScaler()),
('clf', RandomForestClassifier())]
pipeline = Pipeline(estimators)
rfecv = RFECV(estimator=pipeline, scoring='accuracy')
rfecv_data = rfecv.fit(X, Y)
Update: About the error 'RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes'
Yes thats a known issue in scikit-learn pipeline. You can look at my other answer here for more details and use the new pipeline I created there.
Define a custom pipeline like this:
class Mypipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
#property
def feature_importances_(self):
return self._final_estimator.feature_importances_
And use that:
pipeline = Mypipeline(estimators)
rfecv = RFECV(estimator=pipeline, scoring='accuracy')
rfecv_data = rfecv.fit(X, Y)
Update 2:
#brute, For your data and code, the algorithms completes within a minute on my PC. This is the complete code I use:
import numpy as np
import glob
from sklearn.utils import resample
files = glob.glob('/home/Downloads/Untitled Folder/*')
outs = []
for fi in files:
data = np.genfromtxt(fi, delimiter='|', dtype=float)
data = data[~np.isnan(data).any(axis=1)]
data = resample(data, replace=False, n_samples=1800, random_state=0)
outs.append(data)
X = np.vstack(outs)
print X.shape
Y = np.repeat([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 1800)
print Y.shape
#from sklearn.utils import shuffle
#X, Y = shuffle(X, Y, random_state=0)
from sklearn.feature_selection import RFECV
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
clf = RandomForestClassifier()
kf = KFold(n_splits=10, shuffle=True, random_state=0)
estimators = [('standardize' , StandardScaler()),
('clf', RandomForestClassifier())]
class Mypipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
#property
def feature_importances_(self):
return self._final_estimator.feature_importances_
pipeline = Mypipeline(estimators)
rfecv = RFECV(estimator=pipeline, scoring='accuracy', verbose=10)
rfecv_data = rfecv.fit(X, Y)
print ('no. of selected features =', rfecv_data.n_features_)
Update 3: For cross_val_predict
X_new = rfecv.transform(X)
print X_new.shape
# Here change clf to pipeline,
# because RFECV has found features according to scaled data,
# which is not present when you pass clf
y_predicts = cross_val_predict(pipeline, X_new, Y, cv=kf)
accuracy = accuracy_score(Y, y_predicts)
print ('accuracy =', accuracy)
Here's how we'll do it:
Fit on the training set
from sklearn.feature_selection import RFECV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
data = load_iris()
X = data.data, Y = data.target
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, shuffle=True)
# create model
clf = RandomForestClassifier()
# instantiate K-Fold
kf = KFold(n_splits=10, shuffle=True, random_state=0)
# pipeline estimators
estimators = [('standardize' , StandardScaler()),
('rfecv', RFECV(estimator=clf, cv=kf, scoring='accuracy'))]
# instantiate pipeline
pipeline = Pipeline(estimators)
# fit rfecv to train model
rfecv_model = rfecv_model = pipeline.fit(X_train, y_train)
# print number of selected features
print ('no. of selected features =', pipeline.named_steps['rfecv'].n_features_)
# print feature ranking
print ('ranking =', pipeline.named_steps['rfecv'].ranking_)
'Output':
no. of selected features = 3
ranking = [1 2 1 1]
Predict on the test set
# make predictions on the test set
predictions = rfecv_model.predict(X_test)
# evaluate the model performance using accuracy metric
print("Accuracy on test set: ", accuracy_score(y_test, predictions))
'Output':
Accuracy: 0.9736842105263158

ValueError while in SVC

This is a cancer dataset with 10 features and a class.
X=df.iloc[:,1:10].values
y=df.iloc[:,[-1]].values
from sklearn.preprocessing import Imputer
imputer=Imputer(missing_values='NaN',strategy='mean',axis=1)
imputer=imputer.fit(X)
X=imputer.transform(X)
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)
from sklearn.svm import SVC
classifier=SVC (kernel='rbf',random_state=0)
classifier.fit(X_train,y_train)
y_pred=classifier.predict(y_test)
When I execute this I get
ValueError: X.shape[1] = 1 should be equal to 9, the number of features at training time
Your error was caused by the following line, where you passed y_test instead of X_test:
classifier.predict(y_test)
Full code:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer
from sklearn.svm import SVC
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
X=df.iloc[:,1:10]
y = data.target
imputer=Imputer(strategy='mean',axis=1)
X = imputer.fit_transform(X)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)
clf = SVC(kernel='rbf').fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(clf.score(X_test, y_test))
yields:
0.6842105263157895

Resources