SKLEARN // Combine GridsearchCV with column transform and pipeline - scikit-learn

I am struggling with a machine learning project, in which I am trying to combine :
a sklearn column transform to apply different transformers to my numerical and categorical features
a pipeline to apply my different transformers and estimators
a GridSearchCV to search for the best parameters.
As long as I fill-in the parameters of my different transformers manually in my pipeline, the code is working perfectly.
But as soon as I try to pass lists of different values to compare in my gridsearch parameters, I am getting all kind of invalid parameter error messages.
Here is my code :
First I divide my features into numerical and categorical
from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
numerical_features=make_column_selector(dtype_include=np.number)
cat_features=make_column_selector(dtype_exclude=np.number)
Then I create 2 different preprocessing pipelines for numerical and categorical features:
numerical_pipeline= make_pipeline(KNNImputer())
cat_pipeline=make_pipeline(SimpleImputer(strategy='most_frequent'),OneHotEncoder(handle_unknown='ignore'))
I combined both into another pipeline, set my parameters, and run my GridSearchCV code
model=make_pipeline(preprocessor, LinearRegression() )
params={
'columntransformer__numerical_pipeline__knnimputer__n_neighbors':[1,2,3,4,5,6,7]
}
grid=GridSearchCV(model, param_grid=params,scoring = 'r2',cv=10)
cv = KFold(n_splits=5)
all_accuracies = cross_val_score(grid, X, y, cv=cv,scoring='r2')
I tried different ways to declare the paramaters, but never found the proper one. I always get an "invalid parameter" error message.
Could you please help me understanding what went wrong?
Really a lot of thanks for your support, and take good care!

I am assuming that you might have defined preprocessor as the following,
preprocessor = Pipeline([('numerical_pipeline',numerical_pipeline),
('cat_pipeline', cat_pipeline)])
then you need to change your param name as following:
pipeline__numerical_pipeline__knnimputer__n_neighbors
but, there are couple of other problems with the code:
you don't have to call cross_val_score after performing GridSearchCV. Output of GridSearchCV itself would have the cross validation result for each combination of hyper parameters.
KNNImputer would not work when you data is having string data. You need to apply cat_pipeline before num_pipeline.
Complete example:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
import pandas as pd # doctest: +SKIP
X = pd.DataFrame({'city': ['London', 'London', 'Paris', np.nan],
'rating': [5, 3, 4, 5]}) # doctest: +SKIP
y = [1,0,1,1]
from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
numerical_features=make_column_selector(dtype_include=np.number)
cat_features=make_column_selector(dtype_exclude=np.number)
numerical_pipeline= make_pipeline(KNNImputer())
cat_pipeline=make_pipeline(SimpleImputer(strategy='most_frequent'),
OneHotEncoder(handle_unknown='ignore', sparse=False))
preprocessor = Pipeline([('cat_pipeline', cat_pipeline),
('numerical_pipeline',numerical_pipeline)])
model=make_pipeline(preprocessor, LinearRegression() )
params={
'pipeline__numerical_pipeline__knnimputer__n_neighbors':[1,2]
}
grid=GridSearchCV(model, param_grid=params,scoring = 'r2',cv=2)
grid.fit(X, y)

Related

How to tune quantile_range in RobustScaler in sklearn Pipeline?

pipeline = Pipeline([
('scale', RobustScaler(quantile_range=()))
('classify', OneVsRestClassifier(SVC()))
],
memory=self.memory)
Given that pipeline, how to tune the quantile_range in RobustScaler using GridSearchCV? The default quantile_range is (25.0, 75.0). Alternatives I want to try are something like (5.0, 95.0), (10.0, 90.0), ..., (25.0, 75.0). How to achieve that?
I guess, the params_grid should look this:
params_grid = [{'scale__quantile_range': ??}]
But I don't know what to put into the question mark placeholder.
The hyperparameters to try from should be an iterable. Try:
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_classification
pipeline = Pipeline([
('scale', RobustScaler(quantile_range=())),
('classify', OneVsRestClassifier(SVC()))
],
memory=None)
params = {"scale__quantile_range":[(25.0,75.0),(10.0,90.0),(1.0,99.0)]}
grid_cf = GridSearchCV(pipeline, param_grid=params)
X,y = make_classification(1000,10,n_classes=2,random_state=42)
grid_cf.fit(X,y)
grid_cf.best_params_
{'scale__quantile_range': (1.0, 99.0)}

Python 3 and Sklearn: Difficulty to use a NOT-sklearn model as a sklearn model

The code below is working. I have just a routine to run a cross validation scheme using a linear model previous defined in sklearn. I do not have a problem with this. My problem is that: if I replace the code model=linear_model.LinearRegression() by the model=RBF('multiquadric') (please see line 14 and 15 in the __main__, it does not work anymore. So my problem is actually in the class RBF where I try to mimic a sklearn model.
If I replace the code described above, I get the following error:
FitFailedWarning)
/home/daniel/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
ValueError: All arrays must be equal length.
FitFailedWarning)
1) Should I define a score function in the Class RBF?
2) How to do that? I am lost. Since I am inherit BaseEstimator and RegressorMixin, I expected that this was internally solved.
3) Is there something else missing?
from sklearn import datasets
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from scipy.interpolate import Rbf
np.random.seed(0)
from sklearn.base import BaseEstimator, RegressorMixin
class RBF(BaseEstimator, RegressorMixin):
def __init__(self,function):
self.function=function
def fit(self,x,y):
self.rbf = Rbf(x, y,function=self.function)
def predict(self,x):
return self.rbf(x)
if __name__ == "__main__":
# Load Data
targetName='HousePrice'
data=datasets.load_boston()
featuresNames=list(data.feature_names)
featuresData=data.data
targetData = data.target
df=pd.DataFrame(featuresData,columns=featuresNames)
df[targetName]=targetData
independent_variable_list=featuresNames
dependent_variable=targetName
X=df[independent_variable_list].values
y=np.squeeze(df[[dependent_variable]].values)
# Model Definition
model=linear_model.LinearRegression()
#model=RBF('multiquadric')
# Cross validation routine
number_splits=5
score_list=['neg_mean_squared_error','neg_mean_absolute_error','r2']
kfold = model_selection.KFold(n_splits=number_splits,shuffle=True, random_state=0)
scalar = StandardScaler()
pipeline = Pipeline([('transformer', scalar), ('estimator', model)])
results = model_selection.cross_validate(pipeline, X, y, cv=kfold, scoring=score_list,return_train_score=True)
for score in score_list:
print(score+':')
print('Train: '+'Mean',np.mean(results['train_'+score]),'Standard Error',np.std(results['train_'+score]))
print('Test: '+'Mean',np.mean(results['test_'+score]),'Standard Error',np.std(results['test_'+score]))
Lets look at the documentation here
*args : arrays
x, y, z, …, d, where x, y, z, … are the coordinates of the nodes and d is the array of values at the nodes
So it takes variable length argument with the last argument being the value which is y in your case. Argument k is the kth coordinates of all the data point (same for all other argument z, y, z, ….
Following the documentation, your code should be
from sklearn import datasets
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from scipy.interpolate import Rbf
np.random.seed(0)
from sklearn.base import BaseEstimator, RegressorMixin
class RBF(BaseEstimator, RegressorMixin):
def __init__(self,function):
self.function=function
def fit(self,X,y):
self.rbf = Rbf(*X.T, y,function=self.function)
def predict(self,X):
return self.rbf(*X.T)
# Load Data
data=datasets.load_boston()
X = data.data
y = data.target
number_splits=5
score_list=['neg_mean_squared_error','neg_mean_absolute_error','r2']
kfold = model_selection.KFold(n_splits=number_splits,shuffle=True, random_state=0)
scalar = StandardScaler()
model = RBF(function='multiquadric')
pipeline = Pipeline([('transformer', scalar), ('estimator', model)])
results = model_selection.cross_validate(pipeline, X, y, cv=kfold, scoring=score_list,return_train_score=True)
for score in score_list:
print(score+':')
print('Train: '+'Mean',np.mean(results['train_'+score]),'Standard Error',np.std(results['train_'+score]))
print('Test: '+'Mean',np.mean(results['test_'+score]),'Standard Error',np.std(results['test_'+score]))
Output
neg_mean_squared_error:
Train: Mean -1.552450953914355e-20 Standard Error 7.932530906290208e-21
Test: Mean -23.007377210596463 Standard Error 4.254629143836107
neg_mean_absolute_error:
Train: Mean -9.398502208736061e-11 Standard Error 2.4673749061941226e-11
Test: Mean -3.1319779583728673 Standard Error 0.2162343985534446
r2:
Train: Mean 1.0 Standard Error 0.0
Test: Mean 0.7144217179633185 Standard Error 0.08526294242760363
Why *X.T : As we saw, each argument correspond to an axis of all the data points, so we transpose them and then use * operator to expand and pass each of the sub array as an argument to the variable length function.
Looks like the latest implementation has a mode parameter where we can pass the N-D array directly.

XGBoost error: /workspace/src/metric/elementwise_metric.cc:28: Check failed: preds.size() == info.labelsSize() (

I am new to machine learning and trying to solve a problem of housing prices of kaggle competition.. i am trying to run this code and fit this model but outputs a error..please help and explain as i am a novice...thank in advance
I tried to search in google but shows multiclass error don't know what it is and shows the solution as a "mlogloss" or "merror"
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from learntools.core import *
from xgboost import XGBRegressor
iowa_file_path = '../input/train.csv'
home_data = pd.read_csv(iowa_file_path)
y = home_data.SalePrice
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath',
'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
iowa_model = XGBRegressor(n_estimators=1000,learning_rate=0.05)
iowa_model.fit(train_X, train_y,early_stopping_rounds=5,eval_set=
[(train_X,val_y)],verbose=False)
you got a 'typo' try
iowa_model.fit(train_X, train_y,early_stopping_rounds=5,eval_set= [(val_X,val_y)],verbose=False)

LabelEncoder in sklearn_pandas mapper with pipeline after cross_val_score returns error

I have a strange error, that I could not understand.
I have a data:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, LabelBinarizer
from sklearn_pandas import DataFrameMapper
test = pd.DataFrame({"a": ['a','c','-','9','c','a','a','c','b','i','c','r'],
"b": [0,0,1,0,0,1, 0,0,1,0,0,1] })
Then I make DataFrameMapper()
Mapper = DataFrameMapper([ ('a', LabelEncoder()) ])
Then Pipeline()
pipeline = Pipeline([('featurize', Mapper),('forest',RandomForestClassifier())])
X = test[test.columns.drop('b')]
y = test['b']
model = pipeline.fit(X = X, y = y)
Everything works fine, i can predict with this model.
But, when I do cross_val_score
cross_val_score(pipeline, X, y, 'accuracy', cv=2)
It returns error:
a: y contains new labels: ['-' '9']
How can I avoid this or why does it work this way? Because I thought that LabelEncoder fits the data first, then cross-validation goes. I have tried to fit encoder firstly
enc = LabelEncoder()
enc.fit(test['a'])
on entire column then insert in Mapper, but it doesn't work

sklearn GridSearchCV, SelectKBest, and SVM

I am trying to make a classifier that uses feature selection via this functionI have written, golub, which returns two np arrays as SelectKBest requires. I want to link this to an SVM classifier with a linear and and optimize over the possible combinations of k and C. However, what I have tried so far has not succeeded and I am not sure why. The code is as follows:
import numpy as np
from sklearn import cross_validation
from sklearn import svm
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.grid_search import GridSearchCV
from golub_mod import golub
class SVM_golub_linear:
def __init__(self,X,y):
self.X=X
self.y=y
def Golub_SVM(self):
X=self.X
y=self.y
kbest=SelectKBest(golub,k=1)
k_vals=np.linspace(100,1000,10,dtype=int)
k_vals=k_vals.tolist()
c_vals=[0.00001,0.0001,0.001,0.01,.1,1,10,100,1000]
clf=svm.LinearSVC(penalty='l2')
steps=[('feature_selection',kbest),('svm_linear',clf)]
pipeline=make_pipeline(steps)
params=dict(feature_selection__k=k_vals,
svm_linear__C=c_vals)
best_model=GridSearchCV(pipeline,param_grid=params)
self.model=best_model.fit(X,y)
print(best_mod.best_params_)
def svmpredict(self,X_n):
y_vals=self.model.predict(X_n)
return y_vals
when I try to run this:
model=SVM_golub_linear(X,y)
model.Golub_SVM()
I get the following error:
TypeError: Last step of chain should implement fit '[('feature_selection',
SelectKBest(k=1, score_func=<function golub at 0x105f2c398>)), ('svm_linear', LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, loss='squared_hinge', max_iter=1000,
multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
verbose=0))]' (type <type 'list'>) doesn't)
I do not understand this because LinearSVC does have a fit method. Thanks
In the code above if you replace
pipeline=make_pipeline(steps)
to
pipeline=Pipeline(steps)
the code works as is.

Resources