DataType of InputField is double although in the PMMLPipeline it is string - scikit-learn

I am exporting a PMMLPipeline with a categorical string feature day_of_week as a PMML file. When I open the file in Java and list the InputFields I see that the data type of day_of_week field is double:
InputField{name=day_of_week, fieldName=day_of_week, displayName=null, dataType=double, opType=categorical}
Hence when I evaluate an input I get the error:
org.jpmml.evaluator.InvalidResultException: Field "day_of_week" cannot accept user input value "tuesday"
On the Python side the pipeline works with a string column:
data = pd.DataFrame(data=[{"age": 10, "day_of_week": "tuesday"}])
y = trained_model.predict(X=data)
Miminal example for creating the PMML file:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
if __name__ == '__main__':
data_dict = {
'age': [1, 2, 3],
'day_of_week': ['monday', 'tuesday', 'wednesday'],
'y': [5, 6, 7]
}
data = pd.DataFrame(data_dict, columns=data_dict)
numeric_features = ['age']
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())])
categorical_features = ['day_of_week']
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore', categories='auto'))])
preprocessor = ColumnTransformer(
transformers=[
('numerical', numeric_transformer, numeric_features),
('categorical', categorical_transformer, categorical_features)])
pipeline = PMMLPipeline(
steps=[
('preprocessor', preprocessor),
('classifier', RandomForestRegressor(n_estimators=60))])
X = data.drop(labels=['y'], axis=1)
y = data['y']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=30)
trained_model = pipeline.fit(X=X_train, y=y_train)
sklearn2pmml(pipeline=pipeline, pmml='RandomForestRegressor2.pmml', with_repr=True)
EDIT:
sklearn2pmml creates a PMML file with A DataDictionary with DataField "day_of_week" that has dataType="double". I think it should be "String". Do I have to set the dataType somewhere to correct this?
<DataDictionary>
<DataField name="day_of_week" optype="categorical" dataType="double">

You can assist SkLearn2PMML by providing "feature type hints" using sklearn2pmml.decoration.CategoricalDomain and sklearn2pmml.decoration.ContinuousDomain decorators (see here for more details).
In the current case, you should prepend a CategoricalDomain step to the pipeline that deals with categorical features:
from sklearn2pmml.decoration import CategoricalDomain
categorical_transformer = Pipeline(steps=[
('domain', CategoricalDomain(dtype = str))
('onehot', OneHotEncoder(handle_unknown='ignore', categories='auto'))
])

Thanks for your reply #user1808924.
The given solution works. Now, to add in his answer; I would like to note that CategoricalDomain works for the single feature only.
Problem:
So, when you use it in to pipeline like:
# pipeline creatiion
categorical_transformer = Pipeline(steps=[
('domain', CategoricalDomain(dtype = str)),
('onehot', Ordinalecndoer())
])
# fit and transform of `df` with 3 features
categorical_transformer.fit_transform(df)
### >>> ERROR: Expected 1d array, got 2d array of shape (1000, 3)
Which means you will need to use multiple CategoricalDomains in there.
NOTE: We often use it in the ColumnTransformer as well. You need to know how many categorical features are there before hand.
What can we do?
We will simply use the MultiDomain from the same library.
from sklearn2pmml.decoration import MultiDomain
categorical_transformer = Pipeline(steps=[
('domain', MultiDomain([CategoricalDomain(dtype = str) for _ in range(3)])),
('onehot', OrdinalEncoder())
])
Note that the 3 is the number of categorical columns there. Hence, there will be n CategoricalDomains per categorical columns.
Then performing the transformation will work.

Related

Issues with One Hot Encoding for model with values not in training data

I would like to use One Hot Encoding for my simple model. Yet it seems to trigger an error no matter how I set it up. First, One Hot Encoding is not converting string to float even though I have version 1.0.2 of sklearn. Now the issue is because the values in my training data are not the same length as in test data. Training only has 2 values, testing has all three. How do I fix that? The exact error is the truth value of a series is ambiguous. The error with this other idea is to reshape the data.
import lightgbm as lgbm
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
X = [[ 'apple',5],['banana',1],['apple',6],['banana',2]]
X=pd.DataFrame(X).to_numpy()
test = [[ 'pineapple',0],['banana',1],['apple',7],['banana,2']]
y = [1,0,1,0]
y=pd.DataFrame(y).to_numpy()
labels = ['apples','bananas','pineapple']
ohc = OneHotEncoder(categories=labels)
pp = ColumnTransformer(
transformers=[('ohc', ohc, [0])]
,remainder = 'passthrough')
model=lgbm.LGBMClassifier()
mymodel = Pipeline(steps = [('preprocessor', pp),
('model', model)
])
params = {'model__learning_rate':[0.1]
,'model__n_estimators':[2]}
lgbm_gs=GridSearchCV(
estimator = mymodel, param_grid=params, n_jobs = -1,
cv=2, scoring='accuracy'
,verbose=-1)
lgbm_gs.fit(X,y)
The issue should be related to the fact that you're passing categories as a list rather than as a list of array-like (eg a list of list(s)) as the doc states. Therefore, the following adjustment should fix it.
import lightgbm as lgbm
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
X = [['apple',5],['banana',1],['apple',6],['banana',2]]
X = pd.DataFrame(X).to_numpy()
test = [['pineapple',0],['banana',1],['apple',7],['banana',2]]
y = [1,0,1,0]
y = pd.DataFrame(y).to_numpy()
labels = [['apple', 'banana', 'pineapple']] # observe you were also mispelling categories ('apples' --> 'apple'; 'bananas' --> 'banana')
ohc = OneHotEncoder(categories=labels)
pp = ColumnTransformer(transformers=[('ohc', ohc, [0])], remainder='passthrough')
model=lgbm.LGBMClassifier()
mymodel = Pipeline(steps = [('preprocessor', pp),
('model', model)])
params = {'model__learning_rate':[0.1], 'model__n_estimators':[2]}
lgbm_gs=GridSearchCV(
estimator = mymodel, param_grid=params, n_jobs = -1,
cv=2, scoring='accuracy', verbose=-1)
lgbm_gs.fit(X, y.ravel())
As a further remark, observe what the guide suggests when dealing with cases where test data has categories that cannot be found in the training set.
If there is a possibility that the training data might have missing categorical features, it can often be better to specify handle_unknown='ignore' instead of setting the categories manually as above. When handle_unknown='ignore' is specified and unknown categories are encountered during transform, no error will be raised but the resulting one-hot encoded columns for this feature will be all zeros (handle_unknown='ignore' is only supported for one-hot encoding):
Eventually, you can observe that the attribute categories_ (which specifies the categories of each feature determined during fitting) is a list of array(s) (single array here as you're one-hot-encoding one column only), too. Example with categories='auto':
ohc = OneHotEncoder(handle_unknown='ignore')
ohc.fit(X[:, 0].reshape(-1, 1)).categories_
# Output: [array(['apple', 'banana'], dtype=object)]
Example with your custom categories:
ohc = OneHotEncoder(categories=labels)
ohc.fit(X[:, 0].reshape(-1, 1)).categories_
# Output: [array(['apple', 'banana', 'pineapple'], dtype=object)]

Error saving model in sklearn2pmml using VotingClassifier

I'm new to programming and I'm having a little trouble saving a model in pmml. I have a database and I need to make a selection of attributes, then use the majority vote and finally save in pmml. Even the majority vote part works, but when I save the model on the last line using sklearn2pmml it gives an error.
from pandas import read_csv
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from mlxtend.classifier import EnsembleVoteClassifier
from sklearn.metrics import accuracy_score
from sklearn2pmml import make_pmml_pipeline
from sklearn2pmml import sklearn2pmml
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn.ensemble import VotingClassifier
import joblib
url = 'D:/treinamento.CSV'
df = read_csv(url, header=None)
data = df.values
url_test = 'D:/TESTE.CSV'
df_test = read_csv(url_test, header=None)
data_test = df_test.values
X = data[:, :-1]
y = data_test[:, -1]
X_train = data[:, :-1]
X_test = data_test[:, :-1]
y_train = data[:, -1]
y_test = y
#features selection
features1 = [2, 5, 7]
features2 = [0, 1, 4, 5, 7]
features3 = [0, 1, 4, 5, 6]
features4 = [1, 4]
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
preprocessor1 = ColumnTransformer(transformers=[('numerical', numeric_transformer, features1)])
preprocessor2 = ColumnTransformer(transformers=[('numerical', numeric_transformer, features2)])
preprocessor3 = ColumnTransformer(transformers=[('numerical', numeric_transformer, features3)])
preprocessor4 = ColumnTransformer(transformers=[('numerical', numeric_transformer, features4)])
pipe1 = PMMLPipeline(steps=[('preprocessor', preprocessor1),('classifier', DecisionTreeClassifier(min_samples_split = 2))])
pipe2 = PMMLPipeline(steps=[('preprocessor', preprocessor2),('classifier', DecisionTreeClassifier(min_samples_split = 2))])
pipe3 = PMMLPipeline(steps=[('preprocessor', preprocessor3),('classifier', DecisionTreeClassifier(min_samples_split = 2))])
pipe4 = PMMLPipeline(steps=[('preprocessor', preprocessor4),('classifier', DecisionTreeClassifier(min_samples_split = 2))])
eclf = VotingClassifier(estimators=[('pipe1', PMMLPipeline(steps=[('preprocessor', preprocessor1),('classifier', DecisionTreeClassifier(min_samples_split = 2))])),
('pipe2', PMMLPipeline(steps=[('preprocessor', preprocessor2),('classifier', DecisionTreeClassifier(min_samples_split = 2))])),
('pipe3', PMMLPipeline(steps=[('preprocessor', preprocessor3),('classifier', DecisionTreeClassifier(min_samples_split = 2))])),
('pipe4', PMMLPipeline(steps=[('preprocessor', preprocessor4),('classifier', DecisionTreeClassifier(min_samples_split = 2))]))], voting='hard', weights=[1,1,1,1])
eclf.fit(X_train, y_train)
yhat = eclf.predict(X_test)
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy * 100))
sklearn2pmml(eclf, "D:/Mestrado/ARTIGO DRC/dados_pos_revisao/cross validation - dados reavaliados/4 revisao/5 FOLDS/1 FOLD/eclf.pmml", with_repr = True)
Code error
65 sklearn2pmml(eclf, "D:/mest/eclf.pmml", with_repr = True)
~\anaconda3\lib\site-packages\sklearn2pmml\__init__.py in sklearn2pmml(pipeline, pmml, user_classpath, with_repr, debug, java_encoding)
222 print("{0}: {1}".format(java_version[0], java_version[1]))
223 if not isinstance(pipeline, PMMLPipeline):
--> 224 raise TypeError("The pipeline object is not an instance of " + PMMLPipeline.__name__ + ". Use the 'sklearn2pmml.make_pmml_pipeline(obj)' utility function to translate a regular Scikit-Learn estimator or pipeline to a PMML pipeline")
225 estimator = pipeline._final_estimator
226 cmd = ["java", "-cp", os.pathsep.join(_classpath(user_classpath)), "org.jpmml.sklearn.Main"]
TypeError: The pipeline object is not an instance of PMMLPipeline. Use the 'sklearn2pmml.make_pmml_pipeline(obj)' utility function to translate a regular Scikit-Learn estimator or pipeline to a PMML pipeline
The pipeline object is not an instance of PMMLPipeline
Did you read the SkLearn2PMML error message or not? Probably not, because it clearly states what's the issue!
You're using the PMMLPipeline class in completely wrong places. It should be used only as the topmost wrapper to the VotingClassifier estimator.
Please reorganize your code like this:
pipeline = PMMLPipeline([
("classifier", VotingClassifier([
("pipe1", Pipeline(...)),
("pipe2", Pipeline(...)),
("pipe3", Pipeline(...))
]))
])
sklearn2pmml(pipeline, "pipeline.pmml")

TypeError: __init__() got an unexpected keyword argument 'categorical_features'

Spyder(python 3.7)
I am facing following errors here. I have already update all library from anaconda prompt. But can't findout the solution of the problem.
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
Traceback (most recent call last):
File "<ipython-input-4-05deb1f02719>", line 2, in <module>
onehotencoder = OneHotEncoder(categorical_features = [1])
TypeError: __init__() got an unexpected keyword argument 'categorical_features'
So based on your code, you'd have to:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
# Country column
ct = ColumnTransformer([("Country", OneHotEncoder(), [1])], remainder = 'passthrough')
X = ct.fit_transform(X)
# Male/Female
labelencoder_X = LabelEncoder()
X[:, 2] = labelencoder_X.fit_transform(X[:, 2])
Noticed how the first LabelEncoder was removed, you do not need to apply both the label encoded and the one hot encoder on the column anymore.
(I've kinda assumed your example came from the ML Udemy course, and the first column was a list of countries, while the second one a male/female binary choice)
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X=np.array(columnTransformer.fit_transform(X),dtype=np.str)
Since the latest build of sklearn library removed categorical_features parameter for onehotencoder class. It is advised to use ColumnTransformer class for categorical datasets. Refer the sklearn's official documentation for futher clarifications.
According to the documentation this is the __init__ line:
class sklearn.preprocessing.OneHotEncoder(categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')
As you can see the init does not get the variable categorical_features
You have an categories flag:
categories‘auto’ or a list of array-like, default=’auto’
Categories (unique values) per feature:
‘auto’ : Determine categories automatically from the training data.
list : categories[i] holds the categories expected in the ith column.
The passed categories should not mix strings and numeric values within
a single feature, and should be sorted in case of numeric values.
The used categories can be found in the categories_ attribute.
Attributes: categories_list of arrays The categories of each feature
determined during fitting (in order of the features in X and
corresponding with the output of transform). This includes the
category specified in drop (if any).
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
label_encoder_x_1 = LabelEncoder()
X[: , 2] = label_encoder_x_1.fit_transform(X[:,2])
transformer = ColumnTransformer(
transformers=[
("OneHot", # Just a name
OneHotEncoder(), # The transformer class
[1] # The column(s) to be applied on.
)
],
remainder='passthrough' # donot apply anything to the remaining columns
)
X = transformer.fit_transform(X.tolist())
X = X.astype('float64')
working like charm :)
Assuming this is problem from ML course from Udemy
complete code
I did replaced label encoder 1 with column transformer as suggested by Antoine Jaussoin in above comment.
Categorical Data
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([("Geography", OneHotEncoder(), [1])], remainder = 'passthrough')
X = ct.fit_transform(X)
Your Gender column will have index 4 now
labelencoder_x_2=LabelEncoder()
X[:,4]=labelencoder_x_2.fit_transform(X[:,4])
to avoid dummy variable trap
X=X[:, 1:]
You need to add call another class on sklearn which will eliminate 1 column to avoid dummies trap.
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer # Here is the one
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
#onehotencoder = OneHotEncoder(categorical_features = [1]) Not this one
# use this instead
ct = ColumnTransformer([("Country", OneHotEncoder(), [1])], remainder = 'passthrough')
X = ct.fit_transform(X)
X = X[:, 1:])
Happy Helping!!!
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([("Geography",OneHotEncoder(),[1])], remainder= 'passthrough')
X = ct.fit_transform(X)
labelencoder_X2 = LabelEncoder()
X[:, 4] = labelencoder_X2.fit_transform(X[:, 4])
X = X[: , 1:]
X = np.array(X, dtype=float)
Just adding an extra line to convert it from array of objects.
Replace the following code
# onehotencoder = OneHotEncoder(categorical_features = [1])
# X = onehotencoder.fit_transform(X).toarray()
# X = X[:, 1:]
with the following chunk and your code must
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), [1])], remainder = 'passthrough')
X = np.array(columnTransformer.fit_transform(X), dtype = np.float64)
X = X[:, 1:]
Assuming you're learning Deep Learning from udemy.
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
# remove categorical_features, it works 100% perfectly
onehotencoder = OneHotEncoder()
X = onehotencoder.fit_transform(X).toarray()
X = X[:, 1:]
Here is only one extension for onehotencoder.
if X have lot of columns.
instead
ct = ColumnTransformer([("encoder", OneHotEncoder(), list(categorical_features))], remainder = 'passthrough')
X = ct.fit_transform(X)
Another solution including the transformation of the X object in array type in a float64 type
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('encoder', OneHotEncoder(), [1])], remainder='passthrough')
X = np.array(ct.fit_transform(X), dtype=np.float)
one_hot_encode = OneHotEncoder(categorical_features=[0]) is working for scikit-learn 0.20.3 and the parameter removed from scikit-learn 0.24.2 (versions I am checking).
Either Downgrade scikit-learn version
Or Use
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
"""2 classes- Known/unknown Face"""
ct = ColumnTransformer([("Faces", OneHotEncoder(), [0])], remainder = 'passthrough')
X = ct.fit_transform(X)
"""Country column"""
ct = ColumnTransformer([("Country", OneHotEncoder(), [1])], remainder = 'passthrough')
X = ct.fit_transform(X)```

How to make a GridSearchCV with a proper FunctionTransformer in a pipeline?

I'm trying to make a Pipeline with GridSearchCV to filter data (with iforest) and perform a regression with StandarSclaler+MLPRegressor.
I made a FunctionTransformer to include my iForest filter in the pipeline. I also define a parameters grid for the iForest filter (using kw_args methods).
All seems OK but when un mahe the fit, nothing happens ... No error message. Nothing.
After, when I want to make a predict, I have the message : "This RandomizedSearchCV instance is not fitted yet"
from sklearn.preprocessing import FunctionTransformer
#Definition of the function auto_filter using the iForest algo
def auto_filter(DF, conta=0.1):
#iForest made on the DF dataframe
iforest = IsolationForest(behaviour='new', n_estimators=300, max_samples='auto', contamination=conta)
iforest = iforest.fit(DF)
# The DF (dataframe in input) is filtered taking into account only the inlier observations
data_filtered = DF[iforest.predict(DF) == 1]
# Only few variables are kept for the next step (regression by MLPRegressor)
# this function delivers X_filtered and y
X_filtered = data_filtered[['SessionTotalTime','AverageHR','MaxHR','MinHR','EETotal','EECH','EEFat','TRIMP','BeatByBeatRMSSD','BeatByBeatSD','HFAverage','LFAverage','LFHFRatio','Weight']]
y = data_filtered['MaxVO2']
return (X_filtered, y)
#Pipeline definition ('auto_filter' --> 'scaler' --> 'MLPRegressor')
pipeline_steps = [('auto_filter', FunctionTransformer(auto_filter)), ('scaler', StandardScaler()), ('MLPR', MLPRegressor(solver='lbfgs', activation='relu', early_stopping=True, n_iter_no_change=20, validation_fraction=0.2, max_iter=10000))]
#Gridsearch Definition with differents values of 'conta' for the first stage of the pipeline ('auto_filter)
parameters = {'auto_filter__kw_args': [{'conta': 0.1}, {'conta': 0.2}, {'conta': 0.3}], 'MLPR__hidden_layer_sizes':[(sp_randint.rvs(1, nb_features, 1),), (sp_randint.rvs(1, nb_features, 1), sp_randint.rvs(1, nb_features, 1))], 'MLPR__alpha':sp_rand.rvs(0, 1, 1)}
pipeline = Pipeline(pipeline_steps)
estimator = RandomizedSearchCV(pipeline, parameters, cv=5, n_iter=10)
estimator.fit(X_train, y_train)
You can try to run step by step manually to find a problem:
auto_filter_transformer = FunctionTransformer(auto_filter)
X_train = auto_filter_transformer.fit_transform(X_train)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
MLPR = MLPRegressor(solver='lbfgs', activation='relu', early_stopping=True, n_iter_no_change=20, validation_fraction=0.2, max_iter=10000)
MLPR.fit(X_train, y_train)
If each of the steps works fine, build a pipeline. Check the pipeline. If it works fine, try to use RandomizedSearchCV.
The func parameter of FunctionTransformer should be a callable that accepts the
same arguments as transform method (array-like X of shape
(n_samples, n_features) and kwargs for func) and returns a transformed X of
the same shape. Your function auto_filter doesn't fit these requirements.
Additionally, anomaly/outlier detection techniques from scikit-learn cannot be
used as intermediate steps in scikit-learn pipelines since a pipeline assembles
one or more transformers and an optional final estimator. IsolationForest or,
say, OneClassSVM is not a transformer: it implements fit and predict.
Thus, a possible solution is to cut off possible outliers separately and build
a pipeline composing of transformers and a regressor:
>>> import warnings
>>> from sklearn.exceptions import ConvergenceWarning
>>> warnings.filterwarnings(category=ConvergenceWarning, action='ignore')
>>> import numpy as np
>>> from scipy import stats
>>> from sklearn.datasets import make_regression
>>> from sklearn.ensemble import IsolationForest
>>> from sklearn.model_selection import RandomizedSearchCV
>>> from sklearn.neural_network import MLPRegressor
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> X, y = make_regression(n_samples=50, n_features=2, n_informative=2)
>>> detect = IsolationForest(contamination=0.1, behaviour='new')
>>> inliers_mask = detect.fit_predict(X) == 1
>>> pipe = Pipeline([('scale', StandardScaler()),
... ('estimate', MLPRegressor(max_iter=500, tol=1e-5))])
>>> param_distributions = dict(estimate__alpha=stats.uniform(0, 0.1))
>>> search = RandomizedSearchCV(pipe, param_distributions,
... n_iter=2, cv=3, iid=True)
>>> search = search.fit(X[inliers_mask], y[inliers_mask])
The problem is that you won't be able to optimize the hyperparameters of
IsolationForest. One way to handle it is to define hyperparameter space
for the forest, sample hyperparameters with ParameterSampler or
ParameterGrid, predict inliers and fit randomized search:
>>> from sklearn.model_selection import ParameterGrid
>>> forest_param_dict = dict(contamination=[0.1, 0.15, 0.2])
>>> forest_param_grid = ParameterGrid(forest_param_dict)
>>> for sample in forest_param_grid:
... detect = detect.set_params(contamination=sample['contamination'])
... inliers_mask = detect.fit_predict(X) == 1
... search.fit(X[inliers_mask], y[inliers_mask])

Sklearn's SimpleImputer doesn't work in a pipeline?

I have a pandas dataframe that has some NaN values in a particular column:
1291 NaN
1841 NaN
2049 NaN
Name: some column, dtype: float64
And I have made the following pipeline in order to deal with it:
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
scaler = StandardScaler(with_mean = True)
imputer = SimpleImputer(strategy = 'median')
logistic = LogisticRegression()
pipe = Pipeline([('imputer', imputer),
('scaler', scaler),
('logistic', logistic)])
Now when I pass this pipeline to a RandomizedSearchCV, I get the following error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
It's actually quite a bit longer than that -- I can post the entire error in an edit if neccesary. Anyway, I am quite sure that this column is the only column that contains NaNs. Moreover, if I switch from SimpleImputer to the (now deprecated) Imputer in the pipeline, the pipeline works just fine in my RandomizedSearchCV. I checked the documentation, but it seems that SimpleImputer is supposed to behave in (nearly) the exact same way as Imputer. What is the difference in behavior? How do use an imputer in my pipeline without using the deprecated Imputer?
SimpleImputer in make_pipeline
preprocess_pipeline = make_pipeline(
FeatureUnion(transformer_list=[
('Handle numeric columns', make_pipeline(
ColumnSelector(columns=['Amount']),
SimpleImputer(strategy='constant', fill_value=0),
StandardScaler()
)),
('Handle categorical data', make_pipeline(
ColumnSelector(columns=['Type', 'Name', 'Changes']),
SimpleImputer(strategy='constant', missing_values=' ', fill_value='missing_value'),
OneHotEncoder(sparse=False)
))
])
)
SimpleImputer in Pipeline
('features', FeatureUnion ([
('Cat Columns', Pipeline([
('Category Extractor', TypeSelector(np.number)),
('Impute Zero', SimpleImputer(strategy="constant", fill_value=0))
])),
('Numerics', Pipeline([
('Numeric Extractor', TypeSelector("category")),
('Impute Missing', SimpleImputer(strategy="constant", fill_value='missing'))
]))
]))
I had the same issue but this addressed it:
imputer = SimpleImputer(strategy = 'median', fill_value = 0)

Resources