oversampling (SMOTE) does not work properly when fitted inside a pipeline

oversampling (SMOTE) does not work properly when fitted inside a pipeline - python-3.x

I have an imbalanced classification problem and I am using make_pipeline from imblearn
So the steps are the following:
kf = StratifiedKFold(n_splits=10, random_state=42, shuffle=True)
params = {
'max_depth': [2,3,5],
# 'max_features':['auto', 'sqrt', 'log2'],
# 'min_samples_leaf': [5,10,20,50,100,200,300],
'n_estimators': [10,25,30,50]
# 'bootstrap': [True, False]
}
from imblearn.pipeline import make_pipeline
imba_pipeline = make_pipeline(SMOTE(random_state = 42), RobustScaler(), RandomForestClassifier(random_state=42))
imba_pipeline
out:Pipeline(steps=[('smote', SMOTE(random_state=42)),
('robustscaler', RobustScaler()),
('randomforestclassifier',
RandomForestClassifier(random_state=42))])
new_params = {'randomforestclassifier__' + key: params[key] for key in params}
grid_imba = GridSearchCV(imba_pipeline, param_grid=new_params, cv=kf, scoring='recall',
return_train_score=True, n_jobs=-1, verbose=2)
grid_imba.fit(X_train, y_train)
And everything is going ok and I am reaching to the end to by problem (i.e I can see the classification report)
However when I am trying to see inside the black box with eli5 with eli.explain_weights(imba_pipeline)
I get back as error
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SMOTE(random_state=42)' (type <class 'imblearn.over_sampling._smote.SMOTE'>) doesn't
I know that this Is a common problem and i have read the related questions but i am confused as the problem is occurred after the end of my classification procedure
Any suggestions?

Your pipeline has two fitted steps (+ the scaler): the SMOTE augmentation and the random forest. It looks like this is confusing the eli5 which wants to work with the assumptions that only the last layer is fitted. To get the weight explanation of the random forest you could try calling eli5 only on that layer of the pipeline with
from eli5 import explain_weights
explain_weights(imba_pipeline['randomforestclassifier'])
provided the pipeline is fitted, but in your code you were fitting the grid search so
explain_weights(grid_imba.best_estimator_['randomforestclassifier'])
would be more appropriate.

Just wanted to point out that SMOTE generally doesn't improve prediction quality. See https://arxiv.org/abs/2201.08528

Related

Question about Permutation Importance on LSTM Keras

from keras.wrappers.scikit_learn import KerasClassifier, KerasRegressor
import eli5
from eli5.sklearn import PermutationImportance
model = Sequential()
model.add(LSTM(units=30,return_sequences= True, input_shape=(X.shape[1],421)))
model.add(Dropout(rate=0.2))
model.add(LSTM(units=30, return_sequences=True))
model.add(LSTM(units=30))
model.add(Dense(units=1, activation='relu'))
perm = PermutationImportance(model, scoring='accuracy',random_state=1).fit(X, y, epochs=500, batch_size=8)
eli5.show_weights(perm, feature_names = X.columns.tolist())
I am running an LSTM just to see the feature importance of my dataset containing 400+ features. I used the Keras scikit-learn wrapper to use eli5's PermutationImportance function. But the code is returning
ValueError: Found array with dim 3. Estimator expected <= 2.
The code runs smoothly if I use model.fit() but can't debug the error of the permutation importance. Anyone know what is wrong?

eli5's scikitlearn implementation for determining permutation importance can only process 2d arrays while keras' LSTM layers require 3d arrays. This error is a known issue but there appears to be no solution yet.
I understand this does not really answer your question of getting eli5 to work with LSTM (because it currently can't), but I encountered the same problem and used another library called SHAP to get the feature importance of my LSTM model. Here is some of my code to help you get started:
import shap
DE = shap.DeepExplainer(model, X_train) # X_train is 3d numpy.ndarray
shap_values = DE.shap_values(X_validate_np, check_additivity=False) # X_validate is 3d numpy.ndarray
shap.initjs()
shap.summary_plot(
shap_values[0],
X_validate,
feature_names=list_of_your_columns_here,
max_display=50,
plot_type='bar')
Here is an example of the graph which you can get:
Hope this helps.

Fitting in nested cross-validation with cross_val_score with pipeline and GridSearch

I am working in scikit and I am trying to tune my XGBoost.
I made an attempt to use a nested cross-validation using the pipeline for the rescaling of the training folds (to avoid data leakage and overfitting) and in parallel with GridSearchCV for param tuning and cross_val_score to get the roc_auc score at the end.
from imblearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
std_scaling = StandardScaler()
algo = XGBClassifier()
steps = [('std_scaling', StandardScaler()), ('algo', XGBClassifier())]
pipeline = Pipeline(steps)
parameters = {'algo__min_child_weight': [1, 2],
'algo__subsample': [0.6, 0.9],
'algo__max_depth': [4, 6],
'algo__gamma': [0.1, 0.2],
'algo__learning_rate': [0.05, 0.5, 0.3]}
cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)
clf_auc = GridSearchCV(pipeline, cv = cv1, param_grid = parameters, scoring = 'roc_auc', n_jobs=-1, return_train_score=False)
cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)
outer_clf_auc = cross_val_score(clf_auc, X_train, y_train, cv = cv1, scoring = 'roc_auc')
Question 1.
How do I fit cross_val_score to the training data?
Question2.
Since I included the StandardScaler() in the pipeline does it make sense to include the X_train in the cross_val_score or should I use a standardized form of the X_train (i.e. std_X_train)?
std_scaler = StandardScaler().fit(X_train)
std_X_train = std_scaler.transform(X_train)
std_X_test = std_scaler.transform(X_test)

You chose the right way to avoid data leakage as you say - nested CV.
The thing is in nested CV what you estimate is not the score of a real estimator you can "hold in your hand", but of a non-existing "meta-estimator" which describes you model selection process as well.
Meaning - in every round of the outer cross validation (in your case represented by cross_val_score), the estimator clf_auc undergoes internal CV which selects the best model under the given fold of the external CV.
Therefore, for every fold of the external CV you are scoring a different estimator chosen by the internal CV.
For example, in one external CV fold the model scored can be one that selected the param algo__min_child_weight to be 1, and in another a model that selected it to be 2.
The score of the external CV therefore represents a more high-level score: "under the process of reasonable model selection, how well will my selected model generalize".
Now, if you want to finish the process with a real model in hand you would have to select it in some way (cross_val_score will not do that for you).
The way to do that is to now fit your internal model over the entire data.
meaning to perform:
clf_auc.fit(X, y)
This is the moment to understand what you've done here:
You have a model you can use, which is fitted over all the data available.
When you're asked "how well does that model generalizes on new data?" the answer is the score you got during your nested CV - which captured the model selection process as part of your model's scoring.
And regarding Question #2 - if the scaler is part of the pipeline, there is no reason to manipulate the X_train externally.

Error message on attempting to fit training data using GridSearch function

from sklearn.preprocessing import PolynomialFeatures
polyreg = PolynomialFeatures(degree = 4)
param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
grid_search_polyreg = GridSearchCV(polyreg, param_grid, cv = 5)
grid_search_polyreg.fit(x_train, y_train)
grid_search_polyreg.score(x_test, y_test)
print("Best Parameters for polynomial regression:
{}".format(grid_search_polyreg.best_params_))
print("Best Score for polynomial regression:
{:.2f}".format(grid_search_polyreg.best_score_))
TypeError: If no scoring is specified, the estimator passed should
have a 'score' method. The estimator PolynomialFeatures(degree=4,
include_bias=True, interaction_only=False) does not.
1)I understand that alpha is not a parameter for polynomial features. But when I tried to remove alpha and fit the data it did not work.
2) Does that mean that I am not supposed to use grid search for getting scores of KNN Regressor, Linear and kernel SVM?
I am new to python and any suggestion is much appreciated. Thanks in advance.

sklearn.preprocessing.PolynomialFeatures() doesn't have a scoring function. It's not actually an estimator or machine learning model, it just transforms a matrix. You can have it as part of your pipeline and test its parameters, but you have to pass an actual estimator with a scoring function to GridSearchCV.
Fitting to data has a different meaning when your dealing with transformers vs estimators, only in the latter case does it mean "train".

sample_weight parameter shape error in scikit-learn GridSearchCV

Passing the sample_weight parameter to GridSearchCV raises an error due to incorrect shape. My suspicion is that cross validation is not capable of handling the split of sample_weights accordingly with the dataset.
First part: Using sample_weight as a model parameter works beautifully
Let's consider a simple example, first without GridSearch:
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
dataURL = 'https://raw.githubusercontent.com/mcasl/PAELLA/master/data/sinusoidal_data.csv'
x = pd.read_csv(dataURL, usecols=["x"]).x
y = pd.read_csv(dataURL, usecols=["y"]).y
occurrences = pd.read_csv(dataURL, usecols=["Occurrences"]).Occurrences
my_sample_weights = (1 - occurrences/10000)**3
my_sample_weights contains the importance that I assign to each observation in x, y, as the following picture shows. The points of the sinusoidal curve get higher weights than those forming the background noise.
plt.scatter(x, y, c=my_sample_weights>0.9, cmap="cool")
Let's train a neural network, first without using the information contained in my_sample_weights:
def make_model(number_of_hidden_neurons=1):
model = Sequential()
model.add(Dense(number_of_hidden_neurons, input_shape=(1,), activation='tanh'))
model.add(Dense(1, activation='linear'))
model.compile(optimizer='sgd', loss='mse')
return model
net_Not_using_sample_weight = make_model(number_of_hidden_neurons=6)
net_Not_using_sample_weight.fit(x,y, epochs=1000)
plt.scatter(x, y, )
plt.scatter(x, net_Not_using_sample_weight.predict(x), c="green")
As the following picture shows, the neural network tries to fit the shape of the sinusoidal but the background noise prevents it from a good fit.
Now, using the information of my_sample_weights , the quality of the prediction is a much better one.
Second part: Using sample_weight as a GridSearchCV parameter raises an error
my_Regressor = KerasRegressor(make_model)
validator = GridSearchCV(my_Regressor,
param_grid={'number_of_hidden_neurons': range(4, 5),
'epochs': [500],
},
fit_params={'sample_weight': [ my_sample_weights ]},
n_jobs=1,
)
validator.fit(x, y)
Trying to pass the sample_weights as a parameter gives the following error:
...
ValueError: Found a sample_weight array with shape (1000,) for an input with shape (666, 1). sample_weight cannot be broadcast.
It seems that the sample_weight vector has not been split in a similar manner to the input array.
For what is worth:
import sklearn
print(sklearn.__version__)
0.18.1
import keras
print(keras.__version__)
2.0.5

The problem is that as a standard, the GridSearch uses 3-fold cross-validation, unless explicity stated otherwise. This means that 2/3 data points of the data are used as training data and 1/3 for cross-validation, which does fit the error message. The input shape of 1000 of the fit_params doesn't match the number of training examples used for training (666). Adjust the size and the code will run.
my_sample_weights = np.random.uniform(size=666)

We developed PipeGraph, an extension to Scikit-Learn Pipeline that allows you to get intermediate data, build graph like workflows, and in particular, solve this problem (see the examples in the gallery at http://mcasl.github.io/PipeGraph )

Specific Cross Validation with Random Forest

Am using Random Forest with scikit learn.
RF overfits the data and prediction results are bad.
The overfit does NOT depend on the parameters of the RF:
NBtree, Depth_Tree
Overfit happens with many different parameters (Tested it across grid_search).
To remedy:
I tweak the initial data/ down sampling some results
in order to affect the fitting (Manually pre-process noise sample).
Loop on random generation of RF fits,
Get RF prediction on the data for prediction
Select the model which best fits the "predicted data" (not the calibration data).
This Monte carlos is very consuming,
Just wondering if there is another way to do
cross validation on random Forest ? (ie NOT the hyper-parameter optimization).
EDITED

Cross-Validation with any classifier in scikit-learn is really trivial:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
clf = RandomForestClassifier() #Initialize with whatever parameters you want to
# 10-Fold Cross validation
print np.mean(cross_val_score(clf, X_train, y_train, cv=10))
If you wish to run Grid Search, you can easily do it via the GridSearchCV class. In order to do so you will have to provide a param_grid, which according to the documentation is
Dictionary with parameters names (string) as keys and lists of
parameter settings to try as values, or a list of such dictionaries,
in which case the grids spanned by each dictionary in the list are
explored. This enables searching over any sequence of parameter
settings.
So maybe, you could define your param_grid as follows:
param_grid = {
'n_estimators': [5, 10, 15, 20],
'max_depth': [2, 5, 7, 9]
}
Then you can use the GridSearchCV class as follows
from sklearn.model_selection import GridSearchCV
grid_clf = GridSearchCV(clf, param_grid, cv=10)
grid_clf.fit(X_train, y_train)
You can then get the best model using grid_clf. best_estimator_ and the best parameters using grid_clf. best_params_. Similarly you can get the grid scores using grid_clf.cv_results_
Hope this helps!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string