Sklearn MLP Feature Selection - scikit-learn

Recursive Feature Elimination with Cross Validation (RFEVC) does not work on the Multi Layer Perceptron estimator (along with several other classifiers).
I wish to use a feature selection across many classifiers that performs cross validation to verify its feature selection. Any suggestions?

There is a feature selection independent of the model choice for structured data, it is called Permutation Importance. It is well explained here and elsewhere.
You should have a look at it. It is currently being implemented in sklearn.
There is no current implementation for MLP, but one could be easily done with something like this (from the article):
def permutation_importances(rf, X_train, y_train, metric):
baseline = metric(rf, X_train, y_train)
imp = []
for col in X_train.columns:
save = X_train[col].copy()
X_train[col] = np.random.permutation(X_train[col])
m = metric(rf, X_train, y_train)
X_train[col] = save
imp.append(baseline - m)
return np.array(imp)
Note that here the training set is used for computing the feature importances, but you could choose to use the test set, as discussed here.

Related

How to get the features selected by the RandomizedSearchCV for LGBMClassifier model?

I'm using the RandomizedSearchCV (sklearn) model selection to find out the best fit for a LightGBM LGBMClassifier model, but I'm facing issues to figure out which features has been selected for that.
I can print out the the importance of each one by:
lgbm_clf = lgbm.LGBMClassifier(boosting_type='gbdt',....
lgbm_clf.fit(X_train, y_train)
importance_type = lgbm_clf.importance_type
lgbm_clf.importance_type = "gain"
gain = lgbm_clf.feature_importances_
lgbm_clf.importance_type = "split"
split = lgbm_clf.feature_importances_
lgbm_clf.importance_type = importance_type
feature_importance = pd.DataFrame(
dict(snp=data.columns, zgain=zscore(gain), zsplit=zscore(split))
)
feature_importance
But how do I know which features has been used in the model?
e.g.: If I try:
lgbm.plot_split_value_histogram(lgbm_clf, 1)
I get the error: ValueError: Cannot plot split value histogram, because feature 1 was not used in splitting
This question is part of a broad doubt that has been asked at How to compare feature selection regression-based algorithm with tree-based algorithms?.
Thank you!

cross Validation in Sklearn using a Custom CV

I am dealing with a binary classification problem.
I have 2 lists of indexes listTrain and listTest, which are partitions of the training set (the actual test set will be used only later). I would like to use the samples associated with listTrain to estimate the parameters and the samples associated with listTest to evaluate the error in a cross validation process (hold out set approach).
However, I am not be able to find the correct way to pass this to the sklearn GridSearchCV.
The documentation says that I should create "An iterable yielding (train, test) splits as arrays of indices". However, I do not know how to create this.
grid_search = GridSearchCV(estimator = model, param_grid = param_grid,cv = custom_cv, n_jobs = -1, verbose = 0,scoring=errorType)
So, my question is how to create custom_cv based on these indexes to be used in this method?
X and y are respectivelly the features matrix and y is the vector of labels.
Example: Supose that I only have one hyperparameter alpha that belongs to the set{1,2,3}. I would like to set alpha=1, estimate the parameters of the model (for instance the coefficients os a regression) using the samples associated with listTrain and evaluate the error using the samples associated with listTest. Then I repeat the process for alpha=2 and finally for alpha=3. Then I choose the alpha that minimizes the error.
EDIT: Actual answer to question. Try passing cv command a generator of the indices:
def index_gen(listTrain, listTest):
yield listTrain, listTest
grid_search = GridSearchCV(estimator = model, param_grid =
param_grid,cv = index_gen(listTrain, listTest), n_jobs = -1,
verbose = 0,scoring=errorType)
EDIT: Before Edits:
As mentioned in the comment by desertnaut, what you are trying to do is bad ML practice, and you will end up with a biased estimate of the generalisation performance of the final model. Using the test set in the manner you're proposing will effectively leak test set information into the training stage, and give you an overestimate of the model's capability to classify unseen data. What I suggest in your case:
grid_search = GridSearchCV(estimator = model, param_grid = param_grid,cv = 5,
n_jobs = -1, verbose = 0,scoring=errorType)
grid_search.fit(x[listTrain], y[listTrain]
Now, your training set will be split into 5 (you can choose the number here) folds, trained using 4 of those folds on a specific set of hyperparameters, and tested the fold that was left out. This is repeated 5 times, till all of your training examples have been part of a left out set. This whole procedure is done for each hyperparameter setting you are testing (5x3 in this case)
grid_search.best_params_ will give you a dictionary of the parameters that performed the best over all 5 folds. These are the parameters that you use to train your final classifier, using again only the training set:
clf = LogisticRegression(**grid_search.best_params_).fit(x[listTrain],
y[listTrain])
Now, finally your classifier is tested on the test set and an unbiased estimate of the generalisation performance is given:
predictions = clf.predict(x[listTest])

Can RandomizedSearchCV output feature importance based on the best model?

After use RandomizedSearchCV to find the best hyperparameters, is there a way to find the following outputs?
1. save the best model as an object
2. output feature importance
gbm = GradientBoostingClassifier()
rand = RandomizedSearchCV(gbm, param_distributions=param_dist, cv=10,
scoring='roc_auc', n_iter=10, random_state=5)
rand.fit(X_train, y_train_num)
Use the best_params_ parameter and save it into a dictionary. From the dictionary retrain the model and call the values by the keys.
top_params = rand.best_params_
gbm_model = GradientBoostingClassifier(learning_rate=top_params['learning_rate'], max_depth=top_params["max_depth"], ...)
gbm_model.fit(X_train, y_train_num)
gbm_model.feature_importances_

How to use cross_val_predict to predict probabilities for a new dataset?

I am using sklearn's cross_val_predict for training like so:
myprobs_train = cross_val_predict(LogisticRegression(),X = x_old, y=y_old, method='predict_proba', cv=10)
I am happy with the returned probabilities, and would like now to score up a brand-new dataset. I tried:
myprobs_test = cross_val_predict(LogisticRegression(), X =x_new, y= None, method='predict_proba',cv=10)
but this did not work, it's complaining about y having zero shape. Does it mean there's no way to apply the trained and cross-validated model from cross_val_predict on new data? Or am I just using it wrong?
Thank you!
You are looking at a wrong method. Cross validation methods do not return a trained model; they return values that evaluate the performance of a model (logistic regression in your case). Your goal is to fit some data and then generate prediction for new data. The relevant methods are fit and predict of the LogisticRegression class. Here is the basic structure:
logreg = linear_model.LogisticRegression()
logreg.fit(x_old, y_old)
predictions = logreg.predict(x_new)
I have the same concern as #user3490622. If we can only use cross_val_predict on training and testing sets, why y (target) is None as the default value? (sklearn page)
To partially achieve the desired results of multiple predicted probability, one could use the fit then predict approach repeatedly to mimic the cross-validation.

Feature_importances in scikit learn, how choose correct parameters?

My task is to understand which features (situated in columns of X dataset) are the best in predicting target variable - y. I've decided to use feature_importances_ in RandomForestClassifier. RandomForestClassifier have best score (aucroc), when max_depth=10 and n_estimators = 50. Is it correct to use feature_importances_ with best parameters, or default parameters? Why? How does feature_importances_ work?
There are to models with best and default parameters for example.
1)
model = RandomForestClassifier(max_depth=10,n_estimators = 50)
model.fit(X, y)
feature_imp = pd.DataFrame(model.feature_importances_, index=X.columns, columns=["importance"])
2)
model = RandomForestClassifier()
model.fit(X, y)
feature_imp = pd.DataFrame(model.feature_importances_, index=X.columns, columns=["importance"])
I think you should use feature_importances_ with the best parameters, it is the model that you are going to use. There is nothing special about default parameter that deserves special treatment. As for how does feature_importances_ work, you can reference the answer of scikit-learn authors here How are feature_importances in RandomForestClassifier determined?

Resources