Specific Cross Validation with Random Forest - scikit-learn

Am using Random Forest with scikit learn.
RF overfits the data and prediction results are bad.
The overfit does NOT depend on the parameters of the RF:
NBtree, Depth_Tree
Overfit happens with many different parameters (Tested it across grid_search).
To remedy:
I tweak the initial data/ down sampling some results
in order to affect the fitting (Manually pre-process noise sample).
Loop on random generation of RF fits,
Get RF prediction on the data for prediction
Select the model which best fits the "predicted data" (not the calibration data).
This Monte carlos is very consuming,
Just wondering if there is another way to do
cross validation on random Forest ? (ie NOT the hyper-parameter optimization).
EDITED

Cross-Validation with any classifier in scikit-learn is really trivial:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
clf = RandomForestClassifier() #Initialize with whatever parameters you want to
# 10-Fold Cross validation
print np.mean(cross_val_score(clf, X_train, y_train, cv=10))
If you wish to run Grid Search, you can easily do it via the GridSearchCV class. In order to do so you will have to provide a param_grid, which according to the documentation is
Dictionary with parameters names (string) as keys and lists of
parameter settings to try as values, or a list of such dictionaries,
in which case the grids spanned by each dictionary in the list are
explored. This enables searching over any sequence of parameter
settings.
So maybe, you could define your param_grid as follows:
param_grid = {
'n_estimators': [5, 10, 15, 20],
'max_depth': [2, 5, 7, 9]
}
Then you can use the GridSearchCV class as follows
from sklearn.model_selection import GridSearchCV
grid_clf = GridSearchCV(clf, param_grid, cv=10)
grid_clf.fit(X_train, y_train)
You can then get the best model using grid_clf. best_estimator_ and the best parameters using grid_clf. best_params_. Similarly you can get the grid scores using grid_clf.cv_results_
Hope this helps!

Related

oversampling (SMOTE) does not work properly when fitted inside a pipeline

I have an imbalanced classification problem and I am using make_pipeline from imblearn
So the steps are the following:
kf = StratifiedKFold(n_splits=10, random_state=42, shuffle=True)
params = {
'max_depth': [2,3,5],
# 'max_features':['auto', 'sqrt', 'log2'],
# 'min_samples_leaf': [5,10,20,50,100,200,300],
'n_estimators': [10,25,30,50]
# 'bootstrap': [True, False]
}
from imblearn.pipeline import make_pipeline
imba_pipeline = make_pipeline(SMOTE(random_state = 42), RobustScaler(), RandomForestClassifier(random_state=42))
imba_pipeline
out:Pipeline(steps=[('smote', SMOTE(random_state=42)),
('robustscaler', RobustScaler()),
('randomforestclassifier',
RandomForestClassifier(random_state=42))])
new_params = {'randomforestclassifier__' + key: params[key] for key in params}
grid_imba = GridSearchCV(imba_pipeline, param_grid=new_params, cv=kf, scoring='recall',
return_train_score=True, n_jobs=-1, verbose=2)
grid_imba.fit(X_train, y_train)
And everything is going ok and I am reaching to the end to by problem (i.e I can see the classification report)
However when I am trying to see inside the black box with eli5 with eli.explain_weights(imba_pipeline)
I get back as error
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SMOTE(random_state=42)' (type <class 'imblearn.over_sampling._smote.SMOTE'>) doesn't
I know that this Is a common problem and i have read the related questions but i am confused as the problem is occurred after the end of my classification procedure
Any suggestions?
Your pipeline has two fitted steps (+ the scaler): the SMOTE augmentation and the random forest. It looks like this is confusing the eli5 which wants to work with the assumptions that only the last layer is fitted. To get the weight explanation of the random forest you could try calling eli5 only on that layer of the pipeline with
from eli5 import explain_weights
explain_weights(imba_pipeline['randomforestclassifier'])
provided the pipeline is fitted, but in your code you were fitting the grid search so
explain_weights(grid_imba.best_estimator_['randomforestclassifier'])
would be more appropriate.
Just wanted to point out that SMOTE generally doesn't improve prediction quality. See https://arxiv.org/abs/2201.08528

How to plot the random forest tree corresponding to best parameter

Python: 3.6
Windows: 10
I have few question regarding Random Forest and problem at hand:
I am using Gridsearch to run regression problem using Random Forest. I want to plot the tree corresponding to best fit parameter that gridsearch has found out. Here is the code.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=55)
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 50, cv = 5, verbose=2, random_state=56, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)
rf_random.best_params_
The best parameter came out to be is:
{'n_estimators': 1000,
'min_samples_split': 5,
'min_samples_leaf': 1,
'max_features': 'auto',
'max_depth': 5,
'bootstrap': True}
How can I plot this tree using above parameter?
My dependent variable y lies in range [0,1] (continuous) and all predictor variables are either binary or categorical. Which algorithm in general can work well fot this input and output feature space. I tried with Random Forest. (Didn't give that good result). Note here y variable is a kind of ratio therefore its between 0 and 1. Example: Expense on food/Total Expense
The above data is skewed that means the dependent or y variable has value=1 in 60% of data and somewhere between 0 and 1 in rest of data. like 0.66, 0.87 so on.
Since my data has only binary {0,1} and categorical variables {A,B,C}. Do I need to convert it into one-hot encoding variable for using random forest?
Regarding the plot (I am afraid that your other questions are way too-broad for SO, where the general idea is to avoid asking multiple questions at the same time):
Fitting your RandomizedSearchCV has resulted in an rf_random.best_estimator_, which in itself is a random forest with the parameters shown in your question (including 'n_estimators': 1000).
According to the docs, a fitted RandomForestRegressor includes an attribute:
estimators_ : list of DecisionTreeRegressor
The collection of fitted sub-estimators.
So, to plot any individual tree of your Random Forest, you should use either
from sklearn import tree
tree.plot_tree(rf_random.best_estimator_.estimators_[k])
or
from sklearn import tree
tree.export_graphviz(rf_random.best_estimator_.estimators_[k])
for the desired k in [0, 999] in your case ([0, n_estimators-1] in the general case).
Allow me to take a step back before answering your questions.
Ideally one should drill down further on the best_params_ of RandomizedSearchCV output through GridSearchCV. RandomizedSearchCV will go over your parameters without trying out all the possible options. Then once you have the best_params_ of RandomizedSearchCV, we can investigate all the possible options across a more narrower range.
You did not include random_grid parameters in your code input, but I would expect you to do a GridSearchCV like this:
# Create the parameter grid based on the results of RandomizedSearchCV
param_grid = {
'max_depth': [4, 5, 6],
'min_samples_leaf': [1, 2],
'min_samples_split': [4, 5, 6],
'n_estimators': [990, 1000, 1010]
}
# Fit the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,
cv = 5, n_jobs = -1, verbose = 2, random_state=56)
What the above will do is to go through all the possible combinations of parameters in the param_grid and give you the best parameter.
Now coming to your questions:
Random forests are a combination of multiple trees - so you do not have only 1 tree that you can plot. What you can instead do is to plot 1 or more the individual trees used by the random forests. This can be achieved by the plot_tree function. Have a read of the documentation and this SO question to understand it more.
Did you try a simple linear regression first?
This would impact what kind of accuracy metrics you would utilize to assess your model's fit/accuracy. Precision, recall & F1 scores come to mind when dealing with unbalanced/skewed data
Yes, categorical variables need to be converted to dummy variables before fitting a random forest

Sklearn Logistic Regression predict_proba returning 0 or 1

I don't have any example data to share in order to replicate the problem, but perhaps someone can provide a high level answer. I've created a lot of logistic regression models in the past, and this is the first time my predict proba scores are showing up as either 1 or 0.
I'm creating a binary classifier to predict one of two labels. I've also used a couple of other algorithms, XGBClassifier and RandomForestCalssifier with the same dataset. For these, predict_proba yields the expected probability results (i.e, float values between 0 and 1).
Also, for the LogisticRegression model, I've tried a variety of parameters including all default params, yet the issue persists. Weirdly enough, using SGDClassifier with loss = 'log' or 'modified_huber' also yields the same binary predict_proba results, so I'm thinking this might be something intrinsic to the dataset, but not sure. Also, this issue only occurs if I standardize training set data. So far I've tried both StandardScaler and MinMaxScaler, same results.
Has anyone ever encountered a problem such as this?
Edit:
The LR parameters are:
LogisticRegression(C=1.7993269963183343, class_weight='balanced', dual=False,
fit_intercept=True, intercept_scaling=1, l1_ratio=.5,
max_iter=100, multi_class='warn', n_jobs=-1, penalty='elasticnet',
random_state=58, solver='saga', tol=0.0001, verbose=0,
warm_start=False)
Again, the issue only occurs when standardizing the data with either StandardScaler() or MinMaxScaler(). Which is odd because the data is not a uniform scale across all features. For instance, some features are represented as percents, others are represented as dollar values, and others are dummy coded representations.
This can happen when you do the following two things in sequence:
Fit an estimator with standardized training data and then later on,
Pass unstandardized data to the same estimator in the validation or testing phase.
Here's an example of predict_proba returning 0 or 1 using the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=123)
# Example 1 [CORRECT]
pipeline = make_pipeline(StandardScaler(), LogisticRegression())
pipeline.fit(X_train, y_train)
# Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])
print(pipeline)
y_pred = pipeline.predict_proba(X_test)
# [0.37264656 0.62735344]
print(y_pred.mean(axis=0))
# Example 2 [INCORRECT]
# Fit the model with standardized training set
X_scaled = StandardScaler().fit_transform(X_train)
model = LogisticRegression()
model.fit(X_scaled, y_train)
# Test the model with unstandardized test set
y_pred = model.predict_proba(X_test)
# [1.00000000e+000 2.48303123e-204]
print(y_pred.mean(axis=0))
Since the estimator in Example 2 was fitted on scaled data with a unit variance of 1.0 (X_scaled), the variance of the data it's being tested on (X_test) is much higher than expected. It's no surprise then that this results in very extreme probabilities.
You can prevent this from happening by wrapping your estimator within a pipeline and calling the pipeline fit method instead of the estimator's fit method (see Example 1). Doing it this way guarantees that the same transformations are applied to the data in the training, validation and testing phases.

How can this feature ranking problem be implemented with Support Vector Classification?

If I want the classifier to be SVM (using scikit-learn), how can I modify the 'clf' variable such that the svm classifier used for feature ranking results to high accuracy? What parameters/arguments do I need to add ? Which kernel type of SVC ('linear' or 'rbf' or 'sigmoid' or others) would you suggest for best accuracy?
The codes are referred form the following github link:
https://github.com/CynthiaKoopman/Network-Intrusion-Detection/blob/master/RandomForest_IDS.ipynb
I have 10 features which are ranked (with RecursiveFeatureElimination of scikit learn) from 1 to 10 which are from DoS attacks of the NSL-KDD dataset using RandomForestClassifier with 99% accuracy (using RFC as prediction model).
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
#from sklearn.svm import SVC
# Create a decision tree classifier. clf is the 'variable for classifier'
clf = RandomForestClassifier(n_jobs = 2)
# If classifier used is svm
#clf = SVC(kernel = "linear")
#rank all features, i.e continue the elimination until the last one
rfe = RFE(clf, n_features_to_select=1)
rfe.fit(X_newDoS, Y_DoS)
print ("DoS Features sorted by their rank:")
#print (sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), newcolname_DoS)))
sorted_newcolname_DoS = sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), newcolname_DoS))
sorted_newcolname_DoS
I expect more or less 99% similarity between the ranked features of the two classifiers, which I didnt observe.

Fitting in nested cross-validation with cross_val_score with pipeline and GridSearch

I am working in scikit and I am trying to tune my XGBoost.
I made an attempt to use a nested cross-validation using the pipeline for the rescaling of the training folds (to avoid data leakage and overfitting) and in parallel with GridSearchCV for param tuning and cross_val_score to get the roc_auc score at the end.
from imblearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
std_scaling = StandardScaler()
algo = XGBClassifier()
steps = [('std_scaling', StandardScaler()), ('algo', XGBClassifier())]
pipeline = Pipeline(steps)
parameters = {'algo__min_child_weight': [1, 2],
'algo__subsample': [0.6, 0.9],
'algo__max_depth': [4, 6],
'algo__gamma': [0.1, 0.2],
'algo__learning_rate': [0.05, 0.5, 0.3]}
cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)
clf_auc = GridSearchCV(pipeline, cv = cv1, param_grid = parameters, scoring = 'roc_auc', n_jobs=-1, return_train_score=False)
cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)
outer_clf_auc = cross_val_score(clf_auc, X_train, y_train, cv = cv1, scoring = 'roc_auc')
Question 1.
How do I fit cross_val_score to the training data?
Question2.
Since I included the StandardScaler() in the pipeline does it make sense to include the X_train in the cross_val_score or should I use a standardized form of the X_train (i.e. std_X_train)?
std_scaler = StandardScaler().fit(X_train)
std_X_train = std_scaler.transform(X_train)
std_X_test = std_scaler.transform(X_test)
You chose the right way to avoid data leakage as you say - nested CV.
The thing is in nested CV what you estimate is not the score of a real estimator you can "hold in your hand", but of a non-existing "meta-estimator" which describes you model selection process as well.
Meaning - in every round of the outer cross validation (in your case represented by cross_val_score), the estimator clf_auc undergoes internal CV which selects the best model under the given fold of the external CV.
Therefore, for every fold of the external CV you are scoring a different estimator chosen by the internal CV.
For example, in one external CV fold the model scored can be one that selected the param algo__min_child_weight to be 1, and in another a model that selected it to be 2.
The score of the external CV therefore represents a more high-level score: "under the process of reasonable model selection, how well will my selected model generalize".
Now, if you want to finish the process with a real model in hand you would have to select it in some way (cross_val_score will not do that for you).
The way to do that is to now fit your internal model over the entire data.
meaning to perform:
clf_auc.fit(X, y)
This is the moment to understand what you've done here:
You have a model you can use, which is fitted over all the data available.
When you're asked "how well does that model generalizes on new data?" the answer is the score you got during your nested CV - which captured the model selection process as part of your model's scoring.
And regarding Question #2 - if the scaler is part of the pipeline, there is no reason to manipulate the X_train externally.

Resources