How to plot the random forest tree corresponding to best parameter - python-3.x

Python: 3.6
Windows: 10
I have few question regarding Random Forest and problem at hand:
I am using Gridsearch to run regression problem using Random Forest. I want to plot the tree corresponding to best fit parameter that gridsearch has found out. Here is the code.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=55)
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 50, cv = 5, verbose=2, random_state=56, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)
rf_random.best_params_
The best parameter came out to be is:
{'n_estimators': 1000,
'min_samples_split': 5,
'min_samples_leaf': 1,
'max_features': 'auto',
'max_depth': 5,
'bootstrap': True}
How can I plot this tree using above parameter?
My dependent variable y lies in range [0,1] (continuous) and all predictor variables are either binary or categorical. Which algorithm in general can work well fot this input and output feature space. I tried with Random Forest. (Didn't give that good result). Note here y variable is a kind of ratio therefore its between 0 and 1. Example: Expense on food/Total Expense
The above data is skewed that means the dependent or y variable has value=1 in 60% of data and somewhere between 0 and 1 in rest of data. like 0.66, 0.87 so on.
Since my data has only binary {0,1} and categorical variables {A,B,C}. Do I need to convert it into one-hot encoding variable for using random forest?

Regarding the plot (I am afraid that your other questions are way too-broad for SO, where the general idea is to avoid asking multiple questions at the same time):
Fitting your RandomizedSearchCV has resulted in an rf_random.best_estimator_, which in itself is a random forest with the parameters shown in your question (including 'n_estimators': 1000).
According to the docs, a fitted RandomForestRegressor includes an attribute:
estimators_ : list of DecisionTreeRegressor
The collection of fitted sub-estimators.
So, to plot any individual tree of your Random Forest, you should use either
from sklearn import tree
tree.plot_tree(rf_random.best_estimator_.estimators_[k])
or
from sklearn import tree
tree.export_graphviz(rf_random.best_estimator_.estimators_[k])
for the desired k in [0, 999] in your case ([0, n_estimators-1] in the general case).

Allow me to take a step back before answering your questions.
Ideally one should drill down further on the best_params_ of RandomizedSearchCV output through GridSearchCV. RandomizedSearchCV will go over your parameters without trying out all the possible options. Then once you have the best_params_ of RandomizedSearchCV, we can investigate all the possible options across a more narrower range.
You did not include random_grid parameters in your code input, but I would expect you to do a GridSearchCV like this:
# Create the parameter grid based on the results of RandomizedSearchCV
param_grid = {
'max_depth': [4, 5, 6],
'min_samples_leaf': [1, 2],
'min_samples_split': [4, 5, 6],
'n_estimators': [990, 1000, 1010]
}
# Fit the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,
cv = 5, n_jobs = -1, verbose = 2, random_state=56)
What the above will do is to go through all the possible combinations of parameters in the param_grid and give you the best parameter.
Now coming to your questions:
Random forests are a combination of multiple trees - so you do not have only 1 tree that you can plot. What you can instead do is to plot 1 or more the individual trees used by the random forests. This can be achieved by the plot_tree function. Have a read of the documentation and this SO question to understand it more.
Did you try a simple linear regression first?
This would impact what kind of accuracy metrics you would utilize to assess your model's fit/accuracy. Precision, recall & F1 scores come to mind when dealing with unbalanced/skewed data
Yes, categorical variables need to be converted to dummy variables before fitting a random forest

Related

F1 metric and LeaveOneOut validation strategy in scikit-learn

I want to use GridSearchCV to find the optimal n_neighbors parameter of KNeighborsClassifier
I want to use 'f1_score' metrics AND 'leave one out' strategy.
But this code
clf = GridSearchCV(KNeighborsClassifier(), {'n_neighbors': [1, 2, 3]}, cv=LeaveOneOut(), scoring='f1')
clf.fit(x_train, y_train)
leads to an error
UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
I want to compute f1 score not of each fold of cross validation (it is not possible to compute f1 score of the only one test example), but to compute f1 score based on the whole iteration set with n_neighbors = n.
Is it possible using GridSearchCV?
Not sure if this functionality is directly available in Scikit-Learn, but you can implement the following function to get the desired outcome.
In particular, we will make a dummy scorer which just returns the predicted class instead of computing any score using the ground-truth and the prediction. In this way we can access the predictions of each hyperparameters combination on the different examples in the LOO cv.
from sklearn.metrics import f1_score, make_scorer
def get_pred(y_true, y_predicted):
return y_predicted
get_pred_scorer = make_scorer(get_pred)
clf = GridSearchCV(
KNeighborsClassifier(),
{'n_neighbors': [1, 2, 3]},
cv=LeaveOneOut(),
refit=False,
scoring=get_pred_scorer
)
clf.fit(X_train, y_train)
The problem with this approach is that certain results available in the cv_results_ dictionary (and in certain attributes of GridSearchCV) won't have any meaning, but that probably is not a problem. We should just remember to put refit=False, since GridSearchCV doesn't have a way to determine the best model.
Now we can access the predictions through cv_results_ and just use f1_score to compute the metric for each hyperparams configuration.
def print_params_f1_scores(clf, y_true):
y_preds = [] # will contain the predictions of each params combination
results = clf.cv_results_
params = results["params"] # all params combinations
for j in range(len(params)): # for each combination
y_preds.append([])
for i in range(clf.n_splits_): # for each split (sample in loo)
prediction_of_j_on_i = results[f"split{i}_test_score"][j]
y_preds[j].append(prediction_of_j_on_i)
# show the f1-scores of each combination
for j in range(len(y_preds)):
score = f1_score(y_true, y_preds[j])
print(f"KNeighborsClassifier with {params[j]} obtained f1-score of {score}")
print_params_f1_scores(clf, y_train)
The function prints the following output:
KNeighborsClassifier with {'n_neighbors': 1} obtained f1-score of 0.94
KNeighborsClassifier with {'n_neighbors': 2} obtained f1-score of 0.94
KNeighborsClassifier with {'n_neighbors': 3} obtained f1-score of 0.92

how to use an explicit validation set with predefined split fold?

I have explicit train, test and validation sets as 2d arrays:
X_train.shape
(1400, 38785)
X_val.shape
(200, 38785)
X_test.shape
(400, 38785)
I am tuning the alpha parameter and need advice about how I can use the predefined validation set in it:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV, PredefinedSplit
nb = MultinomialNB()
nb.fit(X_train, y_train)
params = {'alpha': [0.1, 1, 3, 5, 10,12,14]}
# how to use on my validation set?
# ps = PredefinedSplit(test_fold=?)
gs = GridSearchCV(nb, param_grid=params, cv = ps, return_train_score=True, scoring='f1')
gs.fit(X_train, y_train)
My results are as following so far.
# on my validation set, alpha = 5
gs.fit(X_val, y_val)
print('Grid best parameter', gs.best_params_)
Grid best parameter: {'alpha': 5}
# on my training set, alpha = 10
Grid best parameter: {'alpha': 10}
I have read the following questions and documentation yet I am not sure how to use PredefinedSplit() in my case. Thank you.
Order between using validation, training and test sets
https://scikit-learn.org/stable/modules/cross_validation.html#predefined-fold-splits-validation-sets
You can achieve your desired outcome by merging X_train and X_val, and passing PredefinedSplit a list of labels, with -1 indicating training data and 1 indicating validation data. IE,
X = np.concatenate((X_train, X_val))
y = np.concatenate((y_train, y_val))
ps = PredefinedSplit(np.concatenate((np.zeros(len(x_train) - 1, np.ones(len(x_val))))
gs = GridSearchCV(nb, param_grid=params, cv = ps, return_train_score=True, scoring='f1')
gs.fit(X, y) # not X_train, y_train
However, unless there is very a good reason for you holding out a separate validation set, you will likely have less overfitting if you use k-fold cross validation for your hyperparameter tuning rather than using a dedicated validation set.

Does oversampling happen before or after cross-validation using imblearn pipelines?

I have split my data into train/test before doing cross-validation on the training data to validate my hyperparameters. I have an unbalanced dataset and want to perform SMOTE oversampling on each iteration, so I have established a pipeline using imblearn.
My understanding is that oversampling should be done after dividing the data into k-folds to prevent information leaking. Is this order of operations (data split into k-folds, k-1 folds oversampled, predict on remaining fold) preserved when using Pipeline in the setup below?
from imblearn.pipeline import Pipeline
model = Pipeline([
('sampling', SMOTE()),
('classification', xgb.XGBClassifier())
])
param_dist = {'classification__n_estimators': stats.randint(50, 500),
'classification__learning_rate': stats.uniform(0.01, 0.3),
'classification__subsample': stats.uniform(0.3, 0.6),
'classification__max_depth': [3, 4, 5, 6, 7, 8, 9],
'classification__colsample_bytree': stats.uniform(0.5, 0.5),
'classification__min_child_weight': [1, 2, 3, 4],
'sampling__ratio': np.linspace(0.25, 0.5, 10)
}
random_search = RandomizedSearchCV(model,
param_dist,
cv=StratifiedKFold(n_splits=5),
n_iter=10,
scoring=scorer_cv_cost_savings)
random_search.fit(X_train.values, y_train)
Your understanding is right. When you feed the pipeline as model, the training data (k-1) is applied using .fit() and testing is done on the kth fold. Then sampling would be done on the training data.
The documentation for imblearn.pipeline .fit() says:
Fit the model
Fit all the transforms/samplers one after the other and transform/sample the data,
then fit the transformed/sampled data using the final estimator.

Fitting in nested cross-validation with cross_val_score with pipeline and GridSearch

I am working in scikit and I am trying to tune my XGBoost.
I made an attempt to use a nested cross-validation using the pipeline for the rescaling of the training folds (to avoid data leakage and overfitting) and in parallel with GridSearchCV for param tuning and cross_val_score to get the roc_auc score at the end.
from imblearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
std_scaling = StandardScaler()
algo = XGBClassifier()
steps = [('std_scaling', StandardScaler()), ('algo', XGBClassifier())]
pipeline = Pipeline(steps)
parameters = {'algo__min_child_weight': [1, 2],
'algo__subsample': [0.6, 0.9],
'algo__max_depth': [4, 6],
'algo__gamma': [0.1, 0.2],
'algo__learning_rate': [0.05, 0.5, 0.3]}
cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)
clf_auc = GridSearchCV(pipeline, cv = cv1, param_grid = parameters, scoring = 'roc_auc', n_jobs=-1, return_train_score=False)
cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)
outer_clf_auc = cross_val_score(clf_auc, X_train, y_train, cv = cv1, scoring = 'roc_auc')
Question 1.
How do I fit cross_val_score to the training data?
Question2.
Since I included the StandardScaler() in the pipeline does it make sense to include the X_train in the cross_val_score or should I use a standardized form of the X_train (i.e. std_X_train)?
std_scaler = StandardScaler().fit(X_train)
std_X_train = std_scaler.transform(X_train)
std_X_test = std_scaler.transform(X_test)
You chose the right way to avoid data leakage as you say - nested CV.
The thing is in nested CV what you estimate is not the score of a real estimator you can "hold in your hand", but of a non-existing "meta-estimator" which describes you model selection process as well.
Meaning - in every round of the outer cross validation (in your case represented by cross_val_score), the estimator clf_auc undergoes internal CV which selects the best model under the given fold of the external CV.
Therefore, for every fold of the external CV you are scoring a different estimator chosen by the internal CV.
For example, in one external CV fold the model scored can be one that selected the param algo__min_child_weight to be 1, and in another a model that selected it to be 2.
The score of the external CV therefore represents a more high-level score: "under the process of reasonable model selection, how well will my selected model generalize".
Now, if you want to finish the process with a real model in hand you would have to select it in some way (cross_val_score will not do that for you).
The way to do that is to now fit your internal model over the entire data.
meaning to perform:
clf_auc.fit(X, y)
This is the moment to understand what you've done here:
You have a model you can use, which is fitted over all the data available.
When you're asked "how well does that model generalizes on new data?" the answer is the score you got during your nested CV - which captured the model selection process as part of your model's scoring.
And regarding Question #2 - if the scaler is part of the pipeline, there is no reason to manipulate the X_train externally.

Specific Cross Validation with Random Forest

Am using Random Forest with scikit learn.
RF overfits the data and prediction results are bad.
The overfit does NOT depend on the parameters of the RF:
NBtree, Depth_Tree
Overfit happens with many different parameters (Tested it across grid_search).
To remedy:
I tweak the initial data/ down sampling some results
in order to affect the fitting (Manually pre-process noise sample).
Loop on random generation of RF fits,
Get RF prediction on the data for prediction
Select the model which best fits the "predicted data" (not the calibration data).
This Monte carlos is very consuming,
Just wondering if there is another way to do
cross validation on random Forest ? (ie NOT the hyper-parameter optimization).
EDITED
Cross-Validation with any classifier in scikit-learn is really trivial:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
clf = RandomForestClassifier() #Initialize with whatever parameters you want to
# 10-Fold Cross validation
print np.mean(cross_val_score(clf, X_train, y_train, cv=10))
If you wish to run Grid Search, you can easily do it via the GridSearchCV class. In order to do so you will have to provide a param_grid, which according to the documentation is
Dictionary with parameters names (string) as keys and lists of
parameter settings to try as values, or a list of such dictionaries,
in which case the grids spanned by each dictionary in the list are
explored. This enables searching over any sequence of parameter
settings.
So maybe, you could define your param_grid as follows:
param_grid = {
'n_estimators': [5, 10, 15, 20],
'max_depth': [2, 5, 7, 9]
}
Then you can use the GridSearchCV class as follows
from sklearn.model_selection import GridSearchCV
grid_clf = GridSearchCV(clf, param_grid, cv=10)
grid_clf.fit(X_train, y_train)
You can then get the best model using grid_clf. best_estimator_ and the best parameters using grid_clf. best_params_. Similarly you can get the grid scores using grid_clf.cv_results_
Hope this helps!

Resources