Does oversampling happen before or after cross-validation using imblearn pipelines? - python-3.x

I have split my data into train/test before doing cross-validation on the training data to validate my hyperparameters. I have an unbalanced dataset and want to perform SMOTE oversampling on each iteration, so I have established a pipeline using imblearn.
My understanding is that oversampling should be done after dividing the data into k-folds to prevent information leaking. Is this order of operations (data split into k-folds, k-1 folds oversampled, predict on remaining fold) preserved when using Pipeline in the setup below?
from imblearn.pipeline import Pipeline
model = Pipeline([
('sampling', SMOTE()),
('classification', xgb.XGBClassifier())
])
param_dist = {'classification__n_estimators': stats.randint(50, 500),
'classification__learning_rate': stats.uniform(0.01, 0.3),
'classification__subsample': stats.uniform(0.3, 0.6),
'classification__max_depth': [3, 4, 5, 6, 7, 8, 9],
'classification__colsample_bytree': stats.uniform(0.5, 0.5),
'classification__min_child_weight': [1, 2, 3, 4],
'sampling__ratio': np.linspace(0.25, 0.5, 10)
}
random_search = RandomizedSearchCV(model,
param_dist,
cv=StratifiedKFold(n_splits=5),
n_iter=10,
scoring=scorer_cv_cost_savings)
random_search.fit(X_train.values, y_train)

Your understanding is right. When you feed the pipeline as model, the training data (k-1) is applied using .fit() and testing is done on the kth fold. Then sampling would be done on the training data.
The documentation for imblearn.pipeline .fit() says:
Fit the model
Fit all the transforms/samplers one after the other and transform/sample the data,
then fit the transformed/sampled data using the final estimator.

Related

How to plot the random forest tree corresponding to best parameter

Python: 3.6
Windows: 10
I have few question regarding Random Forest and problem at hand:
I am using Gridsearch to run regression problem using Random Forest. I want to plot the tree corresponding to best fit parameter that gridsearch has found out. Here is the code.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=55)
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 50, cv = 5, verbose=2, random_state=56, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)
rf_random.best_params_
The best parameter came out to be is:
{'n_estimators': 1000,
'min_samples_split': 5,
'min_samples_leaf': 1,
'max_features': 'auto',
'max_depth': 5,
'bootstrap': True}
How can I plot this tree using above parameter?
My dependent variable y lies in range [0,1] (continuous) and all predictor variables are either binary or categorical. Which algorithm in general can work well fot this input and output feature space. I tried with Random Forest. (Didn't give that good result). Note here y variable is a kind of ratio therefore its between 0 and 1. Example: Expense on food/Total Expense
The above data is skewed that means the dependent or y variable has value=1 in 60% of data and somewhere between 0 and 1 in rest of data. like 0.66, 0.87 so on.
Since my data has only binary {0,1} and categorical variables {A,B,C}. Do I need to convert it into one-hot encoding variable for using random forest?
Regarding the plot (I am afraid that your other questions are way too-broad for SO, where the general idea is to avoid asking multiple questions at the same time):
Fitting your RandomizedSearchCV has resulted in an rf_random.best_estimator_, which in itself is a random forest with the parameters shown in your question (including 'n_estimators': 1000).
According to the docs, a fitted RandomForestRegressor includes an attribute:
estimators_ : list of DecisionTreeRegressor
The collection of fitted sub-estimators.
So, to plot any individual tree of your Random Forest, you should use either
from sklearn import tree
tree.plot_tree(rf_random.best_estimator_.estimators_[k])
or
from sklearn import tree
tree.export_graphviz(rf_random.best_estimator_.estimators_[k])
for the desired k in [0, 999] in your case ([0, n_estimators-1] in the general case).
Allow me to take a step back before answering your questions.
Ideally one should drill down further on the best_params_ of RandomizedSearchCV output through GridSearchCV. RandomizedSearchCV will go over your parameters without trying out all the possible options. Then once you have the best_params_ of RandomizedSearchCV, we can investigate all the possible options across a more narrower range.
You did not include random_grid parameters in your code input, but I would expect you to do a GridSearchCV like this:
# Create the parameter grid based on the results of RandomizedSearchCV
param_grid = {
'max_depth': [4, 5, 6],
'min_samples_leaf': [1, 2],
'min_samples_split': [4, 5, 6],
'n_estimators': [990, 1000, 1010]
}
# Fit the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,
cv = 5, n_jobs = -1, verbose = 2, random_state=56)
What the above will do is to go through all the possible combinations of parameters in the param_grid and give you the best parameter.
Now coming to your questions:
Random forests are a combination of multiple trees - so you do not have only 1 tree that you can plot. What you can instead do is to plot 1 or more the individual trees used by the random forests. This can be achieved by the plot_tree function. Have a read of the documentation and this SO question to understand it more.
Did you try a simple linear regression first?
This would impact what kind of accuracy metrics you would utilize to assess your model's fit/accuracy. Precision, recall & F1 scores come to mind when dealing with unbalanced/skewed data
Yes, categorical variables need to be converted to dummy variables before fitting a random forest

How to make prediction with single sample in sklearn model.predict?

I trained a logistic regression model with some data.
I applied standard scalar to train and test data, trained model.
But if I want to make prediction with the model with the data outside the train and test data, I have to apply standard scalar to new data but what if I have single data than i cannot apply standard scalar to that new single sample that i want to give as input.
What should be the procedure to predict results with new data especially single sample at a time?
The predict() method always expects a 2D array of shape [n_samples, n_features]. This means that if you want to predict even for a single data point, you will have to convert it into a 2D array.
Converting data into a 2D array using reshape
# Sample data
print(arr)
[1, 2, 3, 4]
# Reshaping into 2D
arr.reshape(1, -1)
# Result
array([[1, 2, 3, 4]])
This array can now be transformed using standard scalar using transform() method before being used to generate a prediction from the model.

Fitting in nested cross-validation with cross_val_score with pipeline and GridSearch

I am working in scikit and I am trying to tune my XGBoost.
I made an attempt to use a nested cross-validation using the pipeline for the rescaling of the training folds (to avoid data leakage and overfitting) and in parallel with GridSearchCV for param tuning and cross_val_score to get the roc_auc score at the end.
from imblearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
std_scaling = StandardScaler()
algo = XGBClassifier()
steps = [('std_scaling', StandardScaler()), ('algo', XGBClassifier())]
pipeline = Pipeline(steps)
parameters = {'algo__min_child_weight': [1, 2],
'algo__subsample': [0.6, 0.9],
'algo__max_depth': [4, 6],
'algo__gamma': [0.1, 0.2],
'algo__learning_rate': [0.05, 0.5, 0.3]}
cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)
clf_auc = GridSearchCV(pipeline, cv = cv1, param_grid = parameters, scoring = 'roc_auc', n_jobs=-1, return_train_score=False)
cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)
outer_clf_auc = cross_val_score(clf_auc, X_train, y_train, cv = cv1, scoring = 'roc_auc')
Question 1.
How do I fit cross_val_score to the training data?
Question2.
Since I included the StandardScaler() in the pipeline does it make sense to include the X_train in the cross_val_score or should I use a standardized form of the X_train (i.e. std_X_train)?
std_scaler = StandardScaler().fit(X_train)
std_X_train = std_scaler.transform(X_train)
std_X_test = std_scaler.transform(X_test)
You chose the right way to avoid data leakage as you say - nested CV.
The thing is in nested CV what you estimate is not the score of a real estimator you can "hold in your hand", but of a non-existing "meta-estimator" which describes you model selection process as well.
Meaning - in every round of the outer cross validation (in your case represented by cross_val_score), the estimator clf_auc undergoes internal CV which selects the best model under the given fold of the external CV.
Therefore, for every fold of the external CV you are scoring a different estimator chosen by the internal CV.
For example, in one external CV fold the model scored can be one that selected the param algo__min_child_weight to be 1, and in another a model that selected it to be 2.
The score of the external CV therefore represents a more high-level score: "under the process of reasonable model selection, how well will my selected model generalize".
Now, if you want to finish the process with a real model in hand you would have to select it in some way (cross_val_score will not do that for you).
The way to do that is to now fit your internal model over the entire data.
meaning to perform:
clf_auc.fit(X, y)
This is the moment to understand what you've done here:
You have a model you can use, which is fitted over all the data available.
When you're asked "how well does that model generalizes on new data?" the answer is the score you got during your nested CV - which captured the model selection process as part of your model's scoring.
And regarding Question #2 - if the scaler is part of the pipeline, there is no reason to manipulate the X_train externally.

Multi-Feature Sequence Padding and Masking in RNN using Keras

I have constructed LSTM architecture using Keras, but I am not certain if duplicating time steps is a good approach to deal with variable sequence length.
I have a multidimensional data set with multi-feature sequence and varying time steps. It is a multivariate time series data with multiple examples to train LSTM on, and Y is either 0 or 1. Currently, I am duplicating last time steps for each sequence to ensure timesteps = 3.
I appreciate if someone could answer the following questions or concerns:
1. Is creating additional time steps with feature values represented by zeroes more suitable?
2. What is the right way to frame this problem, pad sequences, and mask for evaluation.
3. I am duplicating last time step in Y variable as well for prediction, and the value 1 in Y only appears at the last time step if at all.
# The input sequences are
trainX = np.array([
[
# Input features at timestep 1
[1, 2, 3],
# Input features at timestep 2
[5, 2, 3] #<------ duplicate this to ensure compliance
],
# Datapoint 2
[
# Features at timestep 1
[1, 8, 9],
# Features at timestep 2
[9, 8, 9],
# Features at timestep 3
[7, 6, 1]
]
])
# The desired model outputs is as follows:
trainY = np.array([
# Datapoint 1
[
# Target class at timestep 1
[0],
# Target class at timestep 2
[1] #<---------- duplicate this to ensure compliance
],
# Datapoint 2
[
# Target class at timestep 1
[0],
# Target class at timestep 2
[0]
# Target class at time step 3
[0]
]
])
timesteps = 3
model = Sequential()
model.add(LSTM(3, kernel_initializer ='uniform', return_sequences=True, batch_input_shape=(None, timesteps, trainX.shape[2]),
kernel_constraint=maxnorm(3), name='LSTM'))
model.add(Dropout(0.2))
model.add(LSTM(3, return_sequences=True, kernel_constraint=maxnorm(3), name='LSTM-2'))
model.add(Flatten(name='Flatten'))
model.add(Dense(timesteps, activation='sigmoid', name='Dense'))
model.compile(loss="mse", optimizer="sgd", metrics=["mse"])
model.fit(trainX, trainY, epochs=2000, batch_size=2)
predY = model.predict(testX)
In my opinion there are two solutions to your problem. (Duplicating timesteps is None of them):
Use pad_sequence layer in combination with a masking layer. This is the common approach. Now thanks to padding every sample has the same number of timesteps. The good thing on this method, it's very easy to implement. Also, the Masking layer will give you a little performance boost.
The downside of this approach: If you train on a GPU, CuDNNLSTM is the layer to go, which is highly optimized for gpu and therefore a lot faster. But it's not working with a masking layer and if your dataset has a high range of timesteps, you're losing perfomance.
Set your timesteps-shape to None and write a keras generator which will group your batches by timesteps.(I think you'll also have to use the functional api) Now you can implement CuDNNLSTM and every sample will be computed with only the relavant timesteps (instead of padded ones), which is much more efficient.
If you're new to keras and perfomance is not so important, go with option 1. If you have a production environment where you often have to train the Network and it's cost relevant, try option 2.

Specific Cross Validation with Random Forest

Am using Random Forest with scikit learn.
RF overfits the data and prediction results are bad.
The overfit does NOT depend on the parameters of the RF:
NBtree, Depth_Tree
Overfit happens with many different parameters (Tested it across grid_search).
To remedy:
I tweak the initial data/ down sampling some results
in order to affect the fitting (Manually pre-process noise sample).
Loop on random generation of RF fits,
Get RF prediction on the data for prediction
Select the model which best fits the "predicted data" (not the calibration data).
This Monte carlos is very consuming,
Just wondering if there is another way to do
cross validation on random Forest ? (ie NOT the hyper-parameter optimization).
EDITED
Cross-Validation with any classifier in scikit-learn is really trivial:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
clf = RandomForestClassifier() #Initialize with whatever parameters you want to
# 10-Fold Cross validation
print np.mean(cross_val_score(clf, X_train, y_train, cv=10))
If you wish to run Grid Search, you can easily do it via the GridSearchCV class. In order to do so you will have to provide a param_grid, which according to the documentation is
Dictionary with parameters names (string) as keys and lists of
parameter settings to try as values, or a list of such dictionaries,
in which case the grids spanned by each dictionary in the list are
explored. This enables searching over any sequence of parameter
settings.
So maybe, you could define your param_grid as follows:
param_grid = {
'n_estimators': [5, 10, 15, 20],
'max_depth': [2, 5, 7, 9]
}
Then you can use the GridSearchCV class as follows
from sklearn.model_selection import GridSearchCV
grid_clf = GridSearchCV(clf, param_grid, cv=10)
grid_clf.fit(X_train, y_train)
You can then get the best model using grid_clf. best_estimator_ and the best parameters using grid_clf. best_params_. Similarly you can get the grid scores using grid_clf.cv_results_
Hope this helps!

Resources