Why GridSearchCV has higher score with a subset of parameters? - decision-tree

Below is my code, I run it twice. The first one with "criterion": ['gini', 'entropy'] and the second one with just 'entropy' ('gini' was removed), nothing else changed. I expected with less number of combinations, the score must be equal or lower, but it was higher - How is it possible?? No randomness and those numbers repeated all the times.
Using "criterion": ['gini', 'entropy'] got score of 0.850
Using "criterion": ['entropy'] got score of 0.871 (higher)
X_train, X_test, y_train, y_test = train_test_split(dataX, dataY, test_size=0.2, random_state=10, stratify=dataY)
gs_params = {
"criterion": ['gini', 'entropy'],
"max_depth": [3, 4, 5, 6, 8, 10, 12, 15],
"min_samples_split": range(2, 9, 2),
"min_samples_leaf": range(1, 5)
}
gs = GridSearchCV(model, param_grid=gs_params, n_jobs=-1, verbose=1,
cv=5, scoring='f1_weighted', refit=True)
clf = Pipeline([
('scaler', StandardScaler()),
('features1', SelectFromModel(RandomForestClassifier(random_state=80), threshold='median')),
('features2', SelectFromModel(RandomForestClassifier(random_state=81), threshold='median')),
('features3', SelectFromModel(RandomForestClassifier(random_state=82), threshold='median')),
('gs', gs)
])
clf.fit(X_train.values, y_train.values)
test_score_opt = clf.score(X_test.values, y_test.values)

Related

Sklearn gridsearchcv score not matching

Variable names for my training and test data are X_train, X_test, Y_train, Y_test
I have ran a GridSearchCV instance from sklearn to do hyperparameter tuning for my random forest model.
param_grid = {
'n_estimators': [500],
'max_features': ['sqrt', None],
'max_depth': [ 6 ],
'max_leaf_nodes': [8],
'min_impurity_decrease':[0,0.02],
'min_samples_split':[2]
}
grid_search= GridSearchCV(RandomForestClassifier(
criterion='gini',
min_weight_fraction_leaf=0.0,
bootstrap=True,
n_jobs=-1,
random_state=1, verbose=0,
warm_start=False, class_weight='balanced',
ccp_alpha=0.0,
max_samples=None),
param_grid=param_grid,verbose=50,cv=2,n_jobs=-1,scoring='balanced_accuracy')
grid_search.fit(X_train, Y_train)
All the scores that I can see while the gridseach is training are in the range of 0.4-.6
Following is the output of the best score:
[CV 2/2; 1/4] END max_depth=6, max_features=sqrt, max_leaf_nodes=8, min_impurity_decrease=0, min_samples_split=2, n_estimators=500;, score=0.552 total time= 15.4s
My questions is when I am manually calculating balanced_accuracy using
from sklearn.metrics import balanced_accuracy_score, by running print('training accuracy', balanced_accuracy_score(grid_search.predict(X_train), Y_train,adjusted=False)), I am getting a value of about 0.96 which is very different from what the output of gridsearchcv is showing during the run. Why is this so? And what does the score in gridsearchcv mean then? Please note I have passed the parameter scoring = 'balanced_accuracy' in gridsearchcv to make sure they calculate the same thing.
The score you get from gridsearchcv is the validation score (score measured on the part of X_train not used to train the model).
Your manually calculated score is the training score (you fit the model and evaluate the score on the same data: X_train).
The high difference is a sign of overfitting.
You can try to change param_grid :
param_grid = {
'n_estimators': [100, 200], # 500 seems high and might take too long for no reason
'max_features': ['sqrt','log2', None], # Less features can reduce overfitting
'max_depth': [3, 4, 5, 6 ], # Lower depth can reduce overfitting
'max_leaf_nodes': [4, 6, 8], # Lower max_leaf_nodes can reduce overfitting
'min_impurity_decrease':[0,0.02],
'min_samples_split':[2],
'min_samples_leaf': [5, 10, 20] # Higher values can reduce overfitting
}
Also using cv=3 or cv=5 in GridSearchCV could help.
See this post about solving Random Forest overfitting.

Ensembling KNeighbours and Decision Tree Using Voting Classifier

I have a classification problem for which I am trying to build an ensemble using two classifiers, say for example KNeighbours, Decision Tree.In addition to this, I want to implement it using Pipeline. Now this is my attempt to the problem:
steps = [('scaler', StandardScaler()),
('regressor', VotingClassifier(estimators=[
('knn', KNeighborsClassifier()),
('clf', RandomForestClassifier())],voting='soft'))]
pipeline = Pipeline(steps)
parameters = [{'knn__n_neighbors': np.arange(1, 50)}, {
'clf__n_estimators': [10, 20, 30],
'clf__criterion': ['gini', 'entropy'],
'clf__max_features': [5, 10, 15],
'clf__max_depth': ['auto', 'log2', 'sqrt', None]}]
X_train, X_test, y_train, y_test = train_test_split(X, y.values.ravel(),
test_size=0.3, random_state=65)
cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)
On running this following error pops up:
Invalid parameter knn for estimator
Pipeline(steps=[('scaler', StandardScaler()),
('regressor', VotingClassifier(
estimators=[('knn', KNeighborsClassifier()),
('clf', RandomForestClassifier())
]
)
)
]
).
Check the list of available parameters with `estimator.get_params().keys()`.
I belive their is some error in how I have defined the parameter grid. Please help me out in this.
Since it's nested, you'll need to specify both prefixes, like this:
parameters = [{'regressor__knn__n_neighbors': np.arange(1, 5), #} { And you'd probably want it to be a single grid?
'regressor__clf__n_estimators': [10, 20, 30],
'regressor__clf__criterion': ['gini', 'entropy'],
'regressor__clf__max_depth': [5, 10, 15],
'regressor__clf__max_features': ['log2', 'sqrt', None]}]
Also, your max_depth and max_features values switched their supposed places somehow, fixed that. (And 'auto' does the same as 'sqrt', at least for the recent versions.)

Trying to understand SKLearn regression predictions array

I'm very new to Python programming, and even newer to SKLearn and ML.
So please forgive my ignorance on these subjects.
I've started to experiment with SKLearn regression models and code, but hit a fundamental problem understanding the results of this experimental code.
Given the code below, I'm trying to figure out what the result of the LinearRegression model predict() function is, in relation to the hypothetical daily sales figures of an item, stored in the sales_data array.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
sales_data = [3, 7, 2, 4, 6, 8, 5, 10, 9, 6, 4, 7, 11, 6, 3, 1, 4, 5, 8, 10, 7] # May be a much larger array in int's
x_train = []
y_train = []
x_test = []
y_test = []
X = sales_data
Y = sales_data
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.25, train_size = 0.75, random_state = 1)
x_train = np.array(x_train)
y_train = np.array(y_train)
x_test = np.array(x_test)
y_test = np.array(y_test)
x_train = x_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)
lregressor = LinearRegression()
lregressor.fit(x_train, y_train)
lregressor_pred = lregressor.predict(x_test) # Trying to understand what the predicted array represents to sales_data
1) Does the predicted array represent possible outcomes for the next days sales of the item ?
2) Is the predicted array ordered from the most likely to the least likely sales figure ?
If neither of the above is true, please could you explain in simple terms what the predicted array does represent, and how it could be used to forcast the next days item sales, or guess the next integer that might be added to the sales_data array.
I've also used similar code with LogisticRegression and RandomForest regression models, but still don't understand the prediction results, and how to use them.
Many Thanks
1) Does the predicted array represent possible outcomes for the next days sales of the item ?
No. It's an array of predictions for each sample in x_test.
2) Is the predicted array ordered from the most likely to the least likely sales figure ?
No. It's ordered in the same order that x_test is ordered in.
I think I've figured out the answer to my own question, but I'm still not sure I've got it right.
But I'm sure the eagle eye readers will comment on the errors of my way, which would be appreciated.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
sales_data = [3, 7, 2, 4, 6, 8, 5, 10, 9, 6, 4, 7, 11, 6, 3, 1, 4, 5, 8, 10, 7] # May be a much larger array in int's
X = list(range(0, len(sales_data)))
Y = sales_data
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.25, train_size = 0.75, random_state = 1)
x_train = np.array(x_train).reshape(-1, 1)
y_train = np.array(y_train)
x_test = np.array(x_test).reshape(-1, 1)
y_test = np.array(y_test)
x_day = len(sales_data) + 1
x_prediction = np.array([x_day]).reshape(-1, 1)
lregressor = LinearRegression().fit(x_train, y_train)
lregressor_pred = lregressor.predict(x_prediction)
lregressor_pred_list = list(np.rint(lregressor_pred)) # The returned list contains a prediction of next day's sales

Y label shape for time_distributed lstm

The shapes of my data (samples, window, number of features):
X_train (3620, 3, 43)
y_train (3620, 1)
X_test (905, 3, 43)
y_test (905, 1)
This is my model:
model = Sequential()
model.add(Bidirectional(LSTM(448, input_shape = (3, 43), activation = 'relu',
return_sequences=True)))
model.add(Dropout(dropout_rate1))
model.add(Bidirectional(LSTM(256, activation = 'relu', return_sequences = True)))
model.add(Dropout(dropout_rate2))
model.add(TimeDistributed(Dense(64, kernel_initializer = 'uniform',
activation = 'relu')))
model.add(TimeDistributed(Dense(nOut, kernel_initializer = 'uniform',
activation = 'linear',
kernel_regularizer = regularizers.l2(regu))))
model.compile(optimizer = 'adam', loss = 'mse', metrics = ['accuracy'])
net_history = model.fit(X_train, y_train, batch_size = batch_size, epochs = num_epochs,
verbose = 0, validation_split = val_split, shuffle =
True, callbacks = [best_model, early_stop])
I get this error:
ValueError:
Error when checking target: expected time_distributed_4 to have 3 dimensions, but got array with shape (3620, 1)
My X_train is done using a moving window of 3. So 3 steps of X for every 1 y_train label. The error seem to be telling me my y_train should be (3620, 3, 1), did I read it right?
And if so, whats the logic here or the logic I should apply, because every 3 steps in X_train to 1 y_train, how do I change it to 3 steps to 3 y? so all 3 y is the same? Let me give an example so I explain myself clearly.
currently X_train =
[[[1, 2, 3 .....43]
[1, 2, 3 .....43]
[1, 2, 3 .....43]]
...
[[1, 2, 3 .....43]
[1, 2, 3 .....43]
[1, 2, 3 .....43]]]
currently y_train =
[[1].....[3620]]
should y_train become the below for it to work?
[[[1],[1],[1]].....[[3620],[3620],[3620]]]
Thanks a lot.

dividing my dataset (csv format) using Stratified k-fold sampling and saving the output of each fold in separate csv file.

My dataset has around 5000 samples and 3 classes (one hot encoded) and I am interested in creating samples using stratified K fold. Moreover, in the end, I want to split each output file (from the K fold) into train and test.
I tried the following suggestion from sklearn documentation but I want to retain the shape of my dataset.
from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
sss = StratifiedShuffleSplit(n_splits=3, test_size=0.5, random_state=0)
sss.get_n_splits(X, y)
print(sss)
for train_index, test_index in sss.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

Resources