GridSearch / make_scorer strange results with xgboost model - scikit-learn

I am tryting to use sklearn's gridsearch with a model created by xgboost. To do this, I am creating a custom scorer based on ndcg evaluation. I am successfully able to use Snippet 1 but it is too messy / hacky, I would prefer to use good old sklearn to simplify the code. I tried to implement GridSearch and the results is completely off: for the same X and y sets I get NDCG#k = 0.8 with Snippet 1 versus 0.5 with Snippet 2. Obviously there something I am not doing right here ...
The following pieces of code return very different results:
Snippet1:
kf = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=42)
max_depth = [6]
learning_rate = [0.22]
n_estimators = [43]
reg_alpha = [0.1]
reg_lambda = [10]
for md in max_depth:
for lr in learning_rate:
for ne in n_estimators:
for ra in reg_alpha:
for rl in reg_lambda:
xgb = XGBClassifier(objective='multi:softprob',
max_depth=md,
learning_rate=lr,
n_estimators=ne,
reg_alpha=ra,
reg_lambda=rl,
subsample=0.6, colsample_bytree=0.6, seed=0)
print([md, lr, ne])
score = []
for train_index, test_index in kf:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
xgb.fit(X_train, y_train)
y_pred = xgb.predict_proba(X_test)
score.append(ndcg_scorer(y_test, y_pred))
print('all scores: %s' % score)
print('average score: %s' % np.mean(score))
Snippet2:
from sklearn.grid_search import GridSearchCV
params = {
'max_depth':[6],
'learning_rate':[0.22],
'n_estimators':[43],
'reg_alpha':[0.1],
'reg_lambda':[10],
'subsample':[0.6],
'colsample_bytree':[0.6]
}
xgb = XGBClassifier(objective='multi:softprob',seed=0)
scorer = make_scorer(ndcg_scorer, needs_proba=True)
gs = GridSearchCV(xgb, params, cv=5, scoring=scorer, verbose=10, refit=False)
gs.fit(X,y)
gs.best_score_
While snippet1 gives me the result as expected, the score returned by Snippet2 is not consistent with the ndcg_scorer.

The problem is with cv inGridSearchCV(xgb, params, cv=5, scoring=scorer, verbose=10, refit=False). It can recieve a KFold / StratifiedKFold instead of an int. Unlike what is says in the doc it seems that by default an agrument of type 'int' is not calling StratifiedKFold another function maybe KFold.

Related

Unable to calculate Model performance for Decision Tree Regressor

Although my code run fine on repl and did giving me results but it miserably fails on the Katacoda testing environment.
I am attaching the repl file here for your review as well, which also contains the question which is commented just above the code I have written.
Kindly review and let me know what mistakes I am making here.
Repl Link
https://repl.it/repls/WarmRobustOolanguage
Also sharing code below
Commented is Question Instructions
#Import two modules sklearn.datasets, and #sklearn.model_selection.
#Import numpy and set random seed to 100.
#Load popular Boston dataset from sklearn.datasets module #and assign it to variable boston.
#Split boston.data into two sets names X_train and X_test. #Also, split boston.target into two sets Y_train and Y_test.
#Hint: Use train_test_split method from #sklearn.model_selection; set random_state to 30.
#Print the shape of X_train dataset.
#Print the shape of X_test dataset.
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
import numpy as np
np.random.seed(100)
max_depth = range(2, 6)
boston = datasets.load_boston()
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, random_state=30)
print(X_train.shape)
print(X_test.shape)
#Import required module from sklearn.tree.
#Build a Decision tree Regressor model from X_train set and #Y_train labels, with default parameters. Name the model as #dt_reg.
#Evaluate the model accuracy on training data set and print #it's score.
#Evaluate the model accuracy on testing data set and print it's score.
#Predict the housing price for first two samples of X_test #set and print them.(Hint : Use predict() function)
dt_reg = DecisionTreeRegressor(random_state=1)
dt_reg = dt_reg.fit(X_train, Y_train)
print('Accuracy of Train Data :', cross_val_score(dt_reg, X_train,Y_train, cv=10 ))
print('Accuracy of Test Data :', cross_val_score(dt_reg, X_test,Y_test, cv=10 ))
predicted = dt_reg.predict(X_test[:2])
print(predicted)
#Fit multiple Decision tree regressors on X_train data and #Y_train labels with max_depth parameter value changing from #2 to 5.
#Evaluate each model accuracy on testing data set.
#Hint: Make use of for loop
#Print the max_depth value of the model with highest accuracy.
dt_reg = DecisionTreeRegressor()
random_grid = {'max_depth': max_depth}
dt_random = RandomizedSearchCV(estimator = dt_reg, param_distributions = random_grid,
n_iter = 90, cv = 3, verbose=2, random_state=42, n_jobs = -1)
dt_random.fit(X_train, Y_train)
dt_random.best_params_
def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100 * np.mean(errors / test_labels)
accuracy = 100 - mape
print('Model Performance')
print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
print('Accuracy = {:0.2f}%.'.format(accuracy))
return accuracy
best_random = dt_random.best_estimator_
random_accuracy = evaluate(best_random, X_test,Y_test)
print("Accuracy Scores of the Model ",random_accuracy)
best_parameters = (dt_random.best_params_['max_depth']);
print(best_parameters)
The question is asking for default values. Try to remove random_state=1
Current Line:
dt_reg = DecisionTreeRegressor(random_state=1)
Update Line:
dt_reg = DecisionTreeRegressor()
I think it should Work!!!
# ================================================================================
# Machine Learning Using Scikit-Learn | 3 | Decision Trees ================================================================================
import sklearn.datasets as datasets
import sklearn.model_selection as model_selection
import numpy as np
from sklearn.tree import DecisionTreeRegressor
np.random.seed(100)
# Load popular Boston dataset from sklearn.datasets module and assign it to variable boston.
boston = datasets.load_boston()
# print(boston)
# Split boston.data into two sets names X_train and X_test. Also, split boston.target into two sets Y_train and Y_test
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(boston.data, boston.target, random_state=30)
# Print the shape of X_train dataset
print(X_train.shape)
# Print the shape of X_test dataset.
print(X_test.shape)
# Build a Decision tree Regressor model from X_train set and Y_train labels, with default parameters. Name the model as dt_reg
dt_Regressor = DecisionTreeRegressor()
dt_reg = dt_Regressor.fit(X_train, Y_train)
print(dt_reg.score(X_train,Y_train))
print(dt_reg.score(X_test,Y_test))
predicted = dt_reg.predict(X_test[:2])
print(predicted)
# Get the max depth
maxdepth = 2
maxscore = 0
for x in range(2, 6):
dt_Regressor = DecisionTreeRegressor(max_depth=x)
dt_reg = dt_Regressor.fit(X_train, Y_train)
score = dt_reg.score(X_test, Y_test)
if(maxscore < score):
maxdepth = x
maxscore = score
print(maxdepth)

How to test unseen test data with cross validation and predict labels?

1.The CSV that contains data(ie. text description) along with categorized labels
df = pd.read_csv('./output/csv_sanitized_16_.csv', dtype=str)
X = df['description_plus']
y = df['category_id']
2.This CSV contains unseen data(ie. text description) for which labels need to be predicted
df_2 = pd.read_csv('./output/csv_sanitized_2.csv', dtype=str)
X2 = df_2['description_plus']
Cross validation function that operates on the training data(item #1) above.
def cross_val():
cv = KFold(n_splits=20)
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')
X_train = vectorizer.fit_transform(X)
clf = make_pipeline(preprocessing.StandardScaler(with_mean=False), svm.SVC(C=1))
scores = cross_val_score(clf, X_train, y, cv=cv)
print(scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
cross_val()
I need to know how to pass the unseen data(item #2) to the cross validation function and how to predict the labels?
Using scores = cross_val_score(clf, X_train, y, cv=cv) you can only get the cross-validated scores of the model. cross_val_score will internally split the data into training and testing based on the cv parameter.
So the values that you get are the cross-validated accuracy of the SVC.
To get the score on the unseen data, you can first fit the model e.g.
clf = make_pipeline(preprocessing.StandardScaler(with_mean=False), svm.SVC(C=1))
clf.fit(X_train, y) # the model is trained now
and then do clf.score(X_unseen,y)
The last will return the accuracy of the model on the unseen data.
EDIT: The best way to do what you want is the following using a GridSearch to first find the best model using the training data and then evaluate the best model using the unseen (test) data:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
# load some data
iris = datasets.load_iris()
X, y = iris.data, iris.target
#split data to training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
# hyperparameter tunig of the SVC model
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
# fit the GridSearch using the TRAINING data
grid_searcher = GridSearchCV(svc, parameters)
grid_searcher.fit(X_train, y_train)
#recover the best estimator (best parameters for the SVC, based on the GridSearch)
best_SVC_model = grid_searcher.best_estimator_
# Now, check how this best model behaves on the test set
cv_scores_on_unseen = cross_val_score(best_SVC_model, X_test, y_test, cv=5)
print(cv_scores_on_unseen.mean())

How to plot the learning curves in lightgbm and Python?

I have trained a lightgbm model and I would like to plot the learning curves. How can I do that? In Keras for examples history returns the metrics so that I can plot them once training is over. How this task is handled here?
My code is the following:
def f_lgboost(data, params):
model = lgb.LGBMClassifier(**params)
X_train = data['X_train']
y_train = data['y_train']
X_dev = data['X_dev']
y_dev = data['y_dev']
X_test = data['X_test']
categorical_feature= ['Ticker_code', 'Category_code']
X_train[categorical_feature] = X_train[categorical_feature].astype('category')
X_dev[categorical_feature] = X_dev[categorical_feature].astype('category')
X_test[categorical_feature] = X_test[categorical_feature].astype('category')
feature_name = X_train.columns.to_list()
model.fit(X_train, y_train, eval_set = [(X_dev, y_dev)], eval_metric = 'auc', early_stopping_rounds = 20,
categorical_feature = categorical_feature, feature_name = feature_name)
y_pred_train = model.predict_proba(X_train)[:, 1].ravel()
y_pred_dev = model.predict_proba(X_dev)[:, 1].ravel()
from sklearn.metrics import roc_auc_score
auc_train = roc_auc_score(y_train, y_pred_train)
auc_dev = roc_auc_score(y_dev, y_pred_dev)
from sklearn.metrics import precision_recall_fscore_support
precision, recall ,fscore, support = precision_recall_fscore_support(y_dev, (y_pred_dev > 0.5).astype(int), beta=0.5)
y_pred_test = model.predict_proba(X_test)[:, 1].ravel()
print(f'auc_train: {auc_train}, auc_dev : {auc_dev}, precision : {precision}, recall: {recall}, fscore : {fscore}')
Results = {
'params' : params,
'data' : data,
'lg_boost_model' : bst,
'y_pred_train' : y_pred_train,
'y_pred_dev' : y_pred_dev,
'y_pred_test' : y_pred_test,
'auc_train' : auc_train,
'auc_dev' : auc_dev,
'precision_dev': precision,
'recall_dev' : recall,
'fscore_dev' : fscore,
'support_dev' : support
}
return Results
In the scikit-learn API, the learning curves are available via attribute lightgbm.LGBMModel.evals_result_. They will include metrics computed with datasets specified in the argument eval_set of method fit (so you would normally want to specify there both the training and the validation sets). There is also built-in plotting function, lightgbm.plot_metric, which accepts model.evals_result_ or model directly.
Here is a complete minimal example:
import lightgbm as lgb
import sklearn.datasets, sklearn.model_selection
X, y = sklearn.datasets.load_boston(return_X_y=True)
X_train, X_val, y_train, y_val = sklearn.model_selection.train_test_split(X, y, random_state=7054)
model = lgb.LGBMRegressor(objective='mse', seed=8798, num_threads=1)
model.fit(X_train, y_train, eval_set=[(X_val, y_val), (X_train, y_train)], verbose=10)
lgb.plot_metric(model)
Here is the resulting plot:

Error encountered: Classification metrics can't handle a mix of multiclass-multioutput and binary targets

from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
file = './BBC.csv'
df = read_csv(file)
array = df.values
X = array[:, 0:11]
Y = array[:, 11]
test_size = 0.30
seed = 45
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
model = RandomForestClassifier()
model.fit(X_train, Y_train)
result = model.score(X_test, X_test)
print("Accuracy: %.3f%%") % (result*100.0)
dataset: https://www.dropbox.com/s/ar1c9yuv5x774cv/BBC.csv?dl=0
I have encountered this error:
Classification metrics can't handle a mix of multiclass-multioutput and binary targets
If i'm not wrong RandomForest should be able to handle both classes (classification) and means (regression). Am i wrong?
Edit:
Checked your dataset. So for classification task, your problem lies in your code.
result = model.score(X_test, X_test)
Note that the parameter here should be X_test and Y_test
-----kind of off-topic-----
If you want to use RandomForest for regression, you probably should call RandomForestRegressor

Why is Random Search showing better results than Grid Search?

I'm playing with RandomizedSearchCV function from scikit-learn. Some academic paper claims that Randomized Search can provide 'good enough' results comparing with a whole grid search, but saves a lot of time.
Surprisingly, on one occasion, the RandomizedSearchCV provided me better results than GridSearchCV. I think GridSearchCV is suppose to be exhaustive, so the result has to be better than RandomizedSearchCV suppose they search through the same grid.
for the same dataset and mostly same settings, GridsearchCV returned me the following result:
Best cv accuracy: 0.7642857142857142
Test set score: 0.725
Best parameters: 'C': 0.02
the RandomizedSearchCV returned me the following result:
Best cv accuracy: 0.7428571428571429
Test set score: 0.7333333333333333
Best parameters: 'C': 0.008
To me the test score of 0.733 is better than 0.725, and the difference between test score and training score for the RandomizedSearchCV is smaller, which to my knowledge means less overfitting.
So why did GridSearchCV return me worse results?
GridSearchCV code:
def linear_SVC(x, y, param, kfold):
param_grid = {'C':param}
k = KFold(n_splits=kfold, shuffle=True, random_state=0)
grid = GridSearchCV(LinearSVC(), param_grid=param_grid, cv=k, n_jobs=4, verbose=1)
return grid.fit(x, y)
#high C means more chance of overfitting
start = timer()
param = [i/1000 for i in range(1,1000)]
param1 = [i for i in range(1,101)]
param.extend(param1)
#progress = progressbar.bar.ProgressBar()
clf = linear_SVC(x=x_train, y=y_train, param=param, kfold=3)
print('LinearSVC:')
print('Best cv accuracy: {}' .format(clf.best_score_))
print('Test set score: {}' .format(clf.score(x_test, y_test)))
print('Best parameters: {}' .format(clf.best_params_))
print()
duration = timer() - start
print('time to run: {}' .format(duration))
RandomizedSearchCV code:
from sklearn.model_selection import RandomizedSearchCV
def Linear_SVC_Rand(x, y, param, kfold, n):
param_grid = {'C':param}
k = StratifiedKFold(n_splits=kfold, shuffle=True, random_state=0)
randsearch = RandomizedSearchCV(LinearSVC(), param_distributions=param_grid, cv=k, n_jobs=4,
verbose=1, n_iter=n)
return randsearch.fit(x, y)
start = timer()
param = [i/1000 for i in range(1,1000)]
param1 = [i for i in range(1,101)]
param.extend(param1)
#progress = progressbar.bar.ProgressBar()
clf = Linear_SVC_Rand(x=x_train, y=y_train, param=param, kfold=3, n=100)
print('LinearSVC:')
print('Best cv accuracy: {}' .format(clf.best_score_))
print('Test set score: {}' .format(clf.score(x_test, y_test)))
print('Best parameters: {}' .format(clf.best_params_))
print()
duration = timer() - start
print('time to run: {}' .format(duration))
First, try to understand this:
https://stats.stackexchange.com/questions/49540/understanding-stratified-cross-validation
So you should know that StratifiedKFold is better than KFold.
Use StratifiedKFold in both GridSearchCV and RandomizedSearchCV. Make sure to set "shuffle = False" and not use "random_state" parameter. What this does: the dataset you are using will not be shuffled so that your results won't be changed each time you train it. You might get what you expect.
GridSearchCV code:
def linear_SVC(x, y, param, kfold):
param_grid = {'C':param}
k = StratifiedKFold(n_splits=kfold)
grid = GridSearchCV(LinearSVC(), param_grid=param_grid, cv=k, n_jobs=4, verbose=1)
return grid.fit(x, y)
RandomizedSearchCV code:
def Linear_SVC_Rand(x, y, param, kfold, n):
param_grid = {'C':param}
k = StratifiedKFold(n_splits=kfold)
randsearch = RandomizedSearchCV(LinearSVC(), param_distributions=param_grid, cv=k, n_jobs=4,
verbose=1, n_iter=n)
return randsearch.fit(x, y)
From what it looks like it is performing properly. Your train/cv set accuracy in gridsearch is higher than train/cv set accuracy in randomized search. The hyper parameters should not be tuned using the test set, so assuming you're doing that properly it might just be a coincidence that the hyper parameters that were chosen from randomized search performed better on the test set.

Resources