How to plot the learning curves in lightgbm and Python? - python-3.x

I have trained a lightgbm model and I would like to plot the learning curves. How can I do that? In Keras for examples history returns the metrics so that I can plot them once training is over. How this task is handled here?
My code is the following:
def f_lgboost(data, params):
model = lgb.LGBMClassifier(**params)
X_train = data['X_train']
y_train = data['y_train']
X_dev = data['X_dev']
y_dev = data['y_dev']
X_test = data['X_test']
categorical_feature= ['Ticker_code', 'Category_code']
X_train[categorical_feature] = X_train[categorical_feature].astype('category')
X_dev[categorical_feature] = X_dev[categorical_feature].astype('category')
X_test[categorical_feature] = X_test[categorical_feature].astype('category')
feature_name = X_train.columns.to_list()
model.fit(X_train, y_train, eval_set = [(X_dev, y_dev)], eval_metric = 'auc', early_stopping_rounds = 20,
categorical_feature = categorical_feature, feature_name = feature_name)
y_pred_train = model.predict_proba(X_train)[:, 1].ravel()
y_pred_dev = model.predict_proba(X_dev)[:, 1].ravel()
from sklearn.metrics import roc_auc_score
auc_train = roc_auc_score(y_train, y_pred_train)
auc_dev = roc_auc_score(y_dev, y_pred_dev)
from sklearn.metrics import precision_recall_fscore_support
precision, recall ,fscore, support = precision_recall_fscore_support(y_dev, (y_pred_dev > 0.5).astype(int), beta=0.5)
y_pred_test = model.predict_proba(X_test)[:, 1].ravel()
print(f'auc_train: {auc_train}, auc_dev : {auc_dev}, precision : {precision}, recall: {recall}, fscore : {fscore}')
Results = {
'params' : params,
'data' : data,
'lg_boost_model' : bst,
'y_pred_train' : y_pred_train,
'y_pred_dev' : y_pred_dev,
'y_pred_test' : y_pred_test,
'auc_train' : auc_train,
'auc_dev' : auc_dev,
'precision_dev': precision,
'recall_dev' : recall,
'fscore_dev' : fscore,
'support_dev' : support
}
return Results

In the scikit-learn API, the learning curves are available via attribute lightgbm.LGBMModel.evals_result_. They will include metrics computed with datasets specified in the argument eval_set of method fit (so you would normally want to specify there both the training and the validation sets). There is also built-in plotting function, lightgbm.plot_metric, which accepts model.evals_result_ or model directly.
Here is a complete minimal example:
import lightgbm as lgb
import sklearn.datasets, sklearn.model_selection
X, y = sklearn.datasets.load_boston(return_X_y=True)
X_train, X_val, y_train, y_val = sklearn.model_selection.train_test_split(X, y, random_state=7054)
model = lgb.LGBMRegressor(objective='mse', seed=8798, num_threads=1)
model.fit(X_train, y_train, eval_set=[(X_val, y_val), (X_train, y_train)], verbose=10)
lgb.plot_metric(model)
Here is the resulting plot:

Related

Prediction with linear regression is very inaccurate

This is the csv that im using https://gist.github.com/netj/8836201 currently, im trying to predict the variety which is categorical data with linear regression but somehow the prediction is very very inaccurate. While you know, the actual label is just combination of 0.0 and 1. but the prediction is 0.numbers and 1.numbers even with minus numbers which in my opinion is very inaccurate, what part did i make the mistake and what is the solution for this inaccuracy? this is the assignment my teacher gave me, he said we could predict the categorical data with linear regression not only logistic regression
import pandas as pd
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
from sklearn import metrics
path= r"D:\python projects\iris.csv"
df = pd.read_csv(path)
array = df.values
X = array[:,0:3]
y = array[:,4]
le = preprocessing.LabelEncoder()
ohe = preprocessing.OneHotEncoder(categorical_features=[0])
y = le.fit_transform(y)
y = y.reshape(-1,1)
y = ohe.fit_transform(y).toarray()
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2, random_state=0)
sc = preprocessing.StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
y_train = sc.fit_transform(y_train)
model = LinearRegression(n_jobs=-1).fit(X_train, y_train)
y_pred = model.predict(X_test)
df = pd.DataFrame({'Actual': X_test.flatten(), 'Predicted': y_pred.flatten()})
the output :
y_pred
Out[46]:
array([[-0.08676055, 0.43120144, 0.65555911],
[ 0.11735424, 0.72384335, 0.1588024 ],
[ 1.17081347, -0.24484483, 0.07403136],
X_test
Out[61]:
array([[-0.09544771, -0.58900572, 0.72247648],
[ 0.14071157, -1.98401928, 0.10361279],
[-0.44968663, 2.66602591, -1.35915595],
Linear Regression is used to predict continuous output data. As you correctly said, you are trying to predict categorical (discrete) output data. Essentially, you want to be doing classification instead of regression - linear regression is not appropriate for this.
As you also said, logistic regression can and should be used instead as it is applicable to classification tasks.

GridSearchCV with Invalid parameter gamma for estimator LogisticRegression

I need to perform a grid search on the parameters listed below for a Logistic Regression classifier, using recall for scoring and cross-validation three times.
The data is in a csv file (11,1 MB), this link for download is: https://drive.google.com/file/d/1cQFp7HteaaL37CefsbMNuHqPzkINCVzs/view?usp=sharing
I have grid_values = {'gamma':[0.01, 0.1, 1, 10, 100]}
I need to apply penalty L1 e L2 in a Logistic Regression
I couldn't verify if the scores will run because I have the following error:
Invalid parameter gamma for estimator LogisticRegression. Check the list of available parameters with estimator.get_params().keys().
This is my code:
from sklearn.model_selection import train_test_split
df = pd.read_csv('fraud_data.csv')
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
def LogisticR_penalty():
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
grid_values = {'gamma':[0.01, 0.1, 1, 10, 100]}
#train de model with many parameters for "C" and penalty='l1'
lr_l1 = LogisticRegression(penalty='l1')
grid_lr_l1 = GridSearchCV(lr_l1, param_grid = grid_values, cv=3, scoring = 'recall')
grid_lr_l1.fit(X_train, y_train)
y_decision_fn_scores_recall = grid_lr_l1.decision_function(X_test)
lr_l2 = LogisticRegression(penalty='l2')
grid_lr_l2 = GridSearchCV(lr_l2, param_grid = grid_values, cv=3 , scoring = 'recall')
grid_lr_l2.fit(X_train, y_train)
y_decision_fn_scores_recall = grid_lr_l2.decision_function(X_test)
#The precision, recall, and accuracy scores for every combination
#of the parameters in param_grid are stored in cv_results_
results = pd.DataFrame()
results['l1_results'] = pd.DataFrame(grid_lr_l1.cv_results_)
results['l1_results'] = results['l2_results'].sort_values(by='mean_test_precision_score', ascending=False)
results['l2_results'] = pd.DataFrame(grid_lr_l2.cv_results_)
results['l2_results'] = results['l2_results'].sort_values(by='mean_test_precision_score', ascending=False)
return results
LogisticR_penalty()
I expected from .cv_results_, the average test scores of each parameter combination that I should be available here: mean_test_precision_score but not sure
The output is: ValueError: Invalid parameter gamma for estimator LogisticRegression. Check the list of available parameters with estimator.get_params().keys().
The error message contains the answer for your question. You can use the function estimator.get_params().keys() to see all available parameters for you estimator:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
print(lr.get_params().keys())
Output:
dict_keys(['C', 'class_weight', 'dual', 'fit_intercept', 'intercept_scaling', 'l1_ratio', 'max_iter', 'multi_class', 'n_jobs', 'penalty', 'random_state', 'solver', 'tol', 'verbose', 'warm_start'])
From scikit-learn's documentation, the LogisticRegression has no parameter gamma, but a parameter C for the regularization weight.
If you change grid_values = {'gamma':[0.01, 0.1, 1, 10, 100]} for grid_values = {'C':[0.01, 0.1, 1, 10, 100]} your code should work.
My code contained some errors the main error was using param_grid incorrectly. I had to apply L1 and L2 penalties with gamma 0.01, 0.1, 1, 10, 100. The right way to do this is:
grid_values ​​= {'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, 10, 100]}
Then it was necessary to correct the way I was training my logistic regression and to correct the way I retrieved the scores in cv_results_ and averaged those scores.
Follow my code:
from sklearn.model_selection import train_test_split
df = pd.read_csv('fraud_data.csv')
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
def LogisticR_penalty():
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
grid_values = {'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, 10, 100]}
#train de model with many parameters for "C" and penalty='l1'
lr = LogisticRegression()
# We use GridSearchCV to find the value of the range that optimizes a given measurement metric.
grid_lr_recall = GridSearchCV(lr, param_grid = grid_values, cv=3, scoring = 'recall')
grid_lr_recall.fit(X_train, y_train)
y_decision_fn_scores_recall = grid_lr_recall.decision_function(X_test)
##The precision, recall, and accuracy scores for every combination
#of the parameters in param_grid are stored in cv_results_
CVresults = []
CVresults = pd.DataFrame(grid_lr_recall.cv_results_)
#test scores and mean of them
split_test_scores = np.vstack((CVresults['split0_test_score'], CVresults['split1_test_score'], CVresults['split2_test_score']))
mean_scores = split_test_scores.mean(axis=0).reshape(5, 2)
return mean_scores
LogisticR_penalty()

Perform cross validation without cross_val_score

In order to have full access to the inner and outer score I would like to create a nested cros validation and grid-search without using cross_val_score.
I have followed examples I found online like this https://github.com/rasbt/pattern_classification/blob/master/data_viz/model-evaluation-articles/nested_cv_code.ipynb.
I am having doubts that the inner nest is ok. I am not sure if I have to split the data before calling GridSearchCV:
for train_index_inner, test_index_inner in inner_cv.split(X_train_outer, y_train_outer):
X_train_inner = X_train_outer[train_index_inner]
y_train_inner = y_train_outer[train_index_inner]
X_test_inner = X_train_outer[test_index_inner]
y_test_inner = y_train_outer[test_index_inner]
# inner cross-validation
for name, gs_est in sorted(gridcvs.items()):
#print(gs_est)
gs_est.fit(X_train_inner, y_train_inner)
y_pred = gs_est.predict(X_test_inner)
#print(y_test_inner.shape)
inner_score = r2_score(y_true=y_test_inner, y_pred=y_pred)
cv_scores[name].append(inner_score)
#for mean_score, params in zip(gs_est.cv_results_ ['mean_test_score'],
#gs_est.cv_results_ ['params']):
#print(name, params, mean_score)
print('print cvscores for model:', cv_scores)
outer_counter = outer_counter + 1
The whole code:
import numpy as np
from sklearn.datasets import make_regression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
import operator
perf_list = [] # list with the performance
hp_list = [] # hyperparameter list
algo_familiy = [] # algorithm familiy list
##################################################################################################
randomState=1
average_scores_across_outer_folds_for_each_model = dict()
X, y = make_regression(n_samples=1000, n_features=10)
##################################################################################################
# Create X_test, y_test = TEST SET
# Create X_train, y_train = TRAIN & VALIDATION SET
X_train, X_gtest, y_train, y_gtest= train_test_split(X, y, train_size=0.8, random_state=randomState)
print(X_train.shape)
#print(X_train.shape)
#print(X_test.shape)
#print(y_train.shape)
#print(y_test.shape)
##################################################################################################
##################################################################
# Regressors you want to use
reg1 = KNeighborsRegressor()
reg2 = RandomForestRegressor()
# Building the pipelines (Transformer, Classifier)
pipe1 = Pipeline([('std' , StandardScaler()),
('reg1', reg1)])
pipe2 = Pipeline([('std' , StandardScaler()),
('reg2', reg2)])
# Setting up parameters for grid
param_grid1 = [{'reg1__n_neighbors': list(range(7, 10))}]
param_grid2 = [{'reg2__max_depth': [50, 20]}]
# outer cross-validation
outer_counter = 1
outer_cv = KFold(n_splits=3, shuffle=True)
inner_cv = KFold(n_splits=2, shuffle=True, random_state=randomState)
##################################################################
###########################
gridcvs = {}
for pgrid, est, name in zip((param_grid1, param_grid2),
(pipe1, pipe2),
('KNN', 'RF')):
regressor_that_optimizes_its_hyperparams = GridSearchCV(estimator=est,
param_grid=pgrid,
scoring='r2',
n_jobs=1,
cv=inner_cv,
verbose=0,
refit=True)
gridcvs[name] = regressor_that_optimizes_its_hyperparams
##################################################################
##################################################################
for train_index_outer, test_index_outer in outer_cv.split(X_train, y_train):
print('outer_cv', outer_counter)
X_train_outer = X_train[train_index_outer]
y_train_outer = y_train[train_index_outer]
X_test_outer = X_train[test_index_outer]
y_test_outer = y_train[test_index_outer]
# print(X_train_outer.shape)
# print(X_test_outer.shape)
cv_scores = {name: [] for name, gs_est in gridcvs.items()}
for train_index_inner, test_index_inner in inner_cv.split(X_train_outer, y_train_outer):
X_train_inner = X_train_outer[train_index_inner]
y_train_inner = y_train_outer[train_index_inner]
X_test_inner = X_train_outer[test_index_inner]
y_test_inner = y_train_outer[test_index_inner]
# inner cross-validation
for name, gs_est in sorted(gridcvs.items()):
#print(gs_est)
gs_est.fit(X_train_inner, y_train_inner)
y_pred = gs_est.predict(X_test_inner)
#print(y_test_inner.shape)
inner_score = r2_score(y_true=y_test_inner, y_pred=y_pred)
cv_scores[name].append(inner_score)
#for mean_score, params in zip(gs_est.cv_results_ ['mean_test_score'],
#gs_est.cv_results_ ['params']):
#print(name, params, mean_score)
print('print cvscores for model:', cv_scores)
outer_counter = outer_counter + 1
# Looking at the results
#####################################################################
for name in cv_scores:
print('%-8s | outer CV acc. %.2f%% +\- %.3f' % (
name, 100 * np.mean(cv_scores[name]), 100 * np.std(cv_scores[name])))
many_stars = '\n' + '*' * 100 + '\n'
print(many_stars + 'Now we choose the best model and refit on the whole dataset' + many_stars)
# Fitting a model to the whole training set
# using the "best" algorithm
best_algo = gridcvs['RF']
best_algo.fit(X_train, y_train)
train_acc = r2_score(y_true=y_train, y_pred=best_algo.predict(X_train))
test_acc = r2_score(y_true=y_gtest, y_pred=best_algo.predict(X_gtest))
print('Accuracy %.2f%% (average over CV test folds)' %
(100 * best_algo.best_score_))
print('Best Parameters: %s' % gridcvs['RF'].best_params_)
print('Training Accuracy: %.2f%%' % (100 * train_acc))
print('Test Accuracy: %.2f%%' % (100 * test_acc))
# Fitting a model to the whole dataset
# using the "best" algorithm and hyperparameter settings
best_clf = best_algo.best_estimator_
final_model = best_clf.fit(X, y)
In general you can obtain the nested cross-validation using the code you posted.
for train_index_outer, test_index_outer in outer_cv.split(X_train, y_train):
print('outer_cv', outer_counter)
X_train_outer = X_train[train_index_outer]
y_train_outer = y_train[train_index_outer]
X_test_outer = X_train[test_index_outer]
y_test_outer = y_train[test_index_outer]
for train_index_inner, test_index_inner in inner_cv.split(X_train_outer, y_train_outer):
X_train_inner = X_train_outer[train_index_inner]
y_train_inner = y_train_outer[train_index_inner]
X_test_inner = X_train_outer[test_index_inner]
y_test_inner = y_train_outer[test_index_inner]
# fit something on X_train_inner
# evaluate it on X_test_inner
or you could do the following:
If you pass to GridSearchCV the argument cv inner_cv, then the GridSearchCV will automatically perform the split when you call the .fit() method. When the fit is complete you can explore the .cv_results to get the individual model score on each of the automatically generated inner folds.
for train_index_outer, test_index_outer in outer_cv.split(X_train, y_train):
X_train_outer = X_train[train_index_outer]
y_train_outer = y_train[train_index_outer]
X_test_outer = X_train[test_index_outer]
y_test_outer = y_train[test_index_outer]
cv= GridSearchCV(estimator=est,
param_grid=pgrid,
scoring='r2',
n_jobs=1,
cv=inner_cv,
verbose=0,
refit=True)
cv.fit(X_train_outer,y_train_outer)

Python different results for manual and cross_val_score prediction

I have one question, I'm trying to implement KFold and cross_val_score.
My goal is to calculate mean_squared_errorand for this purpose I used the following code:
from sklearn import linear_model
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_score
x = np.random.random((10000,20))
y = np.random.random((10000,1))
x_train = x[7000:]
y_train = y[7000:]
x_test = x[:7000]
y_test = y[:7000]
Model = linear_model.LinearRegression()
Model.fit(x_train,y_train)
y_predicted = Model.predict(x_test)
MSE = mean_squared_error(y_test,y_predicted)
print(MSE)
kfold = KFold(n_splits = 100, random_state = None, shuffle = False)
results = cross_val_score(Model,x,y,cv=kfold, scoring='neg_mean_squared_error')
print(results.mean())
I think it's all right here, I got the following results:
Results: 0.0828856459279 and -0.083069435946
But when I try to do this on some other example (datas from Kaggle House Prices), it does not work properly, at least I think so..
train = pd.read_csv('train.csv')
Insert missing values...
...
train = pd.get_dummies(train)
y = train['SalePrice']
train = train.drop(['SalePrice'], axis = 1)
x_train = train[:1000].values.reshape(-1,339)
y_train = y[:1000].values.reshape(-1,1)
y_train_normal = np.log(y_train)
x_test = train[1000:].values.reshape(-1,339)
y_test = y[1000:].values.reshape(-1,1)
Model = linear_model.LinearRegression()
Model.fit(x_train,y_train_normal)
y_predicted = Model.predict(x_test)
y_predicted_transform = np.exp(y_predicted)
MSE = mean_squared_error(y_test, y_predicted_transform)
print(MSE)
kfold = KFold(n_splits = 10, random_state = None, shuffle = False)
results = cross_val_score(Model,train,y, cv = kfold, scoring = "neg_mean_squared_error")
print(results.mean())
Here I get the following results: 0.912874946869 and -6.16986926564e+16
Apparently, the mean_squared_error calculated 'manually' is not the same as the mean_squared_error calculated by the help of KFold.
I'm interested in where I made a mistake?
The discrepancy is because, in contrast to your first approach (training/test set), in your CV approach you use the unnormalized y data for fitting the regression, hence your huge MSE. To get comparable results, you should do the following:
y_normal = np.log(y)
y_test_normal = np.log(y_test)
MSE = mean_squared_error(y_test_normal, y_predicted) # NOT y_predicted_transform
results = cross_val_score(Model, train, y_normal, cv = kfold, scoring = "neg_mean_squared_error")

GridSearch / make_scorer strange results with xgboost model

I am tryting to use sklearn's gridsearch with a model created by xgboost. To do this, I am creating a custom scorer based on ndcg evaluation. I am successfully able to use Snippet 1 but it is too messy / hacky, I would prefer to use good old sklearn to simplify the code. I tried to implement GridSearch and the results is completely off: for the same X and y sets I get NDCG#k = 0.8 with Snippet 1 versus 0.5 with Snippet 2. Obviously there something I am not doing right here ...
The following pieces of code return very different results:
Snippet1:
kf = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=42)
max_depth = [6]
learning_rate = [0.22]
n_estimators = [43]
reg_alpha = [0.1]
reg_lambda = [10]
for md in max_depth:
for lr in learning_rate:
for ne in n_estimators:
for ra in reg_alpha:
for rl in reg_lambda:
xgb = XGBClassifier(objective='multi:softprob',
max_depth=md,
learning_rate=lr,
n_estimators=ne,
reg_alpha=ra,
reg_lambda=rl,
subsample=0.6, colsample_bytree=0.6, seed=0)
print([md, lr, ne])
score = []
for train_index, test_index in kf:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
xgb.fit(X_train, y_train)
y_pred = xgb.predict_proba(X_test)
score.append(ndcg_scorer(y_test, y_pred))
print('all scores: %s' % score)
print('average score: %s' % np.mean(score))
Snippet2:
from sklearn.grid_search import GridSearchCV
params = {
'max_depth':[6],
'learning_rate':[0.22],
'n_estimators':[43],
'reg_alpha':[0.1],
'reg_lambda':[10],
'subsample':[0.6],
'colsample_bytree':[0.6]
}
xgb = XGBClassifier(objective='multi:softprob',seed=0)
scorer = make_scorer(ndcg_scorer, needs_proba=True)
gs = GridSearchCV(xgb, params, cv=5, scoring=scorer, verbose=10, refit=False)
gs.fit(X,y)
gs.best_score_
While snippet1 gives me the result as expected, the score returned by Snippet2 is not consistent with the ndcg_scorer.
The problem is with cv inGridSearchCV(xgb, params, cv=5, scoring=scorer, verbose=10, refit=False). It can recieve a KFold / StratifiedKFold instead of an int. Unlike what is says in the doc it seems that by default an agrument of type 'int' is not calling StratifiedKFold another function maybe KFold.

Resources