XGRegressor not fitting data - scikit-learn

I would like to run a CV for an XGBoost tree regression on my X_train, y_train data. My target is of integer values from 25 to 40. I tried to run this code on my training dataset
# A parameter grid for XGBoost
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
cv_params = {
'min_child_weight': [1, 3, 5],
'gamma': [0.5, 1, 2, 3],
'subsample': [i/10.0 for i in range(6,11)],
'colsample_bytree': [i/10.0 for i in range(6,11)],
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.02, 0.1]
}
# Initialize XGB
xgb_for_gridsearch = XGBRegressor(
n_estimators = 1000,
objective = 'reg:logistic',
seed = 7
)
# Initialize GridSearch
xgb_grid = GridSearchCV(
estimator = xgb_for_gridsearch,
param_grid = cv_params,
scoring = 'explained_variance',
cv = 5,
n_jobs = -1
)
xgb_grid.fit(X_train, y_train)
xgb_grid.grid_scores_
I get an error the fit().
I kinda expected that the CV would just take forever, but not really an error. The error output is a couple of thousand lines long, so I will just put the only part that relates to my code:
During handling of the above exception, another exception occurred:
JoblibXGBoostError Traceback (most recent call last)
<ipython-input-44-a5c1d517107d> in <module>()
25 )
26
---> 27 xgb_grid.fit(X_train, y_train)
Does anyone know what this relates to?
Am I using conflicting parameters?
Would it be better to use xgboost.cv()?
I can also add the whole error code if that would help, should I just add it at the bottom of this question?
UPDATE: added error to a Gist, as suggested XGRegressor_not_fitting_data, since the error is too long.

Thanks for adding the full error code, it is easier to help you.
A github repo is fine, yet you may find it easier to use https://gist.github.com/ or https://pastebin.com/
Note that the most helpfull line of the full error is generally the last one, which contains here:
label must be in [0,1] for logistic regression
It seems you have used logistic regression (objective = 'reg:logistic', in your code), which is a classification loss, and so it requires y_train to be an array of either 0 or 1.
You can easily fix it with something like
y_train_bin = (y_train == 1).astype(int)
xgb_grid.fit(X_train, y_train_bin)

Related

Using GridSearchCV with xgbranker

I am trying to use GridSearchCV with xgbranker estimator from xgboost. I am trying to use GroupKFold and passing qid (group_ids) parameter to the grid fit method but it's not straightforward. After a bit of hit and trial with solutions already suggested on the web, I finally zeroed on a approach. I am still getting an error which seems to be in the scoring method passed. Any help or working example would be great?
Sample code:
from sklearn.model_selection import GroupKFold, GridSearchCV
from sklearn.metrics import make_scorer, ndcg_score
ndcg_scorer = make_scorer(ndcg_score)
param_grid = {
'learning_rate': [0.001, 0.01, 0.02],
'n_estimators': [10, 50]
}
splits = 3
gkf = GroupKFold(n_splits=splits)
cv_group = gkf.split(X_train, y_train, qids_train)
def group_gen():
for ids,_ in cv_group:
yield ids
grid = GridSearchCV(my_model, param_grid, cv=splits, scoring=ndcg_scorer, refit=False)
grid.fit(X_train, y_train, qid=next(group_gen()))
I get below error:
ValueError: Only ('multilabel-indicator', 'continuous-multioutput', 'multiclass-multioutput') formats are supported. Got multiclass instead
The error seems to be related to the scoring method you use, but you didn't share anything about your data. so it's hard to say what exactly is the problem.
It seems to me that you're using for the scoring a method that expects something else then you're providing as a label.

How to pass group information in sklearn Random Grid Search for XGBRanker?

When I'm tryingto perform random grid search on XGBRanker model, I keep getting an error as follows:
/workspace/src/objective/rank_obj.cc:52: Check failed: gptr.size() != 0 && gptr.back() == info.labels_.Size(): group structure not consistent with #rows
The error seems to be regarding the structure of the group information passed. I'm passing the size of each group. If there are N rows and 2 groups then the array passed would be [g1_size, g2_size].
I'm not sure where I'm going wrong since I'm able to fit the model without any issues. Only when I try to perform RandomGridSearchCV, am I facing this error. The code snippet is as follows:
model = xgb.XGBRanker(
objective="rank:ndcg",
max_depth= 10,
n_estimators=100,
verbosity=1)
param_dist = {'n_estimators': [100,200,300],
'learning_rate': [1e-3,1e-4,1e-5],
'subsample': [0.8,0.9,1],
'max_depth': [5, 6, 7]
}
fit_params = {"group": groups}
scoring = make_scorer(ndcg_score, greater_is_better=True)
clf = RandomizedSearchCV(model,
param_distributions=param_dist,
cv =5,
n_iter=5,
scoring=scoring,
error_score=0,
verbose=3,
n_jobs=-1)
clf.fit(X_train, Y_train,**fit_params)

StratifiedKFold split doesn't seem to work

I am trying to perform KFold cross-validation via Keras but due to some reason, the KFold split isn't working.
from sklearn.model_selection import StratifiedKFold
X = train_data[features]
y = train_data['price']
kfold = StratifiedKFold(n_splits=10, shuffle=True)
for train, test in kfold.split(X,y):
print(X[train])
I was actually fitting the model subsequently but that didn't work, so I tried printing the results, which produced the following warning and output.
Warning: /opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_split.py:672: UserWarning: The least populated class in y has only 1 member, which is less than n_splits=10.
% (min_groups, self.n_splits)), UserWarning)
Error: "None of [Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 9,\n 10,\n ...\n 39989, 39990, 39991, 39992, 39993, 39994, 39995, 39996, 39997,\n 39998],\n dtype='int64', length=36000)] are in the [columns]"
The error is self-explanatory:
Warning:
/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_split.py:672:
UserWarning: The least populated class in y has only 1 member, which
is less than n_splits=10. % (min_groups, self.n_splits)), UserWarning)
This means that, for the underrepresented class, you have only one sample, hence the stratified split is unable to work.
I recommend that you check your dataset again in order to verify/correct the labels.
Using x.iloc[test_index] worked for me

Optimization of predictions from sklearn model (e.g. RandomForestRegressor)

Does anyone used any optimization models on fitted sklearn models?
What I'd like to do is fit model based on train data and using this model try to find the best combination of parameters for which model would predict the biggest value.
Some example, simplified code:
import pandas as pd
df = pd.DataFrame({
'temperature': [10, 15, 30, 20, 25, 30],
'working_hours': [10, 12, 12, 10, 30, 15],
'sales': [4, 7, 6, 7.3, 10, 8]
})
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
X = df.drop(['sales'], axis=1)
y = df['sales']
model.fit(X, y);
Our baseline is a simple loop and predict all combination of variables:
results = pd.DataFrame(columns=['temperature', 'working_hours', 'sales_predicted'])
import numpy as np
for temp in np.arange(1,100.01,1):
for work_hours in np.arange(1,60.01,1):
results = pd.concat([
results,
pd.DataFrame({
'temperature': temp,
'working_hours': work_hours,
'sales_predicted': model.predict(np.array([temp, work_hours]).reshape(1,-1))
}
)
]
)
print(results.sort_values(by='sales_predicted', ascending=False))
Using that way it's difficult or impossible to:
* do it fast (brute method)
* implement constraint concerning two or more variables dependency
We tried PuLP library and PyOmo library, but both doesn't allow to put model.predict function as an objective function returning error:
TypeError: float() argument must be a string or a number, not 'LpVariable'
Do anyone have any idea how we can get rid off loop and use some other stuff?
When people talk about optimizing fitted sklearn models, they usually mean maximizing accuracy/performance metrics. So if you are trying to maximize your predicted value, you can definitely improve your code to achieve it more efficiently, like below.
You are collecting all the predictions in a big results dataframe, and then sorting it in ascending order. Instead, you can just search for an increase in your target variable (sales_predicted) on-the-fly, using a simple if logic. So just change your loop into this:
max_sales_predicted = 0
for temp in np.arange(1, 100.01, 1):
for work_hours in np.arange(1, 60.01, 1):
sales_predicted = model.predict(np.array([temp, work_hours]).reshape(1, -1))
if sales_predicted > max_sales_predicted:
max_sales_predicted = sales_predicted
desired_temp = temp
desired_work_hours = work_hours
So that you can only take into account any specification that produces a predictiong that exceeds your current target, and else, do nothing.
The result of my code is the same as yours, i.e. a max_sales_predicted value of 9.2. Also, desired_temp and desired_work_hours now give you the specification that produce that maxima. Hope this helps.

Problems using poly kernel in GridSearchCV and SVM classifier

I am trying to do a grid search using a SVM classifier.
Consider my data and target that have been parsed from file and input to numpy arrays.
I then preprocess them.
# Transform the data to have zero mean and unit variance.
zeroMeanUnitVarianceScaler = preprocessing.StandardScaler().fit(data)
zeroMeanUnitVarianceScaler.transform(data)
scaledData = data
# Transform the target to have range [-1, 1].
scaledTarget = np.empty([161L,], dtype=int)
for i in range(len(target)):
if(target[i] == 'Malignant'):
scaledTarget[i] = 1
if(target[i] == 'Benign'):
scaledTarget[i] = -1
I now try to set up my grid and fit the scaled data to targets.
# Generate parameters for parameter grid.
CValues = np.logspace(-3, 3, 7)
GammaValues = np.logspace(-3, 3, 7)
kernelValues = ('poly', 'sigmoid')
# kernelValues = ('linear', 'rbf', 'sigmoid')
degreeValues = np.array([0, 1, 2, 3, 4])
coef0Values = np.logspace(-3, 3, 7)
# Generate the parameter grid.
paramGrid = dict(C=CValues, gamma=GammaValues, kernel=kernelValues,
coef0=coef0Values)
# Create and train a SVM classifier using the parameter grid and with
stratified shuffle split.
stratifiedShuffleSplit = StratifiedShuffleSplit(n_splits = 10, test_size =
0.25, train_size = None, random_state = 0)
clf = GridSearchCV(estimator=svm.SVC(), param_grid=paramGrid,
cv=stratifiedShuffleSplit, n_jobs=1)
clf.fit(scaledData, scaledTarget)
If I uncomment the line kernelValues = ('linear', 'rbf', 'sigmoid'), then the code runs in approximately 50 seconds on my 16 GB i7-4950 3.6 GHz machine running windows 10.
However, if I try to run the code as is with 'poly' as a possible kernel value, then the code hangs forever. For example, I ran it yesterday overnight and it did not return anything when I got back in the office today.
Interestingly enough, if I try to create a SVM classifier with a poly kernel, it returns a result immediately
clf = svm.SVC(kernel='poly',degree=2)
clf.fit(data, target)
It hangs up when I do the above code. I have not tried other cv methods to see if that changes anything.
Is this a bug in sci-kit learn? Am I doing things properly? On a side note, is my method of doing gridsearch/cross validation using GridSearchCV and StratifiedShuffleSplit sensible? It seems to me the most brute force (i.e. time consuming) but robust method.
Thank you!

Resources