I try to optimize my hyperparameters of my XGBoost Ranker model, but I can't
Here is what my table (df on code) looks like :
query
relevance
features
1
5
5.4.7....
1
3
6........
2
5
3........
2
3
8........
3
2
1........
Then I split my table on train test with on the test table only one query:
gss = GroupShuffleSplit(test_size=1, n_splits=1,).split(df, groups=df['query'])
X_train_inds, X_test_inds = next(gss)
train_data= df.iloc[X_train_inds]
X_train=train_data.drop(columns=["relevance"])
Y_train=train_data.relevance
test_data= df.iloc[X_test_inds]
X_test=test_data.drop(columns=["relevance"])
Y_test=test_data.relevance
and constitute groups which is the number of lines by query:
groups = train_data.groupby('query').size().to_frame('size')['size'].to_numpy()
And then I run my model and try to optimize the hyperparameters with a RandomizedSearchCV:
param_dist = {'n_estimators': randint(40, 1000),
'learning_rate': uniform(0.01, 0.59),
'subsample': uniform(0.3, 0.6),
'max_depth': [3, 4, 5, 6, 7, 8, 9],
'colsample_bytree': uniform(0.5, 0.4),
'min_child_weight': [0.05, 0.1, 0.02]
}
scoring = sklearn.metrics.make_scorer(sklearn.metrics.ndcg_score, k=10,
greater_is_better=True)
model = xgb.XGBRanker(
tree_method='hist',
booster='gbtree',
objective='rank:ndcg',)
clf = RandomizedSearchCV(model,
param_distributions=param_dist,
cv=5,
n_iter=5,
scoring=scoring,
error_score=0,
verbose=3,
n_jobs=-1)
clf.fit(X_train,Y_train, group=groups)
Then I have the following error message which it seems be related to my construction of groups but I don't see why (Knowing that without the randomsearch the model works) :
Check failed: group_ptr_.back() == num_row_ (11544 vs. 9235) : Invalid group structure. Number of rows obtained from groups doesn't equal to actual number of rows given by data.
Same problem as here:(Tuning XGBRanker produces error for groups)
Related
When I'm tryingto perform random grid search on XGBRanker model, I keep getting an error as follows:
/workspace/src/objective/rank_obj.cc:52: Check failed: gptr.size() != 0 && gptr.back() == info.labels_.Size(): group structure not consistent with #rows
The error seems to be regarding the structure of the group information passed. I'm passing the size of each group. If there are N rows and 2 groups then the array passed would be [g1_size, g2_size].
I'm not sure where I'm going wrong since I'm able to fit the model without any issues. Only when I try to perform RandomGridSearchCV, am I facing this error. The code snippet is as follows:
model = xgb.XGBRanker(
objective="rank:ndcg",
max_depth= 10,
n_estimators=100,
verbosity=1)
param_dist = {'n_estimators': [100,200,300],
'learning_rate': [1e-3,1e-4,1e-5],
'subsample': [0.8,0.9,1],
'max_depth': [5, 6, 7]
}
fit_params = {"group": groups}
scoring = make_scorer(ndcg_score, greater_is_better=True)
clf = RandomizedSearchCV(model,
param_distributions=param_dist,
cv =5,
n_iter=5,
scoring=scoring,
error_score=0,
verbose=3,
n_jobs=-1)
clf.fit(X_train, Y_train,**fit_params)
I have a multiclass classficiation problem with 3 classes.
0 - on a given day (24h) my laptop battery did not die
1 - on a given day my laptop battery died before 12AM
2 - on a given day my laptop battery died at or after 12AM
(Note that these categories are mutually exclusive. The battery is not recharged once it died)
I am interested to know the predicted probability for each 3 classes. More specifically, I intend to derive 2 types of warning:
If the prediction for class 1 is higher then a threshold x: 'Your battery is at risk of dying in the morning.'
If the prediction for class 2 is higher then a threshold y: 'Your battery is at risk of dying in the afternoon.'
I can generate the the probabilities by using xgboost.XGBClassifier with the appropriate parameters for a multiclass problem.
import numpy as np
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from xgboost import XGBClassifier
X = np.array([
[10, 10],
[8, 10],
[-5, 5.5],
[-5.4, 5.5],
[-20, -20],
[-15, -20]
])
y = np.array([0, 1, 1, 1, 2, 2])
clf1 = XGBClassifier(objective = 'multi:softprob', num_class = 3, seed = 42)
clf1.fit(X, y)
clf1.predict_proba([[-19, -20]])
Results:
array([[0.15134096, 0.3304505 , 0.51820856]], dtype=float32)
But I can also wrap this with sklearn.multiclass.OneVsRestClassifier. Which then produces slightly different results:
clf2 = OneVsRestClassifier(XGBClassifier(objective = 'multi:softprob', num_class = 3, seed = 42))
clf2.fit(X, y)
clf2.predict_proba([[-19, -20]])
Results:
array([[0.10356173, 0.34510303, 0.5513352 ]], dtype=float32)
I was expecting the two approaches to produce the same results. My understanding was that XGBClassifier is also based on a one-vs-rest approach in a multiclass case, since there are 3 probabilities in the output and they sum up to 1.
Can you tell me where the difference comes from, and how the respective results should be interpreted? And most important, which is approach is better suited to solve my problem.
I'm using MulticlassClassificationEvaluator to retrieve some metrics like F1-Score or accuracy in a Cross Validation in PySpark:
cross_result = CrossValidator(estimator=RandomForestClassifier(),
estimatorParamMaps=ParamGridBuilder().build(),
evaluator=MulticlassClassificationEvaluator(metricName='f1'),
numFolds=5,
parallelism=-1)
f1_score = cross_result.avgMetrics[0]
Now, my question is: why is avgMetrics a list if it only has one value? Doesn't It should be a scalar value? Am I missing something about this attribute?
Following the source code, I realized that avgMetrics is a list with the average of all the cross-validation folds of the metric for each parameter defined in ParamGrid. So:
dataset = spark.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
(Vectors.dense([0.6]), 1.0),
(Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
# Note that there are three values for maxIter: 0, 1 and 5
grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1, 5]).build()
evaluator = MulticlassClassificationEvaluator(metricName='accuracy')
cv = CrossValidator(
estimator=lr,
estimatorParamMaps=grid,
evaluator=evaluator,
parallelism=2
)
cvModel = cv.fit(dataset)
cvModel.avgMetrics[0] # Average accuracy for maxIter = 0
cvModel.avgMetrics[1] # Average accuracy for maxIter = 1
cvModel.avgMetrics[2] # Average accuracy for maxIter = 5
Optimizing parameters of #SVR() using #pyswarm #PSO function.
I have 200 inputs of the dataset with 9 features of each input. I have to predict one output parameter. I already did it by calling using #SVR() function using it's default parameters. The results are not satisfactory. Now I want to optimize its parameters using the "PSO" algorithm but unable to do it.
model = SVR()model.fit(Xtrain,ytrain)
pred_y = model.predict(Xtest)
param = {'kernel' : ('linear', 'poly', 'rbf', 'sigmoid'),'C':[1,5,10],'degree' : [3,8],'coef0' : [0.01,10,0.5],'gamma' : ('auto','scale')}
import pyswarms as ps
optimizer = ps.single.GlobalBestPSO(n_particles=10, dimensions=2,options=param)
best_cost, best_pos = optimizer.optimize(model, iters=100)-
2019-08-13 12:19:48,551 - pyswarms.single.global_best - INFO - Optimize for 100 iters with {'kernel': ('linear', 'poly', 'rbf', 'sigmoid'), 'C': [1, 5, 10], 'degree': [3, 8], 'coef0': [0.01, 10, 0.5], 'gamma': ('auto', 'scale')}
pyswarms.single.global_best: 0%| |0/100
TypeError: 'SVR' object is not callable
There is an error in the first two lines. Two lines of code got mixed there by mistake.
1. model = SVR()
2. model.fit(Xtrain,ytrain)
I would like to run a CV for an XGBoost tree regression on my X_train, y_train data. My target is of integer values from 25 to 40. I tried to run this code on my training dataset
# A parameter grid for XGBoost
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
cv_params = {
'min_child_weight': [1, 3, 5],
'gamma': [0.5, 1, 2, 3],
'subsample': [i/10.0 for i in range(6,11)],
'colsample_bytree': [i/10.0 for i in range(6,11)],
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.02, 0.1]
}
# Initialize XGB
xgb_for_gridsearch = XGBRegressor(
n_estimators = 1000,
objective = 'reg:logistic',
seed = 7
)
# Initialize GridSearch
xgb_grid = GridSearchCV(
estimator = xgb_for_gridsearch,
param_grid = cv_params,
scoring = 'explained_variance',
cv = 5,
n_jobs = -1
)
xgb_grid.fit(X_train, y_train)
xgb_grid.grid_scores_
I get an error the fit().
I kinda expected that the CV would just take forever, but not really an error. The error output is a couple of thousand lines long, so I will just put the only part that relates to my code:
During handling of the above exception, another exception occurred:
JoblibXGBoostError Traceback (most recent call last)
<ipython-input-44-a5c1d517107d> in <module>()
25 )
26
---> 27 xgb_grid.fit(X_train, y_train)
Does anyone know what this relates to?
Am I using conflicting parameters?
Would it be better to use xgboost.cv()?
I can also add the whole error code if that would help, should I just add it at the bottom of this question?
UPDATE: added error to a Gist, as suggested XGRegressor_not_fitting_data, since the error is too long.
Thanks for adding the full error code, it is easier to help you.
A github repo is fine, yet you may find it easier to use https://gist.github.com/ or https://pastebin.com/
Note that the most helpfull line of the full error is generally the last one, which contains here:
label must be in [0,1] for logistic regression
It seems you have used logistic regression (objective = 'reg:logistic', in your code), which is a classification loss, and so it requires y_train to be an array of either 0 or 1.
You can easily fix it with something like
y_train_bin = (y_train == 1).astype(int)
xgb_grid.fit(X_train, y_train_bin)