I am trying to use GridSearchCV with xgbranker estimator from xgboost. I am trying to use GroupKFold and passing qid (group_ids) parameter to the grid fit method but it's not straightforward. After a bit of hit and trial with solutions already suggested on the web, I finally zeroed on a approach. I am still getting an error which seems to be in the scoring method passed. Any help or working example would be great?
Sample code:
from sklearn.model_selection import GroupKFold, GridSearchCV
from sklearn.metrics import make_scorer, ndcg_score
ndcg_scorer = make_scorer(ndcg_score)
param_grid = {
'learning_rate': [0.001, 0.01, 0.02],
'n_estimators': [10, 50]
}
splits = 3
gkf = GroupKFold(n_splits=splits)
cv_group = gkf.split(X_train, y_train, qids_train)
def group_gen():
for ids,_ in cv_group:
yield ids
grid = GridSearchCV(my_model, param_grid, cv=splits, scoring=ndcg_scorer, refit=False)
grid.fit(X_train, y_train, qid=next(group_gen()))
I get below error:
ValueError: Only ('multilabel-indicator', 'continuous-multioutput', 'multiclass-multioutput') formats are supported. Got multiclass instead
The error seems to be related to the scoring method you use, but you didn't share anything about your data. so it's hard to say what exactly is the problem.
It seems to me that you're using for the scoring a method that expects something else then you're providing as a label.
Related
I change the code from https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html a little bit, which looks like this:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear','rbf'), 'C':[10,20, 15, 4]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(iris.data, iris.target)
clf.best_params_
Then the result is:
{'C': 10, 'kernel': 'rbf'}
But if I change the code to:
parameters = {'kernel':('linear','rbf'), 'C':[4, 10,20, 15]}
You can see the only change is the sequence of C list. But the result is:
{'C': 4, 'kernel': 'rbf'}
It looks like GridSearchCV just uses the first parameter combination.
So I have a few questions about this:
In this case, scoring is the default (None), so what function actually uses here? And why the above situation happens?
As far as I know, when we use LatentDirichletAllocation and GridSearchCV, the scoring function is log likelihood even scoring=None. If I understand correctly, then GridSearchCV can automatically pick a scoring function when it combines different models?
I am trying to perform KFold cross-validation via Keras but due to some reason, the KFold split isn't working.
from sklearn.model_selection import StratifiedKFold
X = train_data[features]
y = train_data['price']
kfold = StratifiedKFold(n_splits=10, shuffle=True)
for train, test in kfold.split(X,y):
print(X[train])
I was actually fitting the model subsequently but that didn't work, so I tried printing the results, which produced the following warning and output.
Warning: /opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_split.py:672: UserWarning: The least populated class in y has only 1 member, which is less than n_splits=10.
% (min_groups, self.n_splits)), UserWarning)
Error: "None of [Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 9,\n 10,\n ...\n 39989, 39990, 39991, 39992, 39993, 39994, 39995, 39996, 39997,\n 39998],\n dtype='int64', length=36000)] are in the [columns]"
The error is self-explanatory:
Warning:
/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_split.py:672:
UserWarning: The least populated class in y has only 1 member, which
is less than n_splits=10. % (min_groups, self.n_splits)), UserWarning)
This means that, for the underrepresented class, you have only one sample, hence the stratified split is unable to work.
I recommend that you check your dataset again in order to verify/correct the labels.
Using x.iloc[test_index] worked for me
I get the following error when I train LightGBM model:
# Train the model
import lightgbm as lgb
lgb_train = lgb.Dataset(x_train, y_train)
lgb_val = lgb.Dataset(x_test, y_test)
parameters = {
'application': 'binary',
'objective': 'binary',
'metric': 'auc',
'is_unbalance': 'true',
'boosting': 'gbdt',
'num_leaves': 31,
'feature_fraction': 0.5,
'bagging_fraction': 0.5,
'bagging_freq': 20,
'learning_rate': 0.05,
'verbose': 0
}
model = lgb.train(parameters,
train_data,
valid_sets=test_data,
num_boost_round=5000,
early_stopping_rounds=100)
y_pred = model.predict(test_data)
If you used cut or qcut functions for binning and did not encode later (one-hot encoding, label encoding ..). this may be the cause of the error. Try to use an encoding.
I hope it works.
I had what might be the same problem.
Post the whole traceback to make sure.
For me it was a problem serializing to JSON, which LightGBM does under the hood to save the booster for later use.
Check your dataset for any date/datetime columns, or anything that remotely looks like a date, and either drop it or convert to something JSON can handle.
Mine had all been converted to categorical dtype by some Pandas code I had poorly written, and I usually do the initial GBM run fairly fast-n-dirty to see what variables show up as important. LightGBM let me make the data binaries for training (i.e. it would have thrown an error if they were datetime or timedelta dtypes before letting me run anything). It would run the training just fine, report an AUC, then fail after the last training step when it was dumping the categoricals to JSON. It was maddening, with a cryptic traceback.
Hope this helps.
If you have any time delta variable in the dataset convert it into an int using the dt.days attribute. I faced the same issue it is the issue reported in Github of light gbm
I would like to run a CV for an XGBoost tree regression on my X_train, y_train data. My target is of integer values from 25 to 40. I tried to run this code on my training dataset
# A parameter grid for XGBoost
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
cv_params = {
'min_child_weight': [1, 3, 5],
'gamma': [0.5, 1, 2, 3],
'subsample': [i/10.0 for i in range(6,11)],
'colsample_bytree': [i/10.0 for i in range(6,11)],
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.02, 0.1]
}
# Initialize XGB
xgb_for_gridsearch = XGBRegressor(
n_estimators = 1000,
objective = 'reg:logistic',
seed = 7
)
# Initialize GridSearch
xgb_grid = GridSearchCV(
estimator = xgb_for_gridsearch,
param_grid = cv_params,
scoring = 'explained_variance',
cv = 5,
n_jobs = -1
)
xgb_grid.fit(X_train, y_train)
xgb_grid.grid_scores_
I get an error the fit().
I kinda expected that the CV would just take forever, but not really an error. The error output is a couple of thousand lines long, so I will just put the only part that relates to my code:
During handling of the above exception, another exception occurred:
JoblibXGBoostError Traceback (most recent call last)
<ipython-input-44-a5c1d517107d> in <module>()
25 )
26
---> 27 xgb_grid.fit(X_train, y_train)
Does anyone know what this relates to?
Am I using conflicting parameters?
Would it be better to use xgboost.cv()?
I can also add the whole error code if that would help, should I just add it at the bottom of this question?
UPDATE: added error to a Gist, as suggested XGRegressor_not_fitting_data, since the error is too long.
Thanks for adding the full error code, it is easier to help you.
A github repo is fine, yet you may find it easier to use https://gist.github.com/ or https://pastebin.com/
Note that the most helpfull line of the full error is generally the last one, which contains here:
label must be in [0,1] for logistic regression
It seems you have used logistic regression (objective = 'reg:logistic', in your code), which is a classification loss, and so it requires y_train to be an array of either 0 or 1.
You can easily fix it with something like
y_train_bin = (y_train == 1).astype(int)
xgb_grid.fit(X_train, y_train_bin)
Is there a way to set different class weights for xgboost classifier? For example in sklearn RandomForestClassifier this is done by the "class_weight" parameter.
For sklearn version < 0.19
Just assign each entry of your train data its class weight. First get the class weights with class_weight.compute_class_weight of sklearn then assign each row of the train data its appropriate weight.
I assume here that the train data has the column class containing the class number. I assumed also that there are nb_classes that are from 1 to nb_classes.
from sklearn.utils import class_weight
classes_weights = list(class_weight.compute_class_weight('balanced',
np.unique(train_df['class']),
train_df['class']))
weights = np.ones(y_train.shape[0], dtype = 'float')
for i, val in enumerate(y_train):
weights[i] = classes_weights[val-1]
xgb_classifier.fit(X, y, sample_weight=weights)
Update for sklearn version >= 0.19
There is simpler solution
from sklearn.utils import class_weight
classes_weights = class_weight.compute_sample_weight(
class_weight='balanced',
y=train_df['class']
)
xgb_classifier.fit(X, y, sample_weight=classes_weights)
when using the sklearn wrapper, there is a parameter for weight.
example:
import xgboost as xgb
exgb_classifier = xgboost.XGBClassifier()
exgb_classifier.fit(X, y, sample_weight=sample_weights_data)
where the parameter shld be array like, length N, equal to the target length
I recently ran into this problem, so thought will leave a solution I tried
from xgboost import XGBClassifier
# manually handling imbalance. Below is same as computing float(18501)/392318
on the trainig dataset.
# We are going to inversely assign the weights
weight_ratio = float(len(y_train[y_train == 0]))/float(len(y_train[y_train ==
1]))
w_array = np.array([1]*y_train.shape[0])
w_array[y_train==1] = weight_ratio
w_array[y_train==0] = 1- weight_ratio
xgc = XGBClassifier()
xgc.fit(x_df_i_p_filtered, y_train, sample_weight=w_array)
Not sure, why but the results were pretty disappointing. Hope this helps someone.
[Reference link] https://www.programcreek.com/python/example/99824/xgboost.XGBClassifier
from sklearn.utils.class_weight import compute_sample_weight
xgb_classifier.fit(X, y, sample_weight=compute_sample_weight("balanced", y))
The answers here are outdated. THe sample_weight parameter is no longer supported. Its replaced with scale_pos_weight
Rather just do scale_pos_weight = sum(negative instances) / sum(positive instances)
You can alternatively use the scale_pos_weight hyperparameter, as discussed in the XGBoost docs. The advantage of this approach is that you don't have to construct the sample weight vector, and don't have to pass in the sample weight vector at fit time.
Similar to #Firas Omrane and #Pramit answer, but I think it is slightly more pythonic
from sklearn.utils import class_weight
class_weights = dict(
zip(
[0,1],
class_weight.compute_class_weight(
'balanced', classes=np.unique(train['class']), y=train['class']
),
)
)
xgb_classifier.fit(X, train['class'], sample_weight=class_weights)