StratifiedKFold split doesn't seem to work

StratifiedKFold split doesn't seem to work - scikit-learn

I am trying to perform KFold cross-validation via Keras but due to some reason, the KFold split isn't working.
from sklearn.model_selection import StratifiedKFold
X = train_data[features]
y = train_data['price']
kfold = StratifiedKFold(n_splits=10, shuffle=True)
for train, test in kfold.split(X,y):
print(X[train])
I was actually fitting the model subsequently but that didn't work, so I tried printing the results, which produced the following warning and output.
Warning: /opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_split.py:672: UserWarning: The least populated class in y has only 1 member, which is less than n_splits=10.
% (min_groups, self.n_splits)), UserWarning)
Error: "None of [Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 9,\n 10,\n ...\n 39989, 39990, 39991, 39992, 39993, 39994, 39995, 39996, 39997,\n 39998],\n dtype='int64', length=36000)] are in the [columns]"

The error is self-explanatory:
Warning:
/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_split.py:672:
UserWarning: The least populated class in y has only 1 member, which
is less than n_splits=10. % (min_groups, self.n_splits)), UserWarning)
This means that, for the underrepresented class, you have only one sample, hence the stratified split is unable to work.
I recommend that you check your dataset again in order to verify/correct the labels.

Using x.iloc[test_index] worked for me

Related

Using GridSearchCV with xgbranker

I am trying to use GridSearchCV with xgbranker estimator from xgboost. I am trying to use GroupKFold and passing qid (group_ids) parameter to the grid fit method but it's not straightforward. After a bit of hit and trial with solutions already suggested on the web, I finally zeroed on a approach. I am still getting an error which seems to be in the scoring method passed. Any help or working example would be great?
Sample code:
from sklearn.model_selection import GroupKFold, GridSearchCV
from sklearn.metrics import make_scorer, ndcg_score
ndcg_scorer = make_scorer(ndcg_score)
param_grid = {
'learning_rate': [0.001, 0.01, 0.02],
'n_estimators': [10, 50]
}
splits = 3
gkf = GroupKFold(n_splits=splits)
cv_group = gkf.split(X_train, y_train, qids_train)
def group_gen():
for ids,_ in cv_group:
yield ids
grid = GridSearchCV(my_model, param_grid, cv=splits, scoring=ndcg_scorer, refit=False)
grid.fit(X_train, y_train, qid=next(group_gen()))
I get below error:
ValueError: Only ('multilabel-indicator', 'continuous-multioutput', 'multiclass-multioutput') formats are supported. Got multiclass instead

The error seems to be related to the scoring method you use, but you didn't share anything about your data. so it's hard to say what exactly is the problem.
It seems to me that you're using for the scoring a method that expects something else then you're providing as a label.

Returning standard deviation with `BaggingRegressor`

Is there a way to return standard deviation using sklearn.ensemble.BaggingRegressor?
Cause by looking at several examples all that I have found has been the mean prediction.

You can always get the underlying predictions by each estimator of the ensemble, which (estimator) is accessible through the estimators_ attribute of the ensemble, and handle these predictions accordingly (compute mean, standard deviation, etc).
Adapting the example from the documentation, with an ensemble of 10 SVR base estimators:
import numpy as np
from sklearn.svm import SVR
from sklearn.ensemble import BaggingRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=100, n_features=4,
n_informative=2, n_targets=1,
random_state=0, shuffle=False)
regr = BaggingRegressor(base_estimator=SVR(),
n_estimators=10, random_state=0).fit(X, y)
regr.predict([[0, 0, 0, 0]]) # get (mean) prediction for a single sample, [0, 0, 0, 0]
# array([-2.87202411])
# get the predictions from each individual member of the ensemble using a list comprehension:
raw_pred = [x.predict([[0, 0, 0, 0]]) for x in regr.estimators_]
raw_pred
# result:
[array([-2.13003431]),
array([-1.96224516]),
array([-1.90429596]),
array([-6.90647796]),
array([-6.21360547]),
array([-1.84318744]),
array([1.82285686]),
array([4.62508622]),
array([-5.60320499]),
array([-8.60513286])]
# get the mean, and ensure that it is the same with the one returned above with the .predict method of the ensemble:
np.mean(raw_pred)
# -2.8720241079257436
np.mean(raw_pred) == regr.predict([[0, 0, 0, 0]]) # sanity check
# True
# get the standard deviation:
np.std(raw_pred)
# 3.865135037828279

There is not a builtin way to do that, no.
The fitted estimators are available in the estimators_ attribute, along with estimators_features_ if you've set max_features < 1.0, so you can reproduce the individual predictions manually.

Optimization of predictions from sklearn model (e.g. RandomForestRegressor)

Does anyone used any optimization models on fitted sklearn models?
What I'd like to do is fit model based on train data and using this model try to find the best combination of parameters for which model would predict the biggest value.
Some example, simplified code:
import pandas as pd
df = pd.DataFrame({
'temperature': [10, 15, 30, 20, 25, 30],
'working_hours': [10, 12, 12, 10, 30, 15],
'sales': [4, 7, 6, 7.3, 10, 8]
})
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
X = df.drop(['sales'], axis=1)
y = df['sales']
model.fit(X, y);
Our baseline is a simple loop and predict all combination of variables:
results = pd.DataFrame(columns=['temperature', 'working_hours', 'sales_predicted'])
import numpy as np
for temp in np.arange(1,100.01,1):
for work_hours in np.arange(1,60.01,1):
results = pd.concat([
results,
pd.DataFrame({
'temperature': temp,
'working_hours': work_hours,
'sales_predicted': model.predict(np.array([temp, work_hours]).reshape(1,-1))
}
)
]
)
print(results.sort_values(by='sales_predicted', ascending=False))
Using that way it's difficult or impossible to:
* do it fast (brute method)
* implement constraint concerning two or more variables dependency
We tried PuLP library and PyOmo library, but both doesn't allow to put model.predict function as an objective function returning error:
TypeError: float() argument must be a string or a number, not 'LpVariable'
Do anyone have any idea how we can get rid off loop and use some other stuff?

When people talk about optimizing fitted sklearn models, they usually mean maximizing accuracy/performance metrics. So if you are trying to maximize your predicted value, you can definitely improve your code to achieve it more efficiently, like below.
You are collecting all the predictions in a big results dataframe, and then sorting it in ascending order. Instead, you can just search for an increase in your target variable (sales_predicted) on-the-fly, using a simple if logic. So just change your loop into this:
max_sales_predicted = 0
for temp in np.arange(1, 100.01, 1):
for work_hours in np.arange(1, 60.01, 1):
sales_predicted = model.predict(np.array([temp, work_hours]).reshape(1, -1))
if sales_predicted > max_sales_predicted:
max_sales_predicted = sales_predicted
desired_temp = temp
desired_work_hours = work_hours
So that you can only take into account any specification that produces a predictiong that exceeds your current target, and else, do nothing.
The result of my code is the same as yours, i.e. a max_sales_predicted value of 9.2. Also, desired_temp and desired_work_hours now give you the specification that produce that maxima. Hope this helps.

plot_decision_regions with error "Filler values must be provided when X has more than 2 training features."

I am plotting 2D plot for SVC Bernoulli output.
converted to vectors from Avg word2vec and standerdised data
split data to train and test.
Through grid search found the best C and gamma(rbf)
clf = SVC(C=100,gamma=0.0001)
clf.fit(X_train1,y_train)
from mlxtend.plotting import plot_decision_regions
plot_decision_regions(X_train, y_train, clf=clf, legend=2)
plt.xlabel(X.columns[0], size=14)
plt.ylabel(X.columns[1], size=14)
plt.title('SVM Decision Region Boundary', size=16)
Receive error :-
ValueError: y must be a NumPy array. Found
also tried to convert the y to numpy. Then it prompts error
ValueError: y must be an integer array. Found object. Try passing the array as y.astype(np.integer)
finally i converted it to integer array.
Now it is prompting of error.
ValueError: Filler values must be provided when X has more than 2 training features.

You can use PCA to reduce your data multi-dimensional data to two dimensional data. Then pass the obtained result in plot_decision_region and there will be no need of filler values.
from sklearn.decomposition import PCA
from mlxtend.plotting import plot_decision_regions
clf = SVC(C=100,gamma=0.0001)
pca = PCA(n_components = 2)
X_train2 = pca.fit_transform(X_train)
clf.fit(X_train2, y_train)
plot_decision_regions(X_train2, y_train, clf=clf, legend=2)
plt.xlabel(X.columns[0], size=14)
plt.ylabel(X.columns[1], size=14)
plt.title('SVM Decision Region Boundary', size=16)

I've spent some time with this too as plot_decision_regions was then complaining ValueError: Column(s) [2] need to be accounted for in either feature_index or filler_feature_values and there's one more parameter needed to avoid this.
So, say, you have 4 features and they come unnamed:
X_train_std.shape[1] = 4
We can refer to each feature by their index 0, 1, 2, 3. You only can plot 2 features at a time, say you want 0 and 2.
You'll need to specify one additional parameter (to those specified in #sos.cott's answer), feature_index, and fill the rest with fillers:
value=1.5
width=0.75
fig = plot_decision_regions(X_train.values, y_train.values, clf=clf,
feature_index=[0,2], #these one will be plotted
filler_feature_values={1: value, 3:value}, #these will be ignored
filler_feature_ranges={1: width, 3: width})

You can just do (Assuming X_train and y_train are still panda dataframes) for the numpy array problem.
plot_decision_regions(X_train.values, y_train.values, clf=clf, legend=2)
For the filler_feature issue, you have to specify the number of features so you do the following:
value=1.5
width=0.75
fig = plot_decision_regions(X_train.values, y_train.values, clf=clf,
filler_feature_values={2: value, 3:value, 4:value},
filler_feature_ranges={2: width, 3: width, 4:width},
legend=2, ax=ax)
You need to add one filler feature for each feature you have.

XGRegressor not fitting data

I would like to run a CV for an XGBoost tree regression on my X_train, y_train data. My target is of integer values from 25 to 40. I tried to run this code on my training dataset
# A parameter grid for XGBoost
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
cv_params = {
'min_child_weight': [1, 3, 5],
'gamma': [0.5, 1, 2, 3],
'subsample': [i/10.0 for i in range(6,11)],
'colsample_bytree': [i/10.0 for i in range(6,11)],
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.02, 0.1]
}
# Initialize XGB
xgb_for_gridsearch = XGBRegressor(
n_estimators = 1000,
objective = 'reg:logistic',
seed = 7
)
# Initialize GridSearch
xgb_grid = GridSearchCV(
estimator = xgb_for_gridsearch,
param_grid = cv_params,
scoring = 'explained_variance',
cv = 5,
n_jobs = -1
)
xgb_grid.fit(X_train, y_train)
xgb_grid.grid_scores_
I get an error the fit().
I kinda expected that the CV would just take forever, but not really an error. The error output is a couple of thousand lines long, so I will just put the only part that relates to my code:
During handling of the above exception, another exception occurred:
JoblibXGBoostError Traceback (most recent call last)
<ipython-input-44-a5c1d517107d> in <module>()
25 )
26
---> 27 xgb_grid.fit(X_train, y_train)
Does anyone know what this relates to?
Am I using conflicting parameters?
Would it be better to use xgboost.cv()?
I can also add the whole error code if that would help, should I just add it at the bottom of this question?
UPDATE: added error to a Gist, as suggested XGRegressor_not_fitting_data, since the error is too long.

Thanks for adding the full error code, it is easier to help you.
A github repo is fine, yet you may find it easier to use https://gist.github.com/ or https://pastebin.com/
Note that the most helpfull line of the full error is generally the last one, which contains here:
label must be in [0,1] for logistic regression
It seems you have used logistic regression (objective = 'reg:logistic', in your code), which is a classification loss, and so it requires y_train to be an array of either 0 or 1.
You can easily fix it with something like
y_train_bin = (y_train == 1).astype(int)
xgb_grid.fit(X_train, y_train_bin)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

StratifiedKFold split doesn't seem to work - scikit-learn

Using x.iloc[test_index] worked for me

Related

Using GridSearchCV with xgbranker

Returning standard deviation with `BaggingRegressor`

Optimization of predictions from sklearn model (e.g. RandomForestRegressor)

plot_decision_regions with error "Filler values must be provided when X has more than 2 training features."

XGRegressor not fitting data

Categories

Resources