XGboost python - classifier class weight option? - scikit-learn

Is there a way to set different class weights for xgboost classifier? For example in sklearn RandomForestClassifier this is done by the "class_weight" parameter.

For sklearn version < 0.19
Just assign each entry of your train data its class weight. First get the class weights with class_weight.compute_class_weight of sklearn then assign each row of the train data its appropriate weight.
I assume here that the train data has the column class containing the class number. I assumed also that there are nb_classes that are from 1 to nb_classes.
from sklearn.utils import class_weight
classes_weights = list(class_weight.compute_class_weight('balanced',
np.unique(train_df['class']),
train_df['class']))
weights = np.ones(y_train.shape[0], dtype = 'float')
for i, val in enumerate(y_train):
weights[i] = classes_weights[val-1]
xgb_classifier.fit(X, y, sample_weight=weights)
Update for sklearn version >= 0.19
There is simpler solution
from sklearn.utils import class_weight
classes_weights = class_weight.compute_sample_weight(
class_weight='balanced',
y=train_df['class']
)
xgb_classifier.fit(X, y, sample_weight=classes_weights)

when using the sklearn wrapper, there is a parameter for weight.
example:
import xgboost as xgb
exgb_classifier = xgboost.XGBClassifier()
exgb_classifier.fit(X, y, sample_weight=sample_weights_data)
where the parameter shld be array like, length N, equal to the target length

I recently ran into this problem, so thought will leave a solution I tried
from xgboost import XGBClassifier
# manually handling imbalance. Below is same as computing float(18501)/392318
on the trainig dataset.
# We are going to inversely assign the weights
weight_ratio = float(len(y_train[y_train == 0]))/float(len(y_train[y_train ==
1]))
w_array = np.array([1]*y_train.shape[0])
w_array[y_train==1] = weight_ratio
w_array[y_train==0] = 1- weight_ratio
xgc = XGBClassifier()
xgc.fit(x_df_i_p_filtered, y_train, sample_weight=w_array)
Not sure, why but the results were pretty disappointing. Hope this helps someone.
[Reference link] https://www.programcreek.com/python/example/99824/xgboost.XGBClassifier

from sklearn.utils.class_weight import compute_sample_weight
xgb_classifier.fit(X, y, sample_weight=compute_sample_weight("balanced", y))

The answers here are outdated. THe sample_weight parameter is no longer supported. Its replaced with scale_pos_weight
Rather just do scale_pos_weight = sum(negative instances) / sum(positive instances)

You can alternatively use the scale_pos_weight hyperparameter, as discussed in the XGBoost docs. The advantage of this approach is that you don't have to construct the sample weight vector, and don't have to pass in the sample weight vector at fit time.

Similar to #Firas Omrane and #Pramit answer, but I think it is slightly more pythonic
from sklearn.utils import class_weight
class_weights = dict(
zip(
[0,1],
class_weight.compute_class_weight(
'balanced', classes=np.unique(train['class']), y=train['class']
),
)
)
xgb_classifier.fit(X, train['class'], sample_weight=class_weights)

Related

probability difference between categorical target and one-hot encoding target using OneVsRestClassifier

A bit confused with the probability between categorical target and one-hot encoding target from OneVsRestClassifier of sklean. Using iris data with simple logistic regression as an example. When I use original iris class[0,1,2], the calculated OneVsRestClassifier() probability for each observation will always add up to 1. However, if I converted the target to dummies, this is not the case. I understand that OneVsRestClassifier() compares one vs rest (class 0 vs non class 0, class 1 vs non class 1, etc). It makes more sense that the sum of these probabilities has no relation with 1. Then why I see the difference and how so?
import numpy as np
import pandas as pd
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
np.set_printoptions(suppress=True)
iris = datasets.load_iris()
rng = np.random.RandomState(0)
perm = rng.permutation(iris.target.size)
X = iris.data[perm]
y = iris.target[perm]
# categorical target with no conversion
X_train, y_train1 = X[:80], y[:80]
X_test, y_test1 = X[80:], y[80:]
m3 = LogisticRegression(random_state=0)
clf1 = OneVsRestClassifier(m3).fit(X_train, y_train1)
y_pred1 = clf1.predict(X_test)
print(np.sum(y_pred1 == y_test))
y_prob1 = clf1.predict_proba(X_test)
y_prob1[:5]
#output
array([[0.00014508, 0.17238549, 0.82746943],
[0.03850173, 0.79646817, 0.1650301 ],
[0.73981106, 0.26018067, 0.00000827],
[0.00016332, 0.32231163, 0.67752505],
[0.00029197, 0.2495404 , 0.75016763]])
# one hot encoding for categorical target
y2 = pd.get_dummies(y)
y_train2 = y2[:80]
y_test2 = y2[80:]
clf2 = OneVsRestClassifier(m3).fit(X_train, y_train2)
y_pred2 = clf2.predict(X_test)
y_prob2 = clf2.predict_proba(X_test)
y_prob2[:5]
#output
array([[0.00017194, 0.20430011, 0.98066319],
[0.02152246, 0.44522562, 0.09225181],
[0.96277892, 0.3385952 , 0.00001076],
[0.00023024, 0.45436925, 0.95512082],
[0.00036849, 0.31493725, 0.94676348]])
When you encode the targets, sklearn interprets your problem as a multilabel one instead of just multiclass; that is, that it is possible for a point to have more than one true label. And in that case, it is perfectly acceptable for the total sum of probabilities to be greater (or less) than 1. That's generally true for sklearn, but OneVsRestClassifier calls it out specifically in the docstring:
OneVsRestClassifier can also be used for multilabel classification. To use this feature, provide an indicator matrix for the target y when calling .fit.
As for the first approach, there are indeed three independent models, but the predictions are normalized; see the source code. Indeed, that's the only difference:
(y_prob2 / y_prob2.sum(axis=1)[:, None] == y_prob1).all()
# output
True
It's probably worth pointing out that LogisticRegression also natively supports multiclass. In that case, the weights for each class are independent, so it's similar to three separate models, but the resulting probabilities are the result of a softmax application, and the loss function minimizes the loss for each class simultaneously, so that the resulting coefficients and hence predictions can be different from those obtained from OneVsRestClassifier:
m3.fit(X_train, y_train1)
y_prob0 = m3.predict_proba(X_test)
y_prob0[:5]
# output:
array([[0.00000494, 0.01381671, 0.98617835],
[0.02569699, 0.88835451, 0.0859485 ],
[0.95239985, 0.04759984, 0.00000031],
[0.00001338, 0.04195642, 0.9580302 ],
[0.00002815, 0.04230022, 0.95767163]])

How to deal with imbalanced classes in Keras

I am working on a multi-label image classification problem with Keras and so I utilize functions flow_from_dataframe() and fit_generator().
I have about 2000 classes and as you can guess they are highly skewed / imbalanced. After searching a bit I came across with arguments class_weight and classes and I decided to give them a try. My problem is, I am not sure if I use them correctly. Here is an example:
Let's assume that I have flatten all class occurrences so that I get the following list of (duplicated) labels:
labels = ['classD', 'classA', 'classA', 'classC', 'classD', 'classD']
And this is the function that computes classes and class_weight:
from collections import Counter
def get_classes_weights(l, n):
counter = Counter(l).most_common(n)
classes = [cls for cls, ocu in counter]
majority = max([ocu for cls, ocu in counter])
weights = {idx: float(majority/ocu) for idx, (cls, ocu) in enumerate(counter)}
return classes, weights
Let's also assume that I what to consider the top-2 classes only:
classes, class_weight = get_classes_weights(labels, 2)
This gives:
classes: ['classD', 'classA']
and:
class_weight: {0: 1.0, 1: 1.5}
And finally, this is how I use them within the functions:
generator_train.flow_from_dataframe(
classes=classes,
)
model.fit_generator(
class_weight=class_weight
)
So my question are:
Is the above the right way to apply weights given that I work on a multi-label image classification problem?
Does my validation set need to be balanced or it is OK if it has been taken from the same distribution as the training set (20% and 80% random selection, respectively)?

best-found PCA estimator to be used as the estimator in RFECV

This works (mostly from the demo sample at sklearn):
print(__doc__)
# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from scipy.stats import uniform
lregress = LinearRegression()
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('regress', lregress)])
# Plot the PCA spectrum
pca.fit(data_num)
plt.figure(1, figsize=(16, 9))
plt.clf()
plt.axes([.2, .2, .7, .7])
plt.plot(pca.explained_variance_, linewidth=2)
plt.axis('tight')
plt.xlabel('n_components')
plt.ylabel('explained_variance_')
# Prediction
n_components = uniform.rvs(loc=1, scale=data_num.shape[1], size=50,
random_state=42).astype(int)
# Parameters of pipelines can be set using ‘__’ separated parameter names:
estimator_pca = GridSearchCV(pipe,
dict(pca__n_components=n_components)
)
estimator_pca.fit(data_num, data_labels)
plt.axvline(estimator_pca.best_estimator_.named_steps['pca'].n_components,
linestyle=':', label='n_components chosen ' +
str(estimator_pca.best_estimator_.named_steps['pca'].n_components))
plt.legend(prop=dict(size=12))
plt.plot(np.cumsum(pca.explained_variance_ratio_), linewidth=1)
plt.show()
And this works:
from sklearn.feature_selection import RFECV
estimator = LinearRegression()
selector = RFECV(estimator, step=1, cv=5, scoring='explained_variance')
selector = selector.fit(data_num_pd, data_labels)
print("Selected number of features : %d" % selector.n_features_)
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(selector.grid_scores_) + 1), selector.grid_scores_)
plt.show()
but this gives me the error "RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes" on the line "selector1 = selector1.fit"
pca_est = estimator_pca.best_estimator_
selector1 = RFECV(pca_est, step=1, cv=5, scoring='explained_variance')
selector1 = selector1.fit(data_num_pd, data_labels)
print("Selected number of features : %d" % selector1.n_features_)
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(selector1.grid_scores_) + 1), selector1.grid_scores_)
plt.show()
How do I get my best-found PCA estimator to be used as the estimator in RFECV?
This is a known issue in pipeline design. Refer to the github page here:
Accessing fitted attributes:
Moreover, some fitted attributes are used by meta-estimators;
AdaBoostClassifier assumes its sub-estimator has a classes_ attribute
after fitting, which means that presently Pipeline cannot be used as
the sub-estimator of AdaBoostClassifier.
Either meta-estimators such as AdaBoostClassifier need to be
configurable in how they access this attribute, or meta-estimators
such as Pipeline need to make some fitted attributes of sub-estimators
accessible.
Same goes for other attributes like coef_ and feature_importances_. They are parts of last estimator so not exposed by pipeline.
Now you can try to follow the last para here and try to circumvent this to include it in pipeline, by doing something like this:
class Mypipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
#property
def feature_importances_(self):
return self._final_estimator.feature_importances_
And then using this new pipeline class in your code instead of original Pipeline.
This should work in most cases but not yours. You are doing feature reduction using PCA inside the pipeline. But want to do feature selection using RFECV. This in my opinion is not a good combination.
RFECV will keep on decreasing the number of features to be used. But the n_components in your best selected pca from above grid-search will be fixed. Then it will again throw an error when number of features become less than n_components. You cannot do anything in that case.
So I would advise you to think over your use case and code.

k means cluster method score negative

guys. I am yet a beginner trying to learn ML so do forgive me for such a simple question. I had a dataset from UCI ML Repository. So, started applying all kinds of unsupervised algorithm in which i also applied K Means Cluster algorithm. When I printed out the accuracy score it was negative, not just once but many times. As far as I know scores aren't negative. So could you please help me as to why it's negative.
Any help is appreciated.
import pandas as pd
import numpy as np
a = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data', names = ["a", "b", "c", "d","e","f","g","h","i"])
b = a
c = b.filter(a.columns[[8]], axis=1)
a.drop(a.columns[[8]], axis=1, inplace=True)
from sklearn.preprocessing import LabelEncoder
le1 = LabelEncoder()
le1.fit(a.a)
a.a = le1.transform(a.a)
from sklearn.preprocessing import OneHotEncoder
x = np.array(a)
y = np.array(c)
ohe = OneHotEncoder(categorical_features=[0])
ohe.fit(x)
x = ohe.transform(x).toarray()
from sklearn.model_selection import train_test_split
xtr, xts, ytr, yts = train_test_split(x,y,test_size=0.2)
from sklearn import cluster
kmean = cluster.KMeans(n_clusters=2, init='k-means++', max_iter=100, n_init=10)
kmean.fit(xtr,ytr)
print(kmean.score(xts,yts))
Thank you!!
The k-means score is an indication of how far the points are from the centroids.
In scikit learn, the score is better the closer to zero it is.
Bad scores will return a large negative number, whereas good scores return close to zero. Generally, you will want to take the absolute value of the output from the scores method for better visualization.
Clustering is not classification.
Note that the 'y' argument of fit is ignored. Kmeans will always predict 0,1,...,k-1. So it will never make a correct label on this data set, because it doesn't even know what a label is supposed to look like. It really doesn't work to transfer what you did in classification to clustering. You need to relearn this from scratch. Different workflow, different evaluation.
It was explained in a book called "Hands-on Machine Learning with Scikit Learn Keras and TensorFlow" by Geron Aurelien.
On page 243 of the book (Chapter 9), it says that "The score() method returns the negative inertia. Why negative? Because a predictor’s score() method must always respect Scikit-Learn’s “greater is better” rule: if a predictor is better than another, its score() method should return a greater score."
Hope this helped!

Random forest regression severely overfits single variable data

I am trying to use sklearn's random forest regression for a toy example. I generated 500 uniform random numbers between 1 and 100 as the predictor variables, and then took their logs and added Gaussian noise to form the response variables.
I've heard that random forests typically work well out of the box, so I was expecting a reasonable looking curve, but this is what I got:
I don't understand why the random forest seems to hit each data point. Because of bagging, each tree is missing some fraction of the data, so when all of the trees are averaged it seems like the curve should be more smoothed out, and not hit the outliers.
I'd appreciate any help in understanding why this model overfits so much.
Here's the code I used to generate the plot:
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import matplotlib.pyplot as plt
def create_design_matrix(x_array):
return x_array.reshape((x_array.shape[0],1))
N = 1000
x_array = np.random.uniform(1, 100, N)
y_array = np.log(x_array) + np.random.normal(0, 0.5, N)
model = RandomForestClassifier(n_estimators=100)
model = model.fit(create_design_matrix(x_array), y_array)
test_x = np.linspace(1.0, 100.0, num=10000)
test_y = model.predict(create_design_matrix(test_x))
plt.plot(x_array, y_array, 'ro', linewidth=5.0)
plt.plot(test_x, test_y)
plt.show()
Thank you!
First of all, this is a regression problem, not classification.
If you have as many classes as samples, a decision tree will fit it,
for me this is not an overfit.
If you think this is an overfit, you may reduce the depth of the decision tree.

Resources